AI

OpenAI’s New Audio Models Are Bringing AI Voices to Life

AI voice agents now sound smarter, more human, and highly customizable. New audio models improve transcription, enhance tone control, and simplify development for voice AI.

May 7, 2025

OpenAI’s Speech Models Released

AI Summary

Beyond Text: A Shift Toward Voice-First AI

OpenAI has rolled out a suite of next-generation audio models that mark a major move from plain text interaction to voice-enabled AI agents. Building on recent advancements like Operator and the Agents SDK, this launch aims to help developers create intelligent voice agents that are more human, helpful, and engaging.

“This marks a big step toward intuitive, voice-first AI experiences.”

These updates allow for natural spoken conversations, making AI assistants feel more responsive and emotionally aware.

Sharper Ears: Big Gains in Speech-to-Text Accuracy

Two new speech-to-text models—gpt-4o-transcribe and gpt-4o-mini-transcribe—deliver state-of-the-art transcription accuracy, even in tricky conditions.

You may like...

Understanding AI Hallucinations: Risks, Causes and Solutions

Understanding AI Hallucinations: Risks, Causes and Solutions

Key enhancements include:

Lower Word Error Rates (WER) across 100+ languages
Robust performance in noisy settings, fast speech, and varied accents
Outperformance of previous Whisper models and even top competitors on benchmarks like FLEURS

These models were trained using:

Specialized audio datasets
Advanced reinforcement learning techniques
Self-play interactions to mimic real conversations

“Fewer errors, better recognition—no matter how fast or accented the speaker.”

Steerable Voices: A Voice With Personality

The new gpt-4o-mini-tts model introduces steerability—allowing developers to tell the AI not just what to say, but how to say it.

Voice styles include:

Calm customer service rep
Expressive bedtime storyteller
Professional meeting narrator
Even a medieval knight giving directions

Developers can now fine-tune elements like:

Tone
Emotion
Pacing
Energy level

You can test these variations on OpenAI.fm, where AI voices can be demoed in real-time.

Easy Integration: Voice With Just a Few Lines of Code

OpenAI updated its Agents SDK to make converting a text agent into a voice agent incredibly simple.

“Just 9 lines of code can turn your chatbot into a full-on talkbot.”

This includes built-in handling for:

Speech input
Language processing
Spoken response output

Companies like EliseAI are already using this to build emotionally rich voice agents for tenant support, resulting in:

Higher customer satisfaction
More calls resolved automatically

Looking Ahead: Custom Voices and Multimodal Agents

OpenAI isn’t stopping here. Future plans include:

Allowing developers to upload custom voices
Ensuring all voice models align with strong safety standards
Expanding into other modes like video to enable fully multimodal AI agents

The company continues working with researchers, developers, and policymakers to address the societal impact of synthetic voices.

“The future of AI is multimodal—and voice is just the beginning.”