{"id":3416,"date":"2025-08-29T00:26:04","date_gmt":"2025-08-29T00:26:04","guid":{"rendered":"https:\/\/violethoward.com\/new\/in-crowded-voice-ai-market-openai-bets-on-instruction-following-and-expressive-speech-to-win-enterprise-adoption\/"},"modified":"2025-08-29T00:26:04","modified_gmt":"2025-08-29T00:26:04","slug":"in-crowded-voice-ai-market-openai-bets-on-instruction-following-and-expressive-speech-to-win-enterprise-adoption","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/in-crowded-voice-ai-market-openai-bets-on-instruction-following-and-expressive-speech-to-win-enterprise-adoption\/","title":{"rendered":"In crowded voice AI market, OpenAI bets on instruction-following and expressive speech to win enterprise adoption"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> <em>Subscribe Now<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>OpenAI adds to an increasingly competitive AI voice market for enterprises with its new model, gpt-realtime, that follows complex instructions and with voices \u201cthat sound more natural and expressive.\u201d<\/p>\n\n\n\n<p>As voice AI continues to grow, and customers find use cases such as customer service calls or real-time translation, the market for realistic-sounding AI voices that also offer enterprise-grade security is heating up. OpenAI claims its new model provides a more human-like voice, but it still needs to compete against companies like ElevenLabs.<\/p>\n\n\n\n<p>The model will be available on the Realtime API, which the company also made generally available. Along with the gpt-realtime model, OpenAI also released new voices on the API, which it calls Cedar and Marin, and updated its other voices to work with the latest model.<\/p>\n\n\n\n<p>OpenAI said in a livestream that it worked with its customers who are building voice applications to train gpt-realtime and \u201ccarefully aligned the model to evals that are built on real-world scenarios like customer support and academic tutoring.\u201d<\/p>\n\n\n\n<div id=\"boilerplate_2803147\" class=\"post-boilerplate boilerplate-speedbump\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong\/><strong>AI Scaling Hits Its Limits<\/strong><\/p>\n\n\n\n<p>Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Turning energy into a strategic advantage<\/li>\n\n\n\n<li>Architecting efficient inference for real throughput gains<\/li>\n\n\n\n<li>Unlocking competitive ROI with sustainable AI systems<\/li>\n<\/ul>\n\n\n\n<p><strong>Secure your spot to stay ahead<\/strong>: https:\/\/bit.ly\/4mwGngO<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<\/div><figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><p>\n<iframe loading=\"lazy\" title=\"Introducing gpt-realtime in the API\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/nfBbmtMJhX0?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/p><\/figure>\n\n\n\n<p>The company touted the model\u2019s ability to create emotive, natural-sounding voices that also align with how developers build with the technology.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-speech-to-speech-models\">Speech-to-speech models<\/h2>\n\n\n\n<p>The model operates within a speech-to-speech framework, enabling it to understand spoken prompts and respond vocally. Speech-to-speech models are ideally suited for real-time responses, where a person, typically a customer, interacts with an application.\u00a0<\/p>\n\n\n\n<p>For example, a customer wants to return some products and calls a customer service platform. They could be talking to an AI voice assistant that responds to questions and requests as if they were speaking with a human.\u00a0<\/p>\n\n\n\n<p>In a livestream, OpenAI customers T-Mobile showcased an AI voice-powered agent that helps people find new phones. Another customer, the real estate search platform Zillow, showcased an agent who helps someone narrow down a neighborhood to find the perfect place.\u00a0<\/p>\n\n\n\n<p>OpenAI said gpt-realtime is its \u201cmost advanced, production-ready voice model.\u201d Like its other voice models, it can switch languages mid-sentence. However, OpenAI researchers noted gpt-realtime can follow more complex instructions like \u201cspeak emphatically in a French accent.\u201d<\/p>\n\n\n\n<p>But gpt-realtime faces competition from other models that many brands already use. ElevenLabs released Conversation AI 2.0 in May. Soundhound partners with fast food franchises for an AI voice drive-thru. Emphatic AI startup Hume has launched its EVI 3 model, which allows users to generate AI versions of their own voice.\u00a0<\/p>\n\n\n\n<p>As enterprises discover various use cases for voice AI, even more general model providers that offer multimodal LLMs are making a case for themselves. Mistral released its new Voxtral model, stating it would work well with real-time translation. Google is enhancing its audio capabilities and gaining popularity with an audio feature on NotebookLM that converts research notes into a podcast.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-better-instruction-following\">Better instruction following<\/h2>\n\n\n\n<p>OpenAI said gpt-realtime is smarter and understands native audio better, including the ability to catch non-verbal cues like laughs or sighs.\u00a0<\/p>\n\n\n\n<p>Benchmarking using the Big Bench Audio eval showed the model scoring 82.8% in accuracy, compared to its previous model, which scored 65.6%. OpenAI did not provide numbers testing gpt-realtime against models from its competitors.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"777\" height=\"465\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png\" alt=\"\" class=\"wp-image-3016219\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png 777w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png?resize=300,180 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png?resize=768,460 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png?resize=400,239 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png?resize=750,449 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/08\/image_abee53.png?resize=578,346 578w\" sizes=\"(max-width: 777px) 100vw, 777px\"\/><\/figure>\n\n\n\n<p>OpenAI focused on improving the model\u2019s instruction-following capabilities, ensuring the model would adhere to directions more effectively. The new model achieves a score of 30.5% on the MultiChallenge audio benchmark. The engineers also beefed up function calling so gpt-realtime can access the correct tools.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-realtime-api-updates\">Realtime API updates<\/h2>\n\n\n\n<p>To support the new model and enhance how enterprises integrate real-time AI capabilities into their applications, OpenAI has added several new features to the Realtime API.\u00a0<\/p>\n\n\n\n<p>It can now support MCP and recognize image inputs, allowing it to inform users about what it sees in real-time. This is a feature Google heavily emphasized during its Project Astra presentation last year.\u00a0<\/p>\n\n\n\n<p>The Realtime API can also handle Session Initiation Protocol (SIP). SIP connects apps to phones like a public phone network or desk phones, opening up more contact center use cases. Users can also save and reuse prompts on the API.<\/p>\n\n\n\n<p>So far, people are impressed with the model, although these are still initial tests of a model that was recently released.\u00a0\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Tbh, the MCP and SIP features are the real story here, not just another model. <\/p><p>The ability to connect to external tools and systems seamlessly is what will finally move these models from being impressive demos to being integrated into actual workflows. <\/p><p>The real time aspect\u2026<\/p>\u2014 JK (@_junaidkhalid1) <a href=\"https:\/\/twitter.com\/_junaidkhalid1\/status\/1961119224107307237?ref_src=twsrc%5Etfw\">August 28, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Testing out gpt-realtime<\/p><p>Initial review:<br\/>\u2013 Noticable audio improvement<br\/>\u2013 It&#8217;s a stickler for the instructions (very good)<br\/>\u2013 Feels fast <a href=\"https:\/\/t.co\/LtyCs0QLXV\">pic.twitter.com\/LtyCs0QLXV<\/a><\/p>\u2014 Jake Colling (@JacobColling) <a href=\"https:\/\/twitter.com\/JacobColling\/status\/1961153126993371631?ref_src=twsrc%5Etfw\">August 28, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Well, GPT-realtime got a livestream not because most users are interested, but for strategic business reasons<\/p><p>Call centers are a major target for LLM providers and the first company to reach a real breakthrough will get massive revenue<\/p>\u2014 AnKo (@anko_979) <a href=\"https:\/\/twitter.com\/anko_979\/status\/1961145546011275650?ref_src=twsrc%5Etfw\">August 28, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Pros &amp; Cons from <a href=\"https:\/\/twitter.com\/OpenAI?ref_src=twsrc%5Etfw\">@OpenAI<\/a> real-time update from someone building in AI audio:<\/p><p>Pro: Better function calling, more emotion, 20% cheaper, better control, image is cool but won&#8217;t use<\/p><p>Con: no custom voices (creative experience MUST HAVE), still *expensive* vs TTS-LLM-STT pipelines<\/p>\u2014 Gavin Purcell (@gavinpurcell) <a href=\"https:\/\/twitter.com\/gavinpurcell\/status\/1961119146621481407?ref_src=twsrc%5Etfw\">August 28, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<p>OpenAI reduced prices for gpt-realtime by 20% to $32 per million audio input tokens and $64 for audio output tokens.\u00a0<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div><template id="4s0QZkw8wp6D8QYhtfvv"></template><\/script>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/in-crowded-voice-ai-market-openai-bets-on-instruction-following-and-expressive-speech-to-win-enterprise-adoption\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI adds to an increasingly competitive AI voice market for enterprises with its new model, gpt-realtime, that follows complex instructions and with voices \u201cthat sound more natural and expressive.\u201d [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3417,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-3416","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/08\/crimedy7_illustration_of_a_half_machine_half_human_person_spe_63e21c7b-3093-4e77-b336-a83287f6af4a_2.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3416","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=3416"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3416\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/3417"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=3416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=3416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=3416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 22:29:23 UTC -->