{"id":736,"date":"2025-03-21T10:52:26","date_gmt":"2025-03-21T10:52:26","guid":{"rendered":"https:\/\/violethoward.com\/new\/openais-new-voice-ai-model-gpt-4o-transcribe-lets-you-add-speech-to-your-existing-text-apps-in-seconds\/"},"modified":"2025-03-21T10:52:26","modified_gmt":"2025-03-21T10:52:26","slug":"openais-new-voice-ai-model-gpt-4o-transcribe-lets-you-add-speech-to-your-existing-text-apps-in-seconds","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/openais-new-voice-ai-model-gpt-4o-transcribe-lets-you-add-speech-to-your-existing-text-apps-in-seconds\/","title":{"rendered":"OpenAI&#8217;s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>OpenAI\u2018s voice AI models have gotten it into trouble before with actor Scarlett Johansson, but that isn\u2019t stopping the company from continuing to advance its offerings in this category.<\/p>\n\n\n\n<p>Today, the ChatGPT maker has unveiled three new proprietary voice models: gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts. These models will initially be available through the ChatGPT maker\u2019s application programming interface (API) for third-party software developers to build their own apps. They will also be available on a custom demo site, OpenAI.fm, that individual users can access for limited testing and fun.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><p>\n<iframe loading=\"lazy\" title=\"Audio Models in the API\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/lXb0L16ISAc?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/p><\/figure>\n\n\n\n<p>Moreover, the gpt-4o-mini-tts model voices can be customized from several pre-sets via text prompt to change their accents, pitch, tone and other vocal qualities \u2014 including conveying whatever emotions the user asks them to, which should go a long way to addressing any concerns OpenAI is deliberately imitating any particular user\u2019s voice (the company previously denied that was the case with Johansson, but pulled down the ostensibly imitative voice option, anyway). Now, it\u2019s up to the user to decide how they want their AI voice to sound when speaking back.<\/p>\n\n\n\n<p>In a demo with VentureBeat delivered over a video call, OpenAI technical staff member Jeff Harris showed how, using text alone on the demo site, a user could get the same voice to sound like a cackling mad scientist or a zen, calm yoga teacher.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-discovering-and-refining-new-capabilities-within-gpt-4o-base\">Discovering and refining new capabilities within GPT-4o base<\/h2>\n\n\n\n<p>The models are variants of the existing GPT-4o model OpenAI launched back in May 2024 and which currently powers the ChatGPT text and voice experience for many users, but the company took that base model and post-trained it with additional data to make it excel at transcription and speech. The company didn\u2019t specify when the models might come to ChatGPT.<\/p>\n\n\n\n<p>\u201cChatGPT has slightly different requirements in terms of cost and performance trade-offs, so while I expect they will move to these models in time, for now, this launch is focused on API users,\u201d Harris said. <\/p>\n\n\n\n<p>It is meant to supersede OpenAI\u2019s two-year-old Whisper open-source text-to-speech model, offering lower word error rates across industry benchmarks and improved performance in noisy environments, with diverse accents, and at varying speech speeds across 100+ languages.<\/p>\n\n\n\n<p>The company posted a chart on its website showing just how much lower the gpt-4o-transcribe models\u2019 error rates are at identifying words across 33 languages compared to Whisper \u2014 with an impressively low 2.46% in English.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"949\" height=\"515\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?w=800\" alt=\"\" class=\"wp-image-3001240\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png 949w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=300,163 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=768,417 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=800,434 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=400,217 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=750,407 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=578,314 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/03\/openai-gpt-4o-transcribe-benchmarks.png?resize=930,505 930w\" sizes=\"(max-width: 949px) 100vw, 949px\"\/><\/figure>\n\n\n\n<p>\u201cThese models include noise cancellation and a semantic voice activity detector, which helps determine when a speaker has finished a thought, improving transcription accuracy,\u201d said Harris.<\/p>\n\n\n\n<p>Harris told VentureBeat that the new gpt-4o-transcribe model family is not designed to offer \u201cdiarization,\u201d or the capability to label and differentiate between different speakers. Instead, it is designed primarily to receive one (or possibly multiple voices) as a single input channel and respond to all inputs with a single output voice in that interaction, however long it takes. <\/p>\n\n\n\n<p>The company is <span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">also hosting a competition for the general public to find the most creative examples of using its demo voice site OpenAI.fm and share them online by tagging the\u00a0@openAI account on X. The winner will receive a custom Teenage Engineering radio with the\u00a0<\/span>OpenAI logo, which OpenAI Head of Product, Platform Olivier Godement said is one of only three in the world.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-an-audio-applications-gold-mine\">An audio applications gold mine<\/h2>\n\n\n\n<p>The enhancements make them particularly well-suited for applications such as customer call centers, meeting note transcription, and AI-powered assistants. <\/p>\n\n\n\n<p>Impressively, the company\u2019s newly launched Agents SDK from last week also allows those developers who have already built apps atop its text-based large language models like the regular GPT-4o to add fluid voice interactions with only about \u201cnine lines of code,\u201d according to a presenter during an OpenAI YouTube livestream announcing the new models (embedded above).<\/p>\n\n\n\n<p>For example, an e-commerce app built atop GPT-4o could now respond to turn-based user questions like \u201cTell me about my last orders\u201d in speech with just seconds of tweaking the code by adding these new models.<\/p>\n\n\n\n<p>\u201cFor the first time, we\u2019re introducing streaming speech-to-text, allowing developers to continuously input audio and receive a real-time text stream, making conversations feel more natural,\u201d Harris said.<\/p>\n\n\n\n<p>Still, for those devs looking for low-latency, real-time AI voice experiences, OpenAI recommends using its speech-to-speech models in the Realtime API.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-pricing-and-availability\">Pricing and availability<\/h2>\n\n\n\n<p>The new models are available immediately via OpenAI\u2019s API, with pricing as follows:<\/p>\n\n\n\n<p>\u2022 <strong>gpt-4o-transcribe:<\/strong> $6.00 per 1M audio input tokens (~$0.006 per minute)<\/p>\n\n\n\n<p>\u2022 <strong>gpt-4o-mini-transcribe:<\/strong> $3.00 per 1M audio input tokens (~$0.003 per minute)<\/p>\n\n\n\n<p>\u2022 <strong>gpt-4o-mini-tts:<\/strong> $0.60 per 1M text input tokens, $12.00 per 1M audio output tokens (~$0.015 per minute)<\/p>\n\n\n\n<p>However, they arrive <span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">at a time of fiercer-than-ever competition in the AI transcription and speech space, with dedicated speech AI firms such as\u00a0ElevenLabs offering their new Scribe model,\u00a0which supports diarization and boasts a similarly (but not as low) reduced error rate of 3.3% in English. It is priced at<\/span> $0.40 per hour of input audio (or $0.006 per minute, roughly equivalent).<\/p>\n\n\n\n<p><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">Another startup,\u00a0Hume AI, offers a new model, Octave TTS,\u00a0with sentence-level and even word-level customization of pronunciation and emotional inflection \u2014 based entirely on the user\u2019s instructions, not any pre-set voices.<\/span> The pricing of Octave TTS isn\u2019t directly comparable, but there is a free tier offering 10 minutes of audio and costs increase from there between<\/p>\n\n\n\n<p>Meanwhile, more advanced audio and speech models are also coming to the open source community, including one called Orpheus 3B which is available with a permissive Apache 2.0 license, meaning developers don\u2019t have to pay any costs to run it \u2014 provided they have the right hardware or cloud servers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-industry-adoption-and-early-results\">Industry adoption and early results<\/h2>\n\n\n\n<p>According to testimonials shared by OpenAI with VentureBeat, several companies have already integrated OpenAI\u2019s new audio models into their platforms, reporting significant improvements in voice AI performance.<\/p>\n\n\n\n<p>EliseAI, a company focused on property management automation, found that OpenAI\u2019s text-to-speech model enabled more natural and emotionally rich interactions with tenants. <\/p>\n\n\n\n<p>The enhanced voices made AI-powered leasing, maintenance, and tour scheduling more engaging, leading to higher tenant satisfaction and improved call resolution rates.<\/p>\n\n\n\n<p>Decagon, which builds AI-powered voice experiences, saw a 30% improvement in transcription accuracy using OpenAI\u2019s speech recognition model. <\/p>\n\n\n\n<p>This increase in accuracy has allowed Decagon\u2019s AI agents to perform more reliably in real-world scenarios, even in noisy environments. The integration process was quick, with Decagon incorporating the new model into its system within a day.<\/p>\n\n\n\n<p>Not all reactions to OpenAI\u2019s latest release have been warm. Dawn AI app analytics software co-founder Ben Hylak (@benhylak), a former Apple human interfaces designer, posted on X that while the models seem promising, the announcement \u201cfeels like a retreat from real-time voice,\u201d suggesting a shift away from OpenAI\u2019s previous focus on low-latency conversational AI via ChatGPT.<\/p>\n\n\n\n<p>Additionally, the launch was preceded by an early leak on X (formerly Twitter). TestingCatalog News (@testingcatalog) posted details on the new models several minutes before the official announcement, listing the names of gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe. The leak was credited to @StivenTheDev, and the post quickly gained traction.<\/p>\n\n\n\n<p>However, looking ahead, OpenAI plans to continue refining its audio models and exploring custom voice capabilities while ensuring safety and responsible AI use. Beyond audio, OpenAI is also investing in multimodal AI, including video, to enable more dynamic and interactive agent-based experiences.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/openais-new-voice-ai-models-gpt-4o-transcribe-let-you-add-speech-to-your-existing-text-apps-in-seconds\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI\u2018s voice AI models have gotten it into trouble before with actor Scarlett Johansson, but that isn\u2019t stopping the company from continuing to advance its offerings in this category. Today, the ChatGPT maker has unveiled three [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":737,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-736","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/03\/robot-answers-phone.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=736"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/736\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/737"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-28 23:15:43 UTC -->