{"id":1460,"date":"2025-04-27T12:40:57","date_gmt":"2025-04-27T12:40:57","guid":{"rendered":"https:\/\/violethoward.com\/new\/a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more\/"},"modified":"2025-04-27T12:40:57","modified_gmt":"2025-04-27T12:40:57","slug":"a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more\/","title":{"rendered":"A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>A two-person startup by the name of Nari Labs has introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts \u2014 and one of its creators claims it surpasses the performance of competing proprietary offerings from the likes of ElevenLabs, Google\u2019s hit NotebookLM AI podcast generation product.<\/p>\n\n\n\n<p>It could also threaten uptake of OpenAI\u2019s recent gpt-4o-mini-tts.<\/p>\n\n\n\n<p>\u201cDia rivals NotebookLM\u2019s podcast feature while surpassing ElevenLabs Studio and Sesame\u2019s open model in quality,\u201d said Toby Kim, one of the co-creators of Nari and Dia, on a post from his account on the social network X.<\/p>\n\n\n\n<p>In a separate post, Kim noted that the model was built with \u201czero funding,\u201d and added across a thread: \u201c\u2026we were not AI experts from the beginning. It all started when we fell in love with NotebookLM\u2019s podcast feature when it was released last year. We wanted more\u2014more control over the voices, more freedom in the script. We tried every TTS API on the market. None of them sounded like real human conversation.\u201d<\/p>\n\n\n\n<p>Kim further credited Google for giving him and his collaborator access to the company\u2019s Tensor Processing Unit chips (TPUs) for training Dia through Google\u2019s Research Cloud.<\/p>\n\n\n\n<p>Dia\u2019s code and weights \u2014 the internal model connection set \u2014 is now available for download and local deployment by anyone from Hugging Face or Github. Individual users can try generating speech from it on a Hugging Face Space. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-advanced-controls-and-more-customizable-features\">Advanced controls and more customizable features<\/h2>\n\n\n\n<p>Dia supports nuanced features like emotional tone, speaker tagging, and nonverbal audio cues\u2014all from plain text. <\/p>\n\n\n\n<p>Users can mark speaker turns with tags like [S1] and [S2], and include cues like (laughs), (coughs), or (clears throat) to enrich the resulting dialogue with nonverbal behaviors. <\/p>\n\n\n\n<p>These tags are correctly interpreted by Dia during generation\u2014something not reliably supported by other available models, according to the company\u2019s examples page.<\/p>\n\n\n\n<p>The model is currently English-only and not tied to any single speaker\u2019s voice, producing different voices per run unless users fix the generation seed or provide an audio prompt. Audio conditioning, or voice cloning, lets users guide speech tone and voice likeness by uploading a sample clip. <\/p>\n\n\n\n<p>Nari Labs offers example code to facilitate this process and a Gradio-based demo so users can try it without setup.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-comparison-with-elevenlabs-and-sesame\">Comparison with ElevenLabs and Sesame<\/h2>\n\n\n\n<p>Nari offers a host of example audio files generated by Dia on its Notion website, comparing it to other leading speech-to-text rivals, specifically ElevenLabs Studio and Sesame CSM-1B, the latter a new text-to-speech model from Oculus VR headset co-creator Brendan Iribe that went somewhat viral on X earlier this year. <\/p>\n\n\n\n<p>Side-by-side examples shared by Nari Labs show how Dia outperforms the competition in several areas:<\/p>\n\n\n\n<p>In standard dialogue scenarios, Dia handles both natural timing and nonverbal expressions better. For example, in a script ending with (laughs), Dia interprets and delivers actual laughter, whereas ElevenLabs and Sesame output textual substitutions like \u201chaha\u201d.<\/p>\n\n\n\n<p>For example, here\u2019s Dia\u2026<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls=\"\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/audio_8.wav\"\/><\/figure>\n\n\n\n<p>\u2026and the same sentence spoken by ElevenLabs Studio<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls=\"\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/ElevenLabs_Untitled_Project_1.mp3\"\/><\/figure>\n\n\n\n<p>In multi-turn conversations with emotional range, Dia demonstrates smoother transitions and tone shifts. One test included a dramatic, emotionally-charged emergency scene. Dia rendered the urgency and speaker stress effectively, while competing models often flattened delivery or lost pacing.<\/p>\n\n\n\n<p>Dia uniquely handles nonverbal-only scripts, such as a humorous exchange involving coughs, sniffs, and laughs. Competing models failed to recognize these tags or skipped them entirely.<\/p>\n\n\n\n<p>Even with rhythmically complex content like rap lyrics, Dia generates fluid, performance-style speech that maintains tempo. This contrasts with more monotone or disjointed outputs from ElevenLabs and Sesame\u2019s 1B model.<\/p>\n\n\n\n<p>Using audio prompts, Dia can extend or continue a speaker\u2019s voice style into new lines. An example using a conversational clip as a seed showed how Dia carried vocal traits from the sample through the rest of the scripted dialogue. This feature isn\u2019t robustly supported in other models.<\/p>\n\n\n\n<p>In one set of tests, Nari Labs noted that Sesame\u2019s best website demo likely used an internal 8B version of the model rather than the public 1B checkpoint, resulting in a gap between advertised and actual performance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-model-access-and-tech-specs\">Model access and tech specs<\/h2>\n\n\n\n<p>Developers can access Dia from Nari Labs\u2019 GitHub repository and its Hugging Face model page. <\/p>\n\n\n\n<p>The model runs on PyTorch 2.0+ and CUDA 12.6 and requires about 10GB of VRAM. <\/p>\n\n\n\n<p>Inference on enterprise-grade GPUs like the NVIDIA A4000 delivers roughly 40 tokens per second. <\/p>\n\n\n\n<p>While the current version only runs on GPU, Nari plans to offer CPU support and a quantized release to improve accessibility.<\/p>\n\n\n\n<p>The startup offers both a Python library and CLI tool to further streamline deployment. <\/p>\n\n\n\n<p>Dia\u2019s flexibility opens use cases from content creation to assistive technologies and synthetic voiceovers. <\/p>\n\n\n\n<p>Nari Labs is also developing a consumer version of Dia aimed at casual users looking to remix or share generated conversations. Interested users can sing up via email to a waitlist for early access.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-fully-open-source\">Fully open source<\/h2>\n\n\n\n<p>The model is distributed under a fully open source Apache 2.0 license, which means it can be used for commercial purposes \u2014 something that will obviously appeal to enterprises or indie app developers.<\/p>\n\n\n\n<p>Nari Labs explicitly prohibits usage that includes impersonating individuals, spreading misinformation, or engaging in illegal activities. The team encourages responsible experimentation and has taken a stance against unethical deployment.<\/p>\n\n\n\n<p>Dia\u2019s development credits support from the Google TPU Research Cloud, Hugging Face\u2019s ZeroGPU grant program, and prior work on SoundStorm, Parakeet, and Descript Audio Codec. <\/p>\n\n\n\n<p>Nari Labs itself comprises just two engineers\u2014one full-time and one part-time\u2014but they actively invite community contributions through its Discord server and GitHub.<\/p>\n\n\n\n<p>With a clear focus on expressive quality, reproducibility, and open access, Dia adds a distinctive new voice to the landscape of generative speech models.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A two-person startup by the name of Nari Labs has introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts \u2014 and one of its creators claims it [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1461,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1460","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/04\/cfr0z3n_vector_art_line_art_flat_illustration_graphic_novel_s_fb1d5b43-fc73-495d-a097-b7a9a874c81e_0.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1460","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1460"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1460\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1461"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1460"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1460"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1460"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 05:20:05 UTC -->