\n\t\t\t\t

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n

\n<\/div>
A two-person startup by the name of Nari Labs has introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts \u2014 and one of its creators claims it surpasses the performance of competing proprietary offerings from the likes of ElevenLabs, Google\u2019s hit NotebookLM AI podcast generation product.<\/p>\n\n\n\n
It could also threaten uptake of OpenAI\u2019s recent gpt-4o-mini-tts.<\/p>\n\n\n\n
\u201cDia rivals NotebookLM\u2019s podcast feature while surpassing ElevenLabs Studio and Sesame\u2019s open model in quality,\u201d said Toby Kim, one of the co-creators of Nari and Dia, on a post from his account on the social network X.<\/p>\n\n\n\n
In a separate post, Kim noted that the model was built with \u201czero funding,\u201d and added across a thread: \u201c\u2026we were not AI experts from the beginning. It all started when we fell in love with NotebookLM\u2019s podcast feature when it was released last year. We wanted more\u2014more control over the voices, more freedom in the script. We tried every TTS API on the market. None of them sounded like real human conversation.\u201d<\/p>\n\n\n\n
Kim further credited Google for giving him and his collaborator access to the company\u2019s Tensor Processing Unit chips (TPUs) for training Dia through Google\u2019s Research Cloud.<\/p>\n\n\n\n
Dia\u2019s code and weights \u2014 the internal model connection set \u2014 is now available for download and local deployment by anyone from Hugging Face or Github. Individual users can try generating speech from it on a Hugging Face Space. <\/p>\n\n\n\n
Advanced controls and more customizable features<\/h2>\n\n\n\n
Dia supports nuanced features like emotional tone, speaker tagging, and nonverbal audio cues\u2014all from plain text. <\/p>\n\n\n\n
Users can mark speaker turns with tags like [S1] and [S2], and include cues like (laughs), (coughs), or (clears throat) to enrich the resulting dialogue with nonverbal behaviors. <\/p>\n\n\n\n
These tags are correctly interpreted by Dia during generation\u2014something not reliably supported by other available models, according to the company\u2019s examples page.<\/p>\n\n\n\n
The model is currently English-only and not tied to any single speaker\u2019s voice, producing different voices per run unless users fix the generation seed or provide an audio prompt. Audio conditioning, or voice cloning, lets users guide speech tone and voice likeness by uploading a sample clip. <\/p>\n\n\n\n
Nari Labs offers example code to facilitate this process and a Gradio-based demo so users can try it without setup.<\/p>\n\n\n\n
Comparison with ElevenLabs and Sesame<\/h2>\n\n\n\n
Nari offers a host of example audio files generated by Dia on its Notion website, comparing it to other leading speech-to-text rivals, specifically ElevenLabs Studio and Sesame CSM-1B, the latter a new text-to-speech model from Oculus VR headset co-creator Brendan Iribe that went somewhat viral on X earlier this year. <\/p>\n\n\n\n
Side-by-side examples shared by Nari Labs show how Dia outperforms the competition in several areas:<\/p>\n\n\n\n
In standard dialogue scenarios, Dia handles both natural timing and nonverbal expressions better. For example, in a script ending with (laughs), Dia interprets and delivers actual laughter, whereas ElevenLabs and Sesame output textual substitutions like \u201chaha\u201d.<\/p>\n\n\n\n
For example, here\u2019s Dia\u2026<\/p>\n\n\n\n