\n\t\t\t\t

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more<\/em><\/p>\n\n\n\n

\n<\/div>
Generating voices that are not only humanlike and nuanced but diverse <\/em>continues to be a struggle in conversational AI.\u00a0<\/p>\n\n\n\n
At the end of the day, people want to hear voices that sound like them or are at least natural, not just the 20th-century American broadcast standard.\u00a0<\/p>\n\n\n\n
Startup Rime is tackling this challenge with Arcana text-to-speech (TTS), a new spoken language model that can quickly generate \u201cinfinite\u201d new voices of varying genders, ages, demographics and languages just based on a simple text description of intended characteristics.\u00a0<\/p>\n\n\n\n
The model has helped boost customer sales \u2014 for the likes of Domino\u2019s and Wingstop \u2014 by 15%.\u00a0<\/p>\n\n\n\n
\u201cIt\u2019s one thing to have a really high-quality, life-like, real person-sounding model,\u201d Lily Clifford, Rime CEO and co-founder, told VentureBeat. \u201cIt\u2019s another to have a model that can not just create one voice, but infinite variability of voices along demographic lines.\u201d<\/p>\n\n\n\n
A voice model that \u2018acts human\u2019\u00a0<\/h2>\n\n\n\n
Rime\u2019s multimodal and autoregressive TTS model was trained on natural conversations with real people (as opposed to voice actors). Users simply type in a text prompt description of a voice with desired demographic characteristics and language.\u00a0<\/p>\n\n\n\n
For instance: \u2018I want a 30 year old female who lives in California and is into software,\u2019 or \u2018Give me an Australian man\u2019s voice.\u2019\u00a0<\/p>\n\n\n\n
$\"\"$ <\/figure>\n\n\n\n
\u201cEvery time you do that, you\u2019re going to get a different voice,\u201d said Clifford.\u00a0<\/p>\n\n\n\n
Rime\u2019s Mist v2 TTS model was built for high-volume, business-critical applications, allowing enterprises to craft unique voices for their business needs. \u201cThe customer hears a voice that allows for a natural, dynamic conversation without needing a human agent,\u201d said Clifford.\u00a0<\/p>\n\n\n\n
For those looking for out-of-the-box options, meanwhile, Rime offers eight flagship speakers with unique characteristics:\u00a0<\/p>\n\n\n\n
\n
Luna (female, chill but excitable, Gen-Z optimist)<\/li>\n\n\n\n
Celeste (female, warm, laid-back, fun-loving)<\/li>\n\n\n\n
Orion (male, older, African-American, happy)<\/li>\n\n\n\n
Ursa (male, 20 years old, encyclopedic knowledge of 2000s emo music)<\/li>\n\n\n\n
Astra (female, young, wide-eyed)<\/li>\n\n\n\n
Esther (female, older, Chinese American, loving)<\/li>\n\n\n\n
Estelle (female, middle-aged, African-American, sounds so sweet)<\/li>\n\n\n\n
Andromeda (female, young, breathy, yoga vibes)<\/li>\n<\/ul>\n\n\n\n
The model has the ability to switch between languages, and can whisper, be sarcastic and even mocking. Arcana can also insert laughter into speech when given the token . This can return varied, realistic outputs, from \u201ca small chuckle to a big guffaw,\u201d Rime says. The model can also interpret , and even correctly, even though it wasn\u2019t explicitly trained to do so.\u00a0<\/hum><\/sigh><\/chuckle><\/laugh><\/p>\n\n\n\n
\u201cIt infers emotion from context,\u201d Rime writes in a technical paper. \u201cIt laughs, sighs, hums, audibly breathes and makes subtle mouth noises. It says \u2018um\u2019 and other disfluencies naturally. It has emergent behaviors we are still discovering. In short, it acts human.\u201d\u00a0<\/p>\n\n\n\n
Capturing natural conversations<\/h2>\n\n\n\n
Rime\u2019s model generates audio tokens that are decoded into speech using a codec-based approach, which Rime says provides for \u201cfaster-than-real-time synthesis.\u201d At launch, time to first audio was 250 milliseconds and public cloud latency was roughly 400 milliseconds.\u00a0<\/p>\n\n\n\n
Arcana was trained in three stages:<\/p>\n\n\n\n
\n
Pre-training: Rime used open-source large language models (LLMs) as a backbone and pre-trained on a large group of text-audio pairs to help Arcana learn general linguistic and acoustic patterns.<\/li>\n\n\n\n
Supervised fine-tuning with a \u201cmassive\u201d proprietary dataset.\u00a0<\/li>\n\n\n\n
Speaker-specific fine-tuning: Rime identified the speakers it found \u201cmost exemplary\u201d among its dataset, conversations and reliability.\u00a0<\/li>\n<\/ul>\n\n\n\n
Rime\u2019s data incorporates sociolinguistic conversation techniques (factoring in social context like class, gender, location), idiolect (individual speech habits) and paralinguistic nuances (non-verbal aspects of communication that go along with speech).\u00a0<\/p>\n\n\n\n
\u00a0The model was also trained on accent subtleties, filler words (those subconscious \u2018uhs\u2019 and \u2018ums\u2019) as well as pauses, prosodic stress patterns (intonation, timing, stressing of certain syllables) and multilingual code-switching (when multilingual speakers switch back and forth between languages).\u00a0<\/p>\n\n\n\n
The company has taken a unique approach to collecting all this data. Clifford explained that, typically, model builders will gather snippets from voice actors, then create a model to reproduce the characteristics of that person\u2019s voice based on text input. Or, they\u2019ll scrape audiobook data.\u00a0<\/p>\n\n\n\n
\u201cOur approach was very different,\u201d she explained. \u201cIt was, \u2018How do we create the world\u2019s largest proprietary data set of conversational speech?\u2019\u201d\u00a0<\/p>\n\n\n\n
To do so, Rime built its own recording studio in a basement in San Francisco and spent several months recruiting people off Craigslist, through word-of-mouth, or just causally gathered themselves and friends and family. Rather than scripted conversations, they recorded natural conversations and chitchat.\u00a0<\/p>\n\n\n\n
They then annotated voices with detailed metadata, encoding gender, age, dialect, speech affect and language. This has allowed Rime to achieve 98 to 100% accuracy.\u00a0<\/p>\n\n\n\n
Clifford noted that they are constantly augmenting this dataset.\u00a0<\/p>\n\n\n\n
\u201cHow do we get it to sound personal? You\u2019re never going to get there if you\u2019re just using voice actors,\u201d she said. \u201cWe did the insanely hard thing of collecting really naturalistic data. The huge secret sauce of Rime is that these aren\u2019t actors. These are real people.\u201d<\/p>\n\n\n\n
A \u2018personalization harness\u2019 that creates bespoke voices<\/h2>\n\n\n\n
Rime intends to give customers the ability to find voices that will work best for their application. They built a \u201cpersonalization harness\u201d tool to allow users to do A\/B testing with various voices. After a given interaction, the API reports back to Rime, which provides an analytics dashboard identifying the best-performing voices based on success metrics.\u00a0<\/p>\n\n\n\n
Of course, customers have different definitions of what constitutes a successful call. In food service, that might be upselling an order of fries or extra wings.\u00a0<\/p>\n\n\n\n
\u201cThe goal for us is how do we create an application that makes it easy for our customers to run those experiments themselves?,\u201d said Clifford. \u201cBecause our customers aren\u2019t voice casting directors, neither are we. The challenge becomes how to make that personalization analytics layer really intuitive.\u201d<\/p>\n\n\n\n
Another KPI customers are maximizing for is the caller\u2019s willingness to talk to the AI. They\u2019ve found that, when switching to Rime, callers are 4X more likely to talk to the bot.\u00a0<\/p>\n\n\n\n
\u201cFor the first time ever, people are like, \u2018No, you don\u2019t need to transfer me. I\u2019m perfectly willing to talk to you,\u2019\u201d said Clifford. \u201cOr, when they\u2019re transferred, they say \u2018Thank you.\u2019\u201d (20%, in fact, are cordial when ending conversations with a bot).\u00a0<\/p>\n\n\n\n
Powering 100 million calls a month<\/h2>\n\n\n\n
Rime counts among its customers Domino\u2019s, Wingstop, Converse Now and Ylopo. They do a lot of work with large contact centers, enterprise developers building interactive voice response (IVR) systems and telecoms, Clifford noted.\u00a0\u00a0<\/p>\n\n\n\n
\u201cWhen we switched to Rime we saw an immediate double-digit improvement in the likelihood of our calls succeeding,\u201d said Akshay Kayastha, director of engineering at ConverseNow. \u201cWorking with Rime means we solve a ton of the last-mile problems that come up in shipping a high-impact application.\u201d\u00a0<\/p>\n\n\n\n
Ylopo CPO Ge Juefeng noted that, for his company\u2019s high-volume outbound application, they need to build immediate trust with the consumer. \u201cWe tested every model on the market and found that Rime\u2019s voices converted customers at the highest rate,\u201d he reported.\u00a0<\/p>\n\n\n\n
Rime is already helping power close to 100 million phone calls a month, said Clifford. \u201cIf you call Domino\u2019s or Wingstop, there\u2019s an 80 to 90% chance that you hear a Rime voice,\u201d she said.\u00a0<\/p>\n\n\n\n
Looking ahead, Rime will push more into on-premises offerings to support low latency. In fact, they anticipate that, by the end of 2025, 90% of their volume will be on-prem. \u201cThe reason for that is you\u2019re never going to be as fast if you\u2019re running these models in the cloud,\u201d said Clifford.\u00a0<\/p>\n\n\n\n
Also, Rime continues to fine-tune its models to address other linguistic challenges. For instance, phrases the model has never encountered, like Domino\u2019s tongue-tying \u201cMeatza ExtravaganZZa.\u201d As Clifford noted, even if a voice is personalized, natural and responds in real time, it\u2019s going to fail if it can\u2019t handle a company\u2019s unique needs.\u00a0<\/p>\n\n\n\n
\u201cThere are still a lot of problems that our competitors see as last-mile problems, but that our customers see as first-mile problems,\u201d said Clifford.\u00a0<\/p>\n\n\n\n\n
\n
\n
Daily insights on business use cases with VB Daily<\/strong><\/p>\n
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n
Read our Privacy Policy<\/p>\n
\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n
An error occured.<\/p>\n<\/p><\/div>\n
\n\t\t\t\t\t $\"\"\/$ \n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n
\r\n
Source link <\/a>","protected":false},"excerpt":{"rendered":"
Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more Generating voices that are not only humanlike and nuanced but diverse continues to be a struggle in conversational AI.\u00a0 At the end of the day, people want to hear voices that sound […]<\/p>\n","protected":false},"author":1,"featured_media":1898,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1897","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/06\/cfr0z3n_surreal_a_computer_surrounded_by_human_mouths_speaking__ec99691b-8f3f-495d-a279-7e47142b569a.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1897","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1897"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1897\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1898"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1897"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1897"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1897"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}