Spanish English French German Italian Portuguese
Social Marketing
HomeBig TechsAmazonThe largest text-to-speech AI model...

Largest Text-to-Speech AI Model Yet Shows 'Emerging Skills'

Amazon researchers have trained the largest text-to-speech model ever created, and they say it exhibits “emergent” qualities that improve its ability to speak even complex sentences naturally. The breakthrough could be what technology needs to escape this complex valley.

These models were always going to grow and improve, but the researchers specifically expected to see the kind of jump in capacity we saw once the language models grew beyond a certain size. For reasons unknown to us, once LLMs pass a certain point, they begin to become much more robust and versatile, capable of performing tasks for which they were not trained.

That doesn't mean they're gaining sensitivity or anything like that, it's just that past a certain point their performance on certain conversational AI tasks grows like an asymptote (hockey stick=. The Amazon AGI team (it's no secret what they shoot at ) thought the same thing could happen as text-to-speech models grow, and their research suggests that this is, in fact, the case.

The new model is called Great Adaptive and Transmissible TTS with Emerging Skills (Big Adaptive Streamable TTS with Emergent abilities), which they have transformed into the abbreviation BASE TTS. The largest version of the model uses 100.000 hours of public domain speech, 90% of which is in English and the rest in German, Dutch and Spanish.

With 980 million parameters, BASE-large appears to be the largest model in this category. They also trained 400M and 150M parameter models based on 10,000 and 1,000 hours of audio respectively, for comparison; The idea is that, if one of these models shows emergent behaviors but another does not, you have a range where those behaviors begin to emerge.

It turns out that the medium-sized model showed the jump in ability the team was looking for, not necessarily in ordinary speech quality (it is reviewed better but only by a couple of points), but in the set of emerging skills they observed and they measured. Here are examples of text, originally in English, complicated mentioned in his notes:

  • Compound names: The Beckhams decided to rent a charming stone-built holiday home in the countryside.
  • Emotions: "Oh my God! Are we really going to the Maldives? That is incredible!" Jennie squealed, bouncing on the balls of her feet with uncontrollable joy.
  • foreign words: "Mister. Henry, famous for his mise en place, orchestrated a seven-course meal, each of which was a piece de resistance.
  • Paralinguistics (i.e. not legible words): “Shh, Lucy, shhh, we mustn't wake your little brother,” whispered Tom, as they tiptoed through the nursery.
  • Ratings: Received a strange text message from his brother: 'Emergency at home; Call as soon as possible! Mom and dad are worried...#familyissues.'
  • Frequently asked: But the question about Brexit remains: after all the trials and tribulations, will ministers find the answers in time?
  • Syntactic complexities: The film starring De Moya, recently awarded the lifetime achievement award, in 2022 was a box office success, despite mixed reviews.

“These sentences are designed to contain challenging tasks: analyzing difficult-to-understand sentences, putting emphasis on long compound nouns, producing emotional or whispered speech, or producing the correct phonemes for foreign words.
words like “qi” or punctuations like “@,” neither of which BASE TTS is explicitly trained to perform,” the authors write.

These features typically trip up text-to-speech engines, which mispronounce, skip words, use strange intonations, or make some other blunder. BASE TTS still had problems, but it did much better than its contemporaries: models like Tortoise and VALL-E.

There are plenty of examples of these difficult texts spoken quite naturally by the new model on the site built to display it. Of course, these were chosen by the researchers, so they are necessarily hand-picked, but it's impressive nonetheless.

Because all three BASE TTS models share an architecture, it seems clear that the size of the model and the extent of its training data appear to be the cause of the model's ability to handle some of the above complexities. Keep in mind that this is still an experimental model and process, not a commercial model. Further research will need to identify the tipping point for the emerging capability and how to train and deploy the resulting model efficiently.

In particular, this model is “streamable,” as its name implies, meaning that it does not need to output entire sentences at once, but instead proceeds moment by moment at a relatively low bitrate. The team also attempted to package speech metadata, such as emotionality, prosody, etc., into a separate low-bandwidth stream that could accompany the basic audio.

It looks like text-to-speech models may have a breakthrough moment in 2024, just in time for the election! But the usefulness of this technology cannot be denied, particularly for accessibility. The team notes that it declined to publish the model's source and other data due to the risk of bad actors taking advantage of it. However, the whole secret will be revealed at some point soon.

RELATED

SUBSCRIBE TO TRPLANE.COM

Publish on TRPlane.com

If you have an interesting story about transformation, IT, digital, etc. that can be found on TRPlane.com, please send it to us and we will share it with the entire Community.

MORE PUBLICATIONS

Enable notifications OK No thanks