ElevenLabs, an AI startup that just raised a $180 million funding round, is primarily known for its audio-generating capabilities. The company took a step in another technological direction by launching its first standalone speech-to-text model called Scribe.
The startup, valued at $3.3 billion, has helped many other companies provide speech-to-text services through its vast library of voices. However, the company is now looking to enter speech detection and compete with others in this space such as Gladia, Speechmatics, AssembyAI, Deepgram and Whisper Models from OpenAI.
ElevenLabs' Scribe model supports over 99 languages at launch. The company classifies over 25 languages with excellent accuracy for a model where the error rate per word is less than 5%. This list includes English (97% accuracy), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Other languages are classified internally into different accuracy or error categories, from high (5-10% error rate), good (10-20% error rate), and moderate (25-50% error rate).
The company said the model outperformed Google Gemini 2.0 Flash and Whisper Large V3 in multiple languages in FLOWERS y Common Voice.

ElevenLabs had developed the speech-to-text component for its AI conversational agent platform, which it launched last year. However, this is the first time which launches an independent voice detection model.
“We want to better understand what is being said in a conversation. We are working on ways to move away from just generating content and understanding and transcribing speech,” Staniszewski said at the time. “Many people say speech-to-text is a solved problem. But for many languages, it’s quite poor. We believe we can build better speech detection models because we have internal teams to annotate data and give us feedback quickly.”
The model also features diarying (the process of identifying and separating the voices of different speakers in an audio recording or video stream) to identify who is speaking, word-level time stamps for accurate captions, and automatic sound events such as audience laughter. The startup is providing a way for customers to directly transcribe video content for captioning.
Scribe currently only works with pre-recorded audio formats. The company said it will soon release a low-latency, real-time version of the model. This means it's not yet effective enough for transcriptions or voice note taking.
ElevenLabs has set Scribe's price at $0.40 for one hour of transcribed audio. While the rate is competitive, some of his rivals offer a lower price for audio transcriptions at this time with some differentiation in their features.
[ad_1]
[ad_2]
Source link