Flash v2.5 (model_id: eleven_flash_v2_5) is ElevenLabs’ fastest speech synthesis model, engineered specifically for real-time applications where the time between text input and audio output is the primary constraint. The 75ms figure — excluding application and network latency — means the model itself contributes less than a tenth of a second to the total latency budget of a voice interaction.
To understand why 75ms matters, consider the conversational voice agent use case. A user speaks a sentence. The STT model transcribes it (Scribe v2 Realtime: sub-150ms). An LLM generates a response (variable, typically 300ms–1s for the first token). The TTS model converts that response to speech. In a conversation where end-to-end latency determines whether the exchange feels natural or awkward, every millisecond of TTS latency is perceptible. At 75ms, Flash v2.5 is effectively imperceptible to users — the pause feels like normal speech timing rather than AI processing delay.
For context on how Flash v2.5 fits into a full conversational AI voice agent architecture, see our ElevenLabs Conversational AI builder’s guide.
Flash v2.5 vs All ElevenLabs Models: Full Comparison
| Model | model_id | Latency | Languages | Char Limit | Cost/Char | Best For |
| Eleven v3 | eleven_v3 | Higher | 70+ | 3,000 | ~1.5–2x | Expressive performance, audio tags, character dialogue |
| Multilingual v2 | eleven_multilingual_v2 | Standard | 29 | 10,000 | 1 credit | Production narration, audiobooks, podcasts, highest quality |
| Flash v2.5 | eleven_flash_v2_5 | ~75ms | 32 | 40,000 | 0.5 credits | Real-time agents, voice chatbots, interactive apps, bulk production |
| Flash v2 | eleven_flash_v2 | ~75ms | English only | 40,000 | 0.5 credits | English-only real-time applications |
| Turbo v2.5 | eleven_turbo_v2_5 | ~250–300ms | 32 | 40,000 | 0.5 credits | Balance of speed and quality when 75ms is not required |
| Turbo v2 | eleven_turbo_v2 | ~250–300ms | English + multilingual | 40,000 | 0.5 credits | Legacy — use Flash v2 or v2.5 instead |
When to Use Flash v2.5 vs Other Models
Use Flash v2.5 when:
- Building conversational AI voice agents where response latency directly affects user experience — the 75ms TTS latency is a prerequisite, not a luxury.
- Generating speech for real-time interactive applications: live gaming NPCs, interactive stories, voice chatbots, call centre agents.
- Processing large-scale bulk text-to-speech at the lowest cost — Flash v2.5’s 0.5 credit/character pricing is 50% cheaper than Multilingual v2 for the same character volume.
- Building multilingual applications covering languages not in Multilingual v2 — Flash v2.5 adds Hungarian, Norwegian, and Vietnamese.
Use Multilingual v2 instead when:
- Production quality for audiobooks, podcasts, or narrated video is the priority and the additional latency of the standard model is acceptable.
- Text contains phone numbers, dates, currencies, or other content requiring normalisation — Multilingual v2 handles these correctly by default where Flash v2.5 may not.
- Highly emotional or nuanced delivery is required — Multilingual v2 provides more expressive range than Flash v2.5.
Use Eleven v3 instead when:
- Maximum expressiveness, audio tags ([whispers], [excited], [laughs]), or multi-character dialogue via the Text to Dialogue API is the requirement. v3 is not designed for real-time applications.
Flash v2.5 Technical Specifications
| Specification | Value | Notes |
| model_id | eleven_flash_v2_5 | API parameter for all requests |
| Latency | ~75ms | Excluding application and network latency |
| Languages | 32 | All Multilingual v2 languages + Hungarian, Norwegian, Vietnamese |
| Character limit | 40,000 per request | ~40 minutes of audio per single API call |
| Cost | 0.5 credits per character | 50% lower than Multilingual v2 (1 credit/char) |
| Text normalisation | Disabled by default | Enable via apply_text_normalization=on (Enterprise) |
| Streaming | Yes | Start playback before generation completes |
| Seed parameter | Yes | Same seed + same input = consistent output |
| Max request size | 40,000 characters | Split longer content into multiple requests |
| Turbo v2.5 equivalent | eleven_turbo_v2_5 | Same languages/cost but higher latency (~250ms) |
The Text Normalisation Caveat: What Developers Must Know
Flash v2.5 disables text normalisation by default to maintain 75ms latency. Normalisation — the process of converting symbols and abbreviations into spoken form — adds processing time. Without it, the model speaks text literally rather than converting it to natural speech patterns. This creates specific failure modes:
| Content Type | Without Normalisation | With Normalisation (Multilingual v2) | Impact |
| Phone numbers | Four-one-five, five-five-five… | Four fifteen, five fifty-five… | Caller confusion in IVR applications |
| Dates | 12/04/2026 as ‘twelve slash four’ | ‘April twelfth, twenty twenty-six’ | Date communication errors |
| Currencies | $500 as ‘dollar five hundred’ | ‘Five hundred dollars’ | Financial content errors |
| Abbreviations | ‘Dr.’ as ‘Dr’ not ‘Doctor’ | ‘Doctor’ | Professional title misread |
| URLs | ‘www.site.com’ literally | Varies — often still literal | Links in content awkward |
| Times | ‘3:45’ as ‘three colon forty-five’ | ‘Three forty-five’ | Time communication errors |
The practical mitigation: pre-process your text before sending to Flash v2.5. Expand abbreviations, spell out numbers in the desired spoken form, write dates in full (‘April 12, 2026’), and convert symbols to words (‘five hundred dollars’). This pre-processing step adds negligible latency to your application while ensuring Flash v2.5 generates the correct speech output.
For Enterprise customers where pre-processing is impractical — high-volume IVR systems reading dynamic data — the apply_text_normalization API parameter can be set to ‘on’, which re-enables normalisation at a small latency cost. This option is only available to Enterprise plan customers.
For the full ElevenLabs API implementation guide including Flash v2.5 streaming setup, see our ElevenLabs API developer guide.
Flash v2.5 for Voice Agents: The Real-Time Stack
Flash v2.5 is ElevenLabs’ recommended model for the Conversational AI agents platform. The real-time voice agent stack using ElevenLabs components:
- Speech input: Scribe v2 Realtime (WebSocket, <150ms) — transcribes user speech.
- LLM processing: OpenAI GPT-4o or Gemini Flash (via ElevenLabs agent configuration) — generates response text.
- Speech output: Flash v2.5 streaming (75ms) — converts response to audio.
- End-to-end latency: approximately 300–600ms for most interactions — within the threshold for natural conversational feel.
This stack is available as a managed platform via ElevenLabs Conversational AI (no-code/low-code visual builder) or via the Conversation WebSocket API for custom implementations. Flash v2.5 is the default TTS model in ElevenLabs Agents, with Scribe v2 Realtime available as an optional STT upgrade for maximum accuracy.
Flash v2.5 for Bulk Production: The Cost Mathematics
For content production pipelines where audio is generated in batch — narrated articles, automated video voiceovers, e-learning module narration — Flash v2.5’s 0.5 credit/character pricing represents a significant cost reduction versus Multilingual v2 when quality requirements are met.
| Volume | Multilingual v2 Cost | Flash v2.5 Cost | Monthly Saving |
| 100K chars (Creator plan) | 100K credits ($22/mo) | 50K credits (~$11 equivalent) | 50% on characters |
| 500K chars (Pro plan) | 500K credits ($99/mo) | 250K credits (~$49.50 equivalent) | ~$49.50/mo |
| 2M chars (Scale plan) | 2M credits ($330/mo) | 1M credits (~$165 equivalent) | ~$165/mo |
| 10M chars | Business plan or Enterprise | Scale plan may suffice | Significant tier drop |
The caveat: Flash v2.5’s lower expressive range means quality-sensitive content (audiobooks, emotional storytelling, professional narration) still requires Multilingual v2 or Eleven v3 despite the cost. The bulk cost advantage applies specifically to applications where Flash v2.5’s voice quality is acceptable — IVR systems, notification audio, quick news summaries, live translation outputs.
Flash v2.5 Language Support: The Full 32-Language List
Flash v2.5 supports all 29 Multilingual v2 languages plus Hungarian, Norwegian, and Vietnamese. The full 29 Multilingual v2 languages: English (US, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian, and Russian.
The additional three Flash v2.5 exclusive languages — Hungarian, Norwegian, and Vietnamese — make Flash v2.5 strictly the broader multilingual choice when language coverage is the selection criterion. Note that language quality may vary; for the highest quality in any specific language, test both Flash v2.5 and Multilingual v2 with your target language content before committing to production.
Optimising Flash v2.5 Performance: Best Practices
1. Use streaming for real-time applications
Enable streaming (POST /v1/text-to-speech/:voice_id/stream) to begin audio playback before the full generation completes. This reduces perceived latency further — the user hears the first words while the rest of the audio is still generating. Combined with Flash v2.5’s 75ms first-chunk latency, streaming makes the response feel immediate.
2. Keep requests under 1,000 characters for lowest latency
While Flash v2.5 supports up to 40,000 characters per request, latency scales with request length. For real-time conversational applications where each agent turn is a sentence or short paragraph, keeping requests under 1,000 characters maintains the sub-100ms latency profile. For bulk generation, longer requests are fine.
3. Use the seed parameter for reproducibility
Flash v2.5 is nondeterministic — the same text with the same settings may produce slightly different audio on each generation. The seed parameter enables reproducible output: the same seed, text, and voice settings produce consistent audio. Document seeds for any audio that needs to be regenerated identically (e.g. system prompt audio for IVR menus).
4. Pre-process text for normalisation-sensitive content
Expand numbers, dates, currencies, and abbreviations to spoken form before passing to Flash v2.5. This eliminates normalisation failures without requiring the Enterprise normalisation parameter or switching to Multilingual v2.
5. Use WebSocket connections for persistent real-time applications
For voice agent applications that handle multiple consecutive exchanges, establish a persistent WebSocket connection rather than a new HTTP request per utterance. This eliminates TCP handshake overhead from the latency budget for each turn.
Flash v2.5 vs Google Cloud TTS vs Azure TTS: Competitive Context
| Metric | Flash v2.5 | Google Cloud TTS (Neural) | Azure TTS (Neural) |
| Latency | ~75ms | ~200ms | ~100–200ms |
| Languages | 32 | 100+ (Wavenet/Neural) | 140+ (Neural) |
| Voice quality | High naturalness | Professional grade | Professional grade |
| Pricing | 0.5 credits/char (platform plan) | ~$0.016/1K chars (Neural) | ~$15/1M chars (Neural) |
| Normalisation | Disabled default (Enterprise opt-in) | Yes (default) | Yes (default) |
| Character limit per request | 40,000 | 5,000 (SSML), varies | ~10,000 |
| Voice cloning | Yes (via ElevenLabs platform) | Limited (Custom voices) | Yes (Custom Neural Voice) |
| Ecosystem | ElevenLabs STT, agents, dubbing | Google Cloud suite | Azure Cognitive Services suite |
Flash v2.5’s latency advantage over Google and Azure is real and measurable in voice agent applications. The language coverage gap (32 vs 100–140+) is the primary competitive disadvantage for globally diverse deployments. For applications where 32 languages is sufficient and latency is the critical criterion, Flash v2.5 is the correct choice.
The Future of Flash Models in 2027
ElevenLabs’ model roadmap signals continued investment in the Flash architecture. The pattern of Flash v2 (English-only) → Flash v2.5 (32 languages) will likely continue — a Flash v3 incorporating the expressive capabilities of Eleven v3 at Flash latency would be the most significant voice agent quality improvement possible. ElevenLabs has indicated that text normalisation for Flash models is on the Enterprise roadmap, suggesting full normalisation at Flash latency is technically achievable. The 75ms benchmark will also face competition as Google and Azure accelerate their own low-latency TTS investment.
Key Takeaways
- Flash v2.5 is the correct TTS model for real-time voice agents, conversational AI, and interactive applications — 75ms latency is the industry’s fastest commercial TTS.
- 50% lower cost versus Multilingual v2 makes Flash v2.5 the correct choice for bulk production when quality requirements are met.
- Text normalisation is disabled by default — pre-process numbers, dates, currencies, and abbreviations before sending to Flash v2.5 to prevent mispronunciation.
- 32 languages covers all 29 Multilingual v2 languages plus Hungarian, Norwegian, and Vietnamese — Flash v2.5 is the broader multilingual choice when language count is the selection criterion.
- Use streaming + WebSocket + requests under 1,000 characters for the lowest achievable latency in real-time applications.
Conclusion
Flash v2.5 is the defining TTS model for real-time voice applications in 2026. Its 75ms latency makes conversational AI feel natural rather than mechanical. Its 50% cost reduction versus Multilingual v2 makes high-volume production financially viable at scale. Its 32-language support covers all major global markets. The text normalisation caveat is a known limitation with a clear mitigation path. For any application where latency matters — voice agents, interactive games, live translation, IVR systems — Flash v2.5 is the unambiguous model choice within the ElevenLabs platform.
Frequently Asked Questions
What is ElevenLabs Flash v2.5?
ElevenLabs’ fastest TTS model — 75ms latency, 32 languages, 40,000 character limit per request, 0.5 credits per character (50% cheaper than Multilingual v2). Designed for real-time voice agents, conversational AI, and interactive applications.
How does Flash v2.5 compare to Multilingual v2?
Flash v2.5: 75ms latency, 32 languages, 0.5 credits/char, 40K char limit, text normalisation disabled by default. Multilingual v2: standard latency (~300ms), 29 languages, 1 credit/char, 10K char limit, text normalisation on. Use Flash for real-time applications and cost-sensitive bulk production; use Multilingual v2 for highest quality narration.
Why does Flash v2.5 mispronounce phone numbers and dates?
Text normalisation is disabled by default on Flash v2.5 to maintain 75ms latency. Pre-process your text to expand numbers, dates, and abbreviations to spoken form before sending to the API. Enterprise customers can enable normalisation via the apply_text_normalization parameter at a small latency cost.
Which languages does Flash v2.5 support?
32 languages: all 29 Multilingual v2 languages (English, Japanese, Chinese, German, Hindi, French, Korean, Portuguese, Italian, Spanish, Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic, Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian, Russian) plus Hungarian, Norwegian, and Vietnamese.
Can I use Flash v2.5 for audiobooks?
Flash v2.5’s expressive range is lower than Multilingual v2, and its text normalisation limitations make it less suitable for long-form narration. For audiobooks, use Multilingual v2 (or Eleven v3 for maximum expressiveness). Flash v2.5 is optimised for real-time and bulk short-form applications.
Methodology
Model specifications from official ElevenLabs documentation (Models page, TTS API page, help centre model comparison). Latency figures from ElevenLabs official documentation (75ms, excluding application and network latency) and WaveSpeed AI’s Flash v2.5 overview. Pricing from ElevenLabs help centre model comparison (0.5 credits/char Flash, 1 credit/char Multilingual v2). Competitive latency data from ElevenLabs TTS API FAQ (Google 200ms, Azure 100–200ms). Drafted with AI assistance, reviewed by ElevenLabsMagazine.com editorial team.
References
ElevenLabs. (2026). Models documentation. https://elevenlabs.io/docs/overview/models
ElevenLabs. (2026). Text to Speech documentation. https://elevenlabs.io/docs/overview/capabilities/text-to-speech
ElevenLabs. (2026). What models do you offer? https://help.elevenlabs.io/hc/en-us/articles/17883183930129
ElevenLabs. (2026). Text to Speech API. https://elevenlabs.io/text-to-speech-api
WaveSpeed AI. (2025). Introducing ElevenLabs Flash v2.5. https://wavespeed.ai/blog/posts/introducing-elevenlabs-flash-v2-5-on-wavespeedai/
