ElevenLabs Scribe v2: The Complete Speech-to-Text Guide (2026)

Key Takeaways

  • Scribe v2 Realtime achieves the lowest Word Error Rate of any low-latency ASR model on the FLEURS multilingual benchmark across 30 languages, with under 150ms latency — outperforming OpenAI Whisper, Google Gemini Flash, Amazon Transcribe, and Deepgram on both accuracy and multilingual coverage in ElevenLabs’ benchmarks.
  • Scribe v2 (batch) is purpose-built for long-form audio — podcasts, legal transcripts, meeting recordings, medical dictation — with speaker diarization for up to 48 distinct speakers, dynamic audio tagging for non-speech events, entity detection across 56 PII categories, and keyterm prompting with up to 1,000 domain-specific terms.
  • Three major feature upgrades in the March 2026 update: PII auto-redaction during transcription (before storage), No Verbatim mode (automatic filler word removal), and expanded keyterm capacity from 100 to 1,000 terms — making Scribe v2 enterprise-ready for healthcare, finance, and customer service compliance workflows.

What Scribe v2 Is — and the Two-Model Architecture

ElevenLabs’ speech-to-text platform consists of two specialised models serving distinct use cases. Scribe v-2 (batch) launched January 9, 2026 and is optimised for long-form audio processing — batch transcription, subtitling, captioning, and structured audio analysis at scale. Scribe v2 Realtime launched January 6, 2026 and is purpose-built for live applications — conversational AI agents, meeting assistants, voice-enabled apps, and real-time captioning where latency is the primary constraint.

Both models support 90+ languages with automatic detection, handle accents and diverse acoustic conditions without manual configuration, and are available via REST API and WebSocket. Both carry full enterprise compliance coverage: SOC 2, ISO 27001, PCI DSS L1, HIPAA, and GDPR, with EU and India data residency and zero retention mode for regulated environments.

For context on how Scribe fits into the broader STT market alongside Deepgram, AssemblyAI, and Whisper, see our best speech-to-text software comparison for 2026 (https://elevenlabsmagazine.com/best-speech-to-text-software-2026/).

Scribe v2 Realtime: Architecture and Performance

Scribe v2 Realtime uses predictive transcription — anticipating the most probable next words and punctuation based on context rather than waiting for complete utterances. This architectural approach reduces perceived latency significantly: while the model processes audio in approximately 150ms, predictive text display means the user experience feels faster than the raw latency number suggests.

On the FLEURS multilingual benchmark measuring accuracy across 30 languages, Scribe v-2 Realtime achieved the lowest Word Error Rate of any low-latency ASR model — outperforming Gemini Flash, OpenAI’s real-time STT, and Deepgram on this multilingual measure. ElevenLabs’ internal benchmarks across hundreds of challenging English conversation samples — poor audio quality, diverse accents, filler words — showed Scribe v2 Realtime capturing user intent more accurately than any competing real-time ASR.

Scribe v2 Realtime: Key Technical Capabilities

CapabilityDetailUse Case
LatencyUnder 150ms (30–80ms optimised configurations)Conversational AI agents, live voice apps
Languages90+ with automatic detectionMultilingual agents, global live captioning
Commit controlDeveloper control over when to finalise transcriptsCustom streaming, fine-tuned accuracy pipelines
VAD (Voice Activity Detection)Automatically detects speech start/stopSmoother live processing, cleaner segmentation
Connection resilienceContinues transcription seamlessly after connection resetsProduction-grade live deployments
Complex vocabularyBuilt-in support for technical language, medications, proper nounsHealthcare, legal, financial voice agents
Agents platform integrationAvailable as optional upgrade in ElevenLabs AgentsVoice agents requiring highest STT accuracy
Pricing$0.28/hour of audio processedScales from startup to enterprise

Scribe v2 Batch: Architecture and Performance

Scribe v-2 batch is designed for accuracy over speed — processing long podcast episodes, meeting recordings, legal dictation, medical notes, and video subtitles where the richness of output matters more than real-time delivery. It improves on Scribe v1’s stability across long-form audio, handling pauses, tone changes, and extended silences without accuracy degradation common in models optimised for shorter utterances.

Scribe v2 Batch: Feature Set

FeatureCapabilityEnterprise Value
Speaker diarizationUp to 48 distinct speakers, intuitive labellingMeeting transcription, multi-person interviews, call centre recording
Dynamic audio taggingDetects non-speech events: laughter, footsteps, applauseContent indexing, accessibility enrichment, context preservation
Entity detection56 PII categories with exact timestamps: names, SSNs, credit cards, medical conditionsHIPAA compliance, GDPR data mapping, financial regulation
PII redactionThree modes: complete [REDACTED], categorised [CREDIT_CARD], enumerated [CREDIT_CARD_1]Automated compliance without manual review
Keyterm promptingUp to 1,000 context-aware domain-specific termsTechnical vocabulary, product names, medical terminology
Multi-language in single fileAutomatic language detection, no manual segmentationMultilingual meetings, international interview content
No Verbatim modeRemoves filler words (um, uh), repeated phrases, stutteringMeeting notes, subtitles, polished written records
Mixed-script handlingEnglish words remain in Latin script within Indic language audioHindi-English, Telugu-English, Kannada-English code-switching

CHECK OUT: ElevenLabs Eleven v3 and Audio Tags: The Complete Practical Guide (2026)

The March 2026 Feature Upgrades in Detail

1. PII Auto-Redaction During Transcription

The most significant enterprise capability added in the March 2026 update. PII redaction now happens during transcription — sensitive data is removed before it reaches storage or downstream systems. Three redaction modes give compliance teams the right tool for each requirement: complete redaction replaces entities with [REDACTED] for maximum privacy; categorised redaction replaces with [CREDIT_CARD] or [SSN] for audit trail purposes; enumerated redaction uses [CREDIT_CARD_1], [CREDIT_CARD_2] for cases where tracking distinct instances matters.

For healthcare teams transcribing patient calls, financial services recording client conversations, and customer support centres capturing personal details, this eliminates a post-processing step that previously required a separate PII-scanning service or manual review. The data never enters storage unredacted.

2. No Verbatim Mode

No Verbatim mode automatically removes filler words (um, uh), repeated phrases, and stuttering from transcripts — producing clean, polished written records without manual editing. It is activated per-request in the API. For meeting notes, subtitle generation, executive dictation, and any workflow where the goal is a readable document rather than an exact capture of every spoken sound, No Verbatim mode eliminates significant post-processing time.

3. Keyterm Prompting Expanded to 1,000 Terms

Keyterm capacity expanded tenfold from 100 to 1,000 terms in the March update. Unlike standard custom vocabulary that blindly inserts provided terms, Scribe v2 keyterm prompting is context-aware — the model uses surrounding audio to determine whether a keyterm applies before transcribing it. This prevents false positives where a similar-sounding word would trigger incorrect term substitution. For enterprise deployments with large technical vocabularies, product catalogues, or domain-specific terminology, 1,000 terms provides sufficient coverage for most production use cases. Requests with more than 100 keyterms have a minimum billable unit of 20 seconds.

4. Mixed-Script Handling for Indic Languages

Scribe v2 now correctly transcribes English words in Latin script within Indic language audio — Hindi, Telugu, Kannada, and other Indic language code-switching. Many transcription systems previously transliterated English words into Indic scripts, producing unusable transcripts for bilingual content. This fix works automatically with no language configuration required, making Scribe v2 the most practical STT choice for India-market deployments where English-Indic code-switching is common in professional settings.

Scribe v2 vs Competing STT APIs: 2026 Comparison

CapabilityScribe v2 RealtimeDeepgram Nova-3AssemblyAI Universal-2OpenAI Whisper (managed)Google Cloud STT
Real-time latency<150ms (30–80ms optimised)Sub-300msStreaming available~200ms+Fast
Multilingual WER (FLEURS)Lowest of any low-latency modelExcellent on English and noisy audioStrong across datasetsWide language support125+ languages
Languages90+50+Multilingual99 (open source)125+
Keyterm promptingYes — up to 1,000 terms, context-awareCustom vocabularyCustom vocabularyNoCustom vocabulary
PII entity detectionYes — 56 categories, timestamps, auto-redactionLimitedYes (PII redaction)NoLimited
No Verbatim modeYes (March 2026)NoNoNoNo
Speaker diarizationYes — up to 48 speakers (batch)YesYesNo (managed)Yes
Audio tagging (non-speech)Yes — laughter, footsteps, etc.NoNoNoNo
Mixed-script IndicYes (March 2026)LimitedLimitedPartialPartial
HIPAA/SOC2/GDPRYes — full enterprise stackYesYesLimitedYes
Zero retention modeYesNoNoNoNo
Pricing (real-time)$0.28/hr$4.50/hr (bundled agent)Pay-as-you-goPay-as-you-go~$0.024/min

Pricing: Scribe v2 in 2026

Scribe v2 Realtime is priced at $0.28 per hour of audio processed — significantly lower than Deepgram’s bundled agent rate and competitive with Google Cloud STT. Enterprise clients benefit from higher concurrency limits (30+ simultaneous streams) and dedicated support. Annual Business plan subscribers receive volume discounts.

For Scribe v2 batch, pricing follows ElevenLabs’ standard API credit structure integrated with the broader platform. Teams using ElevenLabs for TTS, voice cloning, and SFX alongside Scribe benefit from a unified credit system rather than managing separate API accounts and billing for each service.

For the full ElevenLabs credit system and API pricing breakdown, see our ElevenLabs API pricing guide (https://elevenlabsmagazine.com/elevenlabs-api-pricing-guide-2026/).

CHECK OUT: ElevenLabs AI Sound Effects in 2026: The Complete Guide to Text-to-SFX

Production Use Cases: Where Scribe v2 Excels

Conversational AI Agents

Scribe v2 Realtime is the STT layer for ElevenLabs’ own Agents platform — available as an optional upgrade from the default model. For voice agents where the STT accuracy directly determines whether the agent understands what the user said, Scribe v2 Realtime’s performance on accented speech, noisy environments, and technical vocabulary makes it the highest-accuracy option within the ElevenLabs ecosystem. Agent teams handling Spanish, Portuguese, Hindi, and other non-English languages benefit most from Scribe v2’s multilingual accuracy advantage.

For how STT fits into the full voice agent stack, see our ElevenLabs Conversational AI builder’s guide (https://elevenlabsmagazine.com/elevenlabs-conversational-ai-guide-2026/).

Healthcare Transcription — HIPAA Compliance

Scribe v2’s PII auto-redaction during transcription — removing names, medical conditions, SSNs, and other protected health information before storage — combined with HIPAA compliance, BAA agreement availability, and zero retention mode makes it production-ready for clinical dictation, patient call transcription, and medical documentation workflows. Healthcare teams must contact ElevenLabs Sales to sign a BAA before deploying in any HIPAA-regulated context.

Meeting Intelligence and Corporate Notes

Speaker diarization for up to 48 speakers, No Verbatim mode for clean transcripts, entity detection for key information extraction, and integration with ElevenLabs Studio for editing and caption export make Scribe v2 batch the most complete meeting transcription tool within the ElevenLabs ecosystem. For organisations already using ElevenLabs for TTS and voice agent infrastructure, Scribe v2 adds meeting intelligence without adding a separate vendor.

Media Production: Subtitles and Captioning

Scribe v2 is now used in ElevenLabs Studio for automated subtitle and caption generation for podcasts, videos, and interviews. Dynamic audio tagging enriches transcripts with non-speech event markers — [laughter], [applause], [background noise] — providing context that pure word transcription loses. For WCAG-compliant caption production at scale, Scribe v2 batch with audio tagging produces the most complete accessible transcript output available within a single TTS-STT platform.

Customer Support — Financial Services

Call centre transcription for financial services requires PII redaction of credit card numbers, account details, and personal identifiers captured during calls. Scribe v2’s enumerated redaction mode ([CREDIT_CARD_1], [CREDIT_CARD_2]) allows compliance teams to track distinct instances across a call transcript for audit purposes, while preventing raw sensitive data from entering storage. PCI DSS L1 compliance covers payment card data handling requirements.

API Integration: Getting Started with Scribe v2

Scribe v2 Realtime uses WebSocket streaming — authenticate with an API key, send audio chunks, and receive partial or final transcripts with configurable VAD and commit controls. Scribe v2 batch uses the standard REST endpoint accepting MP4, MOV, MP3, WAV, and other common formats. Code examples in Python, JavaScript, and other languages are available in ElevenLabs’ documentation.

Keyterm prompting is added via a keyterms array parameter in the API request. PII redaction mode is set per-request via the redaction_config parameter. No Verbatim mode is toggled with a boolean parameter. All features are available via the same endpoint — no separate API configuration required for enterprise features.

Future of Scribe v2 in 2027

ElevenLabs’ roadmap for Scribe points toward deeper integration with the Agents Platform — Scribe v2 Realtime is already an optional upgrade within the Agents platform and will likely become the default model as quality advantage becomes more pronounced at scale. Speaker diarization for Realtime (currently batch-only) is a logical next capability for meeting assistant applications that need live speaker identification. The mixed-script capability for Indic languages signals investment in emerging market voice deployments where ElevenLabs’ multilingual TTS and STT combination creates the most integrated non-English voice AI platform available.

Key Takeaways

  • Use Scribe v2 Realtime for conversational AI agents, meeting assistants, and live captioning — lowest WER of any low-latency model on FLEURS multilingual benchmark, under 150ms latency.
  • Use Scribe v2 batch for long-form audio — podcasts, legal transcripts, medical dictation — with 48-speaker diarization, entity detection, and audio tagging.
  • PII auto-redaction during transcription (March 2026) makes Scribe v2 enterprise-ready for healthcare and financial services without a separate post-processing step.
  • No Verbatim mode eliminates filler words and stuttering automatically — the most practical feature for meeting notes, subtitles, and executive dictation workflows.
  • 1,000-term context-aware keyterm prompting is the most powerful domain vocabulary customisation available in any commercial STT API in 2026.
  • HIPAA deployment requires a BAA with ElevenLabs Sales before production — do not deploy in regulated healthcare contexts without completing this step.

Conclusion

Scribe v2 completes ElevenLabs’ audio loop — the platform can now generate speech with TTS and voice cloning, process it with STT, dub it across languages, and build voice agents around the full pipeline. For teams already in the ElevenLabs ecosystem, Scribe v2 eliminates the need for a separate STT vendor while delivering accuracy that matches or exceeds dedicated STT platforms on multilingual and noisy audio. For enterprise teams specifically needing PII redaction, HIPAA compliance, and large technical vocabulary support, the March 2026 feature set makes Scribe v2 the most capable enterprise STT option within any unified voice AI platform.

Frequently Asked Questions

What is the latency of Scribe v2 Realtime?

Under 150ms in standard configuration, with 30–80ms achievable in optimised deployments. Predictive transcription further reduces perceived latency by displaying partial results as the speaker talks rather than waiting for utterance completion.

How many languages does Scribe v2 support?

90+ languages with automatic detection. No manual language configuration is required — the model detects which language is being spoken and handles code-switching between languages within the same audio file automatically.

Is ElevenLabs Scribe v2 HIPAA compliant?

Yes, with a BAA agreement. Healthcare teams must contact ElevenLabs Sales to sign a Business Associate Agreement before deploying in any HIPAA-regulated context. Zero Retention mode, where audio is deleted immediately after processing, is available for stricter data control.

What is keyterm prompting in Scribe v2?

A feature allowing up to 1,000 domain-specific words or phrases to bias the model toward accurate transcription of those terms. Unlike standard custom vocabulary, keyterm prompting is context-aware — the model uses surrounding audio to determine whether a keyterm applies before transcribing it, preventing false positives.

What is No Verbatim mode?

A transcription setting that automatically removes filler words (um, uh), repeated phrases, and stuttering from transcripts — producing clean, readable records without manual post-editing. Activated per-request in the API.

How does Scribe v2 compare to Deepgram?

Scribe v2 Realtime outperforms Deepgram on the FLEURS multilingual benchmark. Deepgram Nova-3 leads on noisy English audio accuracy with its 54.2% WER reduction advantage. Scribe v2 has stronger multilingual performance, more extensive PII redaction capabilities, and native integration with the ElevenLabs Agents platform. Deepgram has a broader ecosystem for custom deployment and self-hosted options.

Methodology

Accuracy benchmark data from ElevenLabs’ official Scribe v2 Realtime documentation, FLEURS benchmark results published by ElevenLabs, and GenMediaLab’s independent Scribe v2 analysis (January 2026). Feature specifications from the official Scribe v2 Realtime introduction (January 6, 2026), Scribe v2 batch introduction (January 9, 2026), and the Scribe v2 upgrade blog post (March 2026). Pricing from ElevenLabs documentation and Quasa.io’s launch coverage. This article was drafted with AI assistance and reviewed by the editorial team at ElevenLabsMagazine.com.

CHECK OUT: ElevenLabs Eleven Music in 2026: The Complete Guide

References

ElevenLabs. (2026, January 6). Introducing Scribe v2 Realtime. https://elevenlabs.io/blog/introducing-scribe-v2-realtime

ElevenLabs. (2026, January 9). Introducing Scribe v2. https://elevenlabs.io/blog/introducing-scribe-v2

ElevenLabs. (2026, March). Scribe v2 just got an upgrade — four new features. https://elevenlabs.io/blog/scribe-v2-just-got-an-upgrade

ElevenLabs. (2026). Scribe v2 Realtime live in ElevenLabs Agents. https://elevenlabs.io/blog/scribe-v2-realtime-in-elevenlabs-agents

ElevenLabs. (2026). Speech to Text documentation. https://elevenlabs.io/docs/overview/capabilities/speech-to-text

GenMediaLab. (2026). ElevenLabs Launches Scribe v2. https://www.genmedialab.com/news/elevenlabs-scribe-v2-speech-to-text/

Recent Articles

spot_img

Related Stories