ElevenLabs Scribe Realtime Guide 2026: Live STT, Keyterms & Voice Agents

Scribe Realtime is ElevenLabs’ live speech-to-text service, launched in January 2026 as part of the Scribe v2 product family. Where Scribe v2 batch processes pre-recorded audio files with maximum accuracy, Scribe Realtime processes live audio streams with sub-150ms latency — fast enough for natural voice interaction without perceptible delay.

The model is delivered via WebSocket connection, allowing continuous bidirectional audio streaming between a client application and ElevenLabs’ servers. As audio streams in, Scribe Realtime returns incremental transcript results — partial transcripts that update as more audio is received, followed by final transcripts when a natural pause or sentence boundary is detected. This streaming architecture enables applications to display live captions, trigger agent responses, and process voice input without waiting for complete utterances.

April 27, 2026 Update: New Parameters

Keyterms Parameter

The keyterms parameter accepts an array of up to 50 strings, each up to 20 characters, that bias the Scribe Realtime model toward correct recognition of specific terms. This is critical for domain-specific applications where standard speech recognition fails on technical vocabulary, product names, proper nouns, and specialised language that the model has less training exposure to.

Without keyterms: A medical voice agent transcribing ‘the patient presented with dysphagia and odynophagia’ may produce ‘the patient presented with dis-fay-jia and oden-fay-jia’ — phonetically similar but clinically incorrect. With keyterms: [‘dysphagia’, ‘odynophagia’] in the keyterms array biases the model toward the correct medical terminology. The same principle applies to product names, brand names, technical acronyms, and any specialised vocabulary your application regularly encounters.

Both the keyterms parameter and the no_verbatim boolean are echoed back in the session_started event — confirming the configuration is active before transcription begins.

No_verbatim Mode

The no_verbatim boolean parameter removes filler words, false starts, and disfluencies from transcripts in real time. When set to true, output like ‘uh, I was, uh, thinking that we should, you know, maybe consider…’ becomes ‘I was thinking that we should maybe consider…’ in the transcript.

For voice agent applications where the transcript drives downstream processing — LLM prompting, intent classification, database queries — cleaner input without filler words produces more accurate processing. For live captioning in professional contexts (webinars, meeting transcription, broadcast), no_verbatim produces more readable real-time captions without requiring post-processing cleanup.

Scribe Realtime vs Scribe v2 Batch

Dimension	Scribe Realtime	Scribe v2 Batch
Latency	Under 150ms — live output	Minutes per hour of audio — post-processing
Use case	Live captions, voice agents, real-time transcription	Meeting transcription, podcast, long-form audio
Speaker diarization	Yes — real-time speaker identification	Yes — up to 48 speakers
Keyterm support	Up to 50 terms (April 2026 update)	Up to 1,000 terms
PII redaction	Limited	56 entity categories
No verbatim mode	Yes (April 2026 update)	Yes — No Verbatim mode
Languages	90+ languages	90+ languages
Delivery	WebSocket streaming	REST API, file upload
Character timestamps	Yes	Yes — plus word-level timestamps
Best for	Voice agents, live captions, real-time STT	Podcast transcripts, meeting notes, accurate long-form

Use Cases

Voice Agent Input Processing

Scribe Realtime is the hearing layer for ElevenLabs Conversational AI agents. When a user speaks to a voice agent, Scribe Realtime transcribes the speech in real time, the transcript is passed to the LLM reasoning layer, and ElevenLabs TTS delivers the response. The under-150ms latency of Scribe Realtime is what makes this interaction feel natural rather than mechanical — the transcription step is effectively invisible to the user. The keyterms parameter is particularly valuable for voice agents in specialised domains: customer service agents for specific products can be configured with product names and terminology, medical voice agents with clinical vocabulary, and legal voice agents with legal terminology.

Live Meeting Captioning

Scribe Realtime provides live captions for video calls, webinars, and in-person meetings. The no_verbatim mode produces clean, readable captions without filler words — appropriate for professional broadcast and webinar contexts where captions are visible to the audience. Speaker diarization identifies different speakers in real time, enabling multi-speaker caption displays that indicate who is speaking. For corporate communications, legal proceedings, and accessibility-required contexts, Scribe Realtime’s accuracy and language breadth make it a viable live captioning infrastructure.

Real-Time Podcast Transcription

For podcast producers who want live transcription during recording — to monitor transcript accuracy, catch misstatements in real time, and generate show notes during production rather than in post — Scribe Realtime provides the live layer. The keyterms parameter allows podcast-specific terminology, guest names, and topic keywords to be configured at session start, improving accuracy for content-specific vocabulary that general models may struggle with.

Accessibility Applications

Scribe Realtime’s 90+ language support and under-150ms latency make it appropriate for real-time accessibility applications — live spoken content translation, real-time captioning for hearing-impaired users, and voice-to-text communication aids. The low latency is particularly critical for accessibility use cases where delays between speech and caption display create comprehension difficulties.

Developer Implementation Guide

Basic WebSocket Connection

Scribe Realtime uses a WebSocket connection to stream audio and receive transcripts. The connection URL is wss://api.elevenlabs.io/v1/speech-to-text/stream-input. Authentication uses your ElevenLabs API key passed as a header. The session is configured by sending a JSON configuration message immediately after connection, which includes your API key, keyterms array, and no_verbatim preference. The configuration message is echoed back as a session_started event confirming the parameters are active.

Audio Streaming

Audio is streamed as binary WebSocket messages. Supported formats include PCM 16-bit at 16kHz (recommended for lowest latency), as well as MP3, OGG, and other formats. The model accepts continuous audio streaming and returns incremental transcript messages as speech is detected. Final transcript messages are returned when a natural pause or end-of-utterance is detected.

Handling Transcript Events

Scribe Realtime returns three event types during a session: session_started (confirms configuration and signals readiness), transcript (contains partial or final transcript text with speaker labels and confidence scores), and session_ended (signals the session has closed). Partial transcripts should be displayed as tentative text that updates in real time. Final transcripts replace the partial text and trigger any downstream processing — LLM prompting, database storage, caption display.

SDK Integration

The @elevenlabs/client browser SDK, Python SDK, and JavaScript SDK all support Scribe Realtime with the April 2026 keyterms and no_verbatim parameters. The SDK handles WebSocket connection management, authentication, audio format conversion, and event parsing — significantly reducing the implementation complexity compared to raw WebSocket integration.

Three Insights Most Scribe Realtime Coverage Misses

1. Keyterms Is the Feature That Makes Scribe Realtime Enterprise-Ready

The April 27, 2026 addition of the keyterms parameter is a more significant development for enterprise adoption than it appears in the changelog. General speech recognition models fail reliably on domain-specific vocabulary — this is the primary reason that enterprise customers historically built custom speech recognition models or paid for specialised medical, legal, or financial ASR providers. The keyterms parameter provides a lightweight alternative: configure the 50 most critical domain terms at session start and significantly improve recognition accuracy for specialised vocabulary without requiring custom model training. For enterprise voice agent deployments in regulated industries, this capability removes a key barrier to using Scribe Realtime in production.

2. No_verbatim Changes the Quality of LLM-Driven Voice Agent Responses

Voice agent developers often underestimate how much filler words and disfluencies affect downstream LLM processing. An LLM prompt containing ‘uh, I was thinking, you know, that the, uh, shipment should maybe be, like, rerouted’ produces lower-quality intent classification and less accurate responses than the same prompt without the filler content. The no_verbatim parameter effectively pre-processes the LLM input, improving agent response quality for free without any additional prompt engineering. For voice agent developers who have struggled with LLM response quality on voice input, enabling no_verbatim is a simple configuration change with a meaningful quality impact.

3. Scribe Realtime’s Language Breadth at 150ms Latency Is Technically Impressive

Most real-time STT systems with 90+ language support achieve that breadth at the cost of either accuracy or latency — wider language support typically means larger models with more computational overhead and higher latency. Scribe Realtime’s maintenance of under-150ms latency across 90+ languages is technically significant and commercially valuable for any voice agent or live captioning application serving a global or multilingual user base. This capability is directly relevant to ElevenLabs’ overall multilingual positioning and its support of languages at the breadth required for global enterprise deployment.

Scribe Realtime in 2027

The Scribe Realtime development trajectory points toward three improvements over the next 12 months. The keyterms limit of 50 terms will likely increase — Scribe v2 batch already supports 1,000 keyterms, and closing this gap for real-time is a natural progression. Latency will continue to decrease as model efficiency and infrastructure optimisation advance — sub-100ms real-time transcription is achievable and would meaningfully improve the naturalness of voice agent interactions. And PII redaction — currently a Scribe v2 batch feature — will likely come to Scribe Realtime as enterprise voice agent deployments in regulated industries demand it for compliance.

Key Takeaways

Scribe Realtime delivers live speech-to-text at under 150ms latency via WebSocket — the hearing layer for ElevenLabs voice agents and live captioning applications.
April 27, 2026 update added keyterms (50 domain-specific terms for vocabulary biasing) and no_verbatim (removes filler words in real time) — both available across JS, Python, and Node SDKs.
Keyterms is the feature that makes Scribe Realtime enterprise-ready for specialised domains — medical, legal, financial, product-specific vocabulary.
No_verbatim improves LLM-driven voice agent response quality by cleaning input before it reaches the reasoning layer.
Use Scribe Realtime for live applications, Scribe v2 batch for maximum accuracy on recorded audio.

Conclusion

ElevenLabs Scribe Realtime is the critical infrastructure layer for any voice agent or live transcription application built on ElevenLabs. The April 27, 2026 update — keyterms and no_verbatim — moves it from a capable general-purpose live STT tool to an enterprise-ready real-time transcription system with domain vocabulary biasing and clean transcript output. For developers building voice agents with ElevenLabs Conversational AI, Scribe Realtime is not optional infrastructure — it is the component that determines whether users feel they are being heard accurately, which in turn determines whether the voice agent feels intelligent or frustrating. Implement keyterms from day one with your domain vocabulary, enable no_verbatim for all production agent deployments, and the Scribe Realtime layer will be effectively invisible to your users — which is exactly what great infrastructure should be.

Frequently Asked Questions

What is ElevenLabs Scribe Realtime?

A WebSocket-based live speech-to-text service from ElevenLabs that transcribes audio with under 150ms latency. It is designed for voice agent applications and live captioning where real-time transcription speed is essential. Supports 90+ languages, speaker diarization, keyterm biasing, and no-verbatim filler word removal.

What is the keyterms parameter in Scribe Realtime?

Added April 27, 2026 — an array of up to 50 strings that bias the transcription model toward correct recognition of specific domain vocabulary, product names, proper nouns, or technical terms. Configured at session start and echoed in the session_started event.

What does no_verbatim mode do?

Removes filler words (uh, um, you know, like), false starts, and disfluencies from the real-time transcript output. Produces cleaner text for downstream LLM processing in voice agents and more readable live captions in professional captioning contexts.

What is the difference between Scribe Realtime and Scribe v2?

Scribe Realtime is optimised for low latency live transcription (under 150ms). Scribe v2 batch is optimised for maximum accuracy on pre-recorded audio with features like 48-speaker diarization, PII redaction across 56 entity categories, and 1,000-term keyterm support. Use Realtime for live applications, batch for post-processing.

Which ElevenLabs plans include Scribe Realtime?

Scribe Realtime is available on paid ElevenLabs plans. Check the current ElevenLabs pricing page for the specific plan tiers that include Scribe Realtime API access, as availability may vary by tier.

Methodology

Scribe Realtime specifications from ElevenLabs official documentation. April 27, 2026 keyterms and no_verbatim update from ElevenLabs official changelog (elevenlabs.io/docs/changelog). SDK availability from releasebot.io ElevenLabs changelog summary (April 2026). Scribe v2 batch comparison from ElevenLabs Scribe documentation. This article was drafted with AI assistance and reviewed by the editorial team at ElevenLabsMagazine.com.

References

ElevenLabs. (2026). Scribe documentation. https://elevenlabs.io/docs/api-reference/speech-to-text

ElevenLabs. (April 27, 2026). Changelog — Realtime keyterms and verbatim mode. https://elevenlabs.io/docs/changelog

Releasebot. (April 2026). ElevenLabs April 2026 release notes. https://releasebot.io/updates/eleven-labs

ElevenLabs Scribe Realtime 2026: Complete Guide for Developers and Creators