ElevenLabs API Developer Guide 2026: TTS, STT, Agents & Voice Design

ElevenLabs API gives developer programmatic access to every capability in the platform. Unlike competitors who offer a TTS API separately from a cloning tool separately from an STT product, ElevenLabs provides a unified API surface where TTS, STT, voice cloning, voice design, sound effects, music, dubbing, and conversational AI agents are all accessible through the same authentication, SDK, and credit system. For teams building voice-enabled applications, this eliminates vendor management overhead and simplifies cost tracking.

API Capability	Endpoint Category	Use Case	Latency Profile
Text to Speech	POST /v1/text-to-speech	Voiceover, narration, TTS applications	Flash v2.5: 75ms; Multilingual v2: standard
Text to Speech (streaming)	POST /v1/text-to-speech/:id/stream	Real-time voice agents, live TTS	Flash v2.5: 75ms streaming
Speech to Text (batch)	POST /v1/speech-to-text	Transcription, subtitles, content pipelines	Async — not real-time
Speech to Text (realtime)	WebSocket /v1/speech-to-text/stream	Voice agents, live captioning, meetings	Scribe v2 Realtime: <150ms
Voice Design	POST /v1/text-to-voice/design	Generate custom voices from descriptions	Seconds per generation
Voice Cloning (IVC)	POST /v1/voices/add	Instant voice clone from audio sample	Fast — single upload
Sound Effects	POST /v1/sound-generation	Generate SFX from text prompts	Seconds per generation
Music Generation	POST /v1/music-generation	Generate music from text prompts	Seconds to minutes
Dubbing	POST /v1/dubbing	Translate and dub video/audio files	Async — depends on duration
Conversational AI	WebSocket /v1/convai/conversation	Voice agent conversations	Real-time, <1s end-to-end

Authentication and Setup

All ElevenLabs API requests require an API key passed in the xi-api-key header. API keys are generated in your ElevenLabs account settings. For production applications, store API keys as environment variables, never in source code. The base URL for all API requests is https://api.elevenlabs.io.

Install the official SDK:

Python: pip install elevenlabs

Node.js: npm install @elevenlabs/elevenlabs-js

The SDK handles authentication, request formatting, error handling, and streaming automatically. For Python, initialise the client with ElevenLabsClient(api_key=os.getenv(‘ELEVENLABS_API_KEY’)). For JavaScript, use import { ElevenLabsClient } from ‘@elevenlabs/elevenlabs-js’.

Text to Speech API: Core Implementation

Basic TTS Request

The core TTS endpoint (POST /v1/text-to-speech/:voice_id) converts text to audio. Required parameters: voice_id (from voice library or cloned voices), text, model_id. Optional parameters: voice_settings (stability, similarity_boost, style, use_speaker_boost), output_format (mp3_44100_128 default, pcm_44100 for Pro+ plans, various quality tiers).

Model Selection Guide

Model	model_id	Latency	Use Case	Credit Cost
Flash v2.5	eleven_flash_v2_5	75ms	Real-time agents, conversational AI, live streaming	Standard
Multilingual v2	eleven_multilingual_v2	Standard	Production narration, audiobooks, podcasts, 29 languages	Standard
Eleven v3 (alpha)	eleven_v3	Moderate	Emotional character performance, audio tags, multi-speaker dialogue	~1.5–2x
Turbo v2.5	eleven_turbo_v2_5	Fast	Balance of speed and quality, high-volume production	Standard

Streaming TTS for Real-Time Applications

For voice agent applications where the agent needs to respond while text is still being generated, use the streaming endpoint (POST /v1/text-to-speech/:voice_id/stream). This returns audio chunks as they are generated rather than waiting for the complete audio file. Combined with Flash v2.5’s 75ms latency, streaming TTS enables conversational AI applications where the voice response begins playing before the full text is synthesised.

Voice Settings Optimisation

stability (0–1): Higher values produce more consistent delivery; lower values introduce emotional variation. For neutral narration: 0.75. For emotional content: 0.5–0.6. similarity_boost (0–1): How closely AI adheres to the original voice. For voice clones: 0.85. For library voices: 0.75. style (0–1): Style exaggeration — amplifies the speaker’s natural style. Keep at 0 for most production use cases; only increase for character voice applications. Higher values reduce stability.

Scribe v2 STT API: Batch and Realtime

Batch Transcription

Scribe v2 batch (POST /v1/speech-to-text) accepts audio/video files up to the plan limit. Key parameters: audio (file upload), model_id (‘scribe_v2’), language_code (optional — auto-detected if omitted), diarize (boolean — speaker identification), keyterms (array of up to 1,000 domain-specific terms), num_speakers (integer — expected speaker count for diarization), tag_audio_events (boolean — detect non-speech events), entity_detection (boolean — identify PII categories), redact_pii (boolean — remove PII before returning transcript).

No Verbatim mode parameter: remove_filler_words (boolean) — automatically removes um, uh, repeated phrases, and stuttering from the returned transcript.

Realtime Transcription via WebSocket

Scribe v2 Realtime uses WebSocket streaming. Connect to wss://api.elevenlabs.io/v1/speech-to-text/stream with your API key. Send audio chunks as binary messages. Receive partial transcripts in real time (under 150ms latency) as JSON objects with is_final flag indicating whether the transcript for that segment is complete or will be updated as more audio is processed. Use commit control to signal utterance boundaries for higher accuracy on speaker turn detection.

Voice Design API

POST /v1/text-to-voice/design accepts: voice_description (string, 20–1,000 chars), text (optional preview text, 100–1,000 chars), guidance_scale (float — lower for creative freedom, higher for prompt adherence; high guidance with vague prompt produces robotic output), seed (integer — same seed + same prompt = same voice for reproducibility), loudness (float, −1 to 1; 0 ≈ −24 LUFS), auto_generate_text (boolean — AI generates preview text matching the description), return_previews (boolean — include audio in response or return only IDs for streaming).

Returns three voice preview objects. Stream previews via GET /v1/text-to-voice/:generated_voice_id/stream. Save a preview to the voice library via POST /v1/text-to-voice/:generated_voice_id/save. Saved voices are immediately usable in TTS requests via the returned voice_id.

Conversational AI SDK

ElevenLabs’ Conversational AI uses WebSocket-based real-time communication. The Python SDK provides a Conversation class that handles the full duplex audio pipeline: microphone input → streaming STT (Scribe v2 Realtime) → LLM processing → streaming TTS (Flash v2.5) → speaker output. For web applications, the JavaScript SDK provides equivalent functionality with browser-compatible audio APIs.

Key configuration parameters: agent_id (from Conversational AI dashboard), voice_id (Flash v2.5 recommended for latency), conversation_config_override (dynamic per-conversation configuration — useful for personalising the agent’s initial context per user session), and callback handlers for on_agent_response, on_user_transcript, on_agent_audio, and on_disconnect events.

For the full Conversational AI platform ElevenLabs API Developer including MCP integration and 11.ai, see our ElevenLabs Conversational AI builder’s guide.

API Pricing at Scale: Credit Economics

ElevenLabs charges credits per character for TTS. At the Creator plan ($22/month, 100,000 credits): 1 credit ≈ 1 character. A 500-character script narration costs approximately 500 credits. At scale, real-world credit consumption is 1.5–2x the raw character count due to regenerations for quality correction. Plan your credit budget at 1.75x the raw character volume you expect to generate monthly.

Plan	Monthly Cost	Credits	Cost per 1M Chars	Best For
Creator	$22/mo	100K	$220/1M	Small applications, prototyping, low-volume production
Pro	$99/mo	500K	$198/1M	Medium applications, production voice apps
Scale	$330/mo	2M	$165/1M	High-volume applications, enterprise production
Business	$1,320/mo	11M	$120/1M	Large-scale enterprise production
Enterprise	Custom	Custom	Negotiated	Very high volume, custom terms

Scribe v2 Realtime is priced separately at $0.28/hour of audio processed. Enterprise clients at 30+ concurrent streams receive higher concurrency limits and volume discounts. STT credits are separate from TTS credits in billing.

Production Architecture Patterns

Pattern 1: Async Content Production Pipeline

For content teams ElevenLabs API Developer generating narration for videos, podcasts, or audiobooks at volume: queue scripts for batch TTS generation using Multilingual v2, store generated audio in object storage (S3, GCS), trigger downstream processing (normalization, chapter segmentation) on completion. This pattern minimises API latency sensitivity since content is generated ahead of playback. Credit monitoring via the /v1/user/subscription endpoint prevents unexpected overages.

Pattern 2: Real-Time Voice Agent

For conversational AI applications: use Flash v2.5 streaming TTS + Scribe v2 Realtime WebSocket for STT + ElevenLabs Conversational AI SDK for the full duplex agent loop. Implement connection resilience — Scribe v2 Realtime continues transcription seamlessly after connection resets. Cache frequently used voice IDs and settings to avoid repeated API calls on each session initialisation.

Pattern 3: Dynamic Voice Generation Application

For applications where users create custom voices (gaming character creators, personalised assistant apps): implement Voice Design API for voice generation from user descriptions, Voice Remixing API for iterative adjustment, and stream preview audio to the user before committing the voice to their account. Implement rate limiting on your application side — Voice Design requests are subject to ElevenLabs’ API rate limits which increase by plan tier.

Error Handling and Rate Limits

ElevenLabs API returns standard HTTP status codes: 200 (success), 400 (bad request — check parameters), 401 (authentication failed — check API key), 422 (validation error — check request body), 429 (rate limited — implement exponential backoff), 500 (server error — retry with backoff). Credit exhaustion returns a 429 with a specific error code indicating credit limit reached rather than request rate limit.

Implement exponential backoff for 429 and 500 errors. Log the xi-request-id header from every response for debugging — this ID allows ElevenLabs support to locate the specific request in their logs. Monitor credit consumption via the /v1/user endpoint which returns current credit balance and plan limits.

Key Takeaways

Flash v2.5 (75ms latency) is the correct TTS model for real-time conversational AI. Multilingual v2 is the correct model for batch production content. Eleven v3 is for expressive/character applications at higher credit cost.
Scribe v2 Realtime WebSocket is the correct STT architecture for conversational agents. Scribe v2 batch REST is the correct architecture for content pipelines and long-form transcription.
Budget 1.75x raw character volume for real-world credit consumption — regenerations add approximately 50–75% to raw generation cost in production.
Use the seed parameter in Voice Design to reproduce specific voices — document seeds alongside prompts for version control.
Implement exponential backoff for 429 errors and log xi-request-id for every ElevenLabs API Developer request to enable effective debugging with ElevenLabs support.

Conclusion

ElevenLabs’ unified API is the most complete developer toolkit for voice-enabled applications in 2026. The combination of Flash v2.5’s 75ms TTS latency, Scribe v2 Realtime’s sub-150ms STT, the Conversational AI SDK’s full-duplex agent infrastructure, and the Voice Design and Remixing APIs for programmatic voice creation covers every production voice application pattern without assembling multiple vendor APIs. For developers building production voice applications, the ElevenLabs API is the correct starting point.

Frequently Asked Questions

Which ElevenLabs model should I use for a voice agent?

Flash v2.5 (eleven_flash_v2_5) for TTS — 75ms latency enables real-time conversational responses. Scribe v2 Realtime via WebSocket for STT — sub-150ms transcription. Use the Conversational AI SDK which handles the full duplex pipeline automatically.

How do I implement streaming TTS?

Use POST /v1/text-to-speech/:voice_id/stream with Flash v2.5. The endpoint returns audio chunks as they are generated. Process chunks in your application’s audio playback buffer rather than waiting for the complete file. The Python and JavaScript SDKs handle streaming response processing automatically.

How much does ElevenLabs API cost at scale?

TTS: approximately $165–$220 per 1 million characters depending on plan tier (Scale plan: $165/M, Creator: $220/M). Scribe v2 Realtime STT: $0.28/hour. Budget 1.75x raw character count for real-world production credit consumption including regenerations.

Can I reproduce a specific Voice Design generation?

Yes — use the seed parameter in the Voice Design API. The same seed with the same prompt and parameters produces the same voice. Document seeds and prompts as part of your voice asset management for reproducible deployments.

Methodology

AElevenLabs API Developer endpoint and parameter data from official ElevenLabs API documentation. Model latency figures from ElevenLabs’ official model specifications (Flash v2.5: 75ms) and Scribe v2 Realtime documentation (<150ms). Pricing from official ElevenLabs pricing page as of March 2026. Credit consumption multiplier (1.5–2x) from ElevenLabsMagazine.com’s own review findings and published user testing. Drafted with AI assistance, reviewed by ElevenLabsMagazine.com editorial team.