ElevenLabs API gives developer programmatic access to every capability in the platform. Unlike competitors who offer a TTS API separately from a cloning tool separately from an STT product, ElevenLabs provides a unified API surface where TTS, STT, voice cloning, voice design, sound effects, music, dubbing, and conversational AI agents are all accessible through the same authentication, SDK, and credit system. For teams building voice-enabled applications, this eliminates vendor management overhead and simplifies cost tracking.
| API Capability | Endpoint Category | Use Case | Latency Profile |
| Text to Speech | POST /v1/text-to-speech | Voiceover, narration, TTS applications | Flash v2.5: 75ms; Multilingual v2: standard |
| Text to Speech (streaming) | POST /v1/text-to-speech/:id/stream | Real-time voice agents, live TTS | Flash v2.5: 75ms streaming |
| Speech to Text (batch) | POST /v1/speech-to-text | Transcription, subtitles, content pipelines | Async — not real-time |
| Speech to Text (realtime) | WebSocket /v1/speech-to-text/stream | Voice agents, live captioning, meetings | Scribe v2 Realtime: <150ms |
| Voice Design | POST /v1/text-to-voice/design | Generate custom voices from descriptions | Seconds per generation |
| Voice Cloning (IVC) | POST /v1/voices/add | Instant voice clone from audio sample | Fast — single upload |
| Sound Effects | POST /v1/sound-generation | Generate SFX from text prompts | Seconds per generation |
| Music Generation | POST /v1/music-generation | Generate music from text prompts | Seconds to minutes |
| Dubbing | POST /v1/dubbing | Translate and dub video/audio files | Async — depends on duration |
| Conversational AI | WebSocket /v1/convai/conversation | Voice agent conversations | Real-time, <1s end-to-end |
Authentication and Setup
All ElevenLabs API requests require an API key passed in the xi-api-key header. API keys are generated in your ElevenLabs account settings. For production applications, store API keys as environment variables, never in source code. The base URL for all API requests is https://api.elevenlabs.io.
Install the official SDK:
Python: pip install elevenlabs
Node.js: npm install @elevenlabs/elevenlabs-js
The SDK handles authentication, request formatting, error handling, and streaming automatically. For Python, initialise the client with ElevenLabsClient(api_key=os.getenv(‘ELEVENLABS_API_KEY’)). For JavaScript, use import { ElevenLabsClient } from ‘@elevenlabs/elevenlabs-js’.
Text to Speech API: Core Implementation
Basic TTS Request
The core TTS endpoint (POST /v1/text-to-speech/:voice_id) converts text to audio. Required parameters: voice_id (from voice library or cloned voices), text, model_id. Optional parameters: voice_settings (stability, similarity_boost, style, use_speaker_boost), output_format (mp3_44100_128 default, pcm_44100 for Pro+ plans, various quality tiers).
Model Selection Guide
| Model | model_id | Latency | Use Case | Credit Cost |
| Flash v2.5 | eleven_flash_v2_5 | 75ms | Real-time agents, conversational AI, live streaming | Standard |
| Multilingual v2 | eleven_multilingual_v2 | Standard | Production narration, audiobooks, podcasts, 29 languages | Standard |
| Eleven v3 (alpha) | eleven_v3 | Moderate | Emotional character performance, audio tags, multi-speaker dialogue | ~1.5–2x |
| Turbo v2.5 | eleven_turbo_v2_5 | Fast | Balance of speed and quality, high-volume production | Standard |
Streaming TTS for Real-Time Applications
For voice agent applications where the agent needs to respond while text is still being generated, use the streaming endpoint (POST /v1/text-to-speech/:voice_id/stream). This returns audio chunks as they are generated rather than waiting for the complete audio file. Combined with Flash v2.5’s 75ms latency, streaming TTS enables conversational AI applications where the voice response begins playing before the full text is synthesised.
Voice Settings Optimisation
stability (0–1): Higher values produce more consistent delivery; lower values introduce emotional variation. For neutral narration: 0.75. For emotional content: 0.5–0.6. similarity_boost (0–1): How closely AI adheres to the original voice. For voice clones: 0.85. For library voices: 0.75. style (0–1): Style exaggeration — amplifies the speaker’s natural style. Keep at 0 for most production use cases; only increase for character voice applications. Higher values reduce stability.
Scribe v2 STT API: Batch and Realtime
Batch Transcription
Scribe v2 batch (POST /v1/speech-to-text) accepts audio/video files up to the plan limit. Key parameters: audio (file upload), model_id (‘scribe_v2’), language_code (optional — auto-detected if omitted), diarize (boolean — speaker identification), keyterms (array of up to 1,000 domain-specific terms), num_speakers (integer — expected speaker count for diarization), tag_audio_events (boolean — detect non-speech events), entity_detection (boolean — identify PII categories), redact_pii (boolean — remove PII before returning transcript).
No Verbatim mode parameter: remove_filler_words (boolean) — automatically removes um, uh, repeated phrases, and stuttering from the returned transcript.
Realtime Transcription via WebSocket
Scribe v2 Realtime uses WebSocket streaming. Connect to wss://api.elevenlabs.io/v1/speech-to-text/stream with your API key. Send audio chunks as binary messages. Receive partial transcripts in real time (under 150ms latency) as JSON objects with is_final flag indicating whether the transcript for that segment is complete or will be updated as more audio is processed. Use commit control to signal utterance boundaries for higher accuracy on speaker turn detection.
Voice Design API
POST /v1/text-to-voice/design accepts: voice_description (string, 20–1,000 chars), text (optional preview text, 100–1,000 chars), guidance_scale (float — lower for creative freedom, higher for prompt adherence; high guidance with vague prompt produces robotic output), seed (integer — same seed + same prompt = same voice for reproducibility), loudness (float, −1 to 1; 0 ≈ −24 LUFS), auto_generate_text (boolean — AI generates preview text matching the description), return_previews (boolean — include audio in response or return only IDs for streaming).
Returns three voice preview objects. Stream previews via GET /v1/text-to-voice/:generated_voice_id/stream. Save a preview to the voice library via POST /v1/text-to-voice/:generated_voice_id/save. Saved voices are immediately usable in TTS requests via the returned voice_id.
Conversational AI SDK
ElevenLabs’ Conversational AI uses WebSocket-based real-time communication. The Python SDK provides a Conversation class that handles the full duplex audio pipeline: microphone input → streaming STT (Scribe v2 Realtime) → LLM processing → streaming TTS (Flash v2.5) → speaker output. For web applications, the JavaScript SDK provides equivalent functionality with browser-compatible audio APIs.
Key configuration parameters: agent_id (from Conversational AI dashboard), voice_id (Flash v2.5 recommended for latency), conversation_config_override (dynamic per-conversation configuration — useful for personalising the agent’s initial context per user session), and callback handlers for on_agent_response, on_user_transcript, on_agent_audio, and on_disconnect events.
For the full Conversational AI platform ElevenLabs API Developer including MCP integration and 11.ai, see our ElevenLabs Conversational AI builder’s guide.
API Pricing at Scale: Credit Economics
ElevenLabs charges credits per character for TTS. At the Creator plan ($22/month, 100,000 credits): 1 credit ≈ 1 character. A 500-character script narration costs approximately 500 credits. At scale, real-world credit consumption is 1.5–2x the raw character count due to regenerations for quality correction. Plan your credit budget at 1.75x the raw character volume you expect to generate monthly.
| Plan | Monthly Cost | Credits | Cost per 1M Chars | Best For |
| Creator | $22/mo | 100K | $220/1M | Small applications, prototyping, low-volume production |
| Pro | $99/mo | 500K | $198/1M | Medium applications, production voice apps |
| Scale | $330/mo | 2M | $165/1M | High-volume applications, enterprise production |
| Business | $1,320/mo | 11M | $120/1M | Large-scale enterprise production |
| Enterprise | Custom | Custom | Negotiated | Very high volume, custom terms |
Scribe v2 Realtime is priced separately at $0.28/hour of audio processed. Enterprise clients at 30+ concurrent streams receive higher concurrency limits and volume discounts. STT credits are separate from TTS credits in billing.
Production Architecture Patterns
Pattern 1: Async Content Production Pipeline
For content teams ElevenLabs API Developer generating narration for videos, podcasts, or audiobooks at volume: queue scripts for batch TTS generation using Multilingual v2, store generated audio in object storage (S3, GCS), trigger downstream processing (normalization, chapter segmentation) on completion. This pattern minimises API latency sensitivity since content is generated ahead of playback. Credit monitoring via the /v1/user/subscription endpoint prevents unexpected overages.
Pattern 2: Real-Time Voice Agent
For conversational AI applications: use Flash v2.5 streaming TTS + Scribe v2 Realtime WebSocket for STT + ElevenLabs Conversational AI SDK for the full duplex agent loop. Implement connection resilience — Scribe v2 Realtime continues transcription seamlessly after connection resets. Cache frequently used voice IDs and settings to avoid repeated API calls on each session initialisation.
Pattern 3: Dynamic Voice Generation Application
For applications where users create custom voices (gaming character creators, personalised assistant apps): implement Voice Design API for voice generation from user descriptions, Voice Remixing API for iterative adjustment, and stream preview audio to the user before committing the voice to their account. Implement rate limiting on your application side — Voice Design requests are subject to ElevenLabs’ API rate limits which increase by plan tier.
Error Handling and Rate Limits
ElevenLabs API returns standard HTTP status codes: 200 (success), 400 (bad request — check parameters), 401 (authentication failed — check API key), 422 (validation error — check request body), 429 (rate limited — implement exponential backoff), 500 (server error — retry with backoff). Credit exhaustion returns a 429 with a specific error code indicating credit limit reached rather than request rate limit.
Implement exponential backoff for 429 and 500 errors. Log the xi-request-id header from every response for debugging — this ID allows ElevenLabs support to locate the specific request in their logs. Monitor credit consumption via the /v1/user endpoint which returns current credit balance and plan limits.
Key Takeaways
- Flash v2.5 (75ms latency) is the correct TTS model for real-time conversational AI. Multilingual v2 is the correct model for batch production content. Eleven v3 is for expressive/character applications at higher credit cost.
- Scribe v2 Realtime WebSocket is the correct STT architecture for conversational agents. Scribe v2 batch REST is the correct architecture for content pipelines and long-form transcription.
- Budget 1.75x raw character volume for real-world credit consumption — regenerations add approximately 50–75% to raw generation cost in production.
- Use the seed parameter in Voice Design to reproduce specific voices — document seeds alongside prompts for version control.
- Implement exponential backoff for 429 errors and log xi-request-id for every ElevenLabs API Developer request to enable effective debugging with ElevenLabs support.
Conclusion
ElevenLabs’ unified API is the most complete developer toolkit for voice-enabled applications in 2026. The combination of Flash v2.5’s 75ms TTS latency, Scribe v2 Realtime’s sub-150ms STT, the Conversational AI SDK’s full-duplex agent infrastructure, and the Voice Design and Remixing APIs for programmatic voice creation covers every production voice application pattern without assembling multiple vendor APIs. For developers building production voice applications, the ElevenLabs API is the correct starting point.
Frequently Asked Questions
Which ElevenLabs model should I use for a voice agent?
Flash v2.5 (eleven_flash_v2_5) for TTS — 75ms latency enables real-time conversational responses. Scribe v2 Realtime via WebSocket for STT — sub-150ms transcription. Use the Conversational AI SDK which handles the full duplex pipeline automatically.
How do I implement streaming TTS?
Use POST /v1/text-to-speech/:voice_id/stream with Flash v2.5. The endpoint returns audio chunks as they are generated. Process chunks in your application’s audio playback buffer rather than waiting for the complete file. The Python and JavaScript SDKs handle streaming response processing automatically.
How much does ElevenLabs API cost at scale?
TTS: approximately $165–$220 per 1 million characters depending on plan tier (Scale plan: $165/M, Creator: $220/M). Scribe v2 Realtime STT: $0.28/hour. Budget 1.75x raw character count for real-world production credit consumption including regenerations.
Can I reproduce a specific Voice Design generation?
Yes — use the seed parameter in the Voice Design API. The same seed with the same prompt and parameters produces the same voice. Document seeds and prompts as part of your voice asset management for reproducible deployments.
Methodology
AElevenLabs API Developer endpoint and parameter data from official ElevenLabs API documentation. Model latency figures from ElevenLabs’ official model specifications (Flash v2.5: 75ms) and Scribe v2 Realtime documentation (<150ms). Pricing from official ElevenLabs pricing page as of March 2026. Credit consumption multiplier (1.5–2x) from ElevenLabsMagazine.com’s own review findings and published user testing. Drafted with AI assistance, reviewed by ElevenLabsMagazine.com editorial team.
References
ElevenLabs. (2026). API Documentation. https://elevenlabs.io/docs/api-reference
ElevenLabs. (2026). Text to Speech API. https://elevenlabs.io/docs/api-reference/text-to-speech
ElevenLabs. (2026). Speech to Text documentation. https://elevenlabs.io/docs/overview/capabilities/speech-to-text
ElevenLabs. (2026). Voice Design API. https://elevenlabs.io/docs/api-reference/text-to-voice/design
ElevenLabs. (2026). ElevenLabs Python SDK. https://github.com/elevenlabs/elevenlabs-python
ElevenLabs. (2026). Pricing. https://elevenlabs.io/pricing
