voice_id — and is honest about the one thing that decides
whether a clone sounds great or gets rejected: the quality of your reference
clip.
How it fits together
Enrollment is quick but not instant: you upload a clip, the voice startspending, and becomes ready once it passes the quality gate. A ready
voice_id works immediately in POST /v1/audio/speech and as an Omni agent’s
voice.
Voice cloning is English-only today and requires the
voice:clone scope.
Cloning copies a real person’s voice — only clone voices you have explicit
permission to use.What makes a good reference clip
The cloner is gated on real acoustic quality, not file metadata. The single most important requirement: the clip must carry genuine full-band audio (real energy up to ~24 kHz / a 48 kHz capture) — not an 8 kHz phone call that’s been upsampled to look like a 48 kHz file. Upsampling adds samples, not bandwidth; the gate sees through it. A clip that passes cleanly is:- ~6–15 seconds of continuous, natural speech (not a single word, not a 3-minute monologue).
- Genuinely wideband — recorded at 24 kHz or higher with real high-frequency content. A mid-quality phone mic in a quiet room is fine; a telephone recording is not.
- One speaker only. No second voice, no crosstalk, no background conversation.
- Clean — minimal background noise, no music, no reverb-heavy rooms, no compression artifacts.
- Consistent — even volume, no clipping, no long silences.
Build it
Prepare the reference clip
Trim to a clean 6–15 second span where one person speaks continuously. Keep
it as WAV/PCM if you can; avoid re-encoding a lossy file or upsampling a
narrowband source — neither adds the bandwidth the gate needs.
Enroll the voice
POST /v1/voice/clones is a multipart upload: a name and the audio file.
It returns a Voice with an id and a status — typically pending while
the clip is processed.Wait until it's ready
Poll
GET /v1/voice/clones/{id} until status flips to ready. If the clip
fails the quality gate the status goes to failed — see the rejection table
below for what to fix.Preview it
Synthesize a short line to sanity-check the clone before you ship it. This is
just Listen critically: if it sounds muffled, robotic, or off-timbre, the clip is
almost always the cause — re-record per the requirements above rather than
re-running enrollment on the same audio.
POST /v1/audio/speech with your new voice_id.Python
Synthesize with the cloned voice
Once you’re happy, the clone is a first-class
voice everywhere TTS is
accepted — pass voice: voice_abc exactly as you would a stock voice id.Use the clone in an Omni agent
A
ready cloned voice can be an Omni agent’s speaking voice. In the
console Agent Builder, set the agent’s voice to
your voice_id — every realtime session then speaks in the cloned voice with
no code change. Connect exactly as in the
browser voice agent guide; only the agent’s
configured voice differs.Run it
hello.wav — it should be recognizably the speaker from your clip. Manage
your clones any time with GET /v1/voice/clones (list) and delete one with
DELETE /v1/voice/clones/{id}; clones are tenant-isolated, so you only ever see
and touch your own.
”Why was my clip rejected?”
The most common support question, answered honestly. Afailed status almost
always traces to one of these:
| What you hear / see | Root cause | Fix |
|---|---|---|
failed immediately; “insufficient bandwidth” | Narrowband audio (e.g. an 8 kHz phone call) upsampled to look like 48 kHz | Record genuinely wideband at ≥24 kHz; upsampling adds samples, not bandwidth, and the gate detects it |
failed; “clip too short/long” | Outside the ~6–15 s window | Trim to a continuous 6–15 s of speech |
failed; “multiple speakers” | Two voices, crosstalk, or background conversation | Use a clip with exactly one speaker and no overlap |
| Clone sounds muffled or dull | Real high frequencies missing (lossy/telephone source) | Re-record from a wideband source; don’t denoise away the highs |
| Clone sounds robotic or unstable | Background music, reverb, or clipping in the clip | Record in a quiet, dry room; keep levels below clipping |
| Timbre is “almost right” but off | Too little usable speech, or inconsistent volume | Provide a fuller, evenly-leveled 10–15 s sample |
403 forbidden on enroll | Key missing the voice:clone scope | Add voice:clone in the console |
| Non-English clip behaves oddly | Cloning is English-only today | Use an English reference clip |
404 on get/delete | Voice belongs to another tenant (or wrong id) | Clones are tenant-isolated; use an id your key owns |
Next steps
Browser voice agent
Put your cloned voice on a live Omni agent in the browser.
Conversation intelligence
Transcribe and analyze the calls your agents handle.
Authentication & scopes
The
voice:clone scope and key management.API reference
Full
/v1/voice/clones and /v1/audio/speech reference.