Clone a voice end to end

A cloned voice lets you speak in a specific person’s voice across every PyAI surface: text-to-speech, voicemail, IVR prompts, and live Omni agents. This guide takes you from a raw audio clip to a production voice_id — and is honest about the one thing that decides whether a clone sounds great or gets rejected: the quality of your reference clip.

How it fits together

Enrollment is quick but not instant: you upload a clip, the voice starts pending, and becomes ready once it passes the quality gate. A ready voice_id works immediately in POST /v1/audio/speech and as an Omni agent’s voice.

Voice cloning is English-only today and requires the voice:clone scope. Cloning copies a real person’s voice — only clone voices you have explicit permission to use.

What makes a good reference clip

The cloner is gated on real acoustic quality, not file metadata. The single most important requirement: the clip must carry genuine full-band audio (real energy up to ~24 kHz / a 48 kHz capture) — not an 8 kHz phone call that’s been upsampled to look like a 48 kHz file. Upsampling adds samples, not bandwidth; the gate sees through it. A clip that passes cleanly is:

~6–15 seconds of continuous, natural speech (not a single word, not a 3-minute monologue).
Genuinely wideband — recorded at 24 kHz or higher with real high-frequency content. A mid-quality phone mic in a quiet room is fine; a telephone recording is not.
One speaker only. No second voice, no crosstalk, no background conversation.
Clean — minimal background noise, no music, no reverb-heavy rooms, no compression artifacts.
Consistent — even volume, no clipping, no long silences.

The best clip is boring: one person reading two or three sentences in a quiet room with a decent mic. Record at 48 kHz mono WAV and don’t normalize, denoise, or add effects — let the audio be real.

Build it

Prepare the reference clip

Trim to a clean 6–15 second span where one person speaks continuously. Keep it as WAV/PCM if you can; avoid re-encoding a lossy file or upsampling a narrowband source — neither adds the bandwidth the gate needs.

# Trim to a 10s window, keep native sample rate, mono — no upsampling tricks.
ffmpeg -i raw.wav -ss 00:00:04 -t 10 -ac 1 -c:a pcm_s16le reference.wav

Enroll the voice

POST /v1/voice/clones is a multipart upload: a name and the audio file. It returns a Voice with an id and a status — typically pending while the clip is processed.

import os
from pyai import PyAI

pyai = PyAI(api_key=os.environ["PYAI_API_KEY"])

with open("reference.wav", "rb") as f:
    voice = pyai.voice.clones.create(name="Ava — brand voice", file=f)

print(voice.id, voice.status)  # voice_abc  pending

Wait until it's ready

Poll GET /v1/voice/clones/{id} until status flips to ready. If the clip fails the quality gate the status goes to failed — see the rejection table below for what to fix.

import time

def wait_ready(voice_id, timeout=120):
    deadline = time.time() + timeout
    while time.time() < deadline:
        v = pyai.voice.clones.get(voice_id)
        if v.status in ("ready", "failed"):
            return v
        time.sleep(2)
    raise TimeoutError(voice_id)

voice = wait_ready(voice.id)
if voice.status == "failed":
    raise SystemExit("Clip rejected — see the troubleshooting table")

Preview it

Synthesize a short line to sanity-check the clone before you ship it. This is just POST /v1/audio/speech with your new voice_id.

Python

audio = pyai.audio.speech(
    input="Hi! This is a preview of my cloned voice.",
    voice=voice.id,            # voice_abc
    response_format="wav",
)
open("preview.wav", "wb").write(audio)

Listen critically: if it sounds muffled, robotic, or off-timbre, the clip is almost always the cause — re-record per the requirements above rather than re-running enrollment on the same audio.

Synthesize with the cloned voice

Once you’re happy, the clone is a first-class voice everywhere TTS is accepted — pass voice: voice_abc exactly as you would a stock voice id.

audio = pyai.audio.speech(
    input="Your appointment is confirmed for Thursday at 2 PM.",
    voice=voice.id,
    response_format="mp3",
)
open("confirmation.mp3", "wb").write(audio)

Use the clone in an Omni agent

A ready cloned voice can be an Omni agent’s speaking voice. In the console Agent Builder, set the agent’s voice to your voice_id — every realtime session then speaks in the cloned voice with no code change. Connect exactly as in the browser voice agent guide; only the agent’s configured voice differs.

Run it

# 1. enroll
pyai voice clones create --name "Ava — brand voice" --file reference.wav
# 2. it prints voice_abc (pending) → poll until ready, then:
curl https://api.pyai.com/v1/audio/speech \
  -H "Authorization: Bearer $PYAI_API_KEY" -H "Content-Type: application/json" \
  -d '{"input":"Hello in my own voice.","voice":"voice_abc"}' --output hello.wav

Play hello.wav — it should be recognizably the speaker from your clip. Manage your clones any time with GET /v1/voice/clones (list) and delete one with DELETE /v1/voice/clones/{id}; clones are tenant-isolated, so you only ever see and touch your own.

”Why was my clip rejected?”

The most common support question, answered honestly. A failed status almost always traces to one of these:

What you hear / see	Root cause	Fix
`failed` immediately; “insufficient bandwidth”	Narrowband audio (e.g. an 8 kHz phone call) upsampled to look like 48 kHz	Record genuinely wideband at ≥24 kHz; upsampling adds samples, not bandwidth, and the gate detects it
`failed`; “clip too short/long”	Outside the ~6–15 s window	Trim to a continuous 6–15 s of speech
`failed`; “multiple speakers”	Two voices, crosstalk, or background conversation	Use a clip with exactly one speaker and no overlap
Clone sounds muffled or dull	Real high frequencies missing (lossy/telephone source)	Re-record from a wideband source; don’t denoise away the highs
Clone sounds robotic or unstable	Background music, reverb, or clipping in the clip	Record in a quiet, dry room; keep levels below clipping
Timbre is “almost right” but off	Too little usable speech, or inconsistent volume	Provide a fuller, evenly-leveled 10–15 s sample
`403 forbidden` on enroll	Key missing the `voice:clone` scope	Add `voice:clone` in the console
Non-English clip behaves oddly	Cloning is English-only today	Use an English reference clip
`404` on get/delete	Voice belongs to another tenant (or wrong id)	Clones are tenant-isolated; use an id your key owns

Next steps

Browser voice agent

Put your cloned voice on a live Omni agent in the browser.

Conversation intelligence

Transcribe and analyze the calls your agents handle.

Authentication & scopes

The voice:clone scope and key management.

API reference

Full /v1/voice/clones and /v1/audio/speech reference.

​How it fits together

​What makes a good reference clip

​Build it

​Run it

​”Why was my clip rejected?”

​Next steps

Browser voice agent

Conversation intelligence

Authentication & scopes

API reference

How it fits together

What makes a good reference clip

Build it

Run it

”Why was my clip rejected?”

Next steps