Skip to main content
A cloned voice lets you speak in a specific person’s voice across every PyAI surface: text-to-speech, voicemail, IVR prompts, and live Omni agents. This guide takes you from a raw audio clip to a production voice_id — and is honest about the one thing that decides whether a clone sounds great or gets rejected: the quality of your reference clip.

How it fits together

Enrollment is quick but not instant: you upload a clip, the voice starts pending, and becomes ready once it passes the quality gate. A ready voice_id works immediately in POST /v1/audio/speech and as an Omni agent’s voice.
Voice cloning is English-only today and requires the voice:clone scope. Cloning copies a real person’s voice — only clone voices you have explicit permission to use.

What makes a good reference clip

The cloner is gated on real acoustic quality, not file metadata. The single most important requirement: the clip must carry genuine full-band audio (real energy up to ~24 kHz / a 48 kHz capture) — not an 8 kHz phone call that’s been upsampled to look like a 48 kHz file. Upsampling adds samples, not bandwidth; the gate sees through it. A clip that passes cleanly is:
  • ~6–15 seconds of continuous, natural speech (not a single word, not a 3-minute monologue).
  • Genuinely wideband — recorded at 24 kHz or higher with real high-frequency content. A mid-quality phone mic in a quiet room is fine; a telephone recording is not.
  • One speaker only. No second voice, no crosstalk, no background conversation.
  • Clean — minimal background noise, no music, no reverb-heavy rooms, no compression artifacts.
  • Consistent — even volume, no clipping, no long silences.
The best clip is boring: one person reading two or three sentences in a quiet room with a decent mic. Record at 48 kHz mono WAV and don’t normalize, denoise, or add effects — let the audio be real.

Build it

1

Prepare the reference clip

Trim to a clean 6–15 second span where one person speaks continuously. Keep it as WAV/PCM if you can; avoid re-encoding a lossy file or upsampling a narrowband source — neither adds the bandwidth the gate needs.
# Trim to a 10s window, keep native sample rate, mono — no upsampling tricks.
ffmpeg -i raw.wav -ss 00:00:04 -t 10 -ac 1 -c:a pcm_s16le reference.wav
2

Enroll the voice

POST /v1/voice/clones is a multipart upload: a name and the audio file. It returns a Voice with an id and a status — typically pending while the clip is processed.
import os
from pyai import PyAI

pyai = PyAI(api_key=os.environ["PYAI_API_KEY"])

with open("reference.wav", "rb") as f:
    voice = pyai.voice.clones.create(name="Ava — brand voice", file=f)

print(voice.id, voice.status)  # voice_abc  pending
3

Wait until it's ready

Poll GET /v1/voice/clones/{id} until status flips to ready. If the clip fails the quality gate the status goes to failed — see the rejection table below for what to fix.
import time

def wait_ready(voice_id, timeout=120):
    deadline = time.time() + timeout
    while time.time() < deadline:
        v = pyai.voice.clones.get(voice_id)
        if v.status in ("ready", "failed"):
            return v
        time.sleep(2)
    raise TimeoutError(voice_id)

voice = wait_ready(voice.id)
if voice.status == "failed":
    raise SystemExit("Clip rejected — see the troubleshooting table")
4

Preview it

Synthesize a short line to sanity-check the clone before you ship it. This is just POST /v1/audio/speech with your new voice_id.
Python
audio = pyai.audio.speech(
    input="Hi! This is a preview of my cloned voice.",
    voice=voice.id,            # voice_abc
    response_format="wav",
)
open("preview.wav", "wb").write(audio)
Listen critically: if it sounds muffled, robotic, or off-timbre, the clip is almost always the cause — re-record per the requirements above rather than re-running enrollment on the same audio.
5

Synthesize with the cloned voice

Once you’re happy, the clone is a first-class voice everywhere TTS is accepted — pass voice: voice_abc exactly as you would a stock voice id.
audio = pyai.audio.speech(
    input="Your appointment is confirmed for Thursday at 2 PM.",
    voice=voice.id,
    response_format="mp3",
)
open("confirmation.mp3", "wb").write(audio)
6

Use the clone in an Omni agent

A ready cloned voice can be an Omni agent’s speaking voice. In the console Agent Builder, set the agent’s voice to your voice_id — every realtime session then speaks in the cloned voice with no code change. Connect exactly as in the browser voice agent guide; only the agent’s configured voice differs.

Run it

# 1. enroll
pyai voice clones create --name "Ava — brand voice" --file reference.wav
# 2. it prints voice_abc (pending) → poll until ready, then:
curl https://api.pyai.com/v1/audio/speech \
  -H "Authorization: Bearer $PYAI_API_KEY" -H "Content-Type: application/json" \
  -d '{"input":"Hello in my own voice.","voice":"voice_abc"}' --output hello.wav
Play hello.wav — it should be recognizably the speaker from your clip. Manage your clones any time with GET /v1/voice/clones (list) and delete one with DELETE /v1/voice/clones/{id}; clones are tenant-isolated, so you only ever see and touch your own.

”Why was my clip rejected?”

The most common support question, answered honestly. A failed status almost always traces to one of these:
What you hear / seeRoot causeFix
failed immediately; “insufficient bandwidth”Narrowband audio (e.g. an 8 kHz phone call) upsampled to look like 48 kHzRecord genuinely wideband at ≥24 kHz; upsampling adds samples, not bandwidth, and the gate detects it
failed; “clip too short/long”Outside the ~6–15 s windowTrim to a continuous 6–15 s of speech
failed; “multiple speakers”Two voices, crosstalk, or background conversationUse a clip with exactly one speaker and no overlap
Clone sounds muffled or dullReal high frequencies missing (lossy/telephone source)Re-record from a wideband source; don’t denoise away the highs
Clone sounds robotic or unstableBackground music, reverb, or clipping in the clipRecord in a quiet, dry room; keep levels below clipping
Timbre is “almost right” but offToo little usable speech, or inconsistent volumeProvide a fuller, evenly-leveled 10–15 s sample
403 forbidden on enrollKey missing the voice:clone scopeAdd voice:clone in the console
Non-English clip behaves oddlyCloning is English-only todayUse an English reference clip
404 on get/deleteVoice belongs to another tenant (or wrong id)Clones are tenant-isolated; use an id your key owns

Next steps

Browser voice agent

Put your cloned voice on a live Omni agent in the browser.

Conversation intelligence

Transcribe and analyze the calls your agents handle.

Authentication & scopes

The voice:clone scope and key management.

API reference

Full /v1/voice/clones and /v1/audio/speech reference.