Skip to main content
Phone networks (PSTN, SIP, Twilio Media Streams) carry 8 kHz G.711 audio — μ-law in North America and Japan, A-law across most of EMEA/APAC. This page covers how to produce and consume it cleanly: ask Speak for native telephony codecs, decode μ-law to PCM for the recognizer/agent leg, and — only if you truly must — resample with the correct integer ratios.
Never resample by dropping or duplicating samples. Naive decimation (keeping every Nth sample) and sample duplication / zero-stuffing without a low-pass filter alias high-frequency energy back into the speech band. It sounds “tinny” or “robotic” to humans and wrecks transcription accuracy — the recognizer sees energy that was never in the original speech. Always rate-convert with a real resampling filter (polyphase / sinc, or a maintained library).
POST /v1/audio/speech can emit phone-ready audio so you never resample on the client. Shipping now: request a G.711 codec, which is always 8 kHz mono:
curl https://api.pyai.com/v1/audio/speech \
  -H "Authorization: Bearer $PYAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "pyai-voice",
    "input": "Your call is important to us.",
    "voice": "stock_emma_en_gb",
    "response_format": "g711_ulaw"
  }' \
  --output prompt.ulaw
The response is raw, headerless G.711 (Content-Type audio/basic) — 8-bit companded bytes you can hand straight to the carrier.
Wantresponse_formatsample_rateContent-Type
Twilio / PSTN (North America, Japan)g711_ulawfixed 8 kHz — omit, or set 8000audio/basic
A-law markets (most of EMEA/APAC)g711_alawfixed 8 kHz — omit, or set 8000audio/basic
Linear PCM at the phone ratepcm8000audio/pcm
Wideband (16 kHz session)pcm16000audio/pcm
For g711_ulaw / g711_alaw the rate is fixed at 8 kHz: omit sample_rate, or pass exactly 8000 — any other value is rejected. Speak’s default response_format is mp3 and its native rate is 24 kHz, so for linear output pass the sample_rate you actually need rather than inheriting 24 kHz.
Because Speak renders at the codec/rate you ask for, requesting g711_ulaw (or pcm at 8000/16000) means no resampling in your code at all — the cleanest possible path.

Native μ-law: it’s companding, not a sample rate

μ-law (and A-law) is an 8-bit logarithmic companding of 16-bit samples at 8 kHz — a codec, not a rate. Converting between μ-law and 16-bit PCM is a lossless table lookup, not resampling, so it never aliases.
  • Outbound (Speak → phone): g711_ulaw hands you μ-law bytes ready for the carrier — no encode step on your side.
  • Inbound (phone → Hear/Omni): the realtime surfaces take PCM16 (Hear streaming encoding=pcm16; Omni format=pcm16), so decode the carrier’s μ-law to PCM16 first. Keep the session at 8 kHz and the only conversion is companding — never a resample.
μ-law ↔ PCM16 (companding only — no resampling)
import { mulaw } from "alawmulaw";

// phone μ-law (8 kHz) → PCM16 (8 kHz) for Hear streaming / Omni
const pcm16 = mulaw.decode(ulawBytes); // Int16Array

// PCM16 (8 kHz) from Speak/agent → μ-law (8 kHz) for the phone
const ulaw = mulaw.encode(pcm16); // Uint8Array
Running both legs at 8 kHz (Twilio ↔ μ-law decode ↔ 8 kHz session) is the Twilio guide’s entire codec path: companding only, zero resampling.

If you must resample: exact integer ratios

Sometimes you can’t choose the rate at the source — e.g. you already hold a 24 kHz asset (Speak’s native rate) and need 16 kHz for Hear or 8 kHz for a phone leg. Rate-convert with these rational ratios — upsample by L, low-pass, then downsample by M:
From → toRatio (up / down)Required low-pass
24 kHz → 16 kHzup 2 / down 3cutoff < 8 kHz, between the up and down steps
24 kHz → 8 kHzup 1 / down 3cutoff < 4 kHz, before the down step
48 kHz → 16 kHzup 1 / down 3cutoff < 8 kHz, before the down step (browser mic)
8 kHz → 16 kHzup 2 / down 1cutoff < 4 kHz, after the up step (phone leg into a 16 kHz session)
The low-pass filter is not optional — it’s the anti-aliasing (on downsample) and anti-imaging (on upsample) filter that keeps the warning above from biting you. Set its cutoff just below the lower of the two Nyquist frequencies (half the smaller sample rate).
The simplest fix is to not create 24 kHz in the first place: ask Speak for g711_ulaw (telephony), or pcm at sample_rate: 8000 / 16000, and keep capture and sessions on a single rate end to end. Resample only legacy audio you don’t control.

Common telephony recipes

ScenarioPathResampling?
Speak → Twilio / PSTNg711_ulaw → base64 → Twilio mediaNone
Twilio caller → Omni / Hearμ-law decode → PCM16 @ 8 kHz → session at 8 kHzNone (companding only)
Speak 24 kHz asset → 8 kHz phoneresample up 1 / down 3 with a filterYes — filtered
Speak 24 kHz asset → 16 kHz Hearresample up 2 / down 3 with a filterYes — filtered
Browser mic 48 kHz → 16 kHz Hearresample up 1 / down 3 with a filterYes — filtered

See also

Phone agent with Twilio

The μ-law ↔ PCM16 bridge at 8 kHz, end to end.

Stream speech-to-text

Feed live call audio into Hear streaming.

FreeSWITCH integration

Fork SIP/PSTN audio into Omni at 16 kHz.

Errors & limits

Close codes, rate limits, and concurrency.