Telephony audio (8 kHz μ-law)

Phone networks (PSTN, SIP, Twilio Media Streams) carry 8 kHz G.711 audio — μ-law in North America and Japan, A-law across most of EMEA/APAC. This page covers how to produce and consume it cleanly: ask Speak for native telephony codecs, decode μ-law to PCM for the recognizer/agent leg, and — only if you truly must — resample with the correct integer ratios.

Never resample by dropping or duplicating samples. Naive decimation (keeping every Nth sample) and sample duplication / zero-stuffing without a low-pass filter alias high-frequency energy back into the speech band. It sounds “tinny” or “robotic” to humans and wrecks transcription accuracy — the recognizer sees energy that was never in the original speech. Always rate-convert with a real resampling filter (polyphase / sinc, or a maintained library).

Speak: ask for telephony audio directly (recommended)

POST /v1/audio/speech can emit phone-ready audio so you never resample on the client. Shipping now: request a G.711 codec, which is always 8 kHz mono:

curl https://api.pyai.com/v1/audio/speech \
  -H "Authorization: Bearer $PYAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "pyai-voice",
    "input": "Your call is important to us.",
    "voice": "stock_emma_en_gb",
    "response_format": "g711_ulaw"
  }' \
  --output prompt.ulaw

The response is raw, headerless G.711 (Content-Type audio/basic) — 8-bit companded bytes you can hand straight to the carrier.

Want	`response_format`	`sample_rate`	Content-Type
Twilio / PSTN (North America, Japan)	`g711_ulaw`	fixed 8 kHz — omit, or set `8000`	`audio/basic`
A-law markets (most of EMEA/APAC)	`g711_alaw`	fixed 8 kHz — omit, or set `8000`	`audio/basic`
Linear PCM at the phone rate	`pcm`	`8000`	`audio/pcm`
Wideband (16 kHz session)	`pcm`	`16000`	`audio/pcm`

For g711_ulaw / g711_alaw the rate is fixed at 8 kHz: omit sample_rate, or pass exactly 8000 — any other value is rejected. Speak’s default response_format is mp3 and its native rate is 24 kHz, so for linear output pass the sample_rate you actually need rather than inheriting 24 kHz.

Because Speak renders at the codec/rate you ask for, requesting g711_ulaw (or pcm at 8000/16000) means no resampling in your code at all — the cleanest possible path.

Native μ-law: it’s companding, not a sample rate

μ-law (and A-law) is an 8-bit logarithmic companding of 16-bit samples at 8 kHz — a codec, not a rate. Converting between μ-law and 16-bit PCM is a lossless table lookup, not resampling, so it never aliases.

Outbound (Speak → phone): g711_ulaw hands you μ-law bytes ready for the carrier — no encode step on your side.
Inbound (phone → Hear/Omni): the realtime surfaces take PCM16 (Hear streaming encoding=pcm16; Omni format=pcm16), so decode the carrier’s μ-law to PCM16 first. Keep the session at 8 kHz and the only conversion is companding — never a resample.

μ-law ↔ PCM16 (companding only — no resampling)

import { mulaw } from "alawmulaw";

// phone μ-law (8 kHz) → PCM16 (8 kHz) for Hear streaming / Omni
const pcm16 = mulaw.decode(ulawBytes); // Int16Array

// PCM16 (8 kHz) from Speak/agent → μ-law (8 kHz) for the phone
const ulaw = mulaw.encode(pcm16); // Uint8Array

Running both legs at 8 kHz (Twilio ↔ μ-law decode ↔ 8 kHz session) is the Twilio guide’s entire codec path: companding only, zero resampling.

If you must resample: exact integer ratios

Sometimes you can’t choose the rate at the source — e.g. you already hold a 24 kHz asset (Speak’s native rate) and need 16 kHz for Hear or 8 kHz for a phone leg. Rate-convert with these rational ratios — upsample by L, low-pass, then downsample by M:

From → to	Ratio (up / down)	Required low-pass
24 kHz → 16 kHz	up 2 / down 3	cutoff < 8 kHz, between the up and down steps
24 kHz → 8 kHz	up 1 / down 3	cutoff < 4 kHz, before the down step
48 kHz → 16 kHz	up 1 / down 3	cutoff < 8 kHz, before the down step (browser mic)
8 kHz → 16 kHz	up 2 / down 1	cutoff < 4 kHz, after the up step (phone leg into a 16 kHz session)

The low-pass filter is not optional — it’s the anti-aliasing (on downsample) and anti-imaging (on upsample) filter that keeps the warning above from biting you. Set its cutoff just below the lower of the two Nyquist frequencies (half the smaller sample rate).

The simplest fix is to not create 24 kHz in the first place: ask Speak for g711_ulaw (telephony), or pcm at sample_rate: 8000 / 16000, and keep capture and sessions on a single rate end to end. Resample only legacy audio you don’t control.

Common telephony recipes

Scenario	Path	Resampling?
Speak → Twilio / PSTN	`g711_ulaw` → base64 → Twilio `media`	None
Twilio caller → Omni / Hear	μ-law `decode` → PCM16 @ 8 kHz → session at 8 kHz	None (companding only)
Speak 24 kHz asset → 8 kHz phone	resample up 1 / down 3 with a filter	Yes — filtered
Speak 24 kHz asset → 16 kHz Hear	resample up 2 / down 3 with a filter	Yes — filtered
Browser mic 48 kHz → 16 kHz Hear	resample up 1 / down 3 with a filter	Yes — filtered

Phone agent with Twilio

The μ-law ↔ PCM16 bridge at 8 kHz, end to end.

Stream speech-to-text

Feed live call audio into Hear streaming.

FreeSWITCH integration

Fork SIP/PSTN audio into Omni at 16 kHz.

Errors & limits

Close codes, rate limits, and concurrency.

​Speak: ask for telephony audio directly (recommended)

​Native μ-law: it’s companding, not a sample rate

​If you must resample: exact integer ratios

​Common telephony recipes

​See also

Phone agent with Twilio

Stream speech-to-text

FreeSWITCH integration

Errors & limits

Speak: ask for telephony audio directly (recommended)

Native μ-law: it’s companding, not a sample rate

If you must resample: exact integer ratios

Common telephony recipes

See also