Speak: ask for telephony audio directly (recommended)
POST /v1/audio/speech can emit phone-ready audio so you never resample on the
client. Shipping now: request a G.711 codec, which is always 8 kHz mono:
audio/basic) — 8-bit
companded bytes you can hand straight to the carrier.
| Want | response_format | sample_rate | Content-Type |
|---|---|---|---|
| Twilio / PSTN (North America, Japan) | g711_ulaw | fixed 8 kHz — omit, or set 8000 | audio/basic |
| A-law markets (most of EMEA/APAC) | g711_alaw | fixed 8 kHz — omit, or set 8000 | audio/basic |
| Linear PCM at the phone rate | pcm | 8000 | audio/pcm |
| Wideband (16 kHz session) | pcm | 16000 | audio/pcm |
For
g711_ulaw / g711_alaw the rate is fixed at 8 kHz: omit sample_rate,
or pass exactly 8000 — any other value is rejected. Speak’s default
response_format is mp3 and its native rate is 24 kHz, so for linear
output pass the sample_rate you actually need rather than inheriting 24 kHz.g711_ulaw (or
pcm at 8000/16000) means no resampling in your code at all — the
cleanest possible path.
Native μ-law: it’s companding, not a sample rate
μ-law (and A-law) is an 8-bit logarithmic companding of 16-bit samples at 8 kHz — a codec, not a rate. Converting between μ-law and 16-bit PCM is a lossless table lookup, not resampling, so it never aliases.- Outbound (Speak → phone):
g711_ulawhands you μ-law bytes ready for the carrier — no encode step on your side. - Inbound (phone → Hear/Omni): the realtime surfaces take PCM16 (Hear
streaming
encoding=pcm16; Omniformat=pcm16), so decode the carrier’s μ-law to PCM16 first. Keep the session at 8 kHz and the only conversion is companding — never a resample.
μ-law ↔ PCM16 (companding only — no resampling)
If you must resample: exact integer ratios
Sometimes you can’t choose the rate at the source — e.g. you already hold a 24 kHz asset (Speak’s native rate) and need 16 kHz for Hear or 8 kHz for a phone leg. Rate-convert with these rational ratios — upsample by L, low-pass, then downsample by M:| From → to | Ratio (up / down) | Required low-pass |
|---|---|---|
| 24 kHz → 16 kHz | up 2 / down 3 | cutoff < 8 kHz, between the up and down steps |
| 24 kHz → 8 kHz | up 1 / down 3 | cutoff < 4 kHz, before the down step |
| 48 kHz → 16 kHz | up 1 / down 3 | cutoff < 8 kHz, before the down step (browser mic) |
| 8 kHz → 16 kHz | up 2 / down 1 | cutoff < 4 kHz, after the up step (phone leg into a 16 kHz session) |
Common telephony recipes
| Scenario | Path | Resampling? |
|---|---|---|
| Speak → Twilio / PSTN | g711_ulaw → base64 → Twilio media | None |
| Twilio caller → Omni / Hear | μ-law decode → PCM16 @ 8 kHz → session at 8 kHz | None (companding only) |
| Speak 24 kHz asset → 8 kHz phone | resample up 1 / down 3 with a filter | Yes — filtered |
| Speak 24 kHz asset → 16 kHz Hear | resample up 2 / down 3 with a filter | Yes — filtered |
| Browser mic 48 kHz → 16 kHz Hear | resample up 1 / down 3 with a filter | Yes — filtered |
See also
Phone agent with Twilio
The μ-law ↔ PCM16 bridge at 8 kHz, end to end.
Stream speech-to-text
Feed live call audio into Hear streaming.
FreeSWITCH integration
Fork SIP/PSTN audio into Omni at 16 kHz.
Errors & limits
Close codes, rate limits, and concurrency.