Endpoint:
wss://api.pyai.com/v1/audio/transcriptions/stream. The full wire
protocol — query params, the JSON frame schema, the {"type":"commit"} flush, and
close codes — is published in the API reference
(GET /v1/audio/transcriptions/stream, generated from
https://api.pyai.com/openapi.json) and summarized in
Wire protocol below. Last verified 2026-06-16 against
api.pyai.com.How it fits together
One socket carries two things: you send binary PCM16 audio frames up, and receive text JSON result frames down. Results come in two flavors — fast-but-revisable partials and stable finals — and your UI’s whole job is to show the former without committing to them, then lock in the latter.Streaming vs. batch — pick the right tool
| Hear streaming (this guide) | Batch jobs (/v1/transcription/jobs) | |
|---|---|---|
| Latency | Sub-second first partial; live | Seconds-to-minutes; asynchronous |
| Output | Interim partials + finals as you speak | One complete transcript when done |
| Best for | Live captions, voice UI, agent-assist | Recordings, archives, post-call analytics |
| Diarization | Speaker labeling is a batch strength | channel / diarize (exact + model-based) |
| Cost | Realtime tier | Discounted batch tier |
Audio format: PCM16, one sample rate
Hear streaming consumes PCM16 little-endian mono, the same family the rest of the realtime stack speaks. Pick 16 kHz for speech (16000) and keep every
stage — capture, any resampling, and the wire — on that one rate. Sample-rate
mismatches are the number-one cause of “it transcribes gibberish.”
- Browser mic typically runs the
AudioContextat 48 kHz. Either request a 16 kHz context, or decimate 3:1 on capture (48000 / 16000 = 3) before sending. - Telephony is usually 8 kHz; upsample 2:1 to 16 kHz (16000 / 8000 = 2) with a real resampling filter — see Telephony audio for the exact ratios and the native μ-law option.
- Frame size: ~20 ms per message (320 samples @ 16 kHz) keeps latency low without paying per-message overhead.
capture-processor.js — Float32 → PCM16 @ 16 kHz
The two-pass display pattern
This is the heart of a good live-transcription UI. Treat the transcript as a list of committed final lines plus one pending partial at the end:- A partial arrives → render it greyed/italic as the “current” line. Partials are revisable — replace, never append.
- A newer partial arrives → overwrite the same pending line.
- A final arrives → commit it as solid text and clear the pending line.
- Repeat for the next phrase.
Two-pass transcript renderer
Wire protocol
Connect. Open a WebSocket to the streaming transcription endpoint and pick your audio format with query params:Sec-WebSocket-Protocol: pyai-key.<API_KEY> — server-side clients may use
?api_key= instead. The key needs the hear:stream scope.
Client → server. Stream binary audio frames continuously (PCM16
little-endian mono at sample_rate, or opus), ~20 ms each. To force-finalize the
current utterance — e.g. when your own VAD detects end-of-turn — send a JSON text
frame {"type":"commit"} (the literal text frame EOF is also accepted as an
equivalent flush); closing the socket also flushes a final.
Server → client. JSON text frames, emitted with bare type names:
type | When | Payload |
|---|---|---|
partial | every eager tick | { text, utterance_id, t_ms } (may also carry stable_text / active_text) — the live hypothesis; replace it per utterance_id |
partial_stable | when a prefix locks in | { text, utterance_id, t_ms } — the portion the recognizer no longer expects to revise |
speech_final | on endpoint / after commit | { text, utterance_id, t_ms, audio_ms } — end of an utterance, stable |
final | follows speech_final | { text, utterance_id, t_ms, audio_ms } — corrected full-context transcript |
error | on fault | { code, message } |
utterance_id groups the partials and finals of one phrase; t_ms is the audio
position of the hypothesis; audio_ms is the utterance’s active-speech length.
Cue (grounding). Send
{"type":"config","grounding":true} as the first
JSON frame to ground transcripts against your knowledge base. The speech_final
and final frames then carry a grounding array ([{ content, score }, ...],
top-3 passages; [] when nothing is bound). A grounded session is billed as
Cue.1000 normal · 1008 auth/policy (bad key or missing
hear:stream scope) · 1011 engine error · 4429 over the concurrency cap. A
1011 that arrives immediately after a commit-flushed final is a known,
benign close (fix in progress) — treat it as success if you already received the
final.
Wire it up
Open the socket, stream capture frames as binary, and route text frames through one handler. Authenticate the upgrade with the key as a subprotocol — the browser-safe pattern that’s stable across PyAI’s realtime surfaces (server-side clients may use?api_key= instead).
Connect + route frames
The endpoint path, the frame schema, and the
{"type":"commit"} control message
are part of the published contract — see the API reference
(GET /v1/audio/transcriptions/stream) or the Wire protocol
summary above. Keep all framing inside handleFrame so adding a field later is a
one-place change.Latency expectations
- First partial: sub-second. Once audio is flowing you should see an initial partial within a few hundred milliseconds — that immediacy is the entire point of streaming.
- Finals lag partials slightly. A phrase finalizes once the recognizer is confident (typically at a pause or end of utterance). This is normal — show partials so the UI never feels stalled while waiting for a final.
- Keep frames small and steady. 20 ms frames sent as they’re produced minimize end-to-end latency; don’t batch several seconds of audio into one message.
- Don’t add your own buffering on top. Send frames straight from the capture worklet; extra queues only add delay.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Transcription is gibberish | Sample-rate mismatch | Capture, resample, and send all at one rate (16 kHz); decimate 3:1 from a 48 kHz context |
| No partials appear | Audio not flowing, or frames too large | Confirm binary PCM16 frames are being sent (~20 ms each) from the capture worklet |
| Text “jumps around” before settling | Treating partials as final | Render partials as a single replaceable greyed line; only append on a final |
| Connection closes immediately | Bad credential / wrong subprotocol | Check the close code in Errors & limits; auth via pyai-key.<key> subprotocol |
429 concurrency_limit_exceeded | Too many live sessions | Close idle sockets or raise the limit on your plan |
| High latency / choppy partials | Frames batched or buffered before send | Send each ~20 ms frame immediately; remove extra jitter buffers |
| Transcript never finalizes | Waiting on a flush, or sent the wrong control frame | Send {"type":"commit"} to force a final (note: {"type":"end"} is ignored), or close the socket |
Next steps
Telephony audio (8 kHz μ-law)
Stream phone-call audio into Hear: native μ-law and the exact resample ratios.
Conversation intelligence
When the audio is finished, batch-transcribe at the discounted rate with diarization.
Browser voice agent
The same PCM16 capture pipeline, driving a full-duplex Omni agent.
Errors & limits
Close codes, rate limits, and concurrency.