Stream speech-to-text in real time

When you need words on screen while someone is still talking — live captions, a voice search box, agent-assist, meeting notes as they happen — you want streaming speech-to-text, not a batch job. This guide covers the architecture and the UX pattern for Hear streaming: a WebSocket that emits interim (“partial”) results within a few hundred milliseconds and final results when a phrase settles.

Endpoint: wss://api.pyai.com/v1/audio/transcriptions/stream. The full wire protocol — query params, the JSON frame schema, the {"type":"commit"} flush, and close codes — is published in the API reference (GET /v1/audio/transcriptions/stream, generated from https://api.pyai.com/openapi.json) and summarized in Wire protocol below. Last verified 2026-06-16 against api.pyai.com.

How it fits together

One socket carries two things: you send binary PCM16 audio frames up, and receive text JSON result frames down. Results come in two flavors — fast-but-revisable partials and stable finals — and your UI’s whole job is to show the former without committing to them, then lock in the latter.

Streaming vs. batch — pick the right tool

	Hear streaming (this guide)	Batch jobs (`/v1/transcription/jobs`)
Latency	Sub-second first partial; live	Seconds-to-minutes; asynchronous
Output	Interim partials + finals as you speak	One complete transcript when done
Best for	Live captions, voice UI, agent-assist	Recordings, archives, post-call analytics
Diarization	Speaker labeling is a batch strength	`channel` / `diarize` (exact + model-based)
Cost	Realtime tier	Discounted batch tier

If the audio has already finished (a recording, a voicemail, a call you’ve hung up on), use batch jobs — it’s cheaper and gives you diarization. Reach for streaming only when a human is watching the words appear.

Audio format: PCM16, one sample rate

Hear streaming consumes PCM16 little-endian mono, the same family the rest of the realtime stack speaks. Pick 16 kHz for speech (16000) and keep every stage — capture, any resampling, and the wire — on that one rate. Sample-rate mismatches are the number-one cause of “it transcribes gibberish.”

Browser mic typically runs the AudioContext at 48 kHz. Either request a 16 kHz context, or decimate 3:1 on capture (48000 / 16000 = 3) before sending.
Telephony is usually 8 kHz; upsample 2:1 to 16 kHz (16000 / 8000 = 2) with a real resampling filter — see Telephony audio for the exact ratios and the native μ-law option.
Frame size: ~20 ms per message (320 samples @ 16 kHz) keeps latency low without paying per-message overhead.

capture-processor.js — Float32 → PCM16 @ 16 kHz

const FRAME = 320; // 20 ms @ 16 kHz

class CaptureProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this._buf = new Int16Array(FRAME);
    this._n = 0;
  }
  process(inputs) {
    const ch = inputs[0]?.[0];
    if (!ch) return true;
    for (let i = 0; i < ch.length; i++) {
      const s = Math.max(-1, Math.min(1, ch[i]));
      this._buf[this._n++] = s < 0 ? s * 0x8000 : s * 0x7fff;
      if (this._n === FRAME) {
        const out = this._buf.slice();
        this.port.postMessage(out.buffer, [out.buffer]);
        this._n = 0;
      }
    }
    return true;
  }
}
registerProcessor("capture-processor", CaptureProcessor);

The two-pass display pattern

This is the heart of a good live-transcription UI. Treat the transcript as a list of committed final lines plus one pending partial at the end:

A partial arrives → render it greyed/italic as the “current” line. Partials are revisable — replace, never append.
A newer partial arrives → overwrite the same pending line.
A final arrives → commit it as solid text and clear the pending line.
Repeat for the next phrase.

The effect users love: text appears almost instantly, wobbles a little as the recognizer reconsiders, then snaps to clean final text — exactly like live captions on a video call.

Two-pass transcript renderer

const finals = [];     // committed lines
let pending = "";      // current partial (greyed)

function onPartial(text) {
  pending = text;      // replace, don't append
  paint();
}

function onFinal(text) {
  finals.push(text);   // commit
  pending = "";        // clear the greyed line
  paint();
}

function paint() {
  els.committed.textContent = finals.join(" ");
  els.pending.textContent = pending;      // styled grey/italic in CSS
}

Wire protocol

Connect. Open a WebSocket to the streaming transcription endpoint and pick your audio format with query params:

wss://api.pyai.com/v1/audio/transcriptions/stream
  ?model=pyai-hear&language=en&sample_rate=16000&encoding=pcm16&interim_results=true

Authenticate the upgrade with your key as a subprotocol (browser-safe): Sec-WebSocket-Protocol: pyai-key.<API_KEY> — server-side clients may use ?api_key= instead. The key needs the hear:stream scope. Client → server. Stream binary audio frames continuously (PCM16 little-endian mono at sample_rate, or opus), ~20 ms each. To force-finalize the current utterance — e.g. when your own VAD detects end-of-turn — send a JSON text frame {"type":"commit"} (the literal text frame EOF is also accepted as an equivalent flush); closing the socket also flushes a final.

The flush is {"type":"commit"} (recommended); the literal text frame EOF is also accepted as an alias. {"type":"end"} is ignored by the engine and will not finalize your audio — use commit, EOF, or close the socket.

Server → client. JSON text frames, emitted with bare type names:

`type`	When	Payload
`partial`	every eager tick	`{ text, utterance_id, t_ms }` (may also carry `stable_text` / `active_text`) — the live hypothesis; replace it per `utterance_id`
`partial_stable`	when a prefix locks in	`{ text, utterance_id, t_ms }` — the portion the recognizer no longer expects to revise
`speech_final`	on endpoint / after `commit`	`{ text, utterance_id, t_ms, audio_ms }` — end of an utterance, stable
`final`	follows `speech_final`	`{ text, utterance_id, t_ms, audio_ms }` — corrected full-context transcript
`error`	on fault	`{ code, message }`

utterance_id groups the partials and finals of one phrase; t_ms is the audio position of the hypothesis; audio_ms is the utterance’s active-speech length.

Cue (grounding). Send {"type":"config","grounding":true} as the first JSON frame to ground transcripts against your knowledge base. The speech_final and final frames then carry a grounding array ([{ content, score }, ...], top-3 passages; [] when nothing is bound). A grounded session is billed as Cue.

Close codes. 1000 normal · 1008 auth/policy (bad key or missing hear:stream scope) · 1011 engine error · 4429 over the concurrency cap. A 1011 that arrives immediately after a commit-flushed final is a known, benign close (fix in progress) — treat it as success if you already received the final.

Wire it up

Open the socket, stream capture frames as binary, and route text frames through one handler. Authenticate the upgrade with the key as a subprotocol — the browser-safe pattern that’s stable across PyAI’s realtime surfaces (server-side clients may use ?api_key= instead).

Connect + route frames

const HEAR_STREAM_URL =
  "wss://api.pyai.com/v1/audio/transcriptions/stream" +
  "?model=pyai-hear&language=en&sample_rate=16000&encoding=pcm16&interim_results=true";

function connectHearStream(apiKey, capture) {
  const ws = new WebSocket(HEAR_STREAM_URL, [`pyai-key.${apiKey}`]);
  ws.binaryType = "arraybuffer";

  // mic/call PCM16 frames → server (binary)
  capture.port.onmessage = (e) => {
    if (ws.readyState === WebSocket.OPEN) ws.send(e.data);
  };

  // text frames are JSON results — all framing lives in handleFrame()
  ws.onmessage = (e) => {
    if (typeof e.data === "string") handleFrame(JSON.parse(e.data));
  };
  return ws;
}

// Force-finalize the current utterance (e.g. on your own end-of-turn signal).
// The flush is {"type":"commit"} — {"type":"end"} is ignored by the engine.
function commit(ws) {
  if (ws.readyState === WebSocket.OPEN) ws.send(JSON.stringify({ type: "commit" }));
}

// The one place that knows the frame schema. The engine emits bare types:
// partial / partial_stable (interim) and speech_final / final (stable).
function handleFrame(msg) {
  switch (msg.type) {
    case "partial":
    case "partial_stable":
      onPartial(msg.text);
      break;
    case "speech_final":
    case "final":
      onFinal(msg.text);
      break;
    case "error":
      console.error("hear-stream error:", msg.code, msg.message);
      break;
  }
}

The endpoint path, the frame schema, and the {"type":"commit"} control message are part of the published contract — see the API reference (GET /v1/audio/transcriptions/stream) or the Wire protocol summary above. Keep all framing inside handleFrame so adding a field later is a one-place change.

Latency expectations

First partial: sub-second. Once audio is flowing you should see an initial partial within a few hundred milliseconds — that immediacy is the entire point of streaming.
Finals lag partials slightly. A phrase finalizes once the recognizer is confident (typically at a pause or end of utterance). This is normal — show partials so the UI never feels stalled while waiting for a final.
Keep frames small and steady. 20 ms frames sent as they’re produced minimize end-to-end latency; don’t batch several seconds of audio into one message.
Don’t add your own buffering on top. Send frames straight from the capture worklet; extra queues only add delay.

Troubleshooting

Symptom	Likely cause	Fix
Transcription is gibberish	Sample-rate mismatch	Capture, resample, and send all at one rate (16 kHz); decimate 3:1 from a 48 kHz context
No partials appear	Audio not flowing, or frames too large	Confirm binary PCM16 frames are being sent (~20 ms each) from the capture worklet
Text “jumps around” before settling	Treating partials as final	Render partials as a single replaceable greyed line; only append on a final
Connection closes immediately	Bad credential / wrong subprotocol	Check the close code in Errors & limits; auth via `pyai-key.<key>` subprotocol
`429 concurrency_limit_exceeded`	Too many live sessions	Close idle sockets or raise the limit on your plan
High latency / choppy partials	Frames batched or buffered before send	Send each ~20 ms frame immediately; remove extra jitter buffers
Transcript never finalizes	Waiting on a flush, or sent the wrong control frame	Send `{"type":"commit"}` to force a final (note: `{"type":"end"}` is ignored), or close the socket

Next steps

Telephony audio (8 kHz μ-law)

Stream phone-call audio into Hear: native μ-law and the exact resample ratios.

Conversation intelligence

When the audio is finished, batch-transcribe at the discounted rate with diarization.

Browser voice agent

The same PCM16 capture pipeline, driving a full-duplex Omni agent.

Errors & limits

Close codes, rate limits, and concurrency.

​How it fits together

​Streaming vs. batch — pick the right tool

​Audio format: PCM16, one sample rate

​The two-pass display pattern

​Wire protocol

​Wire it up

​Latency expectations

​Troubleshooting

​Next steps

Telephony audio (8 kHz μ-law)

Conversation intelligence

Browser voice agent

Errors & limits

How it fits together

Streaming vs. batch — pick the right tool

Audio format: PCM16, one sample rate

The two-pass display pattern

Wire protocol

Wire it up

Latency expectations

Troubleshooting

Next steps