Skip to main content
When you need words on screen while someone is still talking — live captions, a voice search box, agent-assist, meeting notes as they happen — you want streaming speech-to-text, not a batch job. This guide covers the architecture and the UX pattern for Hear streaming: a WebSocket that emits interim (“partial”) results within a few hundred milliseconds and final results when a phrase settles.
Endpoint: wss://api.pyai.com/v1/audio/transcriptions/stream. The full wire protocol — query params, the JSON frame schema, the {"type":"commit"} flush, and close codes — is published in the API reference (GET /v1/audio/transcriptions/stream, generated from https://api.pyai.com/openapi.json) and summarized in Wire protocol below. Last verified 2026-06-16 against api.pyai.com.

How it fits together

One socket carries two things: you send binary PCM16 audio frames up, and receive text JSON result frames down. Results come in two flavors — fast-but-revisable partials and stable finals — and your UI’s whole job is to show the former without committing to them, then lock in the latter.

Streaming vs. batch — pick the right tool

Hear streaming (this guide)Batch jobs (/v1/transcription/jobs)
LatencySub-second first partial; liveSeconds-to-minutes; asynchronous
OutputInterim partials + finals as you speakOne complete transcript when done
Best forLive captions, voice UI, agent-assistRecordings, archives, post-call analytics
DiarizationSpeaker labeling is a batch strengthchannel / diarize (exact + model-based)
CostRealtime tierDiscounted batch tier
If the audio has already finished (a recording, a voicemail, a call you’ve hung up on), use batch jobs — it’s cheaper and gives you diarization. Reach for streaming only when a human is watching the words appear.

Audio format: PCM16, one sample rate

Hear streaming consumes PCM16 little-endian mono, the same family the rest of the realtime stack speaks. Pick 16 kHz for speech (16000) and keep every stage — capture, any resampling, and the wire — on that one rate. Sample-rate mismatches are the number-one cause of “it transcribes gibberish.”
  • Browser mic typically runs the AudioContext at 48 kHz. Either request a 16 kHz context, or decimate 3:1 on capture (48000 / 16000 = 3) before sending.
  • Telephony is usually 8 kHz; upsample 2:1 to 16 kHz (16000 / 8000 = 2) with a real resampling filter — see Telephony audio for the exact ratios and the native μ-law option.
  • Frame size: ~20 ms per message (320 samples @ 16 kHz) keeps latency low without paying per-message overhead.
capture-processor.js — Float32 → PCM16 @ 16 kHz
const FRAME = 320; // 20 ms @ 16 kHz

class CaptureProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this._buf = new Int16Array(FRAME);
    this._n = 0;
  }
  process(inputs) {
    const ch = inputs[0]?.[0];
    if (!ch) return true;
    for (let i = 0; i < ch.length; i++) {
      const s = Math.max(-1, Math.min(1, ch[i]));
      this._buf[this._n++] = s < 0 ? s * 0x8000 : s * 0x7fff;
      if (this._n === FRAME) {
        const out = this._buf.slice();
        this.port.postMessage(out.buffer, [out.buffer]);
        this._n = 0;
      }
    }
    return true;
  }
}
registerProcessor("capture-processor", CaptureProcessor);

The two-pass display pattern

This is the heart of a good live-transcription UI. Treat the transcript as a list of committed final lines plus one pending partial at the end:
  1. A partial arrives → render it greyed/italic as the “current” line. Partials are revisable — replace, never append.
  2. A newer partial arrives → overwrite the same pending line.
  3. A final arrives → commit it as solid text and clear the pending line.
  4. Repeat for the next phrase.
The effect users love: text appears almost instantly, wobbles a little as the recognizer reconsiders, then snaps to clean final text — exactly like live captions on a video call.
Two-pass transcript renderer
const finals = [];     // committed lines
let pending = "";      // current partial (greyed)

function onPartial(text) {
  pending = text;      // replace, don't append
  paint();
}

function onFinal(text) {
  finals.push(text);   // commit
  pending = "";        // clear the greyed line
  paint();
}

function paint() {
  els.committed.textContent = finals.join(" ");
  els.pending.textContent = pending;      // styled grey/italic in CSS
}

Wire protocol

Connect. Open a WebSocket to the streaming transcription endpoint and pick your audio format with query params:
wss://api.pyai.com/v1/audio/transcriptions/stream
  ?model=pyai-hear&language=en&sample_rate=16000&encoding=pcm16&interim_results=true
Authenticate the upgrade with your key as a subprotocol (browser-safe): Sec-WebSocket-Protocol: pyai-key.<API_KEY> — server-side clients may use ?api_key= instead. The key needs the hear:stream scope. Client → server. Stream binary audio frames continuously (PCM16 little-endian mono at sample_rate, or opus), ~20 ms each. To force-finalize the current utterance — e.g. when your own VAD detects end-of-turn — send a JSON text frame {"type":"commit"} (the literal text frame EOF is also accepted as an equivalent flush); closing the socket also flushes a final.
The flush is {"type":"commit"} (recommended); the literal text frame EOF is also accepted as an alias. {"type":"end"} is ignored by the engine and will not finalize your audio — use commit, EOF, or close the socket.
Server → client. JSON text frames, emitted with bare type names:
typeWhenPayload
partialevery eager tick{ text, utterance_id, t_ms } (may also carry stable_text / active_text) — the live hypothesis; replace it per utterance_id
partial_stablewhen a prefix locks in{ text, utterance_id, t_ms } — the portion the recognizer no longer expects to revise
speech_finalon endpoint / after commit{ text, utterance_id, t_ms, audio_ms } — end of an utterance, stable
finalfollows speech_final{ text, utterance_id, t_ms, audio_ms } — corrected full-context transcript
erroron fault{ code, message }
utterance_id groups the partials and finals of one phrase; t_ms is the audio position of the hypothesis; audio_ms is the utterance’s active-speech length.
Cue (grounding). Send {"type":"config","grounding":true} as the first JSON frame to ground transcripts against your knowledge base. The speech_final and final frames then carry a grounding array ([{ content, score }, ...], top-3 passages; [] when nothing is bound). A grounded session is billed as Cue.
Close codes. 1000 normal · 1008 auth/policy (bad key or missing hear:stream scope) · 1011 engine error · 4429 over the concurrency cap. A 1011 that arrives immediately after a commit-flushed final is a known, benign close (fix in progress) — treat it as success if you already received the final.

Wire it up

Open the socket, stream capture frames as binary, and route text frames through one handler. Authenticate the upgrade with the key as a subprotocol — the browser-safe pattern that’s stable across PyAI’s realtime surfaces (server-side clients may use ?api_key= instead).
Connect + route frames
const HEAR_STREAM_URL =
  "wss://api.pyai.com/v1/audio/transcriptions/stream" +
  "?model=pyai-hear&language=en&sample_rate=16000&encoding=pcm16&interim_results=true";

function connectHearStream(apiKey, capture) {
  const ws = new WebSocket(HEAR_STREAM_URL, [`pyai-key.${apiKey}`]);
  ws.binaryType = "arraybuffer";

  // mic/call PCM16 frames → server (binary)
  capture.port.onmessage = (e) => {
    if (ws.readyState === WebSocket.OPEN) ws.send(e.data);
  };

  // text frames are JSON results — all framing lives in handleFrame()
  ws.onmessage = (e) => {
    if (typeof e.data === "string") handleFrame(JSON.parse(e.data));
  };
  return ws;
}

// Force-finalize the current utterance (e.g. on your own end-of-turn signal).
// The flush is {"type":"commit"} — {"type":"end"} is ignored by the engine.
function commit(ws) {
  if (ws.readyState === WebSocket.OPEN) ws.send(JSON.stringify({ type: "commit" }));
}

// The one place that knows the frame schema. The engine emits bare types:
// partial / partial_stable (interim) and speech_final / final (stable).
function handleFrame(msg) {
  switch (msg.type) {
    case "partial":
    case "partial_stable":
      onPartial(msg.text);
      break;
    case "speech_final":
    case "final":
      onFinal(msg.text);
      break;
    case "error":
      console.error("hear-stream error:", msg.code, msg.message);
      break;
  }
}
The endpoint path, the frame schema, and the {"type":"commit"} control message are part of the published contract — see the API reference (GET /v1/audio/transcriptions/stream) or the Wire protocol summary above. Keep all framing inside handleFrame so adding a field later is a one-place change.

Latency expectations

  • First partial: sub-second. Once audio is flowing you should see an initial partial within a few hundred milliseconds — that immediacy is the entire point of streaming.
  • Finals lag partials slightly. A phrase finalizes once the recognizer is confident (typically at a pause or end of utterance). This is normal — show partials so the UI never feels stalled while waiting for a final.
  • Keep frames small and steady. 20 ms frames sent as they’re produced minimize end-to-end latency; don’t batch several seconds of audio into one message.
  • Don’t add your own buffering on top. Send frames straight from the capture worklet; extra queues only add delay.

Troubleshooting

SymptomLikely causeFix
Transcription is gibberishSample-rate mismatchCapture, resample, and send all at one rate (16 kHz); decimate 3:1 from a 48 kHz context
No partials appearAudio not flowing, or frames too largeConfirm binary PCM16 frames are being sent (~20 ms each) from the capture worklet
Text “jumps around” before settlingTreating partials as finalRender partials as a single replaceable greyed line; only append on a final
Connection closes immediatelyBad credential / wrong subprotocolCheck the close code in Errors & limits; auth via pyai-key.<key> subprotocol
429 concurrency_limit_exceededToo many live sessionsClose idle sockets or raise the limit on your plan
High latency / choppy partialsFrames batched or buffered before sendSend each ~20 ms frame immediately; remove extra jitter buffers
Transcript never finalizesWaiting on a flush, or sent the wrong control frameSend {"type":"commit"} to force a final (note: {"type":"end"} is ignored), or close the socket

Next steps

Telephony audio (8 kHz μ-law)

Stream phone-call audio into Hear: native μ-law and the exact resample ratios.

Conversation intelligence

When the audio is finished, batch-transcribe at the discounted rate with diarization.

Browser voice agent

The same PCM16 capture pipeline, driving a full-duplex Omni agent.

Errors & limits

Close codes, rate limits, and concurrency.