Stream transcription (WebSocket)
Upgrade to a WebSocket for low-latency streaming speech-to-text with eager partials (first partial typically within ~300 ms). Transcription-only — distinct from /v1/realtime (Omni duplex). Requires the hear:stream scope. With knowledge-base grounding enabled this surface becomes Cue (turn detection + retrieved context for your own LLM/voice pipeline), billed at the Cue rate.
Auth: browsers can’t set Authorization on a WebSocket, so send the key as a subprotocol: Sec-WebSocket-Protocol: pyai-key.<API_KEY> (server clients may use ?api_key= instead). The key is validated and swapped for the internal upstream credential on the upgrade.
Client -> server: stream binary audio frames continuously (PCM16 at the negotiated sample_rate, or opus). Send a JSON {"type":"commit"} text frame to force-finalize the current utterance (e.g. when your VAD detects end-of-turn); the literal text frame EOF is also accepted as an equivalent flush. Closing the socket also flushes a final for any buffered audio.
Server -> client (JSON text frames):
type | When | Payload |
|---|---|---|
partial | every eager tick | {text, stable_text, active_text, utterance_id, t_ms} — the live hypothesis for that utterance_id |
speech_final | on endpoint/commit | {text, utterance_id, t_ms, audio_ms} — stable, end of an utterance |
final | follows speech_final | {text, utterance_id, t_ms, audio_ms} — corrected full-context transcript |
error | on fault | {code, message} |
t_ms is the audio-timeline position of the hypothesis; audio_ms is the utterance’s active-speech length (the billed signal); utterance_id groups partials/finals for one utterance. With grounding enabled (Cue), speech_final/final frames also carry a grounding array: [{content, score}, ...] (top-3 KB passages, [] when no KB is bound or retrieval times out).
Close codes: 1000 normal · 1008 auth/policy (bad key, scope, revoked token) · 1011 engine error · 4429 over concurrency cap.
Billing: metered active audio at the Hear rate (0.015/min instead.
Authorizations
Use Authorization: Bearer pyai_live_... (or pyai_test_...).
Query Parameters
Streaming STT model.
ISO-639-1 hint, e.g. 'en'.
Input PCM sample rate in Hz.
Audio frame encoding.
pcm16, opus Emit eager partial hypotheses.
Optional determinism seed for reproducible eval runs. Forwarded to the engine and honored once the engine supports it (PLATFORM_ASK_EVALS_ENGINE); no effect when omitted.
Optional sampling temperature for reproducible eval runs. Forwarded to the engine and honored once the engine supports it (PLATFORM_ASK_EVALS_ENGINE); no effect when omitted.
Response
Switching Protocols — the streaming transcription WebSocket is open.