Stream transcription (WebSocket)

curl --request GET \
  --url https://api.pyai.com/v1/audio/transcriptions/stream \
  --header 'Authorization: Bearer <token>'

{
  "error": {
    "message": "<string>",
    "type": "<string>",
    "param": "<string>"
  }
}

Hear

Stream transcription (WebSocket)

Upgrade to a WebSocket for low-latency streaming speech-to-text with eager partials (first partial typically within ~300 ms). Transcription-only — distinct from /v1/realtime (Omni duplex). Requires the hear:stream scope. With knowledge-base grounding enabled this surface becomes Cue (turn detection + retrieved context for your own LLM/voice pipeline), billed at the Cue rate.

Auth: browsers can’t set Authorization on a WebSocket, so send the key as a subprotocol: Sec-WebSocket-Protocol: pyai-key.<API_KEY> (server clients may use ?api_key= instead). The key is validated and swapped for the internal upstream credential on the upgrade.

Client -> server: stream binary audio frames continuously (PCM16 at the negotiated sample_rate, or opus). Send a JSON {"type":"commit"} text frame to force-finalize the current utterance (e.g. when your VAD detects end-of-turn); the literal text frame EOF is also accepted as an equivalent flush. Closing the socket also flushes a final for any buffered audio.

Server -> client (JSON text frames):

`type`	When	Payload
`partial`	every eager tick	`{text, stable_text, active_text, utterance_id, t_ms}` — the live hypothesis for that `utterance_id`
`speech_final`	on endpoint/commit	`{text, utterance_id, t_ms, audio_ms}` — stable, end of an utterance
`final`	follows `speech_final`	`{text, utterance_id, t_ms, audio_ms}` — corrected full-context transcript
`error`	on fault	`{code, message}`

t_ms is the audio-timeline position of the hypothesis; audio_ms is the utterance’s active-speech length (the billed signal); utterance_id groups partials/finals for one utterance. With grounding enabled (Cue), speech_final/final frames also carry a grounding array: [{content, score}, ...] (top-3 KB passages, [] when no KB is bound or retrieval times out).

Close codes: 1000 normal · 1008 auth/policy (bad key, scope, revoked token) · 1011 engine error · 4429 over concurrency cap.

Billing: metered active audio at the Hear rate ( $0.003/min) — speech time derived from the transcript timing, not connection wall-clock. With grounding enabled (Cue), the session bills a single `cue.minutes` line at$ 0.015/min instead.

GET

audio

transcriptions

stream

Stream transcription (WebSocket)

curl --request GET \
  --url https://api.pyai.com/v1/audio/transcriptions/stream \
  --header 'Authorization: Bearer <token>'

{
  "error": {
    "message": "<string>",
    "type": "<string>",
    "param": "<string>"
  }
}

Authorizations

Authorization

string

header

required

Use Authorization: Bearer pyai_live_... (or pyai_test_...).

Query Parameters

model

string

default:pyai-hear

Streaming STT model.

language

string

ISO-639-1 hint, e.g. 'en'.

sample_rate

integer

default:16000

Input PCM sample rate in Hz.

encoding

enum<string>

default:pcm16

Audio frame encoding.

Available options:

pcm16,

opus

interim_results

boolean

default:true

Emit eager partial hypotheses.

seed

integer

Optional determinism seed for reproducible eval runs. Forwarded to the engine and honored once the engine supports it (PLATFORM_ASK_EVALS_ENGINE); no effect when omitted.

temperature

number

Optional sampling temperature for reproducible eval runs. Forwarded to the engine and honored once the engine supports it (PLATFORM_ASK_EVALS_ENGINE); no effect when omitted.

Response

Switching Protocols — the streaming transcription WebSocket is open.

Transcribe audio Synthesize speech

⌘I