How it fits together
Two streams share one socket: binary messages carry raw audio both ways, and text messages carry JSON session events (and, depending on your agent, the live transcript). The whole loop runs client-side.This guide is correct on transport, codecs, resampling, and the event behaviors
you build against. The exact JSON payloads for each event are defined by the
Omni wire protocol, which is the source of truth — we
isolate event handling behind a single function so you can fill in field names
once you’ve checked the reference.
Prerequisites
An Omni agent and a key
Create an agent in the console and copy its
agent_id. Grab a pyai_test_ sandbox key to build against — it works
instantly with hard daily caps and no billing.Pick one sample rate and stick to it
The single biggest source of “it sounds like chipmunks/robots” bugs is a sample rate mismatch. Omni speaks PCM16 little-endian; for browser/WebRTC use 24 kHz, declared on the URL as?format=pcm16&rate=24000.
The clean trick: ask the browser for a 24 kHz AudioContext so the mic capture,
the worklets, and the wire all agree — no resampling needed.
if away and called out in the code.
Build it
Capture worklet: mic Float32 → PCM16
An
AudioWorkletProcessor runs on the audio thread and hands you 128-sample
blocks. Buffer them into ~20 ms frames (480 samples @ 24 kHz), convert
Float32 [-1, 1] to little-endian Int16, and post the bytes to the main
thread.capture-processor.js
Playback worklet: a ring buffer of agent audio
Agent audio arrives in bursts; speakers consume it at a steady 24 kHz. Bridge
the two with a ring buffer. Crucially, expose a
clear message — that’s how
barge-in stays snappy (next step).playback-processor.js
Connect to Omni and wire the audio loop
Open the WebSocket with the key as a subprotocol (browser-safe — browsers
can’t set headers on the upgrade), forward capture frames as binary, and feed
binary replies into the playback ring. The official
@pyai/sdk ships
realtimeURL / realtimeSubprotocol helpers so you don’t hand-build the URL.Handle events: transcript + barge-in
Text frames are JSON session events. You build against the known event
names below; render whatever transcript fields your agent emits. Keep all of
this in one function so the payload details live in exactly one place.
The exact fields on each event (and the transcript payload shape) are defined
in the Omni wire protocol. Map them inside
renderTranscript / the cases above once you’ve confirmed names there — the
transport and event names used here are stable.Run it
Dropcapture-processor.js, playback-processor.js, and an index.html (a
Start button calling start()) in one folder and serve it:
flush →
clear path).
Barge-in, latency & quality
- Barge-in is the difference between a demo and a product. The server detects
your speech and sends
flush; your only job is to stop playing queued agent audio right then — that’s the singleplayback.port.postMessage("clear")call. Don’t wait for the socket to drain. - Keep frames small (~20 ms). Smaller frames lower latency; much smaller and you pay per-message overhead. 480 samples @ 24 kHz is a good default.
- Let the browser do AEC.
echoCancellation: truestops the agent’s own voice from being captured and looping back as user speech. - Don’t add your own jitter buffer on top of the worklet ring — the ring is already the buffer. Extra queuing only adds latency.
- Resume the AudioContext from a user gesture. Browsers start it suspended;
call
ctx.resume()inside the click handler if playback is silent.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Agent sounds high/low-pitched or sped up | Sample-rate mismatch | Ensure the AudioContext, worklets, and ?rate= all agree on 24 kHz; if the context forced 48 kHz, decimate/upsample 2:1 |
| Connection closes immediately | Bad credential, missing scope, or malformed agent_id | Check the close code against Errors & limits; 4401 = bad key, 4403 = missing omni:session scope, 400 invalid_agent_id = malformed agent id |
403 origin_not_allowed on close | Publishable token origin not allow-listed | Add your origin in the console (production tokens only) |
No mic prompt / getUserMedia throws | Insecure context | Serve over http://localhost or HTTPS, not file:// or a LAN IP |
| Agent hears itself / echoes | AEC off, or playback feeding back | Set echoCancellation: true; never connect capture to destination |
| Choppy or robotic playback | Ring buffer starved or overrun | Confirm 20 ms frames; check the socket isn’t backpressured |
429 concurrency_limit_exceeded | Too many live sessions | Close old sockets; raise the limit on your plan |
Next steps
Omni wire protocol
Exact event payloads, close codes, and golden frames.
Phone agent with Twilio
Bridge the same Omni session to a real phone number.
FreeSWITCH integration
Fork SIP/PSTN audio into Omni.
Errors & limits
Close codes, rate limits, and concurrency.