Build a browser voice agent

By the end of this guide you’ll have a single HTML page that opens your mic, streams audio to an Omni agent in real time, plays the agent’s reply through your speakers, and renders the live transcript as you both talk. No framework, no build step — just the Web Audio API and a WebSocket.

How it fits together

Two streams share one socket: binary messages carry raw audio both ways, and text messages carry JSON session events (and, depending on your agent, the live transcript). The whole loop runs client-side.

This guide is correct on transport, codecs, resampling, and the event behaviors you build against. The exact JSON payloads for each event are defined by the Omni wire protocol, which is the source of truth — we isolate event handling behind a single function so you can fill in field names once you’ve checked the reference.

Prerequisites

An Omni agent and a key

Create an agent in the console and copy its agent_id. Grab a pyai_test_ sandbox key to build against — it works instantly with hard daily caps and no billing.

A local static server

getUserMedia requires a secure context, which includes http://localhost. Any static server works: npx serve, python3 -m http.server, etc.

A pyai_test_ key is fine while you build locally. Never ship a secret pyai_live_ key in client-side code — in production, mint a short-lived publishable token (origin allow-listed in the console) and hand that to the browser instead. Everything below is identical; only the credential changes.

Pick one sample rate and stick to it

The single biggest source of “it sounds like chipmunks/robots” bugs is a sample rate mismatch. Omni speaks PCM16 little-endian; for browser/WebRTC use 24 kHz, declared on the URL as ?format=pcm16&rate=24000. The clean trick: ask the browser for a 24 kHz AudioContext so the mic capture, the worklets, and the wire all agree — no resampling needed.

const ctx = new AudioContext({ sampleRate: 24000 });

Most browsers honor this. If yours pins the context to its hardware rate (commonly 48 kHz), you have two options: keep the 24 kHz context and let the worklet see 24 kHz directly, or run at 48 kHz and decimate 2:1 on capture (take every other sample) / upsample 2:1 on playback (duplicate each sample). The math is exact because 48000 / 24000 = 2. We use the 24 kHz context path; the fallback is one if away and called out in the code.

Build it

Capture worklet: mic Float32 → PCM16

An AudioWorkletProcessor runs on the audio thread and hands you 128-sample blocks. Buffer them into ~20 ms frames (480 samples @ 24 kHz), convert Float32 [-1, 1] to little-endian Int16, and post the bytes to the main thread.

capture-processor.js

// 20 ms @ 24 kHz = 480 samples per frame.
const FRAME = 480;

class CaptureProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this._buf = new Int16Array(FRAME);
    this._n = 0;
  }

  process(inputs) {
    const ch = inputs[0]?.[0];
    if (!ch) return true;
    for (let i = 0; i < ch.length; i++) {
      // clamp then scale: negatives use 0x8000, positives 0x7FFF
      const s = Math.max(-1, Math.min(1, ch[i]));
      this._buf[this._n++] = s < 0 ? s * 0x8000 : s * 0x7fff;
      if (this._n === FRAME) {
        // transfer the ArrayBuffer to avoid a copy
        const out = this._buf.slice();
        this.port.postMessage(out.buffer, [out.buffer]);
        this._n = 0;
      }
    }
    return true;
  }
}

registerProcessor("capture-processor", CaptureProcessor);

Playback worklet: a ring buffer of agent audio

Agent audio arrives in bursts; speakers consume it at a steady 24 kHz. Bridge the two with a ring buffer. Crucially, expose a clear message — that’s how barge-in stays snappy (next step).

playback-processor.js

const CAP = 24000 * 10; // up to 10 s of buffered audio

class PlaybackProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this._ring = new Float32Array(CAP);
    this._r = 0;
    this._w = 0;
    this.port.onmessage = (e) => {
      if (e.data === "clear") {
        this._r = this._w = 0; // drop everything queued (barge-in)
        return;
      }
      const pcm = new Int16Array(e.data);
      for (let i = 0; i < pcm.length; i++) {
        this._ring[this._w] = pcm[i] / 0x8000; // Int16 → Float32
        this._w = (this._w + 1) % CAP;
      }
    };
  }

  process(_inputs, outputs) {
    const out = outputs[0][0];
    for (let i = 0; i < out.length; i++) {
      out[i] = this._r === this._w ? 0 : this._ring[this._r];
      if (this._r !== this._w) this._r = (this._r + 1) % CAP;
    }
    return true;
  }
}

registerProcessor("playback-processor", PlaybackProcessor);

Connect to Omni and wire the audio loop

Open the WebSocket with the key as a subprotocol (browser-safe — browsers can’t set headers on the upgrade), forward capture frames as binary, and feed binary replies into the playback ring. The official @pyai/sdk ships realtimeURL / realtimeSubprotocol helpers so you don’t hand-build the URL.

const API_KEY = "pyai_test_..."; // publishable/short-lived token in prod
const AGENT = "agent_123";

const url =
  `wss://api.pyai.com/v1/omni?agent_id=${AGENT}` +
  `&format=pcm16&rate=24000`;

async function start() {
  const ctx = new AudioContext({ sampleRate: 24000 });
  await ctx.audioWorklet.addModule("capture-processor.js");
  await ctx.audioWorklet.addModule("playback-processor.js");

  const stream = await navigator.mediaDevices.getUserMedia({
    audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true },
  });

  const src = ctx.createMediaStreamSource(stream);
  const capture = new AudioWorkletNode(ctx, "capture-processor");
  const playback = new AudioWorkletNode(ctx, "playback-processor");
  src.connect(capture);          // mic → capture (no audible monitor)
  playback.connect(ctx.destination); // playback → speakers

  const ws = new WebSocket(url, [`pyai-key.${API_KEY}`]);
  ws.binaryType = "arraybuffer";

  // mic frames → server
  capture.port.onmessage = (e) => {
    if (ws.readyState === WebSocket.OPEN) ws.send(e.data);
  };

  ws.onmessage = (e) => {
    if (typeof e.data === "string") {
      handleEvent(JSON.parse(e.data), playback); // see next step
    } else {
      playback.port.postMessage(e.data); // agent audio → speakers
    }
  };

  ws.onclose = (e) => console.log("closed", e.code, e.reason);
  window.stop = () => { ws.close(); stream.getTracks().forEach((t) => t.stop()); };
}

Handle events: transcript + barge-in

Text frames are JSON session events. You build against the known event names below; render whatever transcript fields your agent emits. Keep all of this in one function so the payload details live in exactly one place.

function handleEvent(msg, playback) {
  switch (msg.type) {
    case "hello":            // connection accepted, before session_started
    case "session_started":  // session is live; start talking
      break;

    case "flush":
      // Barge-in: you started speaking, so the agent's queued audio is
      // now stale. Drop it immediately for a snappy turn-handoff.
      playback.port.postMessage("clear");
      break;

    case "dtmf":             // a keypad digit was detected (telephony bridges)
      break;

    case "transfer_to_human": // agent decided to escalate
      break;

    case "session_ending":   // server is closing the session
      break;

    default:
      // Transcript / partials arrive as JSON text frames too — render the
      // text fields your agent emits here (exact shape: protocol reference).
      renderTranscript(msg);
  }
}

The exact fields on each event (and the transcript payload shape) are defined in the Omni wire protocol. Map them inside renderTranscript / the cases above once you’ve confirmed names there — the transport and event names used here are stable.

Run it

Drop capture-processor.js, playback-processor.js, and an index.html (a Start button calling start()) in one folder and serve it:

npx serve .   # then open http://localhost:3000

Click Start, allow the mic, and say hello. You should hear the agent reply within a few hundred milliseconds and see the transcript fill in. Talk over it — the agent’s audio should cut out as soon as you speak (that’s the flush → clear path).

Barge-in, latency & quality

Barge-in is the difference between a demo and a product. The server detects your speech and sends flush; your only job is to stop playing queued agent audio right then — that’s the single playback.port.postMessage("clear") call. Don’t wait for the socket to drain.
Keep frames small (~20 ms). Smaller frames lower latency; much smaller and you pay per-message overhead. 480 samples @ 24 kHz is a good default.
Let the browser do AEC. echoCancellation: true stops the agent’s own voice from being captured and looping back as user speech.
Don’t add your own jitter buffer on top of the worklet ring — the ring is already the buffer. Extra queuing only adds latency.
Resume the AudioContext from a user gesture. Browsers start it suspended; call ctx.resume() inside the click handler if playback is silent.

Troubleshooting

Symptom	Likely cause	Fix
Agent sounds high/low-pitched or sped up	Sample-rate mismatch	Ensure the `AudioContext`, worklets, and `?rate=` all agree on 24 kHz; if the context forced 48 kHz, decimate/upsample 2:1
Connection closes immediately	Bad credential, missing scope, or malformed `agent_id`	Check the close code against Errors & limits; `4401` = bad key, `4403` = missing `omni:session` scope, `400 invalid_agent_id` = malformed agent id
`403 origin_not_allowed` on close	Publishable token origin not allow-listed	Add your origin in the console (production tokens only)
No mic prompt / `getUserMedia` throws	Insecure context	Serve over `http://localhost` or HTTPS, not `file://` or a LAN IP
Agent hears itself / echoes	AEC off, or playback feeding back	Set `echoCancellation: true`; never connect `capture` to `destination`
Choppy or robotic playback	Ring buffer starved or overrun	Confirm 20 ms frames; check the socket isn’t backpressured
`429 concurrency_limit_exceeded`	Too many live sessions	Close old sockets; raise the limit on your plan

Next steps

Omni wire protocol

Exact event payloads, close codes, and golden frames.

Phone agent with Twilio

Bridge the same Omni session to a real phone number.

FreeSWITCH integration

Fork SIP/PSTN audio into Omni.

Errors & limits

Close codes, rate limits, and concurrency.

​How it fits together

​Prerequisites

​Pick one sample rate and stick to it

​Build it

​Run it

​Barge-in, latency & quality

​Troubleshooting

​Next steps

Omni wire protocol

Phone agent with Twilio

FreeSWITCH integration

Errors & limits

How it fits together

Prerequisites

Pick one sample rate and stick to it

Build it

Run it

Barge-in, latency & quality

Troubleshooting

Next steps