Language support

This page is the single source of truth for which languages work on which product today. We publish what is GA and benchmarked, what is accepted but not yet measured, and what is roadmap — so you can plan an integration without guessing.

Short version: English is GA and benchmarked across every product. Other languages — including Indian languages and code-mixed speech — are an active area of work: the language hint is accepted and forwarded to the engine, but recognition accuracy is not yet published, and native non-English voices (TTS) are not in the catalog yet. For non-English voice today, the supported pattern is Cue (turn detection + your knowledge base) + your own LLM and TTS.

Status at a glance

Product	English	Other languages (STT hint)	Native non-English voices
Hear — speech-to-text	GA · benchmarked	Accepted as a hint · accuracy not yet published	n/a
Cue — turn detection + KB grounding	GA · benchmarked	Accepted as a hint · accuracy not yet published	n/a (you bring TTS)
Omni — agentic voice (STT + brain + TTS)	GA · end-to-end	Roadmap (end-to-end non-English)	Roadmap
Speak — text-to-speech	GA	n/a	Not yet — English voices only (incl. Indian-English accents)

Legend. GA · benchmarked = supported and we publish accuracy numbers. Accepted as a hint = the language parameter is forwarded to the engine and may work, but we make no published accuracy guarantee yet. Roadmap = planned, not available today.

Hear & Cue (speech-to-text)

The streaming endpoint (GET /v1/audio/transcriptions/stream) and the batch endpoint (POST /v1/transcription/jobs) accept an optional language parameter:

GET /v1/audio/transcriptions/stream?model=pyai-hear&language=en&sample_rate=16000&encoding=pcm16

language is an ISO-639-1 hint forwarded to the engine.
It is one hint per session — there is no mid-session auto-detect and no per-turn language switching today.
English is the GA, benchmarked language. Published English accuracy: ~1.6% WER on clean audio and ~4.8% WER on an 8 kHz telephony/accented corpus (see Benchmarks).
Other ISO-639-1 codes are accepted and forwarded, but we do not yet publish accuracy for them. Treat non-English STT as unmeasured until the numbers land on the benchmarks page.

Indian languages & code-mixed speech

This is a frequent request, so we are explicit about it. Today PyAI does not publish word-error-rate numbers for Hindi, Tamil, Telugu, Bengali, Marathi, Kannada, Malayalam, Gujarati, or Punjabi, nor for code-mixed Hinglish / Tanglish (Roman + native script mixed with English). The language hint will accept these codes, but you should not assume production accuracy until we publish measured results.

If your product depends on Indian-language or code-mixed transcription accuracy, validate on your own audio before committing. Talk to us and we will run a joint accuracy evaluation against your real call samples — that measurement is the right go/no-go signal, not this page. The benchmark harness we use for this lives in the repo at evals/benchmarks/hear-multilingual.benchmark.json.

Omni (agentic voice)

Omni runs the full loop — speech-to-text, the brain, and text-to-speech — on the PyAI engine. Today that loop is English. Multilingual Omni (non-English end-to-end) is on the roadmap. The configure frame has a language field reserved for this, but it is not yet honored — sending it today is a no-op (see the Omni protocol reference). For a non-English voice agent right now, use the composable path:

Cue — stream call audio for turn detection (and optional knowledge-base grounding) on GET /v1/audio/transcriptions/stream.
Your LLM — generate the reply from the Cue transcript + grounding.
Your TTS — synthesize the reply in the target language and play it back.

This is the supported pattern for Indian-language voice agents until native multilingual Omni ships.

Speak (text-to-speech)

The Speak catalog (GET /v1/voices) is English voices only today. This includes several Indian-English accent voices (filter ?region=india) — but those speak English (language: "en"), not Hindi/Tamil/etc. There is no native Indian-language (or other non-English) TTS in the catalog yet, and POST /v1/audio/speech has no language parameter. For non-English speech output, bring your own TTS in the composable path above. Non-English voices and non-English voice cloning are on the roadmap.

How accuracy gets published

We gate engine quality in CI with an offline benchmark harness (evals/) and publish the headline numbers on the Benchmarks page. As languages are measured, their numbers appear there and this matrix is updated. If you need a number that isn’t published yet, ask — an unpublished number means “not measured to our bar,” not “hidden.”

Stream speech-to-text

Feed live call audio into Hear / Cue streaming.

Omni wire protocol

Connect, configure, and the live-vs-roadmap field table.

Telephony audio (8 kHz)

μ-law ↔ PCM16 at 8 kHz for phone legs.

Pricing & metering

Per-second billing and the rate card.

​Status at a glance

​Hear & Cue (speech-to-text)

​Indian languages & code-mixed speech

​Omni (agentic voice)

​Speak (text-to-speech)

​How accuracy gets published

​See also

Stream speech-to-text

Omni wire protocol

Telephony audio (8 kHz)

Pricing & metering

Status at a glance

Hear & Cue (speech-to-text)

Indian languages & code-mixed speech

Omni (agentic voice)

Speak (text-to-speech)

How accuracy gets published

See also