How it fits together
Batch transcription is asynchronous: you submit a job, it returns202
immediately, and the result arrives either via a signed webhook or by polling.
Because batch is latency-tolerant, Hear batch is billed at a discounted rate
versus realtime — the right tool for processing call archives at scale. (See
the pricing page for current rates.)
This guide is fully grounded in the shipped jobs API
(
/v1/transcription/jobs). The only deliberately provider-agnostic piece is the
summarization step — it calls your LLM (any OpenAI-compatible chat endpoint),
clearly marked, because PyAI doesn’t prescribe one.Prerequisites
A key with the jobs scopes
Create a key in the console with the
hear:transcribe and transcribe:jobs scopes. A pyai_test_ sandbox key
works instantly for building against — hard daily caps, no billing.Call recordings you can reach
Either an https URL we can fetch (privacy-cleanest — the input is never
stored), or a local file you’ll upload as multipart. Stereo telephony
recordings (one party per channel) give the most accurate speaker split.
Choose your diarization mode
Speaker separation is what turns a transcript into conversation intelligence. Pick the mode that matches how the call was recorded:| Recording | Set | What you get |
|---|---|---|
| Stereo (each party on its own channel — common in telephony) | channel: true | Exact, model-free separation: each word is tagged with the channel that spoke it. Most accurate — prefer it whenever you have it. |
| Mono (everyone mixed into one track) | diarize: true | Model-based diarization (speaker_0, speaker_1, …); words are aligned to detected speaker turns. |
Build it
Submit the transcription job
Send the recording, choose your diarization mode, ask for For a mono recording, swap
srt/vtt
alongside json if you want subtitles, and register a webhook_url for the
completion callback. Pass an Idempotency-Key so a retried submit can’t
create a duplicate job.channel: true for diarize: true. To upload a
local file instead of a URL, post multipart/form-data with an audio part
and the same fields as form fields.Receive the result — webhook (recommended) or polling
When the job finishes, PyAI POSTs a signed callback to your
No public URL? Poll instead — equally valid for batch jobs and back-end
pipelines:
webhook_url. Verify the X-PyAI-Signature header before trusting the body,
then fetch the full job.Node — webhook handler
Read the diarized result
A completed job carries a
result with the full text, a speakers count,
audio_seconds, diarized segments (each with start, end, text, and a
speaker and/or channel), per-word timings, and a formats map of signed
URLs for the SRT/VTT you requested. Large results are offloaded to a
signed result_url instead of being inlined — handle both.Python — normalize inline vs offloaded
Compute talk-ratio and track keywords
The diarized A rep talking 80% of a discovery call is a coaching signal; a spike in
“pricing” near the end is a buying signal. These two functions are the core
of a Gong-style scorecard.
segments are all you need. Sum each speaker’s segment
durations for talk-ratio, and scan segment text for the phrases you care
about (competitors, pricing, objections) for keyword tracking. Use
channel as the speaker key when you transcribed stereo, speaker
otherwise.Python — metrics from segments
Run it
Wire the steps into oneanalyzeCall(job) (or run the CLI for a one-off), point
it at a real recording, and you’ll get back a structured record per call:
Python — page through jobs
Cost & scale notes
- Batch is cheaper than realtime. Routing call processing through
/v1/transcription/jobs(rather than realtime transcription) is billed at the discounted batch rate — see the pricing page for current figures. Theresult.audio_secondsfield is the exact billed quantity — reconcile against thex-pyai-unitsheader on your own ledger. - Prefer
audio_urlover upload when you can — the input is fetched and never stored, which is the cleanest posture for customer-call data. - Idempotency keys are per logical job. Reuse the same key only when retrying the exact same submit after a network blip; a new recording gets a new key.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Speakers are merged / mislabeled | Mono recording sent without diarization | Set diarize: true for mono, or record dual-channel and use channel: true |
channel: true returns one speaker | Recording is actually mono | Confirm the file is true stereo; otherwise use diarize: true |
result is empty but result_url is set | Large result was offloaded | Fetch and parse result_url; don’t assume result is inline |
403 forbidden on submit | Key missing transcribe:jobs (or hear:transcribe) | Add both scopes in the console |
409 idempotency_conflict | Same Idempotency-Key reused with a different body | Use a fresh key per distinct job |
402 credit_exhausted | Org out of prepaid credit | Add credit, or build with a pyai_test_ key |
| Webhook ignored / spoofable | Signature not verified | Recompute the HMAC over the raw body and compare X-PyAI-Signature |
Job stuck queued | Source URL unreachable / not https | Ensure audio_url is a reachable https URL; check the job’s error once it fails |
Next steps
Streaming speech-to-text
Live transcription when you need partials in real time, not after the call.
Voice cloning
Give your agents and voicemail a branded, custom voice.
Pricing & metering
How batch usage is measured and the discounted batch tier.
Errors & limits
Idempotency, pagination, and the full code catalog.