Skip to main content
By the end of this guide you’ll have a pipeline that takes a recorded call, transcribes it with PyAI Hear batch jobs (diarized, with SRT/VTT subtitles), and turns the result into the metrics every revenue team wants: who talked how much, which keywords came up, and an LLM-written summary with action items. It’s the “mine your calls” recipe — and every step here runs on the shipped async jobs API, so it’s concrete and runnable today.

How it fits together

Batch transcription is asynchronous: you submit a job, it returns 202 immediately, and the result arrives either via a signed webhook or by polling. Because batch is latency-tolerant, Hear batch is billed at a discounted rate versus realtime — the right tool for processing call archives at scale. (See the pricing page for current rates.)
This guide is fully grounded in the shipped jobs API (/v1/transcription/jobs). The only deliberately provider-agnostic piece is the summarization step — it calls your LLM (any OpenAI-compatible chat endpoint), clearly marked, because PyAI doesn’t prescribe one.

Prerequisites

1

A key with the jobs scopes

Create a key in the console with the hear:transcribe and transcribe:jobs scopes. A pyai_test_ sandbox key works instantly for building against — hard daily caps, no billing.
export PYAI_API_KEY=pyai_test_...
2

Call recordings you can reach

Either an https URL we can fetch (privacy-cleanest — the input is never stored), or a local file you’ll upload as multipart. Stereo telephony recordings (one party per channel) give the most accurate speaker split.
3

An LLM endpoint (for the summary step)

Any OpenAI-compatible chat completions endpoint and its key. This is the one “bring your own” piece of the pipeline.

Choose your diarization mode

Speaker separation is what turns a transcript into conversation intelligence. Pick the mode that matches how the call was recorded:
RecordingSetWhat you get
Stereo (each party on its own channel — common in telephony)channel: trueExact, model-free separation: each word is tagged with the channel that spoke it. Most accurate — prefer it whenever you have it.
Mono (everyone mixed into one track)diarize: trueModel-based diarization (speaker_0, speaker_1, …); words are aligned to detected speaker turns.
Don’t set both. Use channel: true for stereo and diarize: true for mono. If your telephony provider can record dual-channel, turn that on — it’s strictly better than guessing speakers from a mixed track.

Build it

1

Submit the transcription job

Send the recording, choose your diarization mode, ask for srt/vtt alongside json if you want subtitles, and register a webhook_url for the completion callback. Pass an Idempotency-Key so a retried submit can’t create a duplicate job.
import os, uuid
from pyai import PyAI

pyai = PyAI(api_key=os.environ["PYAI_API_KEY"])

job = pyai.transcription_jobs.create(
    audio_url="https://recordings.example.com/calls/abc123.wav",
    channel=True,            # stereo telephony → exact per-channel speakers
    numerals=True,           # "fifty bucks" → "$50"-friendly digits
    output_formats=["json", "srt", "vtt"],
    webhook_url="https://app.example.com/webhooks/transcription",
    idempotency_key=str(uuid.uuid4()),
)
print(job.job_id, job.status)  # job_aZ09...  queued
For a mono recording, swap channel: true for diarize: true. To upload a local file instead of a URL, post multipart/form-data with an audio part and the same fields as form fields.
2

Receive the result — webhook (recommended) or polling

When the job finishes, PyAI POSTs a signed callback to your webhook_url. Verify the X-PyAI-Signature header before trusting the body, then fetch the full job.
Node — webhook handler
import crypto from "node:crypto";

app.post("/webhooks/transcription", async (req, reply) => {
  const raw = req.rawBody; // the exact bytes we signed
  const expected = crypto
    .createHmac("sha256", process.env.PYAI_WEBHOOK_SECRET!)
    .update(raw)
    .digest("hex");
  const got = req.headers["x-pyai-signature"];
  if (!got || !crypto.timingSafeEqual(Buffer.from(got), Buffer.from(expected))) {
    return reply.code(401).send();
  }

  const { job_id } = JSON.parse(raw);
  const job = await pyai.transcriptionJobs.get(job_id); // fetch authoritative result
  await analyzeCall(job);
  reply.code(200).send();
});
No public URL? Poll instead — equally valid for batch jobs and back-end pipelines:
import time

def wait_for(job_id, timeout=900):
    deadline = time.time() + timeout
    while time.time() < deadline:
        job = pyai.transcription_jobs.get(job_id)
        if job.status in ("completed", "failed", "cancelled"):
            return job
        time.sleep(3)
    raise TimeoutError(job_id)
The pyai CLI does the submit-and-poll loop in one line — handy for backfilling an archive:
pyai transcribe --url https://recordings.example.com/calls/abc123.wav \
  --channel --numerals --formats json,srt,vtt --poll
3

Read the diarized result

A completed job carries a result with the full text, a speakers count, audio_seconds, diarized segments (each with start, end, text, and a speaker and/or channel), per-word timings, and a formats map of signed URLs for the SRT/VTT you requested. Large results are offloaded to a signed result_url instead of being inlined — handle both.
Python — normalize inline vs offloaded
import requests

def load_result(job):
    if getattr(job, "result_url", None):
        return requests.get(job.result_url, timeout=30).json()
    return job.result

result = load_result(job)
print(result["speakers"], "speakers,", result["audio_seconds"], "s of audio")

# SRT/VTT are signed URLs in result["formats"]
srt_url = result["formats"]["srt"]
open("call.srt", "wb").write(requests.get(srt_url, timeout=30).content)
4

Compute talk-ratio and track keywords

The diarized segments are all you need. Sum each speaker’s segment durations for talk-ratio, and scan segment text for the phrases you care about (competitors, pricing, objections) for keyword tracking. Use channel as the speaker key when you transcribed stereo, speaker otherwise.
Python — metrics from segments
from collections import defaultdict

def speaker_key(seg):
    # stereo → channel (0/1); mono diarization → "speaker_0" etc.
    return f"ch{seg['channel']}" if "channel" in seg else seg.get("speaker", "unknown")

def talk_ratio(segments):
    secs = defaultdict(float)
    for s in segments:
        secs[speaker_key(s)] += s["end"] - s["start"]
    total = sum(secs.values()) or 1.0
    return {k: round(v / total, 3) for k, v in secs.items()}

def track_keywords(segments, terms):
    hits = defaultdict(list)
    for s in segments:
        low = s["text"].lower()
        for term in terms:
            if term.lower() in low:
                hits[term].append({"at": s["start"], "speaker": speaker_key(s)})
    return hits

segments = result["segments"]
ratios = talk_ratio(segments)          # {"ch0": 0.62, "ch1": 0.38}
keywords = track_keywords(segments, ["pricing", "competitor", "contract", "discount"])
A rep talking 80% of a discovery call is a coaching signal; a spike in “pricing” near the end is a buying signal. These two functions are the core of a Gong-style scorecard.
5

Summarize with your LLM of choice

Hand the transcript to any OpenAI-compatible chat endpoint for a summary, next steps, and sentiment. PyAI doesn’t run this step — point it at whichever model you’ve standardized on.
Python — provider-agnostic summary (YOUR LLM)
from openai import OpenAI  # works against any OpenAI-compatible endpoint

# ↓↓↓ Your LLM of choice — not a PyAI endpoint ↓↓↓
llm = OpenAI(base_url=os.environ["LLM_BASE_URL"], api_key=os.environ["LLM_API_KEY"])

def labeled_transcript(segments):
    return "\n".join(f"[{speaker_key(s)}] {s['text']}" for s in segments)

def summarize(segments, ratios):
    resp = llm.chat.completions.create(
        model="your-model",
        messages=[
            {"role": "system", "content":
             "You are a sales-call analyst. Return a 3-sentence summary, "
             "a bulleted list of action items, and the customer's sentiment."},
            {"role": "user", "content":
             f"Talk-ratio: {ratios}\n\nTranscript:\n{labeled_transcript(segments)}"},
        ],
    )
    return resp.choices[0].message.content

insights = summarize(segments, ratios)

Run it

Wire the steps into one analyzeCall(job) (or run the CLI for a one-off), point it at a real recording, and you’ll get back a structured record per call:
{
  "job_id": "job_aZ09...",
  "duration_s": 742.5,
  "talk_ratio": { "ch0": 0.62, "ch1": 0.38 },
  "keywords": { "pricing": [{ "at": 611.2, "speaker": "ch1" }], "discount": [] },
  "summary": "Customer is evaluating two vendors and is price-sensitive...",
  "subtitles": { "srt": "call.srt", "vtt": "call.vtt" }
}
Backfill an archive by listing past jobs — the list is cursor-paginated, newest first:
Python — page through jobs
cursor = None
while True:
    page = pyai.transcription_jobs.list(limit=100, cursor=cursor)
    for job in page.data:
        ...  # reconcile / re-index
    if not page.has_more:
        break
    cursor = page.next_cursor

Cost & scale notes

  • Batch is cheaper than realtime. Routing call processing through /v1/transcription/jobs (rather than realtime transcription) is billed at the discounted batch rate — see the pricing page for current figures. The result.audio_seconds field is the exact billed quantity — reconcile against the x-pyai-units header on your own ledger.
  • Prefer audio_url over upload when you can — the input is fetched and never stored, which is the cleanest posture for customer-call data.
  • Idempotency keys are per logical job. Reuse the same key only when retrying the exact same submit after a network blip; a new recording gets a new key.

Troubleshooting

SymptomLikely causeFix
Speakers are merged / mislabeledMono recording sent without diarizationSet diarize: true for mono, or record dual-channel and use channel: true
channel: true returns one speakerRecording is actually monoConfirm the file is true stereo; otherwise use diarize: true
result is empty but result_url is setLarge result was offloadedFetch and parse result_url; don’t assume result is inline
403 forbidden on submitKey missing transcribe:jobs (or hear:transcribe)Add both scopes in the console
409 idempotency_conflictSame Idempotency-Key reused with a different bodyUse a fresh key per distinct job
402 credit_exhaustedOrg out of prepaid creditAdd credit, or build with a pyai_test_ key
Webhook ignored / spoofableSignature not verifiedRecompute the HMAC over the raw body and compare X-PyAI-Signature
Job stuck queuedSource URL unreachable / not httpsEnsure audio_url is a reachable https URL; check the job’s error once it fails

Next steps

Streaming speech-to-text

Live transcription when you need partials in real time, not after the call.

Voice cloning

Give your agents and voicemail a branded, custom voice.

Pricing & metering

How batch usage is measured and the discounted batch tier.

Errors & limits

Idempotency, pagination, and the full code catalog.