Build a conversation-intelligence feature

By the end of this guide you’ll have a pipeline that takes a recorded call, transcribes it with PyAI Hear batch jobs (diarized, with SRT/VTT subtitles), and turns the result into the metrics every revenue team wants: who talked how much, which keywords came up, and an LLM-written summary with action items. It’s the “mine your calls” recipe — and every step here runs on the shipped async jobs API, so it’s concrete and runnable today.

How it fits together

Batch transcription is asynchronous: you submit a job, it returns 202 immediately, and the result arrives either via a signed webhook or by polling. Because batch is latency-tolerant, Hear batch is billed at a discounted rate versus realtime — the right tool for processing call archives at scale. (See the pricing page for current rates.)

This guide is fully grounded in the shipped jobs API (/v1/transcription/jobs). The only deliberately provider-agnostic piece is the summarization step — it calls your LLM (any OpenAI-compatible chat endpoint), clearly marked, because PyAI doesn’t prescribe one.

Prerequisites

A key with the jobs scopes

Create a key in the console with the hear:transcribe and transcribe:jobs scopes. A pyai_test_ sandbox key works instantly for building against — hard daily caps, no billing.

export PYAI_API_KEY=pyai_test_...

Call recordings you can reach

Either an https URL we can fetch (privacy-cleanest — the input is never stored), or a local file you’ll upload as multipart. Stereo telephony recordings (one party per channel) give the most accurate speaker split.

An LLM endpoint (for the summary step)

Any OpenAI-compatible chat completions endpoint and its key. This is the one “bring your own” piece of the pipeline.

Choose your diarization mode

Speaker separation is what turns a transcript into conversation intelligence. Pick the mode that matches how the call was recorded:

Recording	Set	What you get
Stereo (each party on its own channel — common in telephony)	`channel: true`	Exact, model-free separation: each word is tagged with the channel that spoke it. Most accurate — prefer it whenever you have it.
Mono (everyone mixed into one track)	`diarize: true`	Model-based diarization (`speaker_0`, `speaker_1`, …); words are aligned to detected speaker turns.

Don’t set both. Use channel: true for stereo and diarize: true for mono. If your telephony provider can record dual-channel, turn that on — it’s strictly better than guessing speakers from a mixed track.

Build it

Submit the transcription job

Send the recording, choose your diarization mode, ask for srt/vtt alongside json if you want subtitles, and register a webhook_url for the completion callback. Pass an Idempotency-Key so a retried submit can’t create a duplicate job.

import os, uuid
from pyai import PyAI

pyai = PyAI(api_key=os.environ["PYAI_API_KEY"])

job = pyai.transcription_jobs.create(
    audio_url="https://recordings.example.com/calls/abc123.wav",
    channel=True,            # stereo telephony → exact per-channel speakers
    numerals=True,           # "fifty bucks" → "$50"-friendly digits
    output_formats=["json", "srt", "vtt"],
    webhook_url="https://app.example.com/webhooks/transcription",
    idempotency_key=str(uuid.uuid4()),
)
print(job.job_id, job.status)  # job_aZ09...  queued

For a mono recording, swap channel: true for diarize: true. To upload a local file instead of a URL, post multipart/form-data with an audio part and the same fields as form fields.

Receive the result — webhook (recommended) or polling

When the job finishes, PyAI POSTs a signed callback to your webhook_url. Verify the X-PyAI-Signature header before trusting the body, then fetch the full job.

Node — webhook handler

import crypto from "node:crypto";

app.post("/webhooks/transcription", async (req, reply) => {
  const raw = req.rawBody; // the exact bytes we signed
  const expected = crypto
    .createHmac("sha256", process.env.PYAI_WEBHOOK_SECRET!)
    .update(raw)
    .digest("hex");
  const got = req.headers["x-pyai-signature"];
  if (!got || !crypto.timingSafeEqual(Buffer.from(got), Buffer.from(expected))) {
    return reply.code(401).send();
  }

  const { job_id } = JSON.parse(raw);
  const job = await pyai.transcriptionJobs.get(job_id); // fetch authoritative result
  await analyzeCall(job);
  reply.code(200).send();
});

No public URL? Poll instead — equally valid for batch jobs and back-end pipelines:

import time

def wait_for(job_id, timeout=900):
    deadline = time.time() + timeout
    while time.time() < deadline:
        job = pyai.transcription_jobs.get(job_id)
        if job.status in ("completed", "failed", "cancelled"):
            return job
        time.sleep(3)
    raise TimeoutError(job_id)

The pyai CLI does the submit-and-poll loop in one line — handy for backfilling an archive:

pyai transcribe --url https://recordings.example.com/calls/abc123.wav \
  --channel --numerals --formats json,srt,vtt --poll

Read the diarized result

A completed job carries a result with the full text, a speakers count, audio_seconds, diarized segments (each with start, end, text, and a speaker and/or channel), per-word timings, and a formats map of signed URLs for the SRT/VTT you requested. Large results are offloaded to a signed result_url instead of being inlined — handle both.

Python — normalize inline vs offloaded

import requests

def load_result(job):
    if getattr(job, "result_url", None):
        return requests.get(job.result_url, timeout=30).json()
    return job.result

result = load_result(job)
print(result["speakers"], "speakers,", result["audio_seconds"], "s of audio")

# SRT/VTT are signed URLs in result["formats"]
srt_url = result["formats"]["srt"]
open("call.srt", "wb").write(requests.get(srt_url, timeout=30).content)

Compute talk-ratio and track keywords

The diarized segments are all you need. Sum each speaker’s segment durations for talk-ratio, and scan segment text for the phrases you care about (competitors, pricing, objections) for keyword tracking. Use channel as the speaker key when you transcribed stereo, speaker otherwise.

Python — metrics from segments

from collections import defaultdict

def speaker_key(seg):
    # stereo → channel (0/1); mono diarization → "speaker_0" etc.
    return f"ch{seg['channel']}" if "channel" in seg else seg.get("speaker", "unknown")

def talk_ratio(segments):
    secs = defaultdict(float)
    for s in segments:
        secs[speaker_key(s)] += s["end"] - s["start"]
    total = sum(secs.values()) or 1.0
    return {k: round(v / total, 3) for k, v in secs.items()}

def track_keywords(segments, terms):
    hits = defaultdict(list)
    for s in segments:
        low = s["text"].lower()
        for term in terms:
            if term.lower() in low:
                hits[term].append({"at": s["start"], "speaker": speaker_key(s)})
    return hits

segments = result["segments"]
ratios = talk_ratio(segments)          # {"ch0": 0.62, "ch1": 0.38}
keywords = track_keywords(segments, ["pricing", "competitor", "contract", "discount"])

A rep talking 80% of a discovery call is a coaching signal; a spike in “pricing” near the end is a buying signal. These two functions are the core of a Gong-style scorecard.

Summarize with your LLM of choice

Hand the transcript to any OpenAI-compatible chat endpoint for a summary, next steps, and sentiment. PyAI doesn’t run this step — point it at whichever model you’ve standardized on.

Python — provider-agnostic summary (YOUR LLM)

from openai import OpenAI  # works against any OpenAI-compatible endpoint

# ↓↓↓ Your LLM of choice — not a PyAI endpoint ↓↓↓
llm = OpenAI(base_url=os.environ["LLM_BASE_URL"], api_key=os.environ["LLM_API_KEY"])

def labeled_transcript(segments):
    return "\n".join(f"[{speaker_key(s)}] {s['text']}" for s in segments)

def summarize(segments, ratios):
    resp = llm.chat.completions.create(
        model="your-model",
        messages=[
            {"role": "system", "content":
             "You are a sales-call analyst. Return a 3-sentence summary, "
             "a bulleted list of action items, and the customer's sentiment."},
            {"role": "user", "content":
             f"Talk-ratio: {ratios}\n\nTranscript:\n{labeled_transcript(segments)}"},
        ],
    )
    return resp.choices[0].message.content

insights = summarize(segments, ratios)

Run it

Wire the steps into one analyzeCall(job) (or run the CLI for a one-off), point it at a real recording, and you’ll get back a structured record per call:

{
  "job_id": "job_aZ09...",
  "duration_s": 742.5,
  "talk_ratio": { "ch0": 0.62, "ch1": 0.38 },
  "keywords": { "pricing": [{ "at": 611.2, "speaker": "ch1" }], "discount": [] },
  "summary": "Customer is evaluating two vendors and is price-sensitive...",
  "subtitles": { "srt": "call.srt", "vtt": "call.vtt" }
}

Backfill an archive by listing past jobs — the list is cursor-paginated, newest first:

Python — page through jobs

cursor = None
while True:
    page = pyai.transcription_jobs.list(limit=100, cursor=cursor)
    for job in page.data:
        ...  # reconcile / re-index
    if not page.has_more:
        break
    cursor = page.next_cursor

Cost & scale notes

Batch is cheaper than realtime. Routing call processing through /v1/transcription/jobs (rather than realtime transcription) is billed at the discounted batch rate — see the pricing page for current figures. The result.audio_seconds field is the exact billed quantity — reconcile against the x-pyai-units header on your own ledger.
Prefer audio_url over upload when you can — the input is fetched and never stored, which is the cleanest posture for customer-call data.
Idempotency keys are per logical job. Reuse the same key only when retrying the exact same submit after a network blip; a new recording gets a new key.

Troubleshooting

Symptom	Likely cause	Fix
Speakers are merged / mislabeled	Mono recording sent without diarization	Set `diarize: true` for mono, or record dual-channel and use `channel: true`
`channel: true` returns one speaker	Recording is actually mono	Confirm the file is true stereo; otherwise use `diarize: true`
`result` is empty but `result_url` is set	Large result was offloaded	Fetch and parse `result_url`; don’t assume `result` is inline
`403 forbidden` on submit	Key missing `transcribe:jobs` (or `hear:transcribe`)	Add both scopes in the console
`409 idempotency_conflict`	Same `Idempotency-Key` reused with a different body	Use a fresh key per distinct job
`402 credit_exhausted`	Org out of prepaid credit	Add credit, or build with a `pyai_test_` key
Webhook ignored / spoofable	Signature not verified	Recompute the HMAC over the raw body and compare `X-PyAI-Signature`
Job stuck `queued`	Source URL unreachable / not https	Ensure `audio_url` is a reachable https URL; check the job’s `error` once it fails

Next steps

Streaming speech-to-text

Live transcription when you need partials in real time, not after the call.

Voice cloning

Give your agents and voicemail a branded, custom voice.

Pricing & metering

How batch usage is measured and the discounted batch tier.

Errors & limits

Idempotency, pagination, and the full code catalog.

​How it fits together

​Prerequisites

​Choose your diarization mode

​Build it

​Run it

​Cost & scale notes

​Troubleshooting

​Next steps

Streaming speech-to-text

Voice cloning

Pricing & metering

Errors & limits

How it fits together

Prerequisites

Choose your diarization mode

Build it

Run it

Cost & scale notes

Troubleshooting

Next steps