May 8, 2026

Engineering

Why We Went All-In on Streaming for Voice Agents

Arjun k.

Audio waveform representing streaming voice

For a long time we treated streaming as a nice-to-have. Once we started measuring what users actually felt during a voice conversation, it stopped being optional.

The latency that matters

Time-to-first-audio is the metric that maps to perceived responsiveness. Total response time barely registers if the first 200 milliseconds of audio arrived on time. We rebuilt the agent pipeline around that single number.

What changed in our stack

  • Speech-to-text now emits partial transcripts, not just finals.
  • The model layer streams tokens directly into TTS without buffering whole sentences.
  • TTS runs in chunks small enough to start playback while the rest is still synthesizing.

Each layer used to wait for the layer below it. Now they overlap. The whole turn loop boils down to a single async generator:

ts
async function* runTurn(audioIn: AsyncIterable<Buffer>) {
  // 1. STT streams partial transcripts as the user speaks.
  const transcript = stt.partials(audioIn);
 
  // 2. LLM streams tokens as soon as the transcript stabilizes.
  const tokens = llm.stream({ messages: transcript });
 
  // 3. TTS synthesizes in small chunks — we start playback
  //    while the rest of the sentence is still being generated.
  for await (const token of tokens) {
    const audioChunk = await tts.synth(token);
    yield audioChunk;
  }
}

What we'd do differently

Backpressure between stages is the part we underestimated. When TTS is slower than token generation, you need a clear protocol for what the upstream layer should do — drop, queue, or slow down. Pick one early.

Voice AI That Responds Instantly.