Mar 27, 2026

Infrastructure

Latency Budgets That Actually Work

Rajesh R

Stopwatch suggesting latency measurement

A latency budget is only useful if it's allocated. "Keep p95 under 500ms" is an aspiration, not a budget.

How we allocate

For a 500ms turn budget we split the work across the four stages that run sequentially:

50ms

Input handling

STT partials, validation, routing

200ms

Model inference

LLM tokens to the first useful chunk

150ms

Tool calls

Typical-case external dependency lookups

100ms

Output + transmit

TTS buffering, stream to client

Each owner knows their slice and knows when they've blown it.

When budgets break

When one stage routinely exceeds its slice, the question isn't "can we make it faster" — it's "should we re-allocate." Sometimes the answer is the budget was wrong, not the implementation.

Voice AI That Responds Instantly.