Latency Budgets That Actually Work

Stopwatch suggesting latency measurement

A latency budget is only useful if it's allocated. "Keep p95 under 500ms" is an aspiration, not a budget.

How we allocate

For a 500ms turn budget we split the work across the four stages that run sequentially:

50ms

Input handling

STT partials, validation, routing

200ms

Model inference

LLM tokens to the first useful chunk

150ms

Tool calls

Typical-case external dependency lookups

100ms

Output + transmit

TTS buffering, stream to client

Each owner knows their slice and knows when they've blown it.

When budgets break

When one stage routinely exceeds its slice, the question isn't "can we make it faster" — it's "should we re-allocate." Sometimes the answer is the budget was wrong, not the implementation.

Latency Budgets That Actually Work

How we allocate

When budgets break

Designing for Reliability at Scale

Voice AI That Responds Instantly.

Stay Updated

Latency Budgets That Actually Work

How we allocate

When budgets break

Related Posts

Designing for Reliability at Scale

Voice AI That Responds Instantly.