
A latency budget is only useful if it's allocated. "Keep p95 under 500ms" is an aspiration, not a budget.
How we allocate
For a 500ms turn budget we split the work across the four stages that run sequentially:
50ms
Input handling
STT partials, validation, routing
200ms
Model inference
LLM tokens to the first useful chunk
150ms
Tool calls
Typical-case external dependency lookups
100ms
Output + transmit
TTS buffering, stream to client
Each owner knows their slice and knows when they've blown it.
When budgets break
When one stage routinely exceeds its slice, the question isn't "can we make it faster" — it's "should we re-allocate." Sometimes the answer is the budget was wrong, not the implementation.

