Term
Latency budget
The maximum acceptable end-to-end response time for an LLM feature, allocated across retrieval, model inference, post-processing, and network. Drives architecture decisions: model size, streaming, caching, parallelism.
Background
A chat interface tolerates ~2 second time-to-first-token. An autocomplete tool needs <300 ms. A voice assistant needs <500 ms first-token. The budget shapes everything else — small distilled models for low-latency surfaces, big reasoning models behind async jobs, prompt caching to skip re-encoding the same system prompt, speculative decoding when the model supports it.