Term

Latency budget

The maximum acceptable end-to-end response time for an LLM feature, allocated across retrieval, model inference, post-processing, and network. Drives architecture decisions: model size, streaming, caching, parallelism.

Background

A latency budget is the ceiling you set on how long an LLM feature may take from request to usable response, chosen from the user experience the feature needs to deliver. An autocomplete suggestion might be given a few hundred milliseconds, a chat reply a couple of seconds to first token, and a background batch job minutes. The budget is a design constraint you allocate across every stage of the pipeline, because end-to-end time is the sum of the parts: input preprocessing, embedding and vector search for retrieval, network round-trips, the model's own time-to-first-token, per-token generation time, and any post-processing, tool calls, or reranking. If the whole feature must answer within two seconds and retrieval eats four hundred milliseconds, generation and everything else must fit in the rest. Two quantities usually dominate: time-to-first-token, which governs how responsive the feature feels, and total generation time, which scales with output length and model size. Working within a budget drives concrete engineering choices, such as streaming tokens so the user sees output immediately, capping max output tokens, picking a smaller or faster model, caching frequent responses, trimming retrieved context, or running independent steps in parallel. It matters because latency is a hard product constraint, not a nice-to-have: past a threshold users perceive the feature as broken and abandon it, so the budget forces those trade-offs to be made deliberately rather than discovered in production.

Background

Tools that use it