Term

p99

The 99th-percentile response time — out of 100 requests, 99 finish in less than this time, the worst 1 takes longer. The metric to watch for LLM latency since tail latency on chat APIs can be 5-10× the median.

Background

p99 is the 99th-percentile latency: the response time below which 99% of requests complete, meaning only the slowest 1% take longer. It is computed by collecting the latency of every request over a window, sorting them, and reading off the value at the 99% mark, so unlike an average it is not pulled down by the many fast requests and instead characterises the tail of the distribution. If p50 (the median) is 300 ms but p99 is 4 s, then half your requests are quick while one in a hundred is painfully slow, information the mean would hide. Percentiles like p50, p95, p99, and p99.9 are the standard way to state a latency SLO because users experience the tail, not the average, and a single user issuing many calls will hit that tail regularly. p99 matters acutely for LLM-backed software because generation latency is highly variable: long outputs, queueing under load, cold starts, retries, and slow upstream tools all fatten the tail. A chatbot or agent that feels fine in testing can still have a p99 that makes real sessions stall. Practically, you monitor p99 (and often p99.9) rather than averages, watch time-to-first-token separately from total generation time, and set timeouts, retries, streaming, and fallbacks against the tail. Note that percentiles do not average across services, so aggregate them from raw measurements or histograms, not by combining per-service p99 numbers.