vibedonaldsvibedonalds.com
Term

p99

The 99th-percentile response time — out of 100 requests, 99 finish in less than this time, the worst 1 takes longer. The metric to watch for LLM latency since tail latency on chat APIs can be 5-10× the median.

Background

LLM providers' p99 latency is often dominated by traffic spikes hitting capacity limits, GPU queuing, and cold-start overhead. Reduction strategies: provisioned throughput on Anthropic / OpenAI / Gemini, sticky routing to warm replicas, prompt caching to amortise system-prompt encoding, fallback to a smaller/faster model on timeout.