Tail latency
Tail latency refers to the high-percentile latency values — typically p99, p99.9, or p99.99 — that represent the slowest fraction of requests. While median (p50) latency reflects typical performance, tail latency captures the worst-case experiences that real users encounter due to GC pauses, resource contention, network retransmissions, or cache misses. In distributed systems, tail latency is particularly damaging because it compounds across service calls.
Formula
The fan-out problem: A service making N parallel calls, each with p99 latency L, experiences worst-case latency close to L on almost every request, not just 1% of requests. With 100 parallel calls at p99 = 10 ms, the probability that all complete under 10 ms is 0.99^100 ≈ 37% — meaning 63% of composite requests are affected.Why it matters in practice
Tail latency is the main reason distributed systems feel slower than their individual component benchmarks suggest. A microservice architecture with 10 sequential calls, each at p99 = 20 ms, has a composite p99 of roughly 200 ms — even if each service looks fast in isolation. Strategies to reduce tail latency include hedged requests (sending a duplicate request to a second server if the first doesn't respond within a threshold), aggressive timeouts, and circuit breakers that fail fast rather than queue indefinitely.
Common mistakes
- •Treating tail latency as an edge case affecting few users — at p99, 1% of requests is potentially millions per day for high-traffic systems.
- •Not using hedged requests for latency-critical paths — the cost of an occasional duplicate request is low compared to the user experience improvement.
- •Load testing without simulating GC pauses, background jobs, or noisy neighbours — these are the primary sources of production tail latency.
Related Terms
p99 latency
99th percentile of request durations: captures experience of users under stress.
Fan-out (distributed systems)
Single request triggers multiple parallel downstream calls.
Thundering herd
Many clients simultaneously access resource that just became available.
Latency budget
Total time allocated for a complete user-facing request across all architectural hops.