dkduckkit.dev

Tail latency

Latency & SRE

Tail latency refers to the high-percentile latency values — typically p99, p99.9, or p99.99 — that represent the slowest fraction of requests. While median (p50) latency reflects typical performance, tail latency captures the worst-case experiences that real users encounter due to GC pauses, resource contention, network retransmissions, or cache misses. In distributed systems, tail latency is particularly damaging because it compounds across service calls.

Formula

The fan-out problem: A service making N parallel calls, each with p99 latency L, experiences worst-case latency close to L on almost every request, not just 1% of requests. With 100 parallel calls at p99 = 10 ms, the probability that all complete under 10 ms is 0.99^100 ≈ 37% — meaning 63% of composite requests are affected.

Why it matters in practice

Tail latency is the main reason distributed systems feel slower than their individual component benchmarks suggest. A microservice architecture with 10 sequential calls, each at p99 = 20 ms, has a composite p99 of roughly 200 ms — even if each service looks fast in isolation. Strategies to reduce tail latency include hedged requests (sending a duplicate request to a second server if the first doesn't respond within a threshold), aggressive timeouts, and circuit breakers that fail fast rather than queue indefinitely.

Common mistakes

  • Treating tail latency as an edge case affecting few users — at p99, 1% of requests is potentially millions per day for high-traffic systems.
  • Not using hedged requests for latency-critical paths — the cost of an occasional duplicate request is low compared to the user experience improvement.
  • Load testing without simulating GC pauses, background jobs, or noisy neighbours — these are the primary sources of production tail latency.