dkduckkit.dev

Latency budget

Latency & SRE

A latency budget is the total time allocated for a complete user-facing request, distributed across every architectural hop: DNS resolution, TLS handshake, load balancer, application servers, database queries, cache reads, and downstream RPC calls. If the sum of all hop latencies exceeds the budget, the request misses its SLA. A budget makes latency a first-class engineering constraint rather than an afterthought discovered in production.

Formula

total_latency = sum(sequential_hops) + max(parallel_group). Safety rule: keep measured p50 at least 20% below your SLA ceiling to absorb GC pauses and traffic spikes.

Why it matters in practice

Without an explicit budget, teams optimise individual services in isolation while the end-to-end latency silently grows. A 5 ms database query looks fine, but if it sits in a loop that runs 20 times per request, it consumes 100 ms — half a 200 ms SLA. Budgets force architects to allocate time across hops before writing code, not after a production incident reveals the real numbers. They also make capacity planning concrete: if you add a new microservice hop, you must remove latency elsewhere.

Common mistakes

  • Measuring only p50 (median) and declaring success — a good median hides the p99 tail that real users experience during peak traffic.
  • Forgetting serialisation overhead — marshalling a large JSON response can add 5–20 ms that never appears in database or network traces.
  • Not accounting for retries — a single retry doubles the observed latency for the affected percentile of requests.