dkduckkit.dev

System Latency Budget Calculator

Build your architecture from reference hop latencies and see instantly if you fit your SLA — interactive latency numbers for engineers.

Last updated: March 2026

TL;DR

Stack measured or reference hop latencies (DNS, TLS, gateways, RPC, queues) against a single SLA target to see whether the architecture fits.

Formula: Total latency ≈ Σ (hop latency × iterations); compare to SLA in milliseconds.

When to use this

  • Design reviews for user-facing request paths and microservice graphs.
  • Arguing for fewer hops or faster regions with numbers, not opinions.

How the math works

How the math worksLaTeX model and TypeScript reference — same logic as the calculator on this page.

This describes the implementation behind the numbers as of 2026-03-26. It is engineering documentation, not legal or compliance advice.

Specification citation

Logic reflects our proprietary implementation of the following public specifications: Google SRE Book — monitoring distributed systems.

This snippet represents the core logic of our proprietary calculation engine, verified against public SRE literature on latency budgets and service level objectives.

Model (LaTeX source)
Latency budget (duckkit.dev model)

SLA target in microseconds: S = SLA_ms · 1000

Total p50: T_50 = Σ_i (p50_i · max(1, iter_i))

Tail multipliers m_99, m_999 depend on architecture complexity (simple | moderate | complex):
T_99 = round(T_50 · m_99)
T_999 = round(T_99 · m_999)

Headroom (%): H = 100 · (S − T_50) / S when S > 0
Status: over if T_50 > S; else tight if H < 20; else safe.
Reference implementation (TypeScript, excerpt from shipped modules)
// lib/latency-budget/calculate.ts (excerpt)
export function calculateBudget(
  inputs: LatencyBudgetInputs,
): LatencyBudgetResult {
  const { slaTargetMs, hops, complexity } = inputs
  const slaTargetUs = slaTargetMs * 1_000

  let totalP50Us = 0
  for (const hop of hops) {
    const it = Math.max(1, hop.iterations)
    totalP50Us += hop.p50Us * it
  }

  const { p99: p99Mult, p999: p999Mult } = TAIL_MULTIPLIERS[complexity]
  const totalP99Us = Math.round(totalP50Us * p99Mult)
  const totalP999Us = Math.round(totalP99Us * p999Mult)

  const headroomUs = slaTargetUs - totalP50Us
  const headroomPct =
    slaTargetUs > 0
      ? parseFloat(((headroomUs / slaTargetUs) * 100).toFixed(1))
      : 0

  const status: BudgetStatus =
    totalP50Us > slaTargetUs ? 'over' : headroomPct < 20 ? 'tight' : 'safe'
  // … hop contributions, warnings …
}

Budget status safe. Total p50 2.20 milliseconds. Headroom 98 percent.

At a glance

Status
safe
p50 total
2.20 ms
Headroom
98%

Configuration

p50 target in milliseconds

Link and markdown summary reflect your current inputs. Do not include sensitive data. How sharing works.

Architecture hops(3 added)

Redis GET (same AZ)
PostgreSQL simple
HTTP same AZ
Add hop from catalogueCollapse on small screens to focus on your architecture list

Results

Within budgetBottleneck: PostgreSQL simple
p50 latency
2.20ms
p99 latency
3.08ms
p999 latency
8.80ms
Headroom
98%

p99 and p999 are estimated using a static multiplier per architecture complexity. Real tail latency compounds non-linearly across hops (statistical convolution). For precise tail latency modelling, instrument your system with HDR Histogram and measure empirically.

Latency breakdown

SLA targetRedis GET (same A…200μs · 9.1%PostgreSQL simple1.00ms · 45.5%bottleneckHTTP same AZ1.00ms · 45.5%

Methodology

End-to-end p50 is the sum of each hop's p50 multiplied by its iteration count (minimum one). p99p99 latency99th percentile of request durations: captures experience of users under stress.Read more → and p999 are synthesized by applying fixed multipliers to that p50 sum based on architecture complexity (simple / moderate / complex) — they are not measured distributions. HeadroomHeadroom (latency)Gap between measured latency and SLA ceiling; buffer for unpredictable spikes.Read more → is SLA minus p50 sum; status flags compare that to thresholds. The tool does not simulate queuing, retries, or parallel fan-outFan-out (distributed systems)Single request triggers multiple parallel downstream calls.Read more → beyond the iteration field.

Latency numbers every programmer should know

Jeff Dean and others popularised orders-of-magnitude reference latencies—from L1 cache to cross-region RTT—so engineers can reason about systems without benchmarking every layer. A latency budgetLatency budgetTotal time allocated for a complete user-facing request across all architectural hops.Read more → turns those numbers into a design constraint: you add the cost of each hop in your critical path and compare the sum to your SLASLA and SLO (service level agreement vs objective)SLA is a contract with guarantees; SLO is the internal target set stricter than SLA to create error budget.Read more →. This tool is an interactive take on that idea: pick realistic hops, tune counts, and see whether your latency budget still fits.

Why p99 estimates are approximations

Tail latencyTail latencyHigh-percentile latency values (p99, p99.9) representing slowest requests.Read more → across distributed hops does not sum linearly. The p99p99 latency99th percentile of request durations: captures experience of users under stress.Read more → of a system with 5 hops is not 5× the p99 of one hop — it depends on the latency distribution of each component (log-normal for network, heavy-tail for GC pauses). The multipliers used here (1.2× for simple, 1.4× for moderate, 2.0× for complex architectures) are empirically-derived conservative estimates suitable for SLA planning. For production SLO definition, always instrument with real traffic using HDR Histogram or OpenTelemetry percentile metrics.

Building a latency budget

A latency budgetLatency budgetTotal time allocated for a complete user-facing request across all architectural hops.Read more → allocates your end-to-end SLA across services, databases, caches, and network segments. Teams often target p50p50 latency (median)50th percentile of request durations: half of requests complete faster, half slower.Read more → for capacity planning but must still understand p99p99 latency99th percentile of request durations: captures experience of users under stress.Read more → and tail behaviourTail latencyHigh-percentile latency values (p99, p99.9) representing slowest requests.Read more →: small headroomHeadroom (latency)Gap between measured latency and SLA ceiling; buffer for unpredictable spikes.Read more → at p50 means retries, GC, or one slow dependency will breach the SLA under load. We surface p99 and p999 using multipliers that scale with architecture complexity so simple paths are not over-penalised and highly fan-outFan-out (distributed systems)Single request triggers multiple parallel downstream calls.Read more → systems get a more conservative tail estimate.

Many teams aim for roughly 20% headroom below the SLA at p50 so bursts and jitter do not immediately violate customer-facing targets.

The tail latency problem

Median latencyp50 latency (median)50th percentile of request durations: half of requests complete faster, half slower.Read more → hides worst-case behaviour. Garbage collection pauses, packet loss, thread pool exhaustion, and synchronous fan-outFan-out (distributed systems)Single request triggers multiple parallel downstream calls.Read more → all widen the tail. That is why p99p99 latency99th percentile of request durations: captures experience of users under stress.Read more → can exceed a naïve sum of medians and why p999 is often several times worse than p99. When you stack many hops, the probability that at least one is slow rises—another reason to treat distributed systems as higher complexity in this calculator.

Why some latencies have a physical lower bound

Speed of light in optical fibre is approximately 200,000 km/s—about 30% slower than in vacuum. A London–New York round-trip (~11,000 km of cable path) has a theoretical minimum near ~70ms before switching and queuing; measured RTT is often ~80ms once routing and equipment are included. The 54ms figure assumes a straight-line geodesic; real submarine routes are longer. This is why cross-continental HTTP will not reach sub-50ms latency, and why CDN edge nodes exist—to move work physically closer to users. For latency-sensitive operations, geographic distribution is not optional; it is physics.

Common latency killers

Disk seeks on spinning rust, N+1 database patterns (modelled here with per-hop iterations), uncached cross-AZCross-AZ latencyRound-trip time between availability zones; adds cost and latency to replication.Read more → or cross-region calls, and cold startsCold start (serverless)Extra latency when serverless function starts on new execution environment.Read more → on serverless all consume budget fast. Eliminating even one dominant hop often matters more than micro-optimising the rest.

Related calculators

Once you know your p99 latency, use the p99LatencyMs input in our API Rate Limit Calculator to set safe retry windows. For Kafka-based architectures, see the Kafka Consumer Lag Predictor to factor in messaging and processing time.

Copy-paste solution

flowchart LR
  Client --> Edge[CDN / edge]
  Edge --> GW[API gateway]
  GW --> Svc[Service]
  Svc --> DB[(Database)]
  %% Label each edge with p50/p99 ms from this calculator

Paste into any Mermaid-compatible doc; replace hop labels with measured latency from this tool.

Frequently asked questions

What is a latency budget?
A latency budget is the total time allocated for a request to complete, distributed across all architectural hops (database queries, HTTP calls, cache lookups). If the sum of hop latencies exceeds your SLA target, the request misses its deadline. Good practice: maintain 20% headroom below the SLA to absorb GC pauses, network jitter, and traffic spikes.
What is the difference between p50, p99, and p999 latency?
p50 (median) is the latency that 50% of requests are faster than. p99 means 99% of requests complete faster — 1 in 100 is slower. p999 means 999 in 1000 are faster — 1 in 1000 hits the tail. For SLAs in distributed systems, p99 matters most: in a microservice with 10 sequential calls, each at p99 independently, the combined probability of hitting a slow response is much higher than per-component p99 suggests.
p50p99p99.9Tail covers rare slow paths — SLOs usually track p99, not the mean
Why does cross-continent HTTP have a physical lower bound?
Speed of light in optical fibre is approximately 200,000 km/s (about 30% slower than in vacuum). A New York to Tokyo round-trip (18,000 km) has a theoretical minimum of ~90ms, regardless of server performance. The actual ~150ms includes routing overhead and protocol processing. This is why CDN edge nodes exist — to bring responses physically closer to the user.
What is the N+1 query problem and why does it matter for latency?
N+1 occurs when code executes 1 query to fetch a list, then N additional queries to fetch details for each item. A single PostgreSQL query at 1ms looks fine in isolation. But fetching 50 user records in a loop = 50ms just for database calls, consuming half a 100ms SLA. Detection: look for loops containing database calls. Fix: use batched queries (SELECT WHERE id IN (...)) or eager loading.
What is AWS Lambda cold start and how does it affect latency budgets?
Lambda cold start is the initialization time when a new Lambda execution environment is created. JVM runtimes (Java/Kotlin): 500ms–1s. Node.js/Python: 100–300ms. During cold start, the function cannot serve requests. For latency-sensitive APIs, cold starts can breach SLAs by 10×. Mitigations: provisioned concurrency, Lambda SnapStart (JVM), or using lighter runtimes like Node.js.

Related tools