dkduckkit.dev

Kafka Consumer Lag Predictor

Predict when your consumer group will overflow. Size consumers and simulate failures before production incidents.

Last updated: March 2026

TL;DR

The predictor compares produce rate to effective consume rate (capped by partitions) to estimate how fast lag grows and when you hit dangerous backlog.

Formula: Lag growth rate ≈ max(0, produceRate − min(consumers, partitions) × per-consumer throughput).

When to use this

  • Planning consumer count and partition count before peak traffic.
  • Simulating instance loss to see time-to-threshold on lag alerts.

How the math works

How the math worksLaTeX model and TypeScript reference — same logic as the calculator on this page.

This describes the implementation behind the numbers as of 2026-03-26. It is engineering documentation, not legal or compliance advice.

Specification citation

Logic reflects our proprietary implementation of the following public specifications: Apache Kafka documentation (consumers).

This snippet represents the core logic of our proprietary calculation engine, verified against Apache Kafka consumer throughput, partitions, and lag concepts.

Model (LaTeX source)
Consumer lag growth (duckkit.dev model)

Let λ_in be produce rate (msg/s), T_inst messages/s per consumer instance
(including batch mode via maxPollRecords), C consumers, P partitions.

Effective consumers: C_eff = min(C, P)
Group throughput: T_g = C_eff · T_inst
Lag growth rate: λ_lag = λ_in − T_g (msg/s; clamped to 0 in UI when recovering)

Time until lag reaches threshold L_max from current lag L:
t_overflow = max(0, L_max − L) / λ_lag   when λ_lag > 0
Reference implementation (TypeScript, excerpt from shipped modules)
// lib/kafka-lag-predictor/current-state.ts — throughput & lagRate
const effectiveConsumers = Math.min(consumerInstances, partitionCount)
const groupThroughput = parseFloat(
  (effectiveConsumers * instanceThroughput).toFixed(2),
)
const lagRate = parseFloat(
  (incomingRatePerSecond - groupThroughput).toFixed(2),
)

// lib/kafka-lag-predictor/predictions.ts — time to max tolerated lag
let timeToOverflow: number | null = null
if (lagRate > 0 && maxToleratedLag > 0) {
  const remainingLagBudget = Math.max(0, maxToleratedLag - currentLag)
  timeToOverflow =
    remainingLagBudget > 0
      ? parseFloat((remainingLagBudget / lagRate).toFixed(1))
      : 0
}

Time to threshold 1m 53s. Lag rate 880.00 messages per second. Headroom -733.3 percent.

At a glance

Time to threshold
1m 53s
Lag rate
880.0 msg/s
Headroom
-733%

Configuration

Traffic & processing
Processing mode
Infrastructure

1.0 = perfectly even. 1.2 = heaviest partition gets 20% more than average. Check your Kafka monitoring dashboard for the actual ratio.

Failure simulation

Results

Current state

120.00msg/s
Group throughput
880.00msg/s
Lag rate
-733.3%
Headroom
100.0%
Utilization
CriticalLag may hit the threshold in ~2m. Increase to at least 14 instances or add partitions (target ≥ 50).

Predictions

1m 53s
Time to threshold
N/A
Lag recovery
12
Min consumers (steady state)
50
Min partitions needed

Failure analysis

Normal stateLag threshold
threshold060k120k0s57s2m3m4m
1m 53s
Time to threshold (after failure)
0
Safe failure count
0.0%
Safe zone
2.00
Partitions per consumer (baseline)
Even split across active consumers
2.00
Partitions per consumer (after failure)
When failed instances are removed
0.0%
Per-consumer partition load increase
vs baseline (latency risk when high)

Recommendations

Consumer group cannot scale further without more partitions — need capacity for ~50 consumers but only 12 partition(s). Lag is growing at 880 msg/s.
Lag will reach threshold in 1m 53s. Immediate action required.
Only 0 instance failure(s) are safe before lag grows. Consider adding at least 13 instances for resilience.
This calculation assumes 0% partition skew (skewFactor 1×). If you see uneven lag per partition in your monitoring, increase skewFactor to match your observed peak/average ratio.

Methodology

Results chain calculateAll: current state from partitions, consumers, and processing time; predictions for catch-up and threshold timing; failure analysis for degraded scenarios; then recommendations. Throughput uses a simplified messages-per-second model (not broker byte rates, rebalance churn, or consumer group coordinator delays).

How Kafka consumer lag works

In a consumer group, each partition is assigned to at most one consumer at a time. That cap is why kafka consumer scaling stops helping once you have at least as many consumers as partitions. Consumer lag is the difference between the last produced offset and the last committed consumer offset: it tells you how far behind the group is. This kafka lag calculator models steady-state throughput and when backlog would cross your alert threshold — before it shows up in production dashboards.

Why lag accumulates: the math

Effective consumers are min(consumerInstances, partitionCount). Per-instance throughput is roughly 1000 / processingTimeMs messages per second (simplified model). Group throughput is effective consumers times that rate. If incoming rate exceeds group throughput, lag grows at incomingRate − groupThroughput messages per second. That is the core signal for any kafka consumer lag investigation: either add capacity (more partitions and consumers), or speed up processing. Use this tool as a planning companion alongside real metrics from your cluster.

Planning for instance failures

Failures reduce active consumers and therefore group throughput. The simulator shows time to breach your max tolerated lag after losing instances. "Safe failure count" estimates how many instances you can lose before you fall below the steady-state consumer count implied by your load. Headroom matters: even when kafka consumer lag is flat, low headroom means spikes go straight into growing backlog.

The partition bottleneck

If you need more parallel consumers than partitions, extra instances stay idle. Kafka does not assign two consumers in the same group to one partition for normal processing, so partition count is the ceiling on parallelism. When sizing topics, plan for future scale; increasing partitions later is possible, but decreasing them is not.

See also: Kafka Message Size Calculator to estimate per-message overhead before sizing your consumer group. Factor in produce/consume and cross-AZ replication latency with the System Latency Budget Calculator.

Consumer saturation: when adding instances stops helping

Consumer saturation is the point at which your consumer group has reached the partition ceiling — adding more consumer instances beyond partitionCount provides zero additional throughput. Each extra instance sits idle, consuming memory and increasing rebalance duration without improving lag recovery.

The saturation point is always min(consumerInstances, partitionCount). If your monitoring shows consumers with 0 messages/s throughput, you have hit consumer saturation. The only fix is to increase the partition count — but note that Kafka does not support decreasing partitions once created, so plan with future growth in mind. A common rule of thumb: provision 20–30% more partitions than your current consumer count to leave room for horizontal scaling without topic recreation.

The silent killer: partition skew

Kafka distributes messages across partitions based on the message key. With a high-cardinality key like user_id, distribution is typically even. But with a low-cardinality or grouped key like country_id or event_type, one partition may receive 5–10× the average load.

This is why a group-level throughput calculator can look green while one consumer is drowning. Global group_throughput = effectiveConsumers × instanceThroughput tells you nothing about per-partition load. If you see a single consumer lagging while others are nearly idle, partition skew is the most likely culprit.

Detection: compare per-partition lag in your monitoring tool (Grafana, Confluent Control Center, or Kafdrop). If one partition has 10× the lag of others, review your partitioning key. Switching to round-robin assignment (null key) distributes load evenly at the cost of message ordering guarantees.

Rebalance impact and heartbeat timeouts

Every change in consumer group membership — adding an instance, removing one, or a crash — triggers a group rebalance. Under the default Eager Rebalance Protocol (Kafka before 2.4), all consumers in the group stop processing simultaneously while partitions are reassigned. With large numbers of partitions, this stop-the-world pause can last 10–30 seconds and is proportional to partition count.

Kafka 2.4+ introduced the Cooperative Sticky Assignor, which performs incremental rebalancing — only the partitions that need to move are reassigned, and consumers continue processing the rest. Enable it with partition.assignment.strategy=CooperativeStickyAssignor.

The max.poll.interval.ms trap (poll loop)

Kafka uses two separate mechanisms to detect dead consumers. The heartbeat thread sends periodic signals via heartbeat.interval.ms (default: 3s). Separately, max.poll.interval.ms (default: 5 minutes) enforces a maximum time between consecutive poll() calls.

If message processing takes longer than max.poll.interval.ms — for example, a slow database write inside the processing loop — Kafka removes the consumer from the group and triggers a rebalance, even though the heartbeat thread is still sending signals normally. This produces a confusing pattern: the consumer appears alive in application logs, but Kafka repeatedly evicts and re-adds it, causing a rebalance loop.

The fix is to either increase max.poll.interval.ms to match your worst-case processing time, or reduce max.poll.records to process fewer messages per batch. This calculator flags this condition automatically when processingTimeMs × maxPollRecords > maxPollIntervalMs.

Kafka vs Pulsar vs RabbitMQ

  • Apache Kafka — ordered partitions with replay; best when you need retention and consumer groups at scale (kafka.apache.org).
  • Apache Pulsar — tiered storage and separated brokers/bookies; strong multi-tenancy patterns.
  • RabbitMQ — flexible routing exchanges; great for task queues when replay semantics differ from log-based systems.

Copy-paste solution

groups:
  - name: kafka_consumer_lag
    rules:
      - alert: ConsumerLagHigh
        expr: kafka_consumergroup_lag > 1000000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group lag above threshold — check partitions vs consumers"

Related tools