dkduckkit.dev

Kafka Consumer Lag Predictor

Predict when your consumer group will overflow. Size consumers and simulate failures before production incidents.

Last updated: March 2026

TL;DR

The predictor compares produce rate to effective consume rate (capped by partitions) to estimate how fast lag grows and when you hit dangerous backlog.

Formula: Lag growth rate ≈ max(0, produceRate − min(consumers, partitions) × per-consumer throughput).

When to use this

  • Planning consumer count and partition count before peak traffic.
  • Simulating instance loss to see time-to-threshold on lag alerts.

How the math works

How the math worksLaTeX model and TypeScript reference — same logic as the calculator on this page.

This describes the implementation behind the numbers as of 2026-03-26. It is engineering documentation, not legal or compliance advice.

Specification citation

Logic reflects our proprietary implementation of the following public specifications: Apache Kafka documentation (consumers).

This snippet represents the core logic of our proprietary calculation engine, verified against Apache Kafka consumer throughput, partitions, and lag concepts.

Model (LaTeX source)
Consumer lag growth (duckkit.dev model)

Let λ_in be produce rate (msg/s), T_inst messages/s per consumer instance
(including batch mode via maxPollRecords), C consumers, P partitions.

Effective consumers: C_eff = min(C, P)
Group throughput: T_g = C_eff · T_inst
Lag growth rate: λ_lag = λ_in − T_g (msg/s; clamped to 0 in UI when recovering)

Time until lag reaches threshold L_max from current lag L:
t_overflow = max(0, L_max − L) / λ_lag   when λ_lag > 0
Reference implementation (TypeScript, excerpt from shipped modules)
// lib/kafka-lag-predictor/current-state.ts — throughput & lagRate
const effectiveConsumers = Math.min(consumerInstances, partitionCount)
const groupThroughput = parseFloat(
  (effectiveConsumers * instanceThroughput).toFixed(2),
)
const lagRate = parseFloat(
  (incomingRatePerSecond - groupThroughput).toFixed(2),
)

// lib/kafka-lag-predictor/predictions.ts — time to max tolerated lag
let timeToOverflow: number | null = null
if (lagRate > 0 && maxToleratedLag > 0) {
  const remainingLagBudget = Math.max(0, maxToleratedLag - currentLag)
  timeToOverflow =
    remainingLagBudget > 0
      ? parseFloat((remainingLagBudget / lagRate).toFixed(1))
      : 0
}

Time to threshold 1m 53s. Lag rate 880.00 messages per second. Headroom -733.3 percent.

At a glance

Time to threshold
1m 53s
Lag rate
880.0 msg/s
Headroom
-733%

Configuration

Traffic & processing
Processing mode
Infrastructure

1.0 = perfectly even. 1.2 = heaviest partition gets 20% more than average. Check your Kafka monitoring dashboard for the actual ratio.

Failure simulation

Results

Current state

120.00msg/s
Group throughput
880.00msg/s
Lag rate
-733.3%
Headroom
100.0%
Utilization
CriticalLag may hit the threshold in ~2m. Increase to at least 14 instances or add partitions (target ≥ 50).

Predictions

1m 53s
Time to threshold
N/A
Lag recovery
12
Min consumers (steady state)
50
Min partitions needed

Failure analysis

Normal stateLag threshold
threshold060k120k0s57s2m3m4m
1m 53s
Time to threshold (after failure)
0
Safe failure count
0.0%
Safe zone
2.00
Partitions per consumer (baseline)
Even split across active consumers
2.00
Partitions per consumer (after failure)
When failed instances are removed
0.0%
Per-consumer partition load increase
vs baseline (latency risk when high)

Recommendations

Consumer group cannot scale further without more partitions — need capacity for ~50 consumers but only 12 partition(s). Lag is growing at 880 msg/s.
Lag will reach threshold in 1m 53s. Immediate action required.
Only 0 instance failure(s) are safe before lag grows. Consider adding at least 13 instances for resilience.
This calculation assumes 0% partition skew (skewFactor 1×). If you see uneven lag per partition in your monitoring, increase skewFactor to match your observed peak/average ratio.

Methodology

Results chain calculateAll: current state from partitions, consumers, and processing time; predictions for catch-up and threshold timing; failure analysis for degraded scenarios; then recommendations. Throughput uses a simplified messages-per-second model (not broker byte rates, rebalance churn, or consumer group coordinator delays).

How Kafka consumer lag works

In a consumer groupConsumer group (Kafka)Set of consumers cooperatively consuming a topic.Read more →, each partitionKafka partitionFundamental unit of parallelism and ordering within a topic.Read more → is assigned to at most one consumer at a time. That cap is why kafka consumer scaling stops helping once you have at least as many consumers as partitions. Consumer lagConsumer lag (Kafka)Difference between latest message and consumer committed offset.Read more → is the difference between the high watermarkConsumer lag (Kafka)Difference between latest message and consumer committed offset.Read more → (log end offset) and the last committed consumer offsetConsumer offset (Kafka)Position consumer group has committed as processed.Read more →: it tells you how far behind the group is. This kafka lag calculator models steady-state throughput and when backlog would cross your alert threshold — before it shows up in production dashboards.

Why lag accumulates: the math

Effective consumersEffective consumers (Kafka lag)Actual consumers processing partitions, capped at partition count.Read more → are min(consumerInstances, partitionCount). Per-instance throughput is roughly 1000 / processingTimeMs messages per second (simplified model). Group throughput is effective consumers times that rate. If incoming rate exceeds group throughput, lag grows at incomingRate − groupThroughput messages per second. That is the core signal for any kafka consumer lagConsumer lag (Kafka)Difference between latest message and consumer committed offset.Read more → investigation: either add capacity (more partitions and consumers), or speed up processing. Use this tool as a planning companion alongside real metrics from your cluster.

Planning for instance failures

Failures reduce active consumers and therefore group throughput. The simulator shows time to breach your max tolerated lag after losing instances. "Safe failure count" estimates how many instances you can lose before you fall below the steady-state consumer count implied by your load. Headroom matters: even when kafka consumer lagConsumer lag (Kafka)Difference between latest message and consumer committed offset.Read more → is flat, low headroom means spikes go straight into growing backlog.

The partition bottleneck

If you need more parallel consumers than partitions, extra instances stay idle. Kafka does not assign two consumers in the same group to one partitionKafka partitionFundamental unit of parallelism and ordering within a topic.Read more → for normal processing, so partition count is the ceiling on parallelism. When sizing topics, plan for future scale; increasing partitions later is possible, but decreasing them is not.

See also: Kafka Message Size Calculator to estimate per-message overhead before sizing your consumer group. Factor in produce/consume and cross-AZ replication latency with the System Latency Budget Calculator.

Consumer saturation: when adding instances stops helping

Consumer saturation is the point at which your consumer group has reached the partition ceiling — adding more consumer instances beyond partitionCount provides zero additional throughput. Each extra instance sits idle, consuming memory and increasing rebalanceConsumer rebalance (Kafka)Redistributing partition assignments among group consumers.Read more → duration without improving lag recovery. Monitor max.poll.interval.msmax.poll.interval.msMaximum time between poll() calls before consumer considered dead.Read more → to ensure consumers don't get marked dead during long processing. Each rebalance temporarily stops consumption while partitions are reassigned.

The saturation point is always min(consumerInstances, partitionCount). If your monitoring shows consumers with 0 messages/s throughput, you have hit consumer saturation. The only fix is to increase the partition count — but note that Kafka does not support decreasing partitions once created, so plan with future growth in mind. A common rule of thumb: provision 20–30% more partitions than your current consumer count to leave room for horizontal scaling without topic recreation.

The silent killer: partition skew

Kafka distributes messages across partitions based on the message key. With a high-cardinality key like user_id, distribution is typically even. But with a low-cardinality or grouped key like country_id or event_type, one partition may receive 5–10× the average load.

This is why a group-level throughput calculator can look green while one consumer is drowning. Global group_throughput = effectiveConsumers × instanceThroughput tells you nothing about per-partition load. If you see a single consumer lagging while others are nearly idle, partition skew is the most likely culprit.

Detection: compare per-partition lag in your monitoring tool (Grafana, Confluent Control Center, or Kafdrop). If one partition has 10× the lag of others, review your partitioning key. Switching to round-robin assignment (null key) distributes load evenly at the cost of message ordering guarantees.

RebalanceConsumer rebalance (Kafka)Redistributing partition assignments among group consumers.Read more → impact and heartbeat timeouts

Every change in consumer group membership — adding an instance, removing one, or a crash — triggers a group rebalanceConsumer rebalance (Kafka)Redistributing partition assignments among group consumers.Read more →. Under the default Eager Rebalance Protocol (Kafka before 2.4), all consumers in the group stop processing simultaneously while partitions are reassigned. With large numbers of partitions, this stop-the-world pause can last 10–30 seconds and is proportional to partition count.

Kafka 2.4+ introduced the Cooperative Sticky Assignor, which performs incremental rebalancing — only the partitions that need to move are reassigned, and consumers continue processing the rest. Enable it with partition.assignment.strategy=CooperativeStickyAssignor.

The max.poll.interval.ms trap (poll loopConsumer poll loop (Kafka)Central control loop: fetch records, process, repeat.Read more →)

Kafka uses two separate mechanisms to detect dead consumers. The heartbeat thread sends periodic signals via heartbeat.interval.ms (default: 3s). Separately, max.poll.interval.ms (default: 5 minutes) enforces a maximum time between consecutive poll() calls.

If message processing takes longer than max.poll.interval.ms — for example, a slow database write inside the processing loop — Kafka removes the consumer from the group and triggers a rebalance, even though the heartbeat thread is still sending signals normally. This produces a confusing pattern: the consumer appears alive in application logs, but Kafka repeatedly evicts and re-adds it, causing a rebalance loop.

The fix is to either increase max.poll.interval.ms to match your worst-case processing time, or reduce max.poll.records to process fewer messages per batch. This calculator flags this condition automatically when processingTimeMs × maxPollRecords > maxPollIntervalMs.

Kafka vs Pulsar vs RabbitMQ

  • Apache Kafka — ordered partitions with replay; best when you need retention and consumer groups at scale (kafka.apache.org).
  • Apache Pulsar — tiered storage and separated brokers/bookies; strong multi-tenancy patterns.
  • RabbitMQ — flexible routing exchanges; great for task queues when replay semantics differ from log-based systems.

Copy-paste solution

groups:
  - name: kafka_consumer_lag
    rules:
      - alert: ConsumerLagHigh
        expr: kafka_consumergroup_lag > 1000000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group lag above threshold — check partitions vs consumers"

Frequently asked questions

What is Kafka consumer lag?
Consumer lag is the difference between the Log End Offset (latest produced message) and the Consumer Committed Offset (last processed message) per partition. Lag = 0 means real-time processing. Growing lag means your consumers cannot keep up with the incoming rate. At the group level, total lag = sum of per-partition lags.
Produce ratePartition log(LEO grows)Consume rateLag grows when produce − effective consume > 0
How many Kafka consumers do I need for a given throughput?
Minimum consumers = ceil(incomingRatePerSecond / (1000 / processingTimeMs)). However, Kafka enforces a maximum of 1 consumer per partition per consumer group. If your formula requires more consumers than partitions, additional consumers will be idle. Always ensure partitionCount >= minConsumersNeeded.
Why does increasing consumer instances sometimes not reduce lag?
Consumer saturation: once consumerInstances > partitionCount, additional instances are idle and contribute zero throughput. Kafka assigns partitions exclusively to one consumer per group. To scale beyond your current partition count, you must first increase partitions (which cannot be decreased later).
What is max.poll.interval.ms and why does it cause rebalances?
max.poll.interval.ms (default: 300 seconds) is the maximum time between consecutive poll() calls. If processing a batch takes longer than this value, Kafka considers the consumer dead and triggers a group rebalance — even if the heartbeat thread is still running. This is separate from session.timeout.ms. Fix: increase max.poll.interval.ms or reduce max.poll.records to process smaller batches.
What is a consumer group rebalance and how long does it take?
A rebalance is triggered by any membership change: consumer added, removed, or crashed. During rebalance, all consumers in the group pause processing (Eager Rebalance Protocol, the default). Duration depends on partition count and number of consumers — typically 5–30 seconds. Kafka 2.4+ introduced Cooperative Sticky Assignor which rebalances incrementally, reducing pause time.
Can consumer lag cause data loss?
Yes, indirectly. If lag grows beyond the topic's retention period (log.retention.hours, default 168h), older messages are deleted before being consumed. The consumer's committed offset can then become invalid. This results in a RecordTooOldException or offset out-of-range error. Monitor lag relative to retention: if timeToOverflow < retentionHours, data loss is possible.

Related tools