dkduckkit.dev
← Blog
·8 min read·Related tool →

Kafka Consumer Lag: How to Calculate Time-to-Overflow Before the Incident

A practical guide to consumer lag math: partition bottleneck, max.poll.interval.ms misconfiguration, and the alerts you need before production incidents.

It's 3:00 AM. A P1 alert fires — not a database outage, but a consumer group at 2.3 million messages of lag with 8 minutes to overflow. At current ingestion velocity, unconsumed messages will hit the retention limit and be purged permanently. Restarting pods won't help. This is a structural throughput failure.

Here's the math you need to avoid being in that position.

What is consumer lag and why it grows

Consumer lag is the delta between the Log End Offset (latest produced message) and the last committed offset of a consumer group. Per the Apache Kafka official documentation, lag represents the distance between the high watermark of a partition — where all replicas are in sync — and the current consumer position.

Use this formula as a manual kafka consumer lag calculator to quantify your growth deficit:

lag_growth_rate = incomingRatePerSecond
                − min(consumerInstances, partitionCount) × (1000 / processingTimeMs)

Each variable defines a specific architectural limit:

  • incomingRatePerSecond — aggregate producer velocity into the topic
  • consumerInstances — active pods/processes assigned to partitions
  • partitionCount — ceiling of consumer parallelism (Kafka's fundamental constraint)
  • processingTimeMs — wall-clock time per record including DB commits, external API calls

If this formula returns a positive number, lag grows linearly. If you know your broker's retention (by size or time), you can compute exactly when unconsumed messages will be deleted. "We have lag" becomes "we have 42 minutes."

The partition bottleneck — when adding consumers doesn't help

The instinctive response to a lag spike is scaling out the consumer deployment. If you run on Kubernetes, doubling the replica count feels like the right lever. It isn't — unless you have spare partitions.

Kafka assigns each partition to exactly one consumer within a group:

effectiveConsumers = min(consumerInstances, partitions)

With a 12-partition topic and 20 consumer pods, only 12 pods process records. The remaining 8 idle pods consume CPU for heartbeats and contribute zero throughput. As Confluent and LinkedIn Engineering have documented, this is why partition over-provisioning is critical during design — not during incidents.

If your kafka consumer lag calculator shows a deficit even when consumers == partitions, your only levers are:

  1. Reduce processingTimeMs via batching or async I/O
  2. Increase partition count — which risks temporarily worsening lag through rebalances and breaking key-ordering guarantees

The max.poll.interval.ms trap

Sometimes lag grows while consumers appear healthy — high CPU, but zero offset progress. This is the max.poll.interval.ms trap, clarified in KIP-62 which separated the heartbeat thread from the processing loop.

The broker tracks whether the consumer calls poll() within a window. If your batch processing time exceeds max.poll.interval.ms (default 300 seconds), the broker evicts the consumer — even if heartbeats are healthy.

Calculate your eviction risk:

batchProcessingTimeMs = processingTimeMs × max.poll.records

If batchProcessingTimeMs > max.poll.interval.ms, you enter a rebalance loop: the consumer finishes work, tries to commit, discovers it no longer owns those partitions, rejoins the group, processes the same heavy batch, times out again. Zero offset progress, infinite rebalance churn.

This cascade was a contributing factor in documented Kafka incidents where slow processing halted consumption across critical consumer groups despite all consumers being "up."

Setting up lag alerts that matter

Alerting on raw lag numbers is a trap. A lag of 1 million messages may be acceptable for a logging topic but catastrophic for a payment service. Alert on trajectory, not magnitude.

The Fix — copy-paste ready
yaml
groups:
  - name: kafka_lag_alerts
    rules:
      # Alert if lag will reach retention danger zone within 20 minutes
      - alert: KafkaConsumerLagCriticalTrajectory
        expr: |
          predict_linear(kafka_consumergroup_group_lag[10m], 1200) > 10000000
          and
          rate(kafka_consumergroup_group_lag[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Predictive overflow: {{ $labels.group }}"
          description: >
            Consumer group {{ $labels.group }} is projected to exceed
            retention limits in under 20 minutes.

      # Alert on stalled group (lag exists but no offset commits)
      - alert: KafkaConsumerGroupStalled
        expr: |
          sum by (group) (kafka_consumergroup_group_lag) > 5000
          and
          sum by (group) (rate(kafka_consumergroup_group_offset[5m])) == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group stalled: {{ $labels.group }}"
          description: >
            Group {{ $labels.group }} has significant lag but zero offset
            commits in 10 minutes. Check for max.poll.interval.ms timeouts.

The predict_linear rule shifts your response from "how much lag do we have?" to "when will this lag break us?" — the difference between reactive firefighting and proactive platform engineering.

For an interactive version of these formulas with failure simulation, use the Kafka Consumer Lag Predictor.

Related tool

Kafka Consumer Lag Predictor

Predict when your consumer group will overflow. Size consumers and simulate failures before production incidents.