Consumer lag (Kafka)

Kafka

Consumer lag is the difference between the Log End Offset (the latest message written to a partition) and the Consumer Committed Offset (the last position the consumer group has confirmed processing) per partition. Total group lag = sum of per-partition lags. Growing lag is the primary indicator that consumers cannot keep up with produce rate.

**Formula:** `lag_per_partition = log_end_offset − committed_offset`. `lag_growth_rate = produce_rate − min(consumers, partitions) × per_consumer_throughput` (msg/s).

**Precision note:** Lag should technically be measured against the **high watermark (HWM)**, not the Log End Offset (LEO). Messages between HWM and LEO are written but not yet replicated to all in-sync replicas — not yet consumable. The correct formula is:

`consumer_lag = high_watermark_offset - committed_offset`

In practice the difference is small (milliseconds of replication lag) but relevant when setting tight SLO thresholds on lag alerting.

Formula

lag_per_partition = high_watermark_offset - committed_offset
lag_growth_rate = produce_rate - min(consumers, partitions) × per_consumer_throughput

Why it matters in practice

Lag is a leading indicator: it signals that trouble is approaching before users notice delayed processing. At 1,000 messages/s produce rate with group throughput of 800 messages/s, lag grows at 200 messages/s. At this rate a 100,000-message threshold is breached in 500 seconds — 8 minutes of warning before the system degrades. The hidden danger is log retention: if lag grows so large that the committed offset falls outside `log.retention.hours` (default 168 hours), the consumer cannot read those messages — they are deleted without being processed, which is data loss with no exception thrown at produce time.

Common mistakes

•Monitoring only total group lag without per-partition breakdown — a group with 12 partitions can have zero lag on 11 and critical lag on 1 due to partition skew, appearing healthy at the group level.
•Ignoring lag growth rate in favour of absolute lag value — 50,000 messages of lag means nothing without knowing throughput. At 10,000 msg/s that's 5 seconds; at 100 msg/s that's 500 seconds.
•Adding consumer instances beyond partition count — Kafka's 1-consumer-per-partition rule means extra consumers are idle and contribute zero additional throughput.

Try it in

Try in Kafka Calculator Try in Kafka Consumer Lag

Consumer lag (Kafka)

Formula

Why it matters in practice

Common mistakes

Related Terms

Consumer offset (Kafka)

Kafka time lag

Effective consumers (Kafka lag)

Try it in