Consumer lag (Kafka)
Consumer lag is the difference between the Log End Offset (the latest message written to a partition) and the Consumer Committed Offset (the last position the consumer group has confirmed processing) per partition. Total group lag = sum of per-partition lags. Growing lag is the primary indicator that consumers cannot keep up with produce rate.
**Formula:** `lag_per_partition = log_end_offset − committed_offset`. `lag_growth_rate = produce_rate − min(consumers, partitions) × per_consumer_throughput` (msg/s).
**Precision note:** Lag should technically be measured against the **high watermark (HWM)**, not the Log End Offset (LEO). Messages between HWM and LEO are written but not yet replicated to all in-sync replicas — not yet consumable. The correct formula is:
`consumer_lag = high_watermark_offset - committed_offset`
In practice the difference is small (milliseconds of replication lag) but relevant when setting tight SLO thresholds on lag alerting.
Formula
lag_per_partition = high_watermark_offset - committed_offset
lag_growth_rate = produce_rate - min(consumers, partitions) × per_consumer_throughputWhy it matters in practice
Lag is a leading indicator: it signals that trouble is approaching before users notice delayed processing. At 1,000 messages/s produce rate with group throughput of 800 messages/s, lag grows at 200 messages/s. At this rate a 100,000-message threshold is breached in 500 seconds — 8 minutes of warning before the system degrades. The hidden danger is log retention: if lag grows so large that the committed offset falls outside `log.retention.hours` (default 168 hours), the consumer cannot read those messages — they are deleted without being processed, which is data loss with no exception thrown at produce time.
Common mistakes
- •Monitoring only total group lag without per-partition breakdown — a group with 12 partitions can have zero lag on 11 and critical lag on 1 due to partition skew, appearing healthy at the group level.
- •Ignoring lag growth rate in favour of absolute lag value — 50,000 messages of lag means nothing without knowing throughput. At 10,000 msg/s that's 5 seconds; at 100 msg/s that's 500 seconds.
- •Adding consumer instances beyond partition count — Kafka's 1-consumer-per-partition rule means extra consumers are idle and contribute zero additional throughput.