dkduckkit.dev

Kafka partition skew

Kafka

Kafka partition skew occurs when messages are distributed unevenly across partitions — one partition receives significantly more traffic than others. Skew is caused by uneven message key distribution: if the partition is determined by hash(key) % numPartitions, keys with high frequency (one user generating 80% of events, one country code for 70% of traffic) cause hot partitions. A hot partition's consumer may lag while other consumers in the same group are idle.

Why it matters in practice

Group-level lag metrics hide partition skew. A group with 12 partitions and 0 total lag in 11 partitions plus 100,000 messages lag in 1 partition looks healthy at the group level. The affected consumer is overwhelmed; the other 11 are idle. Standard horizontal scaling (adding more consumers) does not help because the hot partition is already assigned to exactly one consumer. The fix requires either changing the partitioning key, adding a sub-partitioning strategy, or using the Parallel Consumer library which allows one consumer to process a single partition concurrently.

Common mistakes

  • Using a low-cardinality key as the partition key — country codes, event types, and status fields typically have skewed distributions.
  • Monitoring only total group lag rather than per-partition lag breakdown.
  • Assuming adding partitions fixes skew — rebalancing keys to more partitions helps only if the key distribution becomes more even after the change.