dkduckkit.dev

replica.fetch.max.bytes

Kafka

`replica.fetch.max.bytes` is a Kafka broker configuration that limits the maximum number of bytes a follower replica fetches from the leader per fetch request. The default is 1,048,576 bytes (1 MB). If `message.max.bytes` (the maximum message size) is increased without also increasing `replica.fetch.max.bytes`, the broker may accept oversized messages from producers while follower replicas cannot replicate them. The message can exist only on the leader and be permanently lost if the leader fails before replication succeeds.

When `replica.fetch.max.bytes` is set below `message.max.bytes`, follower replicas cannot fetch oversized messages. The broker logs `MessageSizeTooLargeException` in broker logs — visible to operators watching broker metrics. However, it is silent at the application layer: producers receive no error, consumers see no data, and the partition simply stops replicating without any client-side signal. This makes it one of the most operationally dangerous misconfigurations despite being logged server-side.

Formula

replica.fetch.max.bytes ≥ message.max.bytes ≥ max.request.size (producer)

Why it matters in practice

This misconfiguration is particularly insidious because: (1) producers often receive success acknowledgements while replication stalls, (2) clients may see partial or inconsistent visibility depending on which replica serves reads, and (3) although `MessageSizeTooLargeException` appears in broker logs, nothing surfaces in application metrics unless you wire broker health into your dashboards. The `UnderReplicatedPartitions` metric exceeding zero is a critical signal. The message is effectively a time bomb — safe until the leader broker fails, at which point a follower that has never seen the message becomes the new leader, and the message is gone. In banking or compliance contexts this is a data integrity incident.

Common mistakes

  • Increasing message.max.bytes without simultaneously increasing replica.fetch.max.bytes — the most common cause of this issue.
  • Not monitoring kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions — broker logs may show fetch errors, but clients get no signal; this metric catches the replication stall early.
  • Not testing message replication in staging with the same configuration as production — data loss may only appear after leader failover, when followers that never replicated the message take over.