replica.fetch.max.bytes
`replica.fetch.max.bytes` is a Kafka broker configuration that limits the maximum number of bytes a follower replica fetches from the leader per fetch request. The default is 1,048,576 bytes (1 MB). If `message.max.bytes` (the maximum message size) is increased without also increasing `replica.fetch.max.bytes`, the broker may accept oversized messages from producers while follower replicas struggle to replicate them efficiently.
In older Kafka versions (< 0.10.1), this could cause permanent data loss — the message would exist only on the leader and be lost if the leader failed. Modern Kafka allows oversized batches to proceed in fetch requests to ensure forward progress, but the misconfiguration still degrades replication performance, increases inter-broker latency, and risks data loss during broker failures. The `UnderReplicatedPartitions` metric exceeding zero is a critical signal. In banking or compliance contexts this remains a data integrity concern.
Formula
replica.fetch.max.bytes ≥ message.max.bytes ≥ max.request.size (producer)Why it matters in practice
This misconfiguration is insidious because nothing surfaces in application metrics: producers receive success acknowledgements while follower fetches slow down, and broker-side fetch errors only appear in broker logs unless you wire broker health into your dashboards. On Kafka before 0.10.1 the failure mode was worse — replication could stall permanently, and a leader failure would lose the message outright. On modern Kafka (KIP-74) replication makes progress, but undersized fetch limits inflate replication lag and shrink the in-sync replica safety margin: a leader failure during that lag window can still lose acknowledged data when `acks` < all. The `UnderReplicatedPartitions` metric exceeding zero is the critical early signal. In banking or compliance contexts this remains a data integrity concern.
Common mistakes
- •Increasing message.max.bytes without simultaneously increasing replica.fetch.max.bytes — the most common cause of this issue.
- •Not monitoring kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions — broker logs may show fetch errors, but clients get no signal; this metric catches the replication stall early.
- •Not testing message replication in staging with the same configuration as production — data loss may only appear after leader failover, when followers that never replicated the message take over.