dkduckkit.dev

Thundering herd

Latency & SRE

A thundering herd is a failure pattern where a large number of clients simultaneously attempt to access a resource that has just become available after a period of unavailability. Common triggers: a cache expiry that causes thousands of requests to hit the database simultaneously, a server restart that causes all reconnecting clients to hit the new instance at once, or a rate limit window reset that causes all throttled clients to retry at the same moment.

Why it matters in practice

Thundering herds turn a recovery event into a second failure. A database that just recovered from a brief outage receives 10,000 simultaneous reconnection requests and crashes again. A cache that expires a popular key causes thousands of cache misses to hit the database in parallel, overwhelming it. The irony is that the protective mechanism (rate limiting, caching) becomes the cause of the next incident if thundering herds are not explicitly managed.

Common mistakes

  • Not adding jitter to cache TTLs — all cached objects set at the same time expire at the same time, causing a coordinated stampede.
  • Not using request coalescing (single-flight) for cache misses — multiple simultaneous misses for the same key should result in one upstream request, not N.
  • Implementing retry logic without jitter — all clients retry at the same backoff interval, creating a new thundering herd with each retry wave.