dkduckkit.dev

Retry storm

Rate Limiting

A retry storm is a failure amplification pattern where clients receive errors, retry synchronously, the retries arrive simultaneously at the backend, the backend fails again, and the cycle repeats — often getting worse with each iteration. Retry storms convert brief transient failures into extended outages. They are a primary cause of cascading failures in microservice architectures.

Why it matters in practice

A retry storm typically starts small: a database hiccup causes 1,000 clients to receive 500 errors in a 1-second window. Without jitter, all 1,000 clients retry after their fixed 5-second backoff — the database receives 1,000 simultaneous requests exactly 5 seconds later, when it may not be fully recovered. It fails again. The cycle repeats until an operator intervenes or the backoff grows large enough to stop synchronisation. The storm can also spread beyond the original failure: the retries consume connection pool capacity and CPU, causing adjacent services to degrade.

Common mistakes

  • Implementing retry logic without circuit breakers — a circuit breaker stops retrying after N consecutive failures, preventing retry storms from perpetuating indefinitely.
  • Not implementing Retry-After header handling — servers that are rate-limiting or temporarily overloaded signal the correct wait time via Retry-After; clients that ignore it and use a fixed backoff will storm again.
  • Retrying non-idempotent operations without deduplication — a retry storm on payment endpoints can cause double-charges.