Retry Pattern: Handling Transient Failures in Distributed Systems

The retry pattern is a resilience strategy that automatically repeats a failed operation when the failure is likely temporary. It handles transient faults like network timeouts, database deadlocks, or service unavailability by retrying the operation after a delay.

Retry Pattern: Handling Transient Failures in Distributed Systems

The retry pattern is a resilience strategy that automatically repeats a failed operation when the failure is likely temporary. In distributed systems, failures like network timeouts, database deadlocks, or service unavailability are often transient. The retry pattern handles these by retrying the operation after a delay, turning a temporary failure into a successful operation.

The retry pattern is essential for building resilient applications that communicate over networks. To understand retries properly, it helps to be familiar with client-server communication, HTTP status codes, and the circuit breaker pattern which often works alongside retries.

Retry pattern

What Is the Retry Pattern?

The retry pattern enables an application to automatically repeat a failed operation in the hope that the failure is temporary. Many failures in distributed systems resolve themselves quickly, such as a momentary network glitch, a database deadlock that clears, or a service that becomes available again after a brief overload.

  • Transient Failure: A temporary problem that often resolves without intervention, like a network timeout or connection pool exhaustion.
  • Retry: Repeating the same operation one or more times after a failure.
  • Backoff: Waiting between retries to give the system time to recover and to avoid overwhelming it.
  • Idempotency: The property that performing an operation multiple times has the same effect as performing it once, which is critical for safe retries.
  • Retry Budget: A limit on total retry attempts across operations to prevent cascading failures.

When to Use Retries

Not all failures should be retried. Understanding which failures are transient and which are permanent is essential for effective retry strategies.

Retry Suitable for Transient Failures Avoid Retry for Permanent Failures
Network timeouts and connection resets Authentication failures, invalid credentials
HTTP 408 Request Timeout, 429 Too Many Requests, 500 Internal Server Error, 503 Service Unavailable HTTP 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found
Database deadlocks and connection failures Constraint violations and data validation errors
Temporary service unavailability Business logic errors
Rate limiting when waiting respects the limit Operations that are not idempotent

Backoff Strategies

The delay between retries is critical. Retrying immediately is often useless because the cause of failure probably still exists. Different backoff strategies serve different scenarios.

Fixed Backoff

Wait the same amount of time between each retry. This is simple but can overwhelm a struggling service if many clients retry simultaneously. It works well when failures are independent and recovery time is predictable.

Linear Backoff

Increase the wait time by a fixed amount after each retry. For example, wait 1 second, then 2 seconds, then 3 seconds. This spreads out retries but still has predictable timing.

Exponential Backoff

Double the wait time after each retry. For example, wait 1 second, then 2 seconds, then 4 seconds, then 8 seconds. This is the most common and recommended strategy because it gives the system time to recover and naturally spreads out retries from many clients.

Exponential Backoff with Jitter

Add randomness to exponential backoff to prevent retry storms. When many clients experience failure simultaneously, they could all retry at the same time with pure exponential backoff. Adding jitter randomises the wait time, spreading out the load and preventing coordinated retry spikes.

Variable Backoff Based on Error Type

Different failures may need different delays. A rate limiting error might include a Retry-After header specifying when to retry. A database deadlock might resolve quickly. A service unavailable error might need longer recovery time.

Retry Configuration Parameters

Proper retry configuration balances reliability against performance and system load.

  • Max Retries: The maximum number of retry attempts. Typical values range from 3 to 10. Too few reduces effectiveness. Too many increases latency and system load.
  • Initial Delay: The first wait time before the first retry. Typical values range from 1 millisecond to 1 second.
  • Max Delay: A cap on the maximum wait time to prevent unbounded delays. Typical values range from 10 seconds to 60 seconds.
  • Timeout per Attempt: How long to wait for each individual attempt before considering it failed.
  • Total Deadline: The absolute maximum time allowed for all retries combined. Once this deadline passes, the operation fails even if retries remain.

Idempotency and Retries

Idempotency is the foundation of safe retries. An idempotent operation produces the same result no matter how many times it is executed. This concept is explored in depth in our idempotency guide.

Without idempotency, retrying a failed operation could cause problems. Consider a money transfer operation that fails due to a network timeout. The transfer might have succeeded on the server, but the client never received the response. Retrying could transfer the money twice. Idempotency keys solve this problem by allowing the server to recognise and ignore duplicate requests.

idempotent vs non-idempotent operations

Retry in Different Layers

Retries can be implemented at multiple levels of an application stack.

Client-Side Retries

The client application implements retry logic when calling external services. This is the most common location for retries. The client has context about the operation and can decide whether retrying makes sense.

Database Retries

Database drivers and ORMs often include built-in retry logic for transient database errors like deadlocks. These are configured separately from application-level retries.

Message Queue Retries

When processing messages from a queue fails, the message can be returned to the queue for later retry. Many message brokers support dead letter queues for messages that exceed retry limits.

API Gateway Retries

An API gateway or reverse proxy can retry failed requests to backend services without the client knowing. This is transparent to the client but requires the backend operation to be idempotent.

Retry Anti-Patterns

Poorly implemented retries can cause more harm than good. Avoid these common mistakes.

  • Retrying Immediately: If a service is overloaded, retrying immediately adds more load and makes things worse. Always include a delay.
  • Unlimited Retries: Without a maximum limit, retries can continue indefinitely, consuming resources and delaying failure detection.
  • Retrying Non-Idempotent Operations: This can cause data corruption, duplicate records, or incorrect state changes.
  • Retrying All Errors: Permanent errors like 404 Not Found will never succeed. Retrying wastes resources and time.
  • Ignoring Retry-After Headers: Some failures include a Retry-After header specifying when to retry. Ignoring it can make rate limiting worse.
  • Synchronised Retry Storms: Many clients retrying at the same time can overwhelm a recovering service. Jitter prevents this.
  • No Circuit Breaker Integration: Retrying a service that is completely down is wasteful. Circuit breakers should stop retries when a service is failing persistently.

Retry Pattern with Circuit Breaker

Retries and circuit breakers are complementary patterns. The retry pattern handles temporary, short-lived failures. The circuit breaker pattern stops retries when a service is persistently failing, giving it time to recover.

A typical integration works like this: When an operation fails, the retry logic attempts it several times with backoff. If all retries fail, the circuit breaker records the failure. After enough failures, the circuit breaker opens and stops all requests, including retries. After a timeout, the circuit breaker allows a test request. If successful, the circuit closes and normal operations resume.

This combination prevents a failing service from being bombarded with repeated retries while still handling transient failures effectively.

Retry Configuration Examples by Scenario

Scenario Recommended Retry Strategy
Network API call to external service Exponential backoff with jitter, 3-5 retries, 1 second initial delay, 30 second max delay
Database query with deadlock risk Fixed or linear backoff, 3 retries, short delays of 10-100 milliseconds
Message queue processing Exponential backoff, 5-10 retries, with dead letter queue for final failures
User-facing synchronous request Low retry count, short total deadline to avoid user waiting too long
Background batch job Higher retry count, longer delays, aggressive exponential backoff
Rate-limited API call Use Retry-After header if provided, otherwise long fixed delay respecting the rate limit

Testing Retry Logic

Retry logic must be carefully tested because failures are unpredictable in production.

  • Inject Faults: Temporarily make a service fail in controlled ways to verify retry behavior.
  • Test Idempotency: Verify that duplicate requests do not cause unintended side effects.
  • Simulate Timeouts: Test what happens when operations hang or take too long.
  • Test Partial Failures: Some attempts fail, some succeed. Verify correct handling.
  • Test Exhaustion: Verify behavior when all retry attempts fail and the deadline passes.
  • Monitor Retry Metrics: Track retry counts, success rates after retry, and total time spent in retries.

Retry Best Practices

  • Only Retry Transient Failures: Distinguish between transient and permanent errors. Do not waste retries on errors that will never succeed.
  • Use Exponential Backoff with Jitter: This is the most robust strategy for production systems and prevents retry storms.
  • Set Maximum Retry Limits: Always cap the number of retries and total time to prevent unbounded delays.
  • Ensure Idempotency for Write Operations: All operations that change state should be idempotent before implementing retries.
  • Integrate with Circuit Breakers: Use circuit breakers to stop retrying services that are persistently failing.
  • Log Retry Attempts: Record when retries happen, including the attempt number, delay, and eventual outcome.
  • Respect Retry-After Headers: When the server provides a suggested retry time, use it.
  • Consider the Client Context: Interactive user requests should have tighter retry budgets than background jobs.
  • Make Retry Configurable: Different environments and error types may need different retry settings.
Retry decision flow

Frequently Asked Questions

  1. What is the difference between retry pattern and circuit breaker pattern?
    The retry pattern handles temporary, short-lived failures by repeating the operation. The circuit breaker pattern stops requests entirely when a service is persistently failing, preventing repeated failed attempts. They work together: retry handles transient issues, and the circuit breaker opens when retries keep failing.
  2. How many retries should I use?
    There is no universal number. Most applications use 3 to 5 retries. Critical background jobs might use more. User-facing operations might use fewer, like 2 retries, to keep latency low. The right number balances reliability against user experience and system load.
  3. What is the difference between retry and timeout?
    A timeout sets how long to wait for a single attempt before giving up. Retries repeat the entire operation after it fails. They work together: each attempt has a timeout, and after a timeout failure, the retry logic decides whether to try again.
  4. Should I retry POST requests?
    Only if the operation is idempotent or you use idempotency keys. Without idempotency, retrying a POST could create duplicate resources. GET, PUT, and DELETE are generally safe to retry. POST requires additional design considerations covered in our idempotency guide.
  5. What is jitter and why is it important?
    Jitter adds randomness to retry delays. Without jitter, many clients experiencing the same failure will retry at exactly the same time, creating a retry storm that can overwhelm the recovering service. Jitter spreads out the retries, reducing coordinated load spikes.
  6. What should I learn next after the retry pattern?
    After mastering retries, explore the circuit breaker pattern for handling persistent failures, idempotency for safe retries of write operations, timeout patterns for bounding operation duration, and bulkhead pattern for isolating failures.