Rate Limiting: How APIs Control Request Volume

Rate limiting restricts the number of API requests a client can make.

API Rate Limiting

Every public API needs rate limiting. Without it, a single aggressive client can consume all available server resources, degrading the experience for everyone else. Rate limiting is how APIs stay fair, responsive, and protected against both accidental overuse and deliberate abuse.

What Is Rate Limiting

Rate limiting restricts how many requests a client can make to an API within a defined time window. When the limit is exceeded, the API rejects further requests until the window resets or more capacity becomes available. The goal is to ensure that no single client can monopolise server resources at the expense of others.

Rate limits can be applied in different ways depending on what the API needs to protect. Some APIs apply limits per user account, so each registered user gets their own quota. Others apply limits per API key, which is useful when multiple applications share the same account but need separate tracking. IP-based limiting adds an additional layer that catches unauthenticated traffic or shared key abuse. Some APIs also apply separate limits per endpoint, allowing generous limits on cheap read operations while being more restrictive on expensive write or compute-heavy operations.

Rate limiting is not only about stopping bad actors. Even well-intentioned clients can accidentally hammer an API with a bug in their retry logic or a runaway background job. Rate limiting catches these situations and prevents them from cascading into an outage that affects all users of the service.

Common Rate Limiting Strategies

Several different algorithms are used to implement rate limiting, each with different characteristics around how smoothly they handle traffic and how they behave when limits are approached. Choosing the right algorithm depends on whether you need strict fairness, burst tolerance, or smooth throughput.

Algorithm How It Works Behaviour at Limit
Fixed WindowCounts requests within fixed time slots such as 100 per minute, resetting at the top of each minuteSimple to implement but allows traffic spikes at window boundaries where a burst at the end of one window and the start of the next can double the effective rate
Sliding WindowCounts requests in a rolling time period, for example the last 60 seconds from the current momentSmoother than fixed window and eliminates boundary spikes, but requires more memory to track request timestamps
Token BucketTokens are added to a bucket at a fixed rate up to a maximum capacity; each request consumes one tokenAllows short bursts up to the bucket capacity while enforcing a long-term average rate, which is good for APIs that expect occasional spikes in legitimate usage
Leaky BucketRequests enter a queue and are processed at a fixed rate regardless of how fast they arriveProduces the smoothest output rate and prevents bursts entirely, but adds latency for requests that must wait in the queue

The token bucket algorithm is one of the most widely used in practice because it strikes a good balance between fairness and flexibility. A client that has been idle for a period accumulates tokens up to the bucket capacity and can spend them in a short burst, which matches the natural usage pattern of many real applications. The leaky bucket is preferred in situations where a constant output rate is critical, such as protecting a downstream service that cannot handle any spikes at all.

HTTP 429 Too Many Requests

When a client exceeds the rate limit, the server responds with HTTP status code 429, which means Too Many Requests. A well-designed rate limit response includes headers that give the client the information it needs to recover gracefully rather than simply failing with no context.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711017600

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Try again in 30 seconds."
}

The Retry-After header tells the client how many seconds to wait before making another attempt. The X-RateLimit-Reset header provides a Unix timestamp indicating when the current window expires and the quota refreshes. A client that reads and respects these headers can back off precisely rather than guessing how long to wait, which reduces unnecessary load on the server during the cooldown period.

Common Rate Limit Headers

Rate limit headers are included in API responses to give clients visibility into their current usage and remaining quota. Not all APIs use the same header names, but the following set has become a widely adopted convention and is used by APIs from GitHub, Stripe, Twitter, and many others.

Header What It Contains
X-RateLimit-LimitThe total number of requests allowed in the current window
X-RateLimit-RemainingThe number of requests still available in the current window
X-RateLimit-ResetA Unix timestamp indicating when the window resets and the quota is restored
Retry-AfterThe number of seconds the client should wait before making another request after receiving a 429

Reading these headers proactively on every response, not just on 429 responses, lets a client slow down before hitting the limit rather than after. If X-RateLimit-Remaining drops to a low number, a well-behaved client can introduce a small delay between requests to avoid triggering the limit at all.

Handling Rate Limits as a Client

How a client handles rate limiting says a lot about how robust its API integration is. A naive client that retries immediately after receiving a 429 will typically keep receiving 429 responses and may make the situation worse by adding to the load on the server. A well-designed client treats rate limit responses as signals to slow down, not to retry faster.

  • Always read the X-RateLimit-Remaining header on each response and slow down proactively as it approaches zero
  • When a 429 is received, wait for the duration specified in Retry-After before retrying rather than retrying immediately
  • Implement exponential backoff with jitter when retrying, so that multiple clients recovering from a rate limit event do not all retry at exactly the same moment
  • Cache API responses wherever the data does not need to be real-time, reducing the total number of requests the client needs to make
  • Use webhooks or server-sent events where the API supports them instead of polling repeatedly on a short interval
  • Batch requests where the API supports it, combining multiple operations into a single call rather than sending them individually
  • Consider upgrading to a higher API tier if legitimate usage consistently approaches or exceeds the allowed limits

Rate Limiting on the Server Side

For developers building APIs rather than consuming them, rate limiting should be implemented as early in the request pipeline as possible. Applying rate limits at a reverse proxy or API gateway layer means that rejected requests never reach your application servers at all, which is the most efficient way to protect backend resources.

Redis is a popular choice for storing rate limit counters because it is fast, supports atomic increment operations, and can set expiry times on keys natively. A sliding window counter can be implemented using Redis sorted sets, where each request is recorded with its timestamp and old entries are pruned on each check. For simpler fixed window counting, a single Redis key per client with an expiry matching the window duration is often sufficient.

Nginx provides built-in rate limiting through the limit_req module, which uses a leaky bucket algorithm and can be configured with just a few lines in the server configuration. API gateway platforms like Kong, AWS API Gateway, and Apigee offer rate limiting as a managed feature with dashboards, per-plan quotas, and detailed analytics, which makes them a practical choice for public APIs with many consumers at different subscription tiers.

Rate Limiting and API Plans

Many commercial APIs use rate limits as a way to differentiate between subscription tiers. A free tier might allow 100 requests per day, a basic paid tier might allow 10,000 requests per day, and an enterprise tier might offer custom limits negotiated directly. This tiered approach aligns the cost of running the API with the revenue generated from its consumers and gives developers a clear path to scale their usage as their product grows.

When designing rate limits for a tiered API, it is important to set the free tier limits high enough to be useful for evaluation and development, but low enough that it is not practical to run a production workload without upgrading. Being transparent about the limits in your documentation, and surfacing them clearly in response headers, reduces frustration for developers and support requests from confused clients.

Frequently Asked Questions

  1. Does rate limiting apply to all APIs?
    All well-designed public APIs implement rate limiting. Internal APIs running behind a private network may not enforce limits, but doing so is still a good practice because it protects against bugs in internal clients and gives visibility into unexpected usage patterns. If an internal service suddenly starts making ten times its normal number of requests, a rate limit will surface that anomaly before it causes an outage.
  2. Can I be rate limited by IP even when using an API key?
    Yes, depending on how the API is implemented. Many APIs apply limits at multiple layers simultaneously. An API key-based limit controls how much a specific integration can do, while an IP-based limit prevents abuse through key sharing or credential stuffing. You may hit either limit independently, and they may have different thresholds and reset windows.
  3. What tools implement rate limiting?
    Nginx's limit_req module implements leaky bucket rate limiting at the web server level. Kong API Gateway, AWS API Gateway, and Apigee offer managed rate limiting with per-plan configuration. For custom implementations, Redis-backed counters using sorted sets for sliding windows or expiring keys for fixed windows are the most common approach in application code.
  4. What is exponential backoff and why does it matter?
    Exponential backoff is a retry strategy where each successive retry waits twice as long as the previous one. For example, the first retry waits 1 second, the second waits 2 seconds, the third waits 4 seconds, and so on up to a maximum cap. Adding a small random jitter to each wait time prevents the thundering herd problem, where many clients recovering from a rate limit event all retry at exactly the same moment and immediately cause another spike. Exponential backoff with jitter is the recommended retry pattern for any client dealing with rate limited APIs.
  5. Should rate limits be enforced per endpoint or globally?
    It depends on how different your endpoints are in terms of cost. A single global limit is simpler to implement and communicate, but it does not account for the fact that some endpoints are far more expensive to serve than others. Applying tighter limits specifically to expensive operations, such as bulk exports, report generation, or machine learning inference endpoints, while allowing higher limits on cheap read endpoints gives a more accurate model of the actual load different request types impose on your infrastructure.

Conclusion

Rate limiting is a non-negotiable feature of any public API. It protects infrastructure from overload, prevents individual clients from degrading the experience for others, and provides a clear mechanism for differentiating service tiers. Implementing rate limits at the proxy or gateway layer keeps the protection as close to the edge as possible, and surfacing quota information through standard headers gives API consumers the transparency they need to build robust integrations. As a client, handling 429 responses gracefully with exponential backoff, proactive header monitoring, and intelligent caching is what separates a brittle integration from a production-ready one. See also HTTP status codes, REST API design, and idempotency.