Rate Limiting

8 min read Β· Updated 2026-04-25

Rate limiting controls the rate of requests a client (or tenant, or IP, or anyone) can make. It’s one of the most universally needed patterns in SaaS β€” for protecting against abuse, noisy neighbors, billing tier enforcement, and accidental traffic spikes.

This lesson covers the algorithms, where to enforce, and the multi-dimensional approach for production multi-tenant SaaS.

Why Rate Limit

Abuse protection
Stop bots, scrapers, brute-force attacks. Without rate limiting, a single abusive client can hammer your service.
Noisy neighbor mitigation
One tenant in a multi-tenant SaaS shouldn't degrade others. Rate limit per tenant.
Billing enforcement
Free tier 100 req/day; pro tier 10,000 req/day; enterprise unlimited. Rate limits enforce monetization.
Backend protection
Database can handle 5,000 QPS sustained. Cap upstream traffic so spikes don't cause cascading failure.

The Algorithms

Token Bucket

The most common, most flexible algorithm.

Bucket size: B (max tokens)
Refill rate: R (tokens per second)

On request:
  if bucket has tokens:
    take 1 token, allow request
  else:
    reject request
How it works
Tokens accumulate, requests consume them
Bucket of capacity B. Tokens refill at R/sec. Each request consumes 1 (or more). Empty bucket = rate limited.
Strengths
Burst-friendly + sustained limit
Allows short bursts up to B requests, sustained throughput of R requests/sec. Matches real traffic patterns where occasional bursts are normal.

Leaky Bucket

Bucket size: B (queue capacity)
Drain rate: R (requests per second)

On request:
  if bucket has space:
    add request to bucket
  else:
    reject request

Worker drains R requests per second from the bucket.

Smoother output rate β€” drains at constant R regardless of input pattern. Good for shaping traffic to a downstream system that needs predictable load.

Fixed Window Counter

Count requests per fixed time window (e.g., per minute).
At window boundary: reset counter.

On request:
  if counter < limit:
    increment counter, allow
  else:
    reject

Simple. Has a boundary problem: a client can fire 2Γ— the limit in a 1-second period spanning two windows. Usually replaced by sliding-window approaches.

Sliding Window Log

Track timestamps of recent requests. Count those in the last N seconds.

On request at time T:
  remove entries older than (T - window) from log
  if log size < limit:
    append T, allow
  else:
    reject

Most accurate; expensive memory-wise (one entry per request).

Sliding Window Counter

Hybrid: keeps fixed-window counts but interpolates across the boundary.

current_count = (count_in_previous_window Γ— overlap_fraction)
              + count_in_current_window

if current_count < limit: allow

The pragmatic choice. Approximate but uses constant memory; avoids boundary issues.

Comparison

AlgorithmBurstSmooth outputMemoryCommon use
Token bucketβœ…βŒO(1)API rate limits, AWS APIs
Leaky bucketβŒβœ…O(B)Traffic shaping to downstream
Fixed windowβœ…βŒO(1)Simple, accept the boundary issue
Sliding log❌❌O(N)Precise limits where memory is cheap
Sliding counterβœ…βŒO(1)Production default for most APIs

Where to Enforce

CDN / Edge
Cloudflare, AWS WAF. Block at the edge before traffic hits your infrastructure. Best for DDoS and abuse.
API gateway
Kong, Ambassador, AWS API Gateway. Per-tenant, per-endpoint, per-tier limits. Most common production location.
Service mesh
Istio, Envoy. Internal service-to-service rate limits. Protects internal services from each other.
Application code
Distributed rate limiting via Redis. When you need per-user-action limits the gateway can't see (e.g., per-user-per-resource).

Distributed Rate Limiting

For multi-instance services, rate limiters must share state.

Centralized (Redis)
Single source of truth
All instances check/update Redis. Simple, accurate. Redis becomes critical infrastructure. Adds latency (often 1-2ms per request).
Distributed counter
Local + sync
Each instance tracks locally, syncs periodically. No single point. Eventual consistency in counts β€” small over-allowance possible during sync.

For most SaaS, Redis-based central rate limiting is the right answer. The 1-2ms cost is acceptable; the simplicity is real.

A canonical Redis pattern (sliding window counter):

# Lua script for atomicity
script = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
  redis.call('ZADD', key, now, now)
  redis.call('EXPIRE', key, window)
  return 1
end
return 0
"""

Multi-Dimensional Rate Limiting in SaaS

For a real multi-tenant SaaS, single-dimension limits aren’t enough. You need multi-dimensional rate limiting:

Per tenant
Total requests per tenant, regardless of user. Caps tenant-level cost.
Per user
Per-user limits within a tenant. Stops one user (or compromised account) from monopolizing.
Per endpoint
Cheap endpoints (read) high limit; expensive endpoints (bulk export) low limit. Reflects backend cost.
Per tier
Free tier 100 req/min; pro 1,000; enterprise unlimited. Different limits per pricing plan.
Per API key
For programmatic access β€” separate quota per key, regardless of which user owns it.
Per IP
Last-line defense for unauthenticated abuse. Usually low limit.

Multiple dimensions enforce simultaneously. A request must pass all of them.

Response Patterns

When rate-limited, return:

HTTP 429 Too Many Requests
Standard status. Use this β€” not 503 or some custom code.
Retry-After header
Tell the client how long to wait. Either seconds (Retry-After: 60) or HTTP-date.
Rate limit headers
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Lets clients self-throttle proactively.
Friendly error body
Specific reason (which limit was hit), upgrade path, support link. Don't make customers guess.

Client Behavior

Even with great rate limits, clients can DDoS you with thundering retries. Best-practice client behavior:

Recap