Rate Limiting

8 min read · Updated 2026-04-25

Rate limiting controls the rate of requests a client (or tenant, or IP, or anyone) can make. It’s one of the most universally needed patterns in SaaS — for protecting against abuse, noisy neighbors, billing tier enforcement, and accidental traffic spikes.

This lesson covers the algorithms, where to enforce, and the multi-dimensional approach for production multi-tenant SaaS.

Why Rate Limit

Abuse protection

Stop bots, scrapers, brute-force attacks. Without rate limiting, a single abusive client can hammer your service.

Noisy neighbor mitigation

One tenant in a multi-tenant SaaS shouldn't degrade others. Rate limit per tenant.

Billing enforcement

Free tier 100 req/day; pro tier 10,000 req/day; enterprise unlimited. Rate limits enforce monetization.

Backend protection

Database can handle 5,000 QPS sustained. Cap upstream traffic so spikes don't cause cascading failure.

The Algorithms

Token Bucket

The most common, most flexible algorithm.

Bucket size: B (max tokens)
Refill rate: R (tokens per second)

On request:
  if bucket has tokens:
    take 1 token, allow request
  else:
    reject request

How it works

Tokens accumulate, requests consume them

Bucket of capacity B. Tokens refill at R/sec. Each request consumes 1 (or more). Empty bucket = rate limited.

Strengths

Burst-friendly + sustained limit

Allows short bursts up to B requests, sustained throughput of R requests/sec. Matches real traffic patterns where occasional bursts are normal.

Leaky Bucket

Bucket size: B (queue capacity)
Drain rate: R (requests per second)

On request:
  if bucket has space:
    add request to bucket
  else:
    reject request

Worker drains R requests per second from the bucket.

Smoother output rate — drains at constant R regardless of input pattern. Good for shaping traffic to a downstream system that needs predictable load.

Fixed Window Counter

Count requests per fixed time window (e.g., per minute).
At window boundary: reset counter.

On request:
  if counter < limit:
    increment counter, allow
  else:
    reject

Simple. Has a boundary problem: a client can fire 2× the limit in a 1-second period spanning two windows. Usually replaced by sliding-window approaches.

Sliding Window Log

Track timestamps of recent requests. Count those in the last N seconds.

On request at time T:
  remove entries older than (T - window) from log
  if log size < limit:
    append T, allow
  else:
    reject

Most accurate; expensive memory-wise (one entry per request).

Sliding Window Counter

Hybrid: keeps fixed-window counts but interpolates across the boundary.

current_count = (count_in_previous_window × overlap_fraction)
              + count_in_current_window

if current_count < limit: allow

The pragmatic choice. Approximate but uses constant memory; avoids boundary issues.

Comparison

Algorithm	Burst	Smooth output	Memory	Common use
Token bucket	✅	❌	O(1)	API rate limits, AWS APIs
Leaky bucket	❌	✅	O(B)	Traffic shaping to downstream
Fixed window	✅	❌	O(1)	Simple, accept the boundary issue
Sliding log	❌	❌	O(N)	Precise limits where memory is cheap
Sliding counter	✅	❌	O(1)	Production default for most APIs

Where to Enforce

CDN / Edge

Cloudflare, AWS WAF. Block at the edge before traffic hits your infrastructure. Best for DDoS and abuse.

API gateway

Kong, Ambassador, AWS API Gateway. Per-tenant, per-endpoint, per-tier limits. Most common production location.

Service mesh

Istio, Envoy. Internal service-to-service rate limits. Protects internal services from each other.

Application code

Distributed rate limiting via Redis. When you need per-user-action limits the gateway can't see (e.g., per-user-per-resource).

Distributed Rate Limiting

For multi-instance services, rate limiters must share state.

Centralized (Redis)

Single source of truth

All instances check/update Redis. Simple, accurate. Redis becomes critical infrastructure. Adds latency (often 1-2ms per request).

Distributed counter

Local + sync

Each instance tracks locally, syncs periodically. No single point. Eventual consistency in counts — small over-allowance possible during sync.

For most SaaS, Redis-based central rate limiting is the right answer. The 1-2ms cost is acceptable; the simplicity is real.

A canonical Redis pattern (sliding window counter):

# Lua script for atomicity
script = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
  redis.call('ZADD', key, now, now)
  redis.call('EXPIRE', key, window)
  return 1
end
return 0
"""

Multi-Dimensional Rate Limiting in SaaS

For a real multi-tenant SaaS, single-dimension limits aren’t enough. You need multi-dimensional rate limiting:

Per tenant

Total requests per tenant, regardless of user. Caps tenant-level cost.

Per user

Per-user limits within a tenant. Stops one user (or compromised account) from monopolizing.

Per endpoint

Cheap endpoints (read) high limit; expensive endpoints (bulk export) low limit. Reflects backend cost.

Per tier

Free tier 100 req/min; pro 1,000; enterprise unlimited. Different limits per pricing plan.

Per API key

For programmatic access — separate quota per key, regardless of which user owns it.

Per IP

Last-line defense for unauthenticated abuse. Usually low limit.

Multiple dimensions enforce simultaneously. A request must pass all of them.

Response Patterns

When rate-limited, return:

HTTP 429 Too Many Requests

Standard status. Use this — not 503 or some custom code.

Retry-After header

Tell the client how long to wait. Either seconds (Retry-After: 60) or HTTP-date.

Rate limit headers

X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Lets clients self-throttle proactively.

Friendly error body

Specific reason (which limit was hit), upgrade path, support link. Don't make customers guess.

Client Behavior

Even with great rate limits, clients can DDoS you with thundering retries. Best-practice client behavior:

Exponential backoff — wait 2^n seconds between retries.
Jitter — randomize backoff timing to avoid synchronized retries.
Honor Retry-After — if the server tells you when to retry, listen.
Circuit breaker (next lesson) — stop hammering when the server is failing.

Recap

Rate limiting protects against abuse, noisy neighbors, and backend overload.
Algorithms: token bucket (burst-friendly), leaky bucket (smoothing), sliding-window counter (production default).
Enforce at multiple layers: edge, gateway, mesh, application.
Distributed rate limiting via Redis is the standard pattern.
For multi-tenant SaaS: multi-dimensional limits (tenant + user + endpoint + tier + API key + IP).
Use HTTP 429, Retry-After, rate limit headers, helpful error bodies.
Combined with client-side backoff + jitter, rate limiting is one of the most effective patterns in SaaS.