Step 1: Requirements
Functional:
- Limit each user to N requests per time window (e.g. 100 req/min)
- Return HTTP 429 Too Many Requests when limit exceeded
- Include headers:
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset - Support multiple limit types: per-user, per-IP, per-API-key, per-endpoint
Non-functional:
- Works across N app servers (distributed)
- Low overhead: rate check adds < 5ms latency
- Soft failures: if the rate limiter is down, allow traffic through (fail open) โ donโt take down the whole API
Step 2: Algorithms
Token Bucket (most common in interviews)
Each user has a bucket with max N tokens.
Tokens refill at R tokens/second.
Each request consumes 1 token.
If bucket is empty โ 429.
Burst allowed: user can consume all N tokens instantly (bursty traffic ok).
Implementation (in-memory โ single server):
import time
from dataclasses import dataclass, field
@dataclass
class TokenBucket:
capacity: int # max tokens
refill_rate: float # tokens per second
tokens: float = 0.0
last_refill: float = field(default_factory=time.time)
def allow(self) -> bool:
now = time.time()
elapsed = now - self.last_refill
# Refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
Problem: This is per-server. In a distributed system, requests from the same user hit different servers โ each server has its own bucket โ effective limit is N ร (number of servers).
Fixed Window Counter
Divide time into fixed windows (e.g., 60-second buckets).
Count requests per user per window.
If count > limit โ 429.
Redis implementation:
import redis
import time
r = redis.Redis()
def is_allowed(user_id: str, limit: int, window_secs: int) -> bool:
window_key = int(time.time()) // window_secs
key = f"rl:{user_id}:{window_key}"
# Atomic increment + set expiry
count = r.incr(key)
if count == 1:
r.expire(key, window_secs * 2) # 2x window to be safe
return count <= limit
Problem: boundary spike. If the window resets at :00 and a user sends 100 requests at :59 and 100 more at :01, they send 200 requests in 2 seconds but never exceed the per-minute limit within any single window.
Sliding Window Counter (best for most cases)
Approximates a sliding window using two fixed-window counters:
current_window_count + (previous_window_count ร overlap_ratio)
def is_allowed_sliding(user_id: str, limit: int, window_secs: int) -> bool:
now = time.time()
current_window = int(now) // window_secs
prev_window = current_window - 1
# Overlap ratio: how much of the previous window is in our sliding window
elapsed_in_current = now % window_secs
overlap = 1 - (elapsed_in_current / window_secs)
current_key = f"rl:{user_id}:{current_window}"
prev_key = f"rl:{user_id}:{prev_window}"
# Atomic pipeline
pipe = r.pipeline()
pipe.get(prev_key)
pipe.incr(current_key)
pipe.expire(current_key, window_secs * 2)
prev_count, current_count, _ = pipe.execute()
estimated = int(prev_count or 0) * overlap + int(current_count)
if estimated > limit:
# Rollback the increment
r.decr(current_key)
return False
return True
Sliding Window Log (most accurate, most memory)
Store a sorted set of timestamps per user. Count entries in the last N seconds.
def is_allowed_log(user_id: str, limit: int, window_secs: int) -> bool:
now = time.time()
window_start = now - window_secs
key = f"rl:log:{user_id}"
pipe = r.pipeline()
# Remove timestamps outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Count remaining
pipe.zcard(key)
# Add current request timestamp
pipe.zadd(key, {str(now): now})
pipe.expire(key, window_secs)
_, count, _, _ = pipe.execute()
return (count + 1) <= limit
Memory: O(unique requests in window) per user โ expensive at high scale. Use for low-traffic, high-accuracy scenarios.
Step 3: Distributed rate limiter architecture
Client
โ
โผ
[ API Gateway / Nginx / Envoy ]
โ โ enforce rate limits here (before reaching app servers)
โ
โผ
[ Rate Limiter Middleware ]
โ reads/writes โ Redis Cluster
โ
โโ Redis Node 1 (users A-M) โ consistent hashing
โโ Redis Node 2 (users N-Z)
โโ Redis Node 3 (replica)
App servers (stateless โ no local rate limit state)
Why Redis?
- INCR is atomic โ no race conditions for the increment operation
- Sub-millisecond reads/writes
- Built-in TTL for automatic key expiry
- Redis Cluster for horizontal scaling
Step 4: Race condition & Lua scripts
INCR + EXPIRE is two commands โ not atomic. If the server crashes between them, the key has no TTL and never expires (ghost user forever blocked at limit).
Fix: Lua script (executes atomically on Redis):
-- KEYS[1] = rate limit key
-- ARGV[1] = limit, ARGV[2] = window_secs
local count = redis.call('INCR', KEYS[1])
if count == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if count > tonumber(ARGV[1]) then
return 0 -- rejected
end
return 1 -- allowed
rate_limit_script = r.register_script(lua_script)
def is_allowed(user_id: str, limit: int, window_secs: int) -> bool:
key = f"rl:{user_id}:{int(time.time()) // window_secs}"
result = rate_limit_script(keys=[key], args=[limit, window_secs * 2])
return bool(result)
Step 5: Response headers
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1720000060 โ Unix timestamp when window resets
Retry-After: 30 โ only on 429; seconds to wait
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 30
Step 6: Where to enforce
| Location | Pros | Cons |
|---|---|---|
| API Gateway (Kong, AWS API GW) | Central; protects all services | Gateway is a potential bottleneck |
| Reverse proxy (nginx, Envoy) | Fast; before app server | Harder to implement per-user logic |
| Middleware (Express/FastAPI) | Full access to auth context | Distributed state still needs Redis |
| Client SDK | Reduces server hits | Client can be bypassed |
Best practice: Enforce at API gateway for global protection + re-enforce in service middleware for business-logic limits (per-endpoint, per-plan tier).
Comparison table
| Algorithm | Accuracy | Memory | Burst | Implementation |
|---|---|---|---|---|
| Token bucket | Good | Low | Allows | Medium |
| Fixed window | OK (boundary spike) | Very low | Not controlled | Easy |
| Sliding window counter | Good (approx.) | Low | Not controlled | Medium |
| Sliding window log | Exact | High | Not controlled | Hard |