HLD: Design a Rate Limiter

Design a distributed rate limiter โ€” token bucket, sliding window counter, Redis implementation, and where to enforce limits.

must hard โฑ 28 min hldrate-limitertoken-bucketsliding-windowredisdistributed
Mastery:
Why interviewers ask this
Rate limiter is a very common HLD question because it touches algorithms, distributed state, consistency trade-offs, and API gateway design.

Step 1: Requirements

Functional:

  • Limit each user to N requests per time window (e.g. 100 req/min)
  • Return HTTP 429 Too Many Requests when limit exceeded
  • Include headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
  • Support multiple limit types: per-user, per-IP, per-API-key, per-endpoint

Non-functional:

  • Works across N app servers (distributed)
  • Low overhead: rate check adds < 5ms latency
  • Soft failures: if the rate limiter is down, allow traffic through (fail open) โ€” donโ€™t take down the whole API

Step 2: Algorithms

Token Bucket (most common in interviews)

Each user has a bucket with max N tokens.
Tokens refill at R tokens/second.
Each request consumes 1 token.
If bucket is empty โ†’ 429.

Burst allowed: user can consume all N tokens instantly (bursty traffic ok).

Implementation (in-memory โ€” single server):

import time
from dataclasses import dataclass, field

@dataclass
class TokenBucket:
    capacity: int        # max tokens
    refill_rate: float   # tokens per second
    tokens: float = 0.0
    last_refill: float = field(default_factory=time.time)

    def allow(self) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        # Refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Problem: This is per-server. In a distributed system, requests from the same user hit different servers โ†’ each server has its own bucket โ†’ effective limit is N ร— (number of servers).


Fixed Window Counter

Divide time into fixed windows (e.g., 60-second buckets).
Count requests per user per window.
If count > limit โ†’ 429.

Redis implementation:

import redis
import time

r = redis.Redis()

def is_allowed(user_id: str, limit: int, window_secs: int) -> bool:
    window_key = int(time.time()) // window_secs
    key = f"rl:{user_id}:{window_key}"
    
    # Atomic increment + set expiry
    count = r.incr(key)
    if count == 1:
        r.expire(key, window_secs * 2)  # 2x window to be safe
    
    return count <= limit

Problem: boundary spike. If the window resets at :00 and a user sends 100 requests at :59 and 100 more at :01, they send 200 requests in 2 seconds but never exceed the per-minute limit within any single window.


Sliding Window Counter (best for most cases)

Approximates a sliding window using two fixed-window counters:

current_window_count + (previous_window_count ร— overlap_ratio)
def is_allowed_sliding(user_id: str, limit: int, window_secs: int) -> bool:
    now = time.time()
    current_window = int(now) // window_secs
    prev_window = current_window - 1
    
    # Overlap ratio: how much of the previous window is in our sliding window
    elapsed_in_current = now % window_secs
    overlap = 1 - (elapsed_in_current / window_secs)
    
    current_key = f"rl:{user_id}:{current_window}"
    prev_key = f"rl:{user_id}:{prev_window}"
    
    # Atomic pipeline
    pipe = r.pipeline()
    pipe.get(prev_key)
    pipe.incr(current_key)
    pipe.expire(current_key, window_secs * 2)
    prev_count, current_count, _ = pipe.execute()
    
    estimated = int(prev_count or 0) * overlap + int(current_count)
    
    if estimated > limit:
        # Rollback the increment
        r.decr(current_key)
        return False
    return True

Sliding Window Log (most accurate, most memory)

Store a sorted set of timestamps per user. Count entries in the last N seconds.

def is_allowed_log(user_id: str, limit: int, window_secs: int) -> bool:
    now = time.time()
    window_start = now - window_secs
    key = f"rl:log:{user_id}"
    
    pipe = r.pipeline()
    # Remove timestamps outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Count remaining
    pipe.zcard(key)
    # Add current request timestamp
    pipe.zadd(key, {str(now): now})
    pipe.expire(key, window_secs)
    _, count, _, _ = pipe.execute()
    
    return (count + 1) <= limit

Memory: O(unique requests in window) per user โ€” expensive at high scale. Use for low-traffic, high-accuracy scenarios.


Step 3: Distributed rate limiter architecture

Client
  โ”‚
  โ–ผ
[ API Gateway / Nginx / Envoy ]
  โ”‚  โ† enforce rate limits here (before reaching app servers)
  โ”‚
  โ–ผ
[ Rate Limiter Middleware ]
  โ”‚  reads/writes โ†’ Redis Cluster
  โ”‚
  โ”œโ”€ Redis Node 1 (users A-M)  โ† consistent hashing
  โ”œโ”€ Redis Node 2 (users N-Z)
  โ””โ”€ Redis Node 3 (replica)
  
  App servers (stateless โ€” no local rate limit state)

Why Redis?

  • INCR is atomic โ€” no race conditions for the increment operation
  • Sub-millisecond reads/writes
  • Built-in TTL for automatic key expiry
  • Redis Cluster for horizontal scaling

Step 4: Race condition & Lua scripts

INCR + EXPIRE is two commands โ€” not atomic. If the server crashes between them, the key has no TTL and never expires (ghost user forever blocked at limit).

Fix: Lua script (executes atomically on Redis):

-- KEYS[1] = rate limit key
-- ARGV[1] = limit, ARGV[2] = window_secs

local count = redis.call('INCR', KEYS[1])
if count == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if count > tonumber(ARGV[1]) then
  return 0  -- rejected
end
return 1    -- allowed
rate_limit_script = r.register_script(lua_script)

def is_allowed(user_id: str, limit: int, window_secs: int) -> bool:
    key = f"rl:{user_id}:{int(time.time()) // window_secs}"
    result = rate_limit_script(keys=[key], args=[limit, window_secs * 2])
    return bool(result)

Step 5: Response headers

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1720000060   โ† Unix timestamp when window resets
Retry-After: 30                 โ† only on 429; seconds to wait
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: 30

Step 6: Where to enforce

LocationProsCons
API Gateway (Kong, AWS API GW)Central; protects all servicesGateway is a potential bottleneck
Reverse proxy (nginx, Envoy)Fast; before app serverHarder to implement per-user logic
Middleware (Express/FastAPI)Full access to auth contextDistributed state still needs Redis
Client SDKReduces server hitsClient can be bypassed

Best practice: Enforce at API gateway for global protection + re-enforce in service middleware for business-logic limits (per-endpoint, per-plan tier).


Comparison table

AlgorithmAccuracyMemoryBurstImplementation
Token bucketGoodLowAllowsMedium
Fixed windowOK (boundary spike)Very lowNot controlledEasy
Sliding window counterGood (approx.)LowNot controlledMedium
Sliding window logExactHighNot controlledHard

Say it out loud
โ€œIโ€™d use a sliding window counter backed by Redis. Each request atomically increments a key for the current window and reads the previous windowโ€™s count; the weighted sum gives an approximate request count over the true sliding window. I use a Lua script to make INCR + EXPIRE atomic. The rate limiter runs as middleware before reaching app servers and reads from a Redis cluster โ€” consistent hashing maps each user to a shard. If Redis is unavailable, I fail open (allow traffic) rather than taking down the API.โ€

Likely follow-up questions
  • What is the difference between token bucket and leaky bucket?
  • How do you implement a sliding window counter?
  • How do you rate-limit in a distributed environment where requests hit different servers?
  • Where should you enforce rate limits โ€” API gateway, middleware, or service level?
  • How do you handle the race condition in a Redis-based rate limiter?

References