Distributed Rate Limiter: The System Design Interview Walkthrough

Most rate limiter interviews go fine until minute 28. You've drawn the Redis box, written INCR, explained the 429 response. Then the interviewer leans in: "What happens when one of your power users fires 50,000 requests per second?"

Your Redis key is now a hot partition. Your rate limiter is the bottleneck. The whole design falls apart at exactly the moment you need it most.

This walkthrough gets you past minute 28.

Clarify First, Draw Second

Five questions shape everything. Don't touch the whiteboard until you've asked them.

What are we rate limiting? API calls, login attempts, message sends, something else?
What's the limiting dimension? Per user, per IP, per API key, or a combination?
What scale are we designing for? (QPS and user count determine the storage tier)
What happens when the limit is hit? Hard block, queue, or graceful degradation?
How accurate does it have to be? Can we allow 1-5% over-limit for better performance?

That last question is the one candidates skip, and it's the one that signals senior-level thinking. Cloudflare's production rate limiter deliberately accepts a 0.003% error rate to avoid the coordination overhead that perfect accuracy requires. Real systems make this tradeoff explicitly.

For this walkthrough: 10 million users, 1,000 requests per user per minute, 99.9% uptime. Hard block at limit. 1-5% tolerance is acceptable.

Pick the Algorithm Before You Design the System

The algorithm shapes the data model. Pick the wrong one and the whole design needs rebuilding halfway through. Candidates do this constantly.

Fixed window counter is the naive choice. Increment a key like user:123:minute:1718089380 on every request. Simple. But a user can fire 1,000 requests in the last second of one window and 1,000 more in the first second of the next, doubling the allowed rate at the boundary. Happy hour at midnight, every minute.

Sliding window log is accurate but expensive. Store every request timestamp in a sorted set, then count events in the last 60 seconds. Memory grows linearly with requests. At 1,000 req/min across 10 million users, that's tens of billions of entries. Your Redis bill arrives looking like a ransom note.

Sliding window counter is the right answer for this scale.

The idea: approximate the sliding window using two adjacent fixed-window counters. The estimate at any point in time is:

estimated = previous_count × overlap + current_count

Where overlap = (window_size - elapsed_time_in_current_window) / window_size. If you're 45 seconds into a 60-second window, 25% of the previous window's count still applies.

def is_rate_limited(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    current_window = int(now // window_seconds) * window_seconds
    previous_window = current_window - window_seconds
    elapsed = now - current_window
    overlap = 1.0 - (elapsed / window_seconds)

    current_key = f"rl:{user_id}:{current_window}"
    prev_key = f"rl:{user_id}:{previous_window}"

    current_count, prev_count = redis.mget(current_key, prev_key)
    current_count = int(current_count or 0)
    prev_count = int(prev_count or 0)

    estimated = prev_count * overlap + current_count
    return estimated >= limit

Here's what the weighted overlap looks like at 45 seconds elapsed:

$Two adjacent rate limit windows showing the 25% overlap calculation. The previous window contributes a weighted fraction of its count to the current estimate.$

45s into a 60s window: only 25% of the previous window's traffic counts. Two Redis keys, one formula.

Cloudflare validated this approach on 400 million requests and measured a 0.003% error rate. Two Redis keys per user regardless of traffic volume. Two reads per check. The math is embarrassingly simple and the accuracy is more than good enough.

Token bucket is also valid and worth mentioning, especially if burst tolerance is a feature requirement. See the tradeoffs in depth here.

The Distributed Rate Limiter Architecture Fits on a Napkin

Architecture diagram showing Client, Load Balancer, and API Gateway routing to a Redis Cluster. The gateway returns allow or 429 based on Redis counter state, and only allowed requests reach Backend Services.

Reject at the gateway. Backend never pays the cost of over-limit requests.

The rate limiter belongs in the API gateway, not in individual backend services. Centralizing enforcement means you reject requests before they consume expensive downstream compute. A backend that never receives an over-limit request can't be overwhelmed by one.

The gateway makes one Redis call per request. If the counter is at limit, return 429 Too Many Requests with a Retry-After header. If Redis is unavailable, more on that shortly.

The Race Condition Lives in Your Data Model

Here's the naive Redis implementation:

count = redis.get(key)
if count and int(count) >= limit:
    return 429
redis.incr(key)
redis.expire(key, window_size * 2)

This has a race condition. Between GET and INCR, another gateway instance reads the same count and both allow the request. Under load, with 100 gateway nodes, you can exceed the limit by 100x before any check fires. Congratulations, your rate limiter is rate-limited by its own correctness.

The fix is a Lua script. Redis executes Lua atomically, serialized at the server, no interleaving possible:

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local count = redis.call('INCR', key)
if count == 1 then
    redis.call('EXPIRE', key, window * 2)
end

return count

One network round trip. One atomic operation. This is the answer to "how do you prevent over-counting under concurrent load." The TTL is 2x the window so you always have the previous window's count without accumulating stale keys indefinitely.

For the full sliding window counter, the Lua script reads two keys, computes the weighted estimate, and returns allow or deny. All atomically.

Where This Breaks Under Real Traffic

Hot Keys Serialize Through One Redis Core

With 50,000 requests per second from one user hitting the same key across all gateway instances, every INCR serializes through a single Redis slot. A Redis node handles roughly 100,000 operations per second total. One abusive user pegs an entire shard and degrades rate limiting for everyone else on it. Your angry user accidentally becomes a denial-of-service attack on your infrastructure.

This is the hot partition problem, and it has a specific playbook.

Side-by-side comparison of hot key problem. Left: all gateway instances write to one Redis shard on every request, saturating it. Right: each gateway keeps a local cache and syncs to Redis every N requests, reducing load 100x.

Without local caching, one abusive user can peg an entire Redis shard. With caching, 99% of writes never leave the gateway.

Three solutions at increasing cost:

Local caching: Cache the counter value in each gateway instance for 100-500ms. Accept minor over-counting during that window. Each gateway writes to Redis once every N requests instead of N times. Reduces Redis pressure by 100x. This is what most production systems do.

Cell-based sharding: Shard the counter across N Redis keys, rl:user_123:shard_0 through rl:user_123:shard_7. Each request picks a shard. To check the total, sum all shards. Reduces write contention by 8x but adds 8 reads per check.

Token bucket with async sync: Each gateway maintains a local token bucket. A background job syncs state to Redis every 100ms. The coordinator distributes token refills. Extremely fast at the expense of some state complexity. This is what Kong uses at high scale.

Pick local caching first. Introduce cell-based sharding only when you can measure the hot key contention.

What Happens When Redis Goes Down

Two choices. Neither is obviously right.

Fail open: Allow all requests during the outage. No downtime, but rate limiting stops working. An attacker who knows your Redis is flaky can deliberately exploit the window. Not great.

Fail closed: Block all requests. Rate limiting holds, but your service goes dark. Also not great, in the other direction.

For most API rate limiters, fail open is correct. You're protecting against abuse, not enforcing exclusion for safety. A brief window of unthrottled traffic is better than a full outage. Your SLA cares more about availability than perfect throttling accuracy.

The production answer is fail open with a local fallback: when Redis is unreachable, each gateway instance applies its own in-memory counter at 1/N of the global limit (where N is the number of gateway nodes). You get approximate rate limiting without any Redis dependency. Wrap it in a circuit breaker that detects Redis failures and switches modes automatically.

Consistent Hashing Routes State Without Coordination

An alternative to centralized Redis: route each user's requests to the same gateway node using consistent hashing. That node owns the rate limit state locally. No cross-node Redis coordination.

This is fast (no network hop to Redis) but loses state when a node goes down and requires sticky routing at the load balancer. Worth mentioning as a performance optimization for single-datacenter deployments. For multi-region or stateless gateways, centralized Redis is cleaner.

Accuracy vs Performance: You Can't Have Both

Decision	Accurate	Fast
Counting	Centralized Lua script	Local cache + periodic sync
Algorithm	Sliding window log	Sliding window counter
Redis failure	Fail closed	Fail open with local fallback
Routing	Consistent hashing	Random with centralized state

The load-bearing insight: every percentage point of accuracy you add costs latency or hot-key pressure. Stripe, Cloudflare, and Uber all run approximate rate limiters in production because the alternative introduces bottlenecks at the exact moment you need rate limiting most, which is traffic spikes.

State this tradeoff explicitly. Interviewers want to hear you reason about it, not just pick one side and hope they don't probe.

Pace Yourself: The 45-Minute Clock Has No Mercy

Interviewers are pacing you whether or not they say so. They have a rubric with checkboxes. Arriving at "distributed challenges" at minute 38 with 7 minutes left means those boxes stay empty.

0-5 min: Five clarifying questions. Don't draw anything yet.
5-15 min: Algorithm choice. Explain fixed window's boundary problem, sliding window log's memory cost, why sliding window counter wins. Show the formula.
15-25 min: High-level design. Three boxes (client, gateway, Redis). Mention the Lua script for atomicity.
25-35 min: Distributed challenges. Hot keys, consistent hashing, Redis failure behavior. This is where candidates separate.
35-42 min: Scale math. 10M users × 2 keys × 8 bytes = 160MB of Redis state. Trivial. The real bottleneck is write QPS. Assume 1% of users are active at peak: 100K users × 1000 req/min / 60 ≈ 1.7M writes/sec, needing ~17 Redis shards. With local caching absorbing 99% of writes, that drops to 2-3 shards.
42-45 min: What you'd add with more time. Rate limit analytics, per-endpoint limits, tenant-level quotas, async Redis writes.

The classic mistake is spending 20 minutes on algorithm comparison and reaching the distributed discussion at minute 38. The hot key problem, the failure mode, and the consistency tradeoff are where strong candidates demonstrate depth. Get there with time to breathe.

Quick Recap

Sliding window counter: two Redis keys per user, 0.003% error rate validated at Cloudflare scale
Lua script for atomicity eliminates the GET-then-INCR race condition
Hot keys serialize through one Redis core; local caching solves it for most cases
Fail open on Redis downtime with a local approximation fallback
State the accuracy vs performance tradeoff explicitly; don't pretend there isn't one
Budget 10 minutes for the distributed challenges section; that's where seniority shows

The distributed discussion is much easier to practice under pressure than to reason about cold. SpaceComplexity runs voice-based system design mock interviews with rubric feedback, so you can feel exactly what minute 28 is like before it counts.