Rate Limiter System Design: Most Candidates Skip Clarification

- Fixed window counter has a boundary spike bug: two adjacent windows each at the limit allow 2x requests in 2 seconds with no violation recorded.
- Sliding window counter is the practical default: two integers per user, 0.003% error rate measured by Cloudflare, no sorted-set memory blowup.
- Token bucket absorbs controlled bursts and is Stripe's production choice; all state lives in two Redis hash fields per user updated atomically via Lua.
- Redis Lua scripts make token bucket atomic. A plain
INCR + EXPIREpair has a race condition if the process dies between commands. - Fail open with alerting beats fail closed for availability-sensitive APIs; blocking all traffic during a Redis outage is almost always the worse outcome.
- Name two algorithm options with their tradeoffs, then commit to one; "it depends" without a condition and a choice is not a complete answer.
You have a server. It's running. Someone is hitting it 10,000 times a second. What do you do?
"Token bucket," says the candidate, and starts drawing boxes. The interviewer writes down: jumped to solution.
Rate limiter system design is one of those interview problems where everyone knows the algorithm names and almost nobody knows which question they're answering. The algorithms are easy to look up. What matters is whether you can establish scope, pick the right tradeoff for the stated requirements, and explain why. This covers the whole path from clarifying questions to Redis keys to scaling gotchas, in the order you'd present it in a 45-minute interview.
Ask the Questions Your Interviewer Already Wrote Down
Spend the first five minutes asking questions that drive algorithm selection. These are the ones that matter:
Who or what gets limited? User ID, API key, IP address, or a combination? IP is easy to spoof. User ID requires an authenticated request. API key is the most common choice for developer-facing APIs.
What shape of limit? Requests per second, per minute, per day? Multi-tier (100/min and 10,000/day simultaneously)? Should short bursts be tolerated, or do you need smooth output?
How accurate? A 0.003% error rate is fine for protecting an API from abuse. It is not fine for a billing system counting transactions.
Where does the limiter live? Inside each application server, at an API gateway layer, or as a dedicated service?
What happens when the rate limiter fails? Fail open (allow all traffic) or fail closed (reject all traffic)?
None of this is stalling. It is the difference between designing a token bucket and designing a leaky bucket. Naming those questions signals that you understand the problem before you start solving it.
Ask these first. Your algorithm choice falls out of the answers.
Five Algorithms. One Is Right for Your Requirements.
Fixed Window Counter: Simple, Broken in One Specific Way
Fixed window counter divides time into intervals and increments a counter per interval. When the counter exceeds the limit, reject. Simple to implement with a single Redis INCR and EXPIRE.
The problem is the boundary spike. If your limit is 100 requests per minute and a client sends 100 at 11:59:59, the window resets, and they send 100 more at 12:00:01, you have 200 requests in two seconds with no violation recorded. Both windows look clean.
Each window reports "OK." The two seconds straddling the boundary serve 200 requests. The client is delighted.
Sliding Window Log: Exact, Memory-Hungry
Sliding window log stores every request timestamp in a sorted set. On each request, discard entries older than the window, then count what remains. Accurate, but every active user consumes memory proportional to their request rate. At 1,000 requests per minute per user, that is 1,000 entries in the sorted set.
Sliding Window Counter: The One You Should Default To
Sliding window counter is the practical default, and most candidates don't know it exists. Keep only two counters: the previous window and the current one. Estimate the in-window count as:
estimated = prev_count × (window_size − elapsed) / window_size + curr_count
If the previous minute had 80 requests, and you are 30 seconds into the current minute with 20 requests:
estimated = 80 × (60 − 30) / 60 + 20
= 80 × 0.5 + 20
= 60
The user is at 60, well under 100. Cloudflare measured this approximation against exact sliding windows in production. The error rate is 0.003%. You get two integers per user instead of a sorted set that grows with traffic.
Two integers. Near-zero error. This is the right default for most abuse-protection use cases.
Token Bucket: When Bursts Are Fine
Token bucket holds up to C tokens, refilling at rate r per second. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket fills between requests, so short bursts up to capacity are absorbed.
This is Stripe's production algorithm. They store two values per user in Redis: tokens and the last refill timestamp. On each request, a Lua script computes tokens accrued since the last request, caps at capacity, and checks whether the request proceeds. Lua scripts execute atomically on Redis, so no two scripts for the same key interleave. That is the key correctness property.
RATE_LIMIT_SCRIPT = """ local tokens_key = KEYS[1] local rate = tonumber(ARGV[1]) -- tokens per second local capacity = tonumber(ARGV[2]) -- max burst local now = tonumber(ARGV[3]) local data = redis.call("HMGET", tokens_key, "tokens", "ts") local tokens = tonumber(data[1]) or capacity local last_ts = tonumber(data[2]) or now local elapsed = now - last_ts local new_tokens = math.min(capacity, tokens + elapsed * rate) if new_tokens < 1 then return 0 end redis.call("HMSET", tokens_key, "tokens", new_tokens - 1, "ts", now) redis.call("EXPIRE", tokens_key, math.ceil(capacity / rate) * 2) return 1 """
Leaky Bucket: When You Need Smooth Output
Leaky bucket puts requests in a FIFO queue that drains at a fixed rate. It enforces strictly smooth output with no bursting. Useful for traffic shaping, absorbing upstream spikes before they reach a fragile backend. Not the right default for user-facing rate enforcement. A burst that fills the queue just delays legitimate requests that arrive later.
Where Does the Limiter Live? (This One Has a Right Answer)
In-process: fast but each server has its own counter. API gateway: shared state, right default. Dedicated service: cross-team quotas.
In-process middleware runs inside each application server. Zero network latency. But each server keeps its own counter, so a user can circumvent limits by spreading requests across N servers. Breaks immediately in any distributed setup.
API gateway is the right default for most systems. Kong, AWS API Gateway, and Nginx with a rate-limiting module all handle this. One place to configure rules, one place to push updates. Enforcement adds roughly 1ms per request for a Redis-backed check.
Dedicated rate-limiting service (a sidecar pattern, as Envoy uses) makes sense when multiple internal services all need to share limits against a single user quota. More operational overhead. Worth it when your microservices sprawl and you need consistent enforcement across all of them.
What Goes in Redis
Two integers for sliding window. One hash for token bucket. Set TTLs slightly longer than the window.
For a sliding window counter:
rl:{user_id}:prev → integer (previous window count)
rl:{user_id}:curr → integer (current window count)
rl:{user_id}:window_start → unix timestamp
For a token bucket:
rl:{user_id} → HASH { tokens: 9.5, ts: 1706745600.123 }
Set the TTL slightly longer than the window. An expired key recreates itself on the next request with a fresh counter. That is the intended behavior.
One trap worth naming explicitly: if you run INCR key and then EXPIRE key 60 as two separate commands, you have a window where the key exists with no expiry if the process dies between them. A single Lua script eliminates this race:
local current = redis.call("INCR", KEYS[1]) if current == 1 then redis.call("EXPIRE", KEYS[1], ARGV[1]) end return current
For the underlying hash behavior in Redis, the hash table time complexity post covers why O(1) lookup holds and when it doesn't.
Where Centralized Redis Falls Apart
Two servers both read 99, both compute 100, both write 100. One request evaporates. Redis INCR serializes this correctly.
Centralized Redis gives you a single source of truth. A single Redis node handles around 100,000 operations per second. Redis Cluster scales horizontally. The cost is a network round trip on every rate-limit check.
Local counters with periodic sync lets each application server track counts locally and flush to Redis every few hundred milliseconds. Much faster. Allows temporary over-limit traffic between syncs. Acceptable for batch workloads, risky for anything security-sensitive.
Sticky sessions route each user to the same server, eliminating shared state entirely. Fails when a server goes down. Not worth the fragility.
Fail open vs fail closed is the question most candidates forget entirely. If Redis is unreachable, do you allow all traffic (fail open) or block all traffic (fail closed)? Stripe uses fail open: if the rate-limit check fails, allow the request and alert loudly. The alternative is rejecting all traffic while your Redis cluster recovers, which is usually the worse outcome for an API business. This is exactly the kind of availability-vs-correctness tradeoff discussed in The Tradeoff Maze.
Race conditions in distributed setups also matter. Two servers reading a counter of 99, both incrementing, both writing 100 instead of 101, is the classic lost update. Redis INCR is atomic and avoids this. The Lua script approach for token bucket avoids it too, because the entire read-compute-write sequence runs as one atomic operation.
Tell Clients When to Back Off
RFC 6585 standardized the 429 Too Many Requests status code in 2012. It optionally allows a Retry-After header. In practice, every major API adds three more headers that have become universal convention even though they are not in the RFC:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706745660
Retry-After: 42
X-RateLimit-Reset is the Unix timestamp of when the current window resets. Retry-After is seconds to wait. Send these on every response, not just 429s. Clients that can see their remaining quota will throttle themselves, reducing the number of 429s you have to generate.
(Fun fact: the most relatable HTTP status code. Every API developer has both sent and received it at 2am.)
How to Not Say "It Depends" and Walk Away
System design interviews fail not because the candidate doesn't know the material but because the candidate never establishes what they are optimizing for. "It depends" is a complete non-answer. Every "it depends" needs a condition and a choice attached to it, or the interviewer hears: I haven't thought this through.
Minutes 0 to 5: clarifying questions. Minimum: who gets limited, what shape of limit, and what happens on failure. Write the requirements on the board before drawing anything.
Minutes 5 to 20: high-level architecture. Place the rate limiter (API gateway is almost always right for a first pass). Name the backing store. Don't touch algorithm internals yet.
Minutes 20 to 35: algorithm deep dive. Name two options and their tradeoff, then commit to one. "Fixed window is simplest but has a boundary spike vulnerability. Sliding window counter eliminates that at the cost of approximate counting at a 0.003% error rate. For a public API protecting against abuse, I'd use sliding window counter." Then implement it.
Minutes 35 to 45: scaling and edge cases. Distributed coordination, Redis Cluster, fail-open behavior, multi-tier limits (per second and per day simultaneously), the race condition in naive INCR implementations.
The one phrase that kills candidates is "it depends" with no follow-through. "If burst tolerance is acceptable, token bucket. If we need smooth output rate, leaky bucket." State the condition, state the choice, move on.
If you want to practice this kind of structured communication under real time pressure, SpaceComplexity runs voice-based mock system design sessions with rubric scoring on your reasoning, not just whether you mentioned Redis.
Recap
- Clarify who gets limited, what burst behavior is acceptable, and what happens when the rate limiter fails before naming an algorithm.
- Fixed window counter has a boundary spike bug. Sliding window log is exact but memory-heavy. Sliding window counter is the practical default: two integers per user, 0.003% error.
- Token bucket allows controlled bursts. Leaky bucket enforces smooth output. Token bucket is the right choice for most user-facing APIs.
- Redis Lua scripts make token bucket atomic. A plain
INCR + EXPIREpair is not atomic. Use a Lua script or pipeline with care. - Fail open with alerting beats fail closed for availability-sensitive systems.
- Name two algorithm options with tradeoffs, then commit to one. "It depends" is not a complete sentence.
Further Reading
- RFC 6585: Additional HTTP Status Codes (the 429 specification)
- Scaling your API with rate limiters (Stripe Engineering)
- Build 5 Rate Limiters with Redis (Redis official documentation)
- Rate Limiting Algorithms: System Design (GeeksforGeeks)
- 429 Too Many Requests (MDN Web Docs)