Rate Limiter System Design: Most Candidates Skip Clarification

You have a server. It's running. Someone is hitting it 10,000 times a second. What do you do?

"Token bucket," says the candidate, and starts drawing boxes. The interviewer writes down: jumped to solution.

Rate limiter system design is one of those interview problems where everyone knows the algorithm names and almost nobody knows which question they're answering. The algorithms are easy to look up. What matters is whether you can establish scope, pick the right tradeoff for the stated requirements, and explain why. This covers the whole path from clarifying questions to Redis keys to scaling gotchas, in the order you'd present it in a 45-minute interview.

Ask the Questions Your Interviewer Already Wrote Down

Spend the first five minutes asking questions that drive algorithm selection. These are the ones that matter:

Who or what gets limited? User ID, API key, IP address, or a combination? IP is easy to spoof. User ID requires an authenticated request. API key is the most common choice for developer-facing APIs.

What shape of limit? Requests per second, per minute, per day? Multi-tier (100/min and 10,000/day simultaneously)? Should short bursts be tolerated, or do you need smooth output?

How accurate? A 0.003% error rate is fine for protecting an API from abuse. It is not fine for a billing system counting transactions.

Where does the limiter live? Inside each application server, at an API gateway layer, or as a dedicated service?

What happens when the rate limiter fails? Fail open (allow all traffic) or fail closed (reject all traffic)?

None of this is stalling. It is the difference between designing a token bucket and designing a leaky bucket. Naming those questions signals that you understand the problem before you start solving it.

Decision tree for rate limiter algorithm selection: who is limited, burst ok, accuracy requirement Ask these first. Your algorithm choice falls out of the answers.

Five Algorithms. One Is Right for Your Requirements.

Fixed Window Counter: Simple, Broken in One Specific Way

Fixed window counter divides time into intervals and increments a counter per interval. When the counter exceeds the limit, reject. Simple to implement with a single Redis INCR and EXPIRE.

The problem is the boundary spike. If your limit is 100 requests per minute and a client sends 100 at 11:59:59, the window resets, and they send 100 more at 12:00:01, you have 200 requests in two seconds with no violation recorded. Both windows look clean.

Fixed window boundary spike: two adjacent 1-minute windows each at 100, with 200 requests in a 2-second window straddling the boundary Each window reports "OK." The two seconds straddling the boundary serve 200 requests. The client is delighted.

Sliding Window Log: Exact, Memory-Hungry

Sliding window log stores every request timestamp in a sorted set. On each request, discard entries older than the window, then count what remains. Accurate, but every active user consumes memory proportional to their request rate. At 1,000 requests per minute per user, that is 1,000 entries in the sorted set.

Sliding Window Counter: The One You Should Default To

Sliding window counter is the practical default, and most candidates don't know it exists. Keep only two counters: the previous window and the current one. Estimate the in-window count as:

estimated = prev_count × (window_size − elapsed) / window_size + curr_count

If the previous minute had 80 requests, and you are 30 seconds into the current minute with 20 requests:

estimated = 80 × (60 − 30) / 60 + 20
          = 80 × 0.5 + 20
          = 60

The user is at 60, well under 100. Cloudflare measured this approximation against exact sliding windows in production. The error rate is 0.003%. You get two integers per user instead of a sorted set that grows with traffic.

Sliding window counter diagram: two overlapping fixed windows with the weighted-average formula annotated on the overlap region Two integers. Near-zero error. This is the right default for most abuse-protection use cases.

Token Bucket: When Bursts Are Fine

Token bucket holds up to C tokens, refilling at rate r per second. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket fills between requests, so short bursts up to capacity are absorbed.

This is Stripe's production algorithm. They store two values per user in Redis: tokens and the last refill timestamp. On each request, a Lua script computes tokens accrued since the last request, caps at capacity, and checks whether the request proceeds. Lua scripts execute atomically on Redis, so no two scripts for the same key interleave. That is the key correctness property.

RATE_LIMIT_SCRIPT = """
local tokens_key = KEYS[1]
local rate       = tonumber(ARGV[1])   -- tokens per second
local capacity   = tonumber(ARGV[2])   -- max burst
local now        = tonumber(ARGV[3])

local data     = redis.call("HMGET", tokens_key, "tokens", "ts")
local tokens   = tonumber(data[1]) or capacity
local last_ts  = tonumber(data[2]) or now

local elapsed   = now - last_ts
local new_tokens = math.min(capacity, tokens + elapsed * rate)

if new_tokens < 1 then
  return 0
end

redis.call("HMSET", tokens_key, "tokens", new_tokens - 1, "ts", now)
redis.call("EXPIRE", tokens_key, math.ceil(capacity / rate) * 2)
return 1
"""

Leaky Bucket: When You Need Smooth Output

Leaky bucket puts requests in a FIFO queue that drains at a fixed rate. It enforces strictly smooth output with no bursting. Useful for traffic shaping, absorbing upstream spikes before they reach a fragile backend. Not the right default for user-facing rate enforcement. A burst that fills the queue just delays legitimate requests that arrive later.

Where Does the Limiter Live? (This One Has a Right Answer)

Three rate limiter architectures: in-process middleware vs API gateway vs dedicated sidecar service In-process: fast but each server has its own counter. API gateway: shared state, right default. Dedicated service: cross-team quotas.

In-process middleware runs inside each application server. Zero network latency. But each server keeps its own counter, so a user can circumvent limits by spreading requests across N servers. Breaks immediately in any distributed setup.

API gateway is the right default for most systems. Kong, AWS API Gateway, and Nginx with a rate-limiting module all handle this. One place to configure rules, one place to push updates. Enforcement adds roughly 1ms per request for a Redis-backed check.

Dedicated rate-limiting service (a sidecar pattern, as Envoy uses) makes sense when multiple internal services all need to share limits against a single user quota. More operational overhead. Worth it when your microservices sprawl and you need consistent enforcement across all of them.

What Goes in Redis

Redis key namespace for rate limiting: sliding window counter keys and token bucket hash keys with TTL annotations Two integers for sliding window. One hash for token bucket. Set TTLs slightly longer than the window.

For a sliding window counter:

rl:{user_id}:prev         →  integer (previous window count)
rl:{user_id}:curr         →  integer (current window count)
rl:{user_id}:window_start →  unix timestamp

For a token bucket:

rl:{user_id}  →  HASH { tokens: 9.5, ts: 1706745600.123 }

Set the TTL slightly longer than the window. An expired key recreates itself on the next request with a fresh counter. That is the intended behavior.

One trap worth naming explicitly: if you run INCR key and then EXPIRE key 60 as two separate commands, you have a window where the key exists with no expiry if the process dies between them. A single Lua script eliminates this race:

local current = redis.call("INCR", KEYS[1])
if current == 1 then
  redis.call("EXPIRE", KEYS[1], ARGV[1])
end
return current

For the underlying hash behavior in Redis, the hash table time complexity post covers why O(1) lookup holds and when it doesn't.

Where Centralized Redis Falls Apart

Race condition in naive GET+SET rate limiting vs atomic Redis INCR Two servers both read 99, both compute 100, both write 100. One request evaporates. Redis INCR serializes this correctly.

Centralized Redis gives you a single source of truth. A single Redis node handles around 100,000 operations per second. Redis Cluster scales horizontally. The cost is a network round trip on every rate-limit check.

Local counters with periodic sync lets each application server track counts locally and flush to Redis every few hundred milliseconds. Much faster. Allows temporary over-limit traffic between syncs. Acceptable for batch workloads, risky for anything security-sensitive.

Sticky sessions route each user to the same server, eliminating shared state entirely. Fails when a server goes down. Not worth the fragility.

Fail open vs fail closed is the question most candidates forget entirely. If Redis is unreachable, do you allow all traffic (fail open) or block all traffic (fail closed)? Stripe uses fail open: if the rate-limit check fails, allow the request and alert loudly. The alternative is rejecting all traffic while your Redis cluster recovers, which is usually the worse outcome for an API business. This is exactly the kind of availability-vs-correctness tradeoff discussed in The Tradeoff Maze.

Race conditions in distributed setups also matter. Two servers reading a counter of 99, both incrementing, both writing 100 instead of 101, is the classic lost update. Redis INCR is atomic and avoids this. The Lua script approach for token bucket avoids it too, because the entire read-compute-write sequence runs as one atomic operation.

Tell Clients When to Back Off

RFC 6585 standardized the 429 Too Many Requests status code in 2012. It optionally allows a Retry-After header. In practice, every major API adds three more headers that have become universal convention even though they are not in the RFC:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706745660
Retry-After: 42

X-RateLimit-Reset is the Unix timestamp of when the current window resets. Retry-After is seconds to wait. Send these on every response, not just 429s. Clients that can see their remaining quota will throttle themselves, reducing the number of 429s you have to generate.

(Fun fact: the most relatable HTTP status code. Every API developer has both sent and received it at 2am.)

How to Not Say "It Depends" and Walk Away

System design interviews fail not because the candidate doesn't know the material but because the candidate never establishes what they are optimizing for. "It depends" is a complete non-answer. Every "it depends" needs a condition and a choice attached to it, or the interviewer hears: I haven't thought this through.

Minutes 0 to 5: clarifying questions. Minimum: who gets limited, what shape of limit, and what happens on failure. Write the requirements on the board before drawing anything.

Minutes 5 to 20: high-level architecture. Place the rate limiter (API gateway is almost always right for a first pass). Name the backing store. Don't touch algorithm internals yet.

Minutes 20 to 35: algorithm deep dive. Name two options and their tradeoff, then commit to one. "Fixed window is simplest but has a boundary spike vulnerability. Sliding window counter eliminates that at the cost of approximate counting at a 0.003% error rate. For a public API protecting against abuse, I'd use sliding window counter." Then implement it.

Minutes 35 to 45: scaling and edge cases. Distributed coordination, Redis Cluster, fail-open behavior, multi-tier limits (per second and per day simultaneously), the race condition in naive INCR implementations.

The one phrase that kills candidates is "it depends" with no follow-through. "If burst tolerance is acceptable, token bucket. If we need smooth output rate, leaky bucket." State the condition, state the choice, move on.

If you want to practice this kind of structured communication under real time pressure, SpaceComplexity runs voice-based mock system design sessions with rubric scoring on your reasoning, not just whether you mentioned Redis.

Recap

Clarify who gets limited, what burst behavior is acceptable, and what happens when the rate limiter fails before naming an algorithm.
Fixed window counter has a boundary spike bug. Sliding window log is exact but memory-heavy. Sliding window counter is the practical default: two integers per user, 0.003% error.
Token bucket allows controlled bursts. Leaky bucket enforces smooth output. Token bucket is the right choice for most user-facing APIs.
Redis Lua scripts make token bucket atomic. A plain INCR + EXPIRE pair is not atomic. Use a Lua script or pipeline with care.
Fail open with alerting beats fail closed for availability-sensitive systems.
Name two algorithm options with tradeoffs, then commit to one. "It depends" is not a complete sentence.