Token Bucket vs Leaky Bucket: Rate-Limiting Algorithms Decoded

- Token bucket allows controlled bursts by accumulating tokens during idle time; AWS, Stripe, and GitHub all use it for user-facing APIs.
- Leaky bucket smooths requests into a constant output rate, making it a traffic shaper rather than just a traffic policer.
- Fixed window counter has a boundary amplification bug that can double your allowed rate at window edges — never propose it in an interview.
- Sliding window counter approximates perfect accuracy using two counters and a weighted average with ~5% error; Cloudflare uses it at the edge.
- Redis with Lua scripting is the standard implementation for atomicity, but multi-region rate limiting requires accepting eventual consistency or paying extra latency.
- Pick token bucket for user-facing APIs, sliding window counter at very large scale, and leaky bucket for strict downstream pipeline protection.
You have an API. It's running. Someone's going to abuse it.
Maybe it's a script kiddie, maybe it's your own mobile app sending retries during a bad network, maybe it's a batch job your biggest customer wrote at 2am. Doesn't matter. At some point, something is going to send ten thousand requests in a second and it's your job to not let that become your problem.
The interviewer is not checking whether you know the names. They want to see if you can pick the right algorithm for the constraints given, and defend that choice against the alternatives.
If a system design interviewer says "design a rate limiter," you have about two minutes before the algorithm question lands. Know all five, have a recommendation ready, and know which ones to cheerfully throw under the bus.
Five Rate-Limiting Algorithms, One Map
Before diving in, here is the full landscape:
| Algorithm | Burst Handling | Memory per Client | Accuracy |
|---|---|---|---|
| Fixed Window Counter | Poor (boundary bug) | Tiny (one counter) | Low |
| Sliding Window Log | Good | High (all timestamps) | Perfect |
| Sliding Window Counter | Good | Tiny (two counters) | ~95% |
| Token Bucket | Controlled bursts | Low (two values) | Good |
| Leaky Bucket | None | Medium (queue size) | Exact |
Most production systems land on token bucket or sliding window counter. The other three exist for good reasons. Knowing when to reject them is part of the answer.
Token Bucket: Controlled Bursts Are a Feature
The token bucket algorithm gives every client a bucket with a maximum capacity of b tokens. Tokens accumulate at a fixed rate r tokens per second. Each request spends one token. No tokens left, request denied. That's it.
import time class TokenBucket: def __init__(self, capacity: int, refill_rate: float): self.capacity = capacity self.tokens = capacity self.refill_rate = refill_rate # tokens per second self.last_refill = time.time() def allow_request(self) -> bool: now = time.time() elapsed = now - self.last_refill self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate) self.last_refill = now if self.tokens >= 1: self.tokens -= 1 return True return False
A user who goes quiet for 30 seconds and then fires off 50 requests gets all 50 through, because idle time accumulated tokens. That is a deliberate design choice: users who stay under their average rate are allowed to exceed it briefly. It feels fair because it actually is.
AWS API Gateway defaults to 10,000 requests per second with a burst capacity of 5,000. Stripe uses similar parameters. Both use token bucket because their customers have legitimate bursty traffic patterns. Batch jobs. Retry storms. End-of-day reporting from the finance team that somehow always runs at 11:59pm. Penalizing those bursts would hurt product quality.
The tradeoff: a client idle for an hour can burst at full capacity immediately. If b is too generous, you can still get hammered. Tuning it requires understanding your actual traffic distribution, not just the average.
Leaky Bucket: Smooth Output, No Mercy for Bursts
The leaky bucket algorithm flips the model. Requests enter a queue (the "bucket") and drain at a constant rate r. Queue full, request dropped. The output is always a steady stream, regardless of how bursty the input is.
import time from collections import deque class LeakyBucket: def __init__(self, capacity: int, leak_rate: float): self.capacity = capacity self.queue: deque = deque() self.leak_rate = leak_rate # requests processed per second self.last_leak = time.time() def allow_request(self) -> bool: self._leak() if len(self.queue) < self.capacity: self.queue.append(time.time()) return True return False def _leak(self): now = time.time() elapsed = now - self.last_leak leaked = int(elapsed * self.leak_rate) for _ in range(min(leaked, len(self.queue))): self.queue.popleft() self.last_leak = now
Leaky bucket is a traffic shaper, not just a traffic policer. It does not just allow or deny requests. It smooths them into an even stream. That property matters in specific contexts: video processing pipelines, payment systems, any downstream service that breaks under short bursts even when the average is fine.
For most web APIs, leaky bucket is the wrong choice. A user who pauses five seconds and sends ten requests gets throttled even when they are well under their hourly average. The queue memory overhead at scale is also harder to manage than a simple counter. Explaining that in an interview makes you look like someone who has thought about it, not just memorized the definition.
Why Fixed Window Gets You Rejected
Fixed window looks correct until someone does the math. Divide time into 60-second windows. Allow 100 requests per window.
The boundary bug: a user sends 100 requests at 12:00:59 and 100 more at 12:01:00. Both windows show 100, both are within limit. You just served 200 requests in two seconds against a 100-per-minute policy.
Your rate limiter has a secret second limit that is exactly double the stated one, and it activates every minute on the minute. Congratulations.
Never propose fixed window in an interview unless the interviewer explicitly asks about it as a starting point. It is a pedagogical tool. Its main job in the interview is to explain why the other algorithms exist.
Sliding Window Log: Perfect Accuracy at a Memory Cost
Sliding window log stores the timestamp of every request, typically in a Redis sorted set. On each new request, it removes all timestamps older than the window, counts what remains, and decides.
It is perfectly accurate. No boundary amplification. The problem is memory: every single request timestamp lives in storage. A service with 10 million users sending 100 requests per minute holds a billion timestamps in Redis. That number ends careers.
Use it for low-traffic endpoints where accuracy is non-negotiable. Financial audit APIs. OAuth token issuance. The kind of endpoints where you would rather be slow and right than fast and slightly wrong.
Sliding Window Counter: The Production Sweet Spot
The sliding window counter approximates the sliding window log using only two counters per client. It stores the count for the current window and the previous window, then estimates the current rate with a weighted average.
rate = current_count + (previous_count × (window_size - elapsed_in_current_window) / window_size)
If the previous window had 80 requests and the current window is 25% complete with 15 so far, the estimated rate is 15 + (80 × 0.75) = 75. Allow if under limit, deny if over.
This approximation is accurate to within a few percent under real traffic patterns, with the same tiny memory footprint as fixed window. Cloudflare adopted this algorithm at the edge specifically for the memory efficiency. When you are running rate limiting logic on servers handling millions of connections, "two integers per user" versus "a sorted set of timestamps per user" is not a minor consideration.
The Distributed Problem Nobody Mentions Until Asked

A single Redis instance with atomic Lua scripting handles rate limiting state cleanly. Redis is standard here because it supports atomic operations and has sub-millisecond latency. The Lua script reads the current count, checks it against the limit, increments, and returns the decision atomically. No race condition possible.
local key = KEYS[1] local limit = tonumber(ARGV[1]) local current = redis.call("INCR", key) if current == 1 then redis.call("EXPIRE", key, ARGV[2]) end if current > limit then return 0 else return 1 end
The problem appears when you have multiple regions. A globally consistent rate limit requires either a centralized store (adding latency on every request) or eventual consistency (accepting that users can briefly exceed limits during sync lag).
Most production systems accept eventual consistency for rate limiting. A user who gets a few extra requests through during a 50ms sync window is not a security catastrophe. Charging extra latency on every request, or blocking legitimate traffic for perfect accuracy, costs more than the small error. The distributed cache system design guide covers the replication and consistency models behind this tradeoff in more depth.
The moment you raise this in an interview without being prompted, you have just confirmed that you think about failure modes proactively. That is a different signal than someone who only talks about the happy path.
How to Answer This in an Interview
The pattern that works: state the landscape, pick two, defend your pick.
"There are five main rate limiting algorithms. For this use case, I want to use token bucket. Here's why."
Then walk through the tradeoffs for whatever the interviewer described. Payment API with strict throughput and no burst tolerance: leaky bucket. User-facing REST API with bursty mobile clients: token bucket or sliding window counter. High-value low-traffic endpoint with perfect accuracy requirements: sliding window log.
What fails: listing all five without a recommendation, or picking one without explaining why the others are wrong for this context. The interviewer is not a Wikipedia entry. They want a recommendation with reasoning, not a reading from the algorithm menu.
When discussing implementation, be specific. Not "you set a limit" but "I would start with a 60-second window, a capacity of 1,000 requests, and a refill rate of 20 per second, then tune from p99 traffic patterns in production." Specificity signals you have built this before. Or at least thought hard enough about it that you might as well have.
Practicing that explanation out loud is a different skill than knowing the algorithms. Algorithm knowledge and interview performance are separate things. SpaceComplexity lets you run through system design scenarios with voice-based feedback so you can hear whether your explanation actually lands, not just whether the information is correct.
Pick One. Here Is the Map.
Token bucket: User-facing APIs with bursty traffic. Standard choice for most web services. AWS, Stripe, and GitHub all use it. Start here unless you have a specific reason not to.
Sliding window counter: When memory efficiency matters at very large scale and you can accept ~5% approximation error. Cloudflare's edge rate limiting. Two integers per user, works at any scale.
Leaky bucket: Strict pipeline protection. Video encoding services, payment processors, any downstream that cannot handle short spikes even if the average rate is within limits.
Sliding window log: High-value, low-traffic endpoints where perfect accuracy is required and memory is not a constraint. Financial audit APIs.
Fixed window: Nowhere in production. Use it to illustrate why the other algorithms exist, then never speak of it again.
Further Reading
- Redis Rate Limiting Tutorial, official Redis documentation with Lua script examples for multiple algorithms
- GeeksforGeeks: Token Bucket vs Leaky Bucket, clear visual walkthrough of both core algorithms
- AWS API Gateway throttling documentation, real parameters for how AWS implements token bucket in production
- NGINX rate limiting guide, practical implementation of token bucket at the reverse proxy layer
- Wikipedia: Leaky bucket algorithm, formal definition and the two interpretations (as a meter vs as a queue)