Token Bucket vs Leaky Bucket: Rate-Limiting Algorithms Decoded

You have an API. It's running. Someone's going to abuse it.

Maybe it's a script kiddie, maybe it's your own mobile app sending retries during a bad network, maybe it's a batch job your biggest customer wrote at 2am. Doesn't matter. At some point, something is going to send ten thousand requests in a second and it's your job to not let that become your problem.

The interviewer is not checking whether you know the names. They want to see if you can pick the right algorithm for the constraints given, and defend that choice against the alternatives.

If a system design interviewer says "design a rate limiter," you have about two minutes before the algorithm question lands. Know all five, have a recommendation ready, and know which ones to cheerfully throw under the bus.

Five Rate-Limiting Algorithms, One Map

Before diving in, here is the full landscape:

Algorithm	Burst Handling	Memory per Client	Accuracy
Fixed Window Counter	Poor (boundary bug)	Tiny (one counter)	Low
Sliding Window Log	Good	High (all timestamps)	Perfect
Sliding Window Counter	Good	Tiny (two counters)	~95%
Token Bucket	Controlled bursts	Low (two values)	Good
Leaky Bucket	None	Medium (queue size)	Exact

Most production systems land on token bucket or sliding window counter. The other three exist for good reasons. Knowing when to reject them is part of the answer.

Token Bucket: Controlled Bursts Are a Feature

The token bucket algorithm gives every client a bucket with a maximum capacity of b tokens. Tokens accumulate at a fixed rate r tokens per second. Each request spends one token. No tokens left, request denied. That's it.

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()

    def allow_request(self) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

A user who goes quiet for 30 seconds and then fires off 50 requests gets all 50 through, because idle time accumulated tokens. That is a deliberate design choice: users who stay under their average rate are allowed to exceed it briefly. It feels fair because it actually is.

AWS API Gateway defaults to 10,000 requests per second with a burst capacity of 5,000. Stripe uses similar parameters. Both use token bucket because their customers have legitimate bursty traffic patterns. Batch jobs. Retry storms. End-of-day reporting from the finance team that somehow always runs at 11:59pm. Penalizing those bursts would hurt product quality.

The tradeoff: a client idle for an hour can burst at full capacity immediately. If b is too generous, you can still get hammered. Tuning it requires understanding your actual traffic distribution, not just the average.

Leaky Bucket: Smooth Output, No Mercy for Bursts

The leaky bucket algorithm flips the model. Requests enter a queue (the "bucket") and drain at a constant rate r. Queue full, request dropped. The output is always a steady stream, regardless of how bursty the input is.

import time
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.queue: deque = deque()
        self.leak_rate = leak_rate  # requests processed per second
        self.last_leak = time.time()

    def allow_request(self) -> bool:
        self._leak()
        if len(self.queue) < self.capacity:
            self.queue.append(time.time())
            return True
        return False

    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        leaked = int(elapsed * self.leak_rate)
        for _ in range(min(leaked, len(self.queue))):
            self.queue.popleft()
        self.last_leak = now

Leaky bucket is a traffic shaper, not just a traffic policer. It does not just allow or deny requests. It smooths them into an even stream. That property matters in specific contexts: video processing pipelines, payment systems, any downstream service that breaks under short bursts even when the average is fine.

For most web APIs, leaky bucket is the wrong choice. A user who pauses five seconds and sends ten requests gets throttled even when they are well under their hourly average. The queue memory overhead at scale is also harder to manage than a simple counter. Explaining that in an interview makes you look like someone who has thought about it, not just memorized the definition.

Why Fixed Window Gets You Rejected

Fixed window looks correct until someone does the math. Divide time into 60-second windows. Allow 100 requests per window.

The boundary bug: a user sends 100 requests at 12:00:59 and 100 more at 12:01:00. Both windows show 100, both are within limit. You just served 200 requests in two seconds against a 100-per-minute policy.

Your rate limiter has a secret second limit that is exactly double the stated one, and it activates every minute on the minute. Congratulations.

Never propose fixed window in an interview unless the interviewer explicitly asks about it as a starting point. It is a pedagogical tool. Its main job in the interview is to explain why the other algorithms exist.

Sliding Window Log: Perfect Accuracy at a Memory Cost

Sliding window log stores the timestamp of every request, typically in a Redis sorted set. On each new request, it removes all timestamps older than the window, counts what remains, and decides.

It is perfectly accurate. No boundary amplification. The problem is memory: every single request timestamp lives in storage. A service with 10 million users sending 100 requests per minute holds a billion timestamps in Redis. That number ends careers.

Use it for low-traffic endpoints where accuracy is non-negotiable. Financial audit APIs. OAuth token issuance. The kind of endpoints where you would rather be slow and right than fast and slightly wrong.

Sliding Window Counter: The Production Sweet Spot

The sliding window counter approximates the sliding window log using only two counters per client. It stores the count for the current window and the previous window, then estimates the current rate with a weighted average.

rate = current_count + (previous_count × (window_size - elapsed_in_current_window) / window_size)

If the previous window had 80 requests and the current window is 25% complete with 15 so far, the estimated rate is 15 + (80 × 0.75) = 75. Allow if under limit, deny if over.

This approximation is accurate to within a few percent under real traffic patterns, with the same tiny memory footprint as fixed window. Cloudflare adopted this algorithm at the edge specifically for the memory efficiency. When you are running rate limiting logic on servers handling millions of connections, "two integers per user" versus "a sorted set of timestamps per user" is not a minor consideration.

The Distributed Problem Nobody Mentions Until Asked

Token Bucket vs Leaky Bucket algorithm comparison diagram

A single Redis instance with atomic Lua scripting handles rate limiting state cleanly. Redis is standard here because it supports atomic operations and has sub-millisecond latency. The Lua script reads the current count, checks it against the limit, increments, and returns the decision atomically. No race condition possible.

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local current = redis.call("INCR", key)
if current == 1 then
    redis.call("EXPIRE", key, ARGV[2])
end
if current > limit then
    return 0
else
    return 1
end

The problem appears when you have multiple regions. A globally consistent rate limit requires either a centralized store (adding latency on every request) or eventual consistency (accepting that users can briefly exceed limits during sync lag).

Most production systems accept eventual consistency for rate limiting. A user who gets a few extra requests through during a 50ms sync window is not a security catastrophe. Charging extra latency on every request, or blocking legitimate traffic for perfect accuracy, costs more than the small error. The distributed cache system design guide covers the replication and consistency models behind this tradeoff in more depth.

The moment you raise this in an interview without being prompted, you have just confirmed that you think about failure modes proactively. That is a different signal than someone who only talks about the happy path.

How to Answer This in an Interview

The pattern that works: state the landscape, pick two, defend your pick.

"There are five main rate limiting algorithms. For this use case, I want to use token bucket. Here's why."

Then walk through the tradeoffs for whatever the interviewer described. Payment API with strict throughput and no burst tolerance: leaky bucket. User-facing REST API with bursty mobile clients: token bucket or sliding window counter. High-value low-traffic endpoint with perfect accuracy requirements: sliding window log.

What fails: listing all five without a recommendation, or picking one without explaining why the others are wrong for this context. The interviewer is not a Wikipedia entry. They want a recommendation with reasoning, not a reading from the algorithm menu.

When discussing implementation, be specific. Not "you set a limit" but "I would start with a 60-second window, a capacity of 1,000 requests, and a refill rate of 20 per second, then tune from p99 traffic patterns in production." Specificity signals you have built this before. Or at least thought hard enough about it that you might as well have.

Practicing that explanation out loud is a different skill than knowing the algorithms. Algorithm knowledge and interview performance are separate things. SpaceComplexity lets you run through system design scenarios with voice-based feedback so you can hear whether your explanation actually lands, not just whether the information is correct.

Pick One. Here Is the Map.

Token bucket: User-facing APIs with bursty traffic. Standard choice for most web services. AWS, Stripe, and GitHub all use it. Start here unless you have a specific reason not to.

Sliding window counter: When memory efficiency matters at very large scale and you can accept ~5% approximation error. Cloudflare's edge rate limiting. Two integers per user, works at any scale.

Leaky bucket: Strict pipeline protection. Video encoding services, payment processors, any downstream that cannot handle short spikes even if the average rate is within limits.

Sliding window log: High-value, low-traffic endpoints where perfect accuracy is required and memory is not a constraint. Financial audit APIs.

Fixed window: Nowhere in production. Use it to illustrate why the other algorithms exist, then never speak of it again.