Reservoir Sampling: Pick Randomly From a Stream You Can't See

You're given a linked list of unknown length. Pick a random node. Every node must have an equal chance. You can't count the nodes first.

That's LeetCode 382, and if you've never seen reservoir sampling before, your first instinct is: count the nodes, generate a random index, traverse again. Two passes, clean, done. The interviewer smiles and says "what if the data is a live stream that never ends?"

Reservoir sampling solves this. One pass, fixed memory, provably fair. And once you understand it, you'll see it everywhere.

You're Sampling Blind

Picture this: you're tailing a server log as it streams in. You want a random sample of 1,000 lines for analysis. You have no idea how many lines the server will produce. Maybe 2,000. Maybe 200 million. The server doesn't know either. It's having a rough day.

You need to select k items uniformly at random from a sequence of n items, where n is unknown or too large to store.

If you could see the whole sequence upfront, you'd shuffle it and take the first k elements (see Fisher-Yates shuffle). But you can't. The data arrives one item at a time, like a conveyor belt where the belt might stop in five seconds or five hours.

Why the Naive Approaches Fail

The tempting solution: first pass to count n, generate k random indices, second pass to collect them.

Two problems. If the stream is live (network logs, sensor data, user events), there is no second pass. The data is gone. The server logged it and moved on. You're not getting it back.

Even if you have a file, reading 500GB twice is just reading 500GB twice. Some people have done it. They regret it.

The other tempting move: read everything into memory, then sample. This works until n is larger than your RAM, which is the entire reason you're sampling in the first place. You don't need a random 1,000 items because you love randomness. You need them because you can't fit the rest.

The constraint that makes this problem interesting is exactly that you can only see each item once.

Algorithm R: One Pass, No Cheating

The algorithm, due to Alan Vitter (1985):

Fill the reservoir with the first k items.
For the (i+1)th item (where i >= k), keep it with probability k / (i+1).
If you keep it, evict a random item from the reservoir.
After processing all n items, return the reservoir.

That's it. One pass, O(k) memory, and every item ends up with equal probability. Vitter didn't win any awards for conciseness, but he did get one pass out of it.

The k=1 case is the cleanest version to hold in your head: keep the first item, then for each new item, replace your current selection with probability 1/i where i is the current count. By the time the stream ends, you've kept exactly one item, and every item had the same 1/n shot.

Why Does This Actually Work?

Reservoir sampling flow: stream of items on the left, fixed-size reservoir on the right, probability k/i acceptance rule, invariant annotation at the bottom

The proof by induction is short enough to sketch in an interview without embarrassing yourself.

After processing i items, every item seen so far has exactly k/i probability of being in the reservoir.

Base case: after processing k items (filling the reservoir), each item has k/k = 1 probability. Correct.

Inductive step: assume after i items, each is in the reservoir with probability k/i. Item i+1 arrives. We keep it with probability k/(i+1). If we keep it, we evict one of the k reservoir slots uniformly, so each existing item survives with probability (1 - (k/(i+1)) * (1/k)) = 1 - 1/(i+1) = i/(i+1).

An existing item's new probability: (k/i) * (i/(i+1)) = k/(i+1). The new item's probability: k/(i+1). Equal. QED.

Each new item challenges the reservoir fairly, and the replacement probability is calibrated to maintain the invariant exactly. The math is almost suspiciously clean.

The k=1 Special Case: LeetCode 382 and 398

LeetCode 382 and 398 both use k=1. You maintain a single "current best" candidate and replace it with decreasing probability as more items arrive. The later items are long shots. The first item started at 100% and has been quietly defending its title ever since.

import random

class Solution:
    def __init__(self, head):
        self.head = head

    def getRandom(self):
        result = None
        node = self.head
        i = 1
        while node:
            if random.randint(1, i) == 1:
                result = node.val
            node = node.next
            i += 1
        return result

Walk through it: first node (i=1), you always take it (1/1 = 100%). Second node (i=2), you take it with probability 1/2. The first node survives with probability 1 * (1 - 1/2) = 1/2. After two nodes, each has 1/2. Keep going and the pattern holds.

random.randint(1, i) == 1 succeeds with probability 1/i, which is exactly what Algorithm R needs for k=1.

LeetCode 398 is the same idea applied to an array where a target value appears multiple times. You want a uniform random index among all matching indices. Same algorithm, just filter on nums[i] == target before doing the reservoir update.

Scaling to k Items

import random

def reservoir_sample(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = item
    return reservoir

random.randint(0, i) generates a number in [0, i], giving i+1 possible values. The item lands in the reservoir if j falls in [0, k-1], which happens with probability k/(i+1). That's the acceptance rule from Algorithm R, translated directly into code.

TypeScript, if that's your interview language:

function reservoirSample<T>(stream: T[], k: number): T[] {
    const reservoir: T[] = stream.slice(0, k);
    for (let i = k; i < stream.length; i++) {
        const j = Math.floor(Math.random() * (i + 1));
        if (j < k) {
            reservoir[j] = stream[i];
        }
    }
    return reservoir;
}

Math.random() returns [0, 1), so Math.floor(Math.random() * (i + 1)) gives a uniform integer in [0, i]. Same behavior as the Python version. You're welcome.

Time and Space: The Good News

Reservoir sampling runs in O(n) time and O(k) space, which is optimal for a single-pass algorithm.

You touch each item once. The reservoir never grows beyond k. For the interview case (k=1, unknown n), this collapses to O(n) time and O(1) space. You process a billion-line log using the same memory you'd use for a ten-line one.

The important thing to say clearly in an interview: k is your sample size, not n. If you're pulling 1,000 items from a billion-line log, you're using constant memory relative to input size.

Where This Shows Up in Real Systems

Log sampling is the canonical case. Distributed systems can emit millions of events per second. Storing all of them is expensive and largely pointless for most analyses. The agent at each host runs reservoir sampling and ships a statistically representative subset. Nobody misses the other 999,000 events per second.

Recommendation systems use it during training data prep. A user with 50,000 interaction events is a problem. You don't want to train on all of them (slow) or just the recent ones (biased). Reservoir sampling gives you a uniform sample without reading the full history twice.

Database query optimizers use it for cardinality estimation. A/B testing frameworks use it for user bucketing when the population is growing in real time. The common thread: reservoir sampling shows up wherever you have a stream, a memory budget, and a fairness requirement.

How to Explain This in an Interview

The algorithm is short enough to fit on a napkin. What separates candidates is whether they can explain why it works.

A strong answer covers three things:

Why can't we just count first? (Stream, or the second pass costs too much.)
What's the invariant? (After i items, each has k/i probability of being in the reservoir.)
How does the inductive step hold? (New item acceptance and existing item survival both work out to k/(i+1).)

If you can sketch the proof in 60 seconds, you signal mathematical fluency that matters at mid-to-senior level. Most people can implement the algorithm once shown. Fewer can explain why it's correct without checking their notes first.

Explaining probability-based algorithms out loud is harder than it looks on paper. If you're prepping for interviews that include randomization questions, SpaceComplexity lets you practice these explanations with a voice-based interviewer that gives rubric feedback on your reasoning, not just whether your code compiles.