Bloom Filter vs Hash Set: When a Wrong Answer Saves Gigabytes

May 28, 202611 min read
dsaalgorithmsinterview-prepdata-structures
Bloom Filter vs Hash Set: When a Wrong Answer Saves Gigabytes
TL;DR
  • Bloom filters use 9.6 bits/element at 1% FPR; hash sets use 8 to 200 bytes, a 10 to 100x gap in RAM
  • False positives are possible (bit collisions from other insertions); false negatives are impossible by construction
  • Optimal hash functions k = (m/n) × ln(2); every 10x reduction in FPR costs ~4.8 more bits per element
  • Four hard limits: no deletion, no retrieval, no iteration, and FPR climbs past the element count you sized for
  • The canonical pattern: Bloom filter in memory as a cheap gatekeeper before a slow disk or network lookup (Cassandra SSTables, Chrome Safe Browsing)
  • Decision rule: choose a Bloom filter only when false positives are tolerable and memory is the actual constraint

You have a billion URLs. You want to check, in microseconds, whether you've seen a given one before. A hash set works perfectly. It also costs you roughly 12 GB of RAM per server. Per server.

Now suppose you can tolerate getting the answer wrong in one specific direction. You'll never miss a URL you've seen. You might occasionally say "seen it" for something you haven't. 1% of the time. In exchange, you use 9.6 bits per URL instead of hundreds of bytes.

That's the bloom filter vs hash set tradeoff in one sentence.

Hash Sets Are Exact. That Exactness Costs Memory.

A hash set computes hash(element), maps that to a bucket, and stores the element (or enough of it to resolve collisions). Lookup, insert, and delete are O(1) average. No false positives. No uncertainty.

If a hash set says yes, the element is in the set. If it says no, it isn't. The price is storing enough data to distinguish every member from every possible non-member.

For a billion 64-bit integers, at 8 bytes each plus load factor overhead, you're looking at 10 to 20 GB. For strings, you store either the full string or a 64-bit hash with pointer overhead on top. Python's set adds garbage collection headers: roughly 200 bytes per element. Two hundred bytes. Per element. In Python. Java's HashSet is a comparatively frugal 48 bytes, and Go maps land at 8 to 16.

None of this is a knock on hash sets. It's just physics. The O(1) guarantee says nothing about how many bytes protect each operation.

See /blog/hash-set-time-complexity for the internals.

Bloom Filters Trade Certainty for Space

A Bloom filter is a probabilistic data structure. "Probabilistic" sounds fancy but it just means: this thing can be wrong, in one specific direction, at a rate you control.

Concretely: a bit array of m bits, all initialized to 0, paired with k independent hash functions.

Insert: run the element through all k hash functions, set each of the k resulting bit positions to 1.

Query: run the element through all k hash functions, check all k positions. If any bit is 0, the element is definitely not in the filter. If all bits are 1, the element is probably in the filter.

Bloom filter operation: insert google.com (bits 2,7,13) and bing.com (bits 0,7,11), then query yahoo.com which finds bit 9 is 0 and returns definitely-not-in-set, and query amazon.com which finds all bits set and returns a false positive even though amazon.com was never inserted Bits set by two inserts. "yahoo.com" hits a zero bit and gets a clean answer. "amazon.com" gets a false positive: all its hash positions happen to be 1, but it was never inserted.

A false positive happens when bits set by previous insertions happen to match all k positions for a new element. The filter has no memory of which element set which bit. It just sees 1s.

False negatives are impossible. If you inserted an element, its k bits are permanently set. Every subsequent query finds all k bits as 1. The filter cannot forget.

That asymmetry, no false negatives but possible false positives, is the entire foundation of its usefulness.

What the Error Rate Formula Actually Says

After inserting n elements with k hash functions into an m-bit array, what is the probability a non-member gets a false positive?

Each hash function independently sets one of m bits. After all n insertions, the probability any specific bit is still 0 is approximately:

P(bit = 0) ≈ (1 - 1/m)^(kn) ≈ e^(-kn/m)

The approximation uses the classical limit (1 - 1/m)^m → e^(-1) for large m. So the probability a specific bit is 1 is 1 - e^(-kn/m).

A false positive requires all k bits to be 1 for a non-member. Since the hash functions are independent:

FPR ≈ (1 - e^(-kn/m))^k

To minimize this over k, take the derivative and set it to zero. The optimal number of hash functions is:

k_opt = (m/n) × ln(2) ≈ 0.693 × (m/n)

Plug this back in and solve for m given a target false positive rate p:

m = -(n × ln(p)) / (ln(2))²
k = log₂(1/p)

Plugging in numbers:

Target FPRBits per elementHash functions
1%9.6 bits7
0.1%14.4 bits10
0.01%19.2 bits13

Every factor-of-10 reduction in error rate costs roughly 4.8 more bits per element. Going from 1% to 0.1% adds 4.8 bits. From 0.1% to 0.01%, another 4.8. You are, quite literally, buying precision by the bit.

FPR vs bits per element curve with k chosen optimally at each point, log scale on FPR axis, operating points marked at 1% (9.6 bits), 0.1% (14.4 bits), and 0.01% (19.2 bits) The curve falls steeply at first then flattens. The three marked points match the rows in the table above.

The Space Advantage Is Not Subtle

At 1% FPR, a Bloom filter uses 9.6 bits = 1.2 bytes per element. Here's the comparison:

StructureSpace per elementFalse positivesDelete
Python set~200 bytesNeverYes
Java HashSet~48 bytesNeverYes
C++ unordered_set (int64)~16 bytesNeverYes
Go map[T]struct{}~8 to 16 bytesNeverYes
Bloom filter at 1% FPR1.2 bytes1 in 100No
Bloom filter at 0.1% FPR1.8 bytes1 in 1,000No

One billion URLs at 1% FPR costs 1.2 GB in a Bloom filter. A Go hash set for the same data costs 10 to 20 GB.

One non-obvious property makes the gap even starker: the space required by a Bloom filter does not depend on key size. Whether your keys are 8-byte integers or 200-byte strings, the filter uses the same 9.6 bits per element at 1% FPR. A hash set storing a billion 50-byte URLs needs at minimum 50 GB just for the raw string data. The Bloom filter at 0.1% FPR needs 1.8 GB and does not care how long the URLs are.

Horizontal log-scale bar chart comparing bytes per element: Python set 200B, Java HashSet 48B, C++ unordered_set 16B, Go map 12B, Bloom 1% FPR 1.2B, Bloom 0.1% FPR 1.8B. The Bloom filter bars are barely visible on this scale. Log scale. The Bloom filter bars are not glitches.

Four Things You Cannot Do

The space savings are real. So is what you give up.

No deletions. Setting a bit to 1 is irreversible. You cannot unset it because other elements may share that bit. Counting Bloom filters replace each bit with a small counter (typically 4 bits), enabling deletion at roughly 4 times the space cost.

No retrieval. The filter stores no actual data. You cannot get an element back out.

No iteration. Same reason. There is no way to enumerate what has been inserted.

FPR climbs past capacity. The formulas above assume exactly n insertions. Add more than n elements and error rate rises above your target. Size with headroom and track your insertion count. You will forget to track the insertion count. Everyone does.

If you need any of deletion, retrieval, or certainty, you need a hash set.

The Filter Always Goes in Front

Most Bloom filter tutorials describe it as an alternative to a hash set. That framing leads you astray. The canonical pattern is gatekeeper: the Bloom filter lives in memory, eliminating most lookups before they reach the slow thing behind it.

Chrome's Safe Browsing feature is the clearest example. The browser held a local Bloom filter of millions of hashed malicious URL prefixes, compressed to a few megabytes. When you navigated to a URL:

  1. Check the local Bloom filter. If the answer is "definitely not malicious," no network request. Done.
  2. If the answer is "maybe malicious," make a real HTTPS request to Google's servers for an exact check.

Every truly malicious URL triggered the server check. Most safe URLs never generated a network request. The false positives sent some harmless URLs through the expensive path unnecessarily. That was fine. The cost of an extra HTTPS request is low. The cost of missing a malicious URL is not.

Two-path flowchart: browser extracts URL prefix, hashes it, checks local Bloom filter. Left path shows bit equals 0, definitely safe, request continues with no network call. Right path shows all bits are 1, maybe malicious, goes to HTTPS request at Google servers, returns exact block or allow verdict. The filter handles the cheap case. Google's servers handle the rest. Most requests never reach Google.

Cassandra uses the same logic for every SSTable file on disk. Before reading from a file, Cassandra checks a per-file Bloom filter. If the partition key is definitely not in that SSTable, skip the disk read entirely. One partition might be spread across dozens of SSTable files. Without Bloom filters, a read for a missing key would touch every file. With them, most files get skipped in microseconds from memory.

BigTable uses Bloom filters the same way for row/column pair lookups, cutting unnecessary disk access for non-existent keys.

The pattern is always the same: cheap probabilistic filter in memory, expensive exact store on disk or network.

Bloom Filter vs Hash Set: Which Do You Actually Need?

Use a Bloom filter when:

  • You have millions or billions of elements and memory is genuinely constrained
  • False positives are manageable (an extra lookup, not a wrong answer to the user)
  • You never need to delete, retrieve, or iterate elements
  • You are pre-filtering in front of a slower, exact check

Use a hash set when:

  • You need exact membership with zero false positives
  • You need deletion, retrieval, or iteration
  • Element count is small enough that memory is not the bottleneck
  • A false positive has unacceptable consequences (access control, financial deduplication, legal records)

The deciding question: can your system tolerate a small false positive rate, and does memory actually constrain you? If yes to both, the Bloom filter earns its place.

Sizing a Bloom Filter for a Billion URLs

A web crawler tracking visited URLs hits the memory wall fast. Here is how to size a Bloom filter for one billion URLs at 1% FPR:

import math n = 1_000_000_000 # one billion URLs p = 0.01 # 1% false positive rate m = int(-(n * math.log(p)) / (math.log(2) ** 2)) k = round(math.log(1 / p) / math.log(2)) print(f"Array size: {m:,} bits ({m / 8 / 1e9:.2f} GB)") print(f"Hash functions: {k}") # Array size: 9,585,058,387 bits (1.20 GB) # Hash functions: 7

1.2 GB. 7 hash functions. One billion URLs at 1% FPR.

The 1% of new URLs incorrectly flagged as "already visited" get skipped until the next crawl cycle. For most crawlers that's an acceptable loss. For one that cannot miss any page ever, it is not, and you would want a hash set or a database.

Production implementations use MurmurHash3 or xxHash for the hash functions, and Kirsch-Mitzenmacher double hashing to simulate k hash functions from just two, keeping the computation cheap.

For the full internals of Bloom filter construction including the FPR derivation, the double-hashing trick, counting variants, and real-world implementations, see /blog/bloom-filter-data-structure.

Recap

  • A Bloom filter is a bit array plus k hash functions. Insertions set bits; queries check all k bits.
  • False positives happen when bit collisions make a non-member look like a member. False negatives are impossible by construction.
  • At 1% FPR: 9.6 bits per element, regardless of key size. Hash sets use 8 to 200 bytes per element.
  • The space advantage is 10 to 100 times, but you lose deletion, retrieval, and certainty.
  • The canonical pattern: Bloom filter in front of an expensive lookup, filtering out most queries cheaply.
  • Decision question: can your system tolerate a small false positive rate, and is memory the real constraint?

If you're preparing for system design interviews, Bloom filters come up in database internals, CDN design, and distributed systems questions. You need to be able to explain the tradeoff out loud, not just write it down. SpaceComplexity runs voice-based mock interviews with rubric-based feedback so you can practice exactly that.

Further Reading