Bloom Filter: The Data Structure That Answers "Definitely Not" in O(1)

- Bloom filters store a lossy fingerprint of a set: they prove absence in O(1) but only suggest presence, making false positives possible and false negatives impossible
- The false positive rate is (1 - e^(-kn/m))^k, and at 1% FPR you need just 9.6 bits per element, which is 1.44x the information-theoretic minimum
- Two hash functions are enough to simulate k independent hashes via the Kirsch-Mitzenmacher double hashing trick
- Every insert and lookup is O(1) since k is a small constant chosen at filter creation time
- Bloom filters cannot delete, enumerate, or count, but counting variants trade 4x space for deletion support
- Production systems using Bloom filters include BigTable SSTable lookups, Akamai cache admission, Chrome Safe Browsing, and Cassandra disk-read avoidance
You have a billion URLs. A user types one in. You need to know: is this URL malicious?
You could store every malicious URL in a hash set. That works. It also eats gigabytes of RAM for breakfast. Or you could use a data structure that answers the same question in constant time, using roughly one byte per URL, with one small asterisk: it occasionally says "yes" when it should say "no." It will never, ever say "no" when it should say "yes."
That structure is the Bloom filter. It trades perfect accuracy for radical space efficiency, and the math behind the trade-off is surprisingly clean.
Burton Howard Bloom published the idea in 1970 with a title only a researcher could love: "Space/Time Trade-offs in Hash Coding with Allowable Errors" (Communications of the ACM). The original use case was hyphenation dictionaries. Storing 500,000 English words took too much memory for 1970s hardware, so Bloom asked: what if we tolerate a small error rate and compress the set into a bit array? The answer turned out to be useful far beyond dictionaries. Google uses Bloom filters in BigTable. Akamai uses them to avoid caching content nobody asks for twice. Bitcoin used them to let lightweight wallets sync without downloading every block. The structure is 56 years old and still shipping in production. Not bad for something born the same year as the floppy disk.
What does a Bloom filter actually store?
A Bloom filter is not a set. It does not store the elements themselves. It stores the shadow of a set: a compressed, lossy fingerprint that can prove absence but can only suggest presence.
The structure has two components:
- A bit array of m bits, all initialized to 0.
- A family of k hash functions, each mapping an element to one of the m positions.
That is the entire data structure. No pointers, no linked lists, no tree nodes. Just bits. Your linked list is jealous.
How insertion works
To insert an element x, compute all k hash functions on x. Each one returns a position in the bit array. Flip those k bits to 1. That's it. No collision resolution, no probing, no rebalancing. Just flip and walk away.
Suppose m = 16, k = 3, and you insert "apple." The three hash functions return positions 2, 7, and 13. You flip those bits:
Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Bits: 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
Now insert "banana." The three hash functions return positions 4, 7, and 11:
Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Bits: 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0
Position 7 was already set from "apple." That is fine. Bits only go from 0 to 1, never back. This one-way door is exactly why deletion is impossible in a standard Bloom filter: clearing bit 7 to remove "banana" would also lobotomize "apple."
Inserting "apple" and "banana" into a 16-bit Bloom filter with k=3. Position 7 is the shared bit that makes deletion impossible.
How lookup works
To check if x is in the set, compute the same k hash functions. Check whether all k bits are 1.
- If any bit is 0, the element was definitely never inserted. Case closed.
- If all bits are 1, the element was probably inserted. But those bits might have been set by other elements entirely. This is the probabilistic "maybe."
Check "cherry" with hash positions 2, 5, and 11. Position 5 is 0. Cherry was never inserted. Guaranteed. The zero bit is like finding a dry sidewalk and concluding it hasn't rained.
Check "mango" with hash positions 2, 7, and 11. All three are 1. But "apple" set positions 2 and 7, and "banana" set position 11. Mango was never inserted, but the filter says it might have been. This is a false positive. Mango walked through three different puddles left by other fruits and got blamed for the rain.
Two lookups on the same filter. Cherry hits a zero at position 5, so it is definitely absent. Mango finds all its bits set by other elements, triggering a false positive.
Why false negatives are impossible
The proof fits in a tweet. When you insert element x, you set k specific bits to 1. Bits in a Bloom filter only move in one direction: from 0 to 1. No operation ever clears a bit. So when you check x later, every bit that was set during insertion is still set. The check will always find all k bits at 1, and the filter will always report "probably yes."
A Bloom filter cannot miss an element it has seen. This one-directional bit guarantee is the foundation every application depends on. If the filter says "no," you can take that to the bank.
You can derive the exact false positive rate
Most probabilistic data structures wave their hands around error bounds. The Bloom filter hands you a closed-form equation. The false positive rate is a function of exactly three variables: m (bits), n (elements inserted), and k (hash functions).
Start with a single bit
Assume each hash function maps uniformly and independently to any of the m positions. After inserting one element using one hash function, the probability that a specific bit is still 0 is:
1 - 1/m
After inserting n elements with k hash functions each (that is kn total bit-setting operations), the probability that a specific bit is still 0 is:
(1 - 1/m)^(kn)
Using the approximation (1 - 1/m)^m ≈ e^(-1) for large m:
(1 - 1/m)^(kn) ≈ e^(-kn/m)
So the probability that a specific bit IS 1 is:
1 - e^(-kn/m)
Multiply k independent chances
A false positive occurs when all k hash positions for a non-member happen to be set to 1. Since the k hash functions are independent, the false positive probability is:
p = (1 - e^(-kn/m))^k
This is the Bloom filter equation. Everything else follows from it.
How many hash functions do you need?
Too few hash functions and each lookup checks too few bits, so collisions run wild. Too many hash functions and the bit array fills up too fast, so collisions run wild again. The Goldilocks zone exists, and calculus can find it.
Minimize p with respect to k by taking the derivative and setting it to zero. The calculation simplifies if you substitute g = e^(-kn/m) and note that at the optimum, exactly half the bits in the array are set (g = 1/2). The result:
k_optimal = (m/n) * ln(2) ≈ 0.693 * (m/n)
At this optimal k, the false positive probability becomes:
p = (1/2)^k = (0.6185)^(m/n)
That 0.6185 is 2^(-ln 2), and it makes the relationship exponential. Every extra bit per element roughly halves the false positive rate. That is an absurdly good deal. One bit gets you a 2x improvement in accuracy. Try getting that ROI from your hash map.
The false positive rate drops exponentially with bits per element. The sweet spot for most applications sits between 8 and 12 bits per element.
How big should the bit array be?
Rearranging for m given a target false positive rate p and n elements:
m = -(n * ln(p)) / (ln(2))^2
And the corresponding optimal number of hash functions:
k = -log₂(p)
Some practical numbers:
| Target FPR | Bits per element (m/n) | Hash functions (k) |
|---|---|---|
| 10% | 4.79 | 3.32 ≈ 3 |
| 1% | 9.58 | 6.64 ≈ 7 |
| 0.1% | 14.38 | 9.97 ≈ 10 |
| 0.01% | 19.17 | 13.29 ≈ 13 |
At 1% false positive rate, you need roughly 9.6 bits per element. That is 1.2 bytes. Compare that to a hash set storing 64-bit keys: at least 8 bytes per key, plus pointer overhead, plus load factor padding. A hash set easily costs 40 to 100+ bytes per element. The Bloom filter uses 10 to 80 times less space. Your hash set is eating steak. Your Bloom filter is eating rice. Same answer, wildly different bill.
How close to perfect is this?
The information-theoretic lower bound for any data structure that answers membership queries with false positive rate p is log₂(1/p) bits per element. At 1% FPR, that is log₂(100) ≈ 6.64 bits.
The Bloom filter uses 9.58 bits. The ratio is 9.58 / 6.64 ≈ 1.44. Bloom filters use exactly 1.44x the theoretically minimum space. That factor is log₂(e), and it comes from the gap between natural and binary logarithms. A structure made of nothing but a bit array and some hash functions gets within 44% of the information-theoretic limit. That is like finishing a marathon only 44% slower than the speed of sound. (Not a great analogy. The point is, it is very good.)
Every operation is O(1), and not the amortized kind
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Create | O(m) | O(m) | Initialize m bits to 0 |
| Insert | O(k) | O(1) | Compute k hashes, set k bits |
| Lookup | O(k) | O(1) | Compute k hashes, check k bits |
| Union | O(m) | O(m) | Bitwise OR of two filters with same m, k |
| Count elements | N/A | N/A | Not supported |
| Delete | N/A | N/A | Not supported (use counting variant) |
| Enumerate | N/A | N/A | Not supported |
Since k is a constant chosen at filter creation (typically 3 to 13), every operation is effectively O(1). Worst case. Not amortized. Not "expected." Actually constant.
Why insert is O(1)
You compute k hash functions (constant) and set k bits in an array (constant time each, direct addressing). No resizing, no collision resolution, no rebalancing. The bit array never grows. It just sits there getting more bits flipped, like a light switch panel at a party.
Why lookup is O(1)
Same as insert: compute k hash functions, check k positions. If any is 0, return false. If all are 1, return true. No chains to traverse, no probing sequences. Hash maps wish they could promise this unconditionally.
Why union works
Two Bloom filters with the same m and k (and same hash functions) can be merged by bitwise OR of their bit arrays. The result is a valid Bloom filter for the union of both sets. This works because any bit set by either filter remains set, and any bit not set by either remains clear. The false positive rate of the union is at most the FPR you would get from inserting all elements into a single filter. This property makes Bloom filters useful in distributed systems where nodes can merge their filters.
Why you cannot delete
Clearing a bit might destroy information about other elements. If "apple" and "banana" both set bit 7, clearing bit 7 to evict "banana" also evicts "apple." The bits are shared, and you have no way to tell who contributed what. It is a studio apartment with no walls: you cannot remove one roommate's furniture without also removing the floor.
Two hash functions are enough (seriously)
Computing k independent hash functions sounds expensive. Good news: Kirsch and Mitzenmacher proved in 2006 that you only need two. The rest are free. Their result: given two independent hash functions h₁(x) and h₂(x), define:
gᵢ(x) = h₁(x) + i · h₂(x) mod m, for i = 0, 1, ..., k-1
This linear combination preserves the asymptotic false positive guarantee. The proof shows that pairwise independence is sufficient. In practice, you hash the element once with a 128-bit hash (like MurmurHash3), split the result into two 64-bit halves, and derive all k positions from those two halves. One hash call, k positions. Cassandra's Bloom filter implementation uses exactly this trick. So does almost everyone else who has read the paper.
The Kirsch-Mitzenmacher trick: one hash call produces a 128-bit digest that splits into h1 and h2, which generate all k positions via linear combination.
One Structure, Every Language
Every implementation below supports three operations: create a Bloom filter with target capacity and false positive rate, insert an element, and check membership. All use the Kirsch-Mitzenmacher double hashing trick for efficiency.
import math import hashlib import struct class BloomFilter: def __init__(self, expected_elements, fpr=0.01): self.size = self._optimal_size(expected_elements, fpr) self.hash_count = self._optimal_hash_count(self.size, expected_elements) self.bit_array = bytearray(math.ceil(self.size / 8)) def add(self, item): for i in range(self.hash_count): pos = self._get_position(item, i) self.bit_array[pos // 8] |= 1 << (pos % 8) def __contains__(self, item): for i in range(self.hash_count): pos = self._get_position(item, i) if not (self.bit_array[pos // 8] & (1 << (pos % 8))): return False return True def _get_position(self, item, index): h = hashlib.md5(str(item).encode()).digest() h1 = struct.unpack_from("<Q", h, 0)[0] h2 = struct.unpack_from("<Q", h, 8)[0] return (h1 + index * h2) % self.size @staticmethod def _optimal_size(n, p): return int(-n * math.log(p) / (math.log(2) ** 2)) @staticmethod def _optimal_hash_count(m, n): return max(1, int(m / n * math.log(2))) bf = BloomFilter(1000, fpr=0.01) bf.add("apple") bf.add("banana") print("apple" in bf) # True print("cherry" in bf) # False (almost certainly)
Where Bloom filters show up in production
Bloom filters answer one question: "Have I seen this before?" Use them anywhere you need to avoid expensive work (a disk read, a network call, a database query) and can tolerate the rare false positive but cannot tolerate a false negative.
The pattern is always the same: cheap probabilistic check first, expensive definitive check only when needed. Think of it as the bouncer at a nightclub. The bouncer might let in someone who is not on the list (false positive). But the bouncer will never turn away someone who is on the list (no false negatives). And the bouncer is very, very fast.
The universal Bloom filter pattern: cheap probabilistic check first, expensive definitive check only when the filter says "maybe."
-
Database key existence. Before reading a 4KB page from disk to find a row, check a Bloom filter in memory. If the filter says "no," skip the disk read entirely. Cassandra, HBase, LevelDB, and RocksDB all do this for their SSTables. Cassandra reports that Bloom filters eliminate over 90% of unnecessary disk reads.
-
Cache admission. Akamai discovered that 75% of objects in their CDN cache were one-hit wonders: requested exactly once, then never again. Seventy-five percent! They added a Bloom filter as a gatekeeper. An object only enters the cache on its second request. The first request adds the URL to the filter. On the second request, the filter confirms the URL was seen before, and the object gets cached. This freed 75% of their cache capacity for content people actually wanted.
-
Malicious URL detection. Chrome's Safe Browsing used to check every URL against a local Bloom filter before making a network request to Google's servers. If the filter says "definitely not malicious," skip the network call. Only on a "maybe" result does the browser do the expensive server lookup.
-
Recommendation deduplication. Medium uses Bloom filters to avoid re-recommending articles a user has already read. Storing every user-article pair in a database is expensive. A Bloom filter per user compresses those read receipts to a few kilobytes.
-
Lightweight blockchain clients. Bitcoin's BIP 37 protocol used Bloom filters to let SPV wallets tell full nodes "send me only transactions matching this filter." Instead of downloading every transaction in every block, lightweight clients get only the ones that might be theirs. (Privacy problems later led to better protocols, but the Bloom filter idea was sound.)
-
Spell checkers. The original use case from Bloom's 1970 paper was essentially this: check whether a word is in the dictionary without storing the entire dictionary. A false positive means occasionally accepting a misspelling. A false negative would mean rejecting a valid word. Bloom filters guarantee the latter never happens.
-
Web crawler deduplication. A crawler maintaining a set of visited URLs needs fast membership checks. Storing billions of URLs in a hash set is expensive. A Bloom filter of visited URLs lets the crawler skip already-crawled pages with minimal memory.
How to spot a Bloom filter problem in an interview
How do you know a problem calls for a Bloom filter? Look for these four signals:
-
Membership testing, not retrieval. You need to know "is X in the set?" but you do not need to retrieve X's value. If you need the value, you need a hash map. If you only need the yes/no, a Bloom filter might work.
-
Asymmetric error tolerance. False positives are acceptable. False negatives are not. This asymmetry is the key signal. If both error types are equally bad, a Bloom filter does not help.
-
The set is large relative to available memory. If the set fits comfortably in a hash set, use a hash set. Bloom filters earn their keep when the set is too large for exact storage, or when memory is scarce.
-
A definitive check exists but is expensive. The Bloom filter is the fast first pass. You need a slower, exact second pass for the positive results. Without this backup, a false positive becomes a real error.
Worked example: web crawler URL deduplication
Problem. You are building a web crawler. You have already crawled 10 billion URLs. Before following a link, you need to check whether you have already visited that URL. Memory is limited to 2 GB.
Why this calls for a Bloom filter. A hash set of 10 billion URLs at, say, 80 bytes per entry (URL string plus overhead) would require 800 GB. That is 400x your budget. That is not a budget problem, that is a fantasy.
With a Bloom filter at 1% FPR, you need 9.6 bits per element. That is 9.6 * 10^10 bits = 12 GB. Still too much. At 10% FPR, you need 4.8 bits per element = 6 GB. Still too much.
But wait. Do you actually need the filter to cover all 10 billion URLs? Partition by time. Keep a Bloom filter for the last 1 billion URLs (1.2 GB at 1% FPR) and store older URLs in a disk-backed structure. The Bloom filter handles the hot set in memory, and you only hit disk for URLs the filter does not recognize.
False positives mean you occasionally skip a URL you have not actually crawled. That is fine. Crawling a page twice is cheap. Missing a page entirely (false negative) would be a real problem. The asymmetry matches perfectly.
Worked example: username availability check
Problem. Your registration system needs to tell users immediately whether a username is taken. The database query takes 50ms. You have 100 million registered usernames.
Why this calls for a Bloom filter. Build a Bloom filter with all 100 million usernames. At 1% FPR, that is 9.6 * 10^8 bits ≈ 115 MB. Check the filter first.
If the filter says "definitely not taken," return "available" immediately. No database query needed. If the filter says "maybe taken," query the database to confirm. The 1% false positive rate means 1 in 100 available usernames triggers an unnecessary database query. That is a rounding error in your load.
False negatives would be catastrophic: telling a user the name is available when it is taken would let two users register the same name. A Bloom filter guarantees this never happens.
The counting Bloom filter enables deletion
Standard Bloom filters cannot delete because bits are shared. "I would like a refund" is not a supported operation. Counting Bloom filters, introduced by Fan, Cao, Almeida, and Broder in 2000, fix this by replacing each bit with a small counter (typically 4 bits). Insertion increments the counters at all k positions. Deletion decrements them. A position is "set" if its counter is greater than zero.
The cost is space: 4x for 4-bit counters. At 1% FPR, that means roughly 38.4 bits per element instead of 9.6. Still far less than a hash set, but the advantage narrows. You are trading the sedan for an SUV: more features, bigger footprint.
Counter overflow is theoretically possible but negligible in practice. Fan et al. showed that with 4-bit counters, the probability of any counter exceeding 15 is approximately 1.37 * 10^(-15) * m. For a filter with a million bits, that is effectively zero. You are more likely to be struck by lightning while winning the lottery.
Standard Bloom filter bits cannot be cleared without corrupting other elements. Counting Bloom filters replace bits with counters, enabling safe deletion by decrementing.
Where Bloom filters fall short
Every data structure has a "yeah, but..." section. Here is this one's.
- No enumeration. You cannot list the elements in the filter. The elements are not stored. You know something is in there. You just cannot ask what.
- No counting. You cannot ask "how many elements are in the filter?" (Cardinality estimation is possible but approximate.)
- No deletion in the standard version. Use a counting Bloom filter if you need it.
- No shrinking. The bit array size is fixed at creation. If you underestimate n, the FPR degrades as the filter fills up. If you overestimate, you waste space. Choose wisely at birth.
- Parameter sensitivity. Choosing m and k wrong can be costly. Too small an m means unacceptable false positives. Too large an m wastes memory. The formulas above give you the right values, so use them. This is one of those rare cases where the math is easier than guessing.
The whole story in 30 seconds
- A Bloom filter is a bit array of m bits with k hash functions. Insert sets k bits. Lookup checks k bits.
- False positives are possible. False negatives are not. This asymmetry is the defining property and the entire reason it is useful.
- The false positive probability is (1 - e^(-kn/m))^k. The optimal k is (m/n) * ln(2).
- At 1% FPR, you need 9.6 bits per element, which is 1.44x the information-theoretic minimum.
- The Kirsch-Mitzenmacher trick: two hash functions can simulate k with no loss in accuracy.
- Every operation (insert, lookup) is O(k), which is O(1) since k is a constant.
- Bloom filters cannot delete, enumerate, or count. They answer exactly one question: "Is this element definitely not in the set?"
- Use them as a cheap first pass before an expensive definitive check: disk reads, network calls, database queries.
Bloom filters show up in systems design interviews and in real production systems, often in the same week. If you want to practice recognizing when a probabilistic data structure is the right call, and articulate the trade-offs out loud instead of staring at a whiteboard, SpaceComplexity runs voice-based mock interviews that test exactly that reasoning.
If you found the hash table internals article useful, Bloom filters are the natural companion: same hash function machinery, completely different trade-off. And for the data structure that Bloom filters are often compared against, see the hash set deep dive.
Further reading
- Bloom filter (Wikipedia)
- Bloom, B.H. "Space/Time Trade-offs in Hash Coding with Allowable Errors" (Communications of the ACM, 1970)
- Kirsch and Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter" (ESA 2006)
- Bloom Filters by Example (GeeksforGeeks)
- Summary Cache: Bloom Filters Math (University of Wisconsin)