What Is a Hash Collision? Causes, Fixes, and Interview Impact

You have a hash map. You insert two keys. They end up in the same bucket. You stare at it. You blame your code. Then you read the math. It's not your code. It's the pigeonhole principle, and it's been waiting for you.

A hash collision is when two different keys produce the same bucket index. It's not a bug. It's a guarantee. And once you understand why, hash map behavior stops being mysterious and starts being predictable.

Collisions Are Mathematically Guaranteed

A hash function maps an arbitrarily large input space to a fixed range of bucket indices. Say you have 256 buckets. Your hash function maps every possible string onto indices 0 through 255. There are infinitely more strings than there are buckets, so two different keys will eventually land in the same slot. That's the pigeonhole principle: put 257 pigeons in 256 holes, and at least one hole has two pigeons. The pigeons did not consent to sharing.

The birthday paradox makes this worse than intuition suggests. With 256 buckets, you'd expect your first collision somewhere around key 17. Not key 257. The math on this one is genuinely startling: the probability of at least one collision after inserting k keys into m buckets is approximately 1 - e^(-k²/2m). Plug in m=256 and k=20 and you're already at 85%. Your hash table is colliding constantly from the start. It's designed to handle this. Don't panic.

Watch a Hash Collision Happen

Say you're building a hash table with 8 buckets. Your hash function is deliberately terrible for illustration:

def hash_key(key: str, size: int) -> int:
    return sum(ord(c) for c in key) % size

Insert two keys:

hash_key("ab", 8)  # ord('a')=97, ord('b')=98 → 195 % 8 = 3
hash_key("ba", 8)  # ord('b')=98, ord('a')=97 → 195 % 8 = 3

Both land in bucket 3. Collision. "listen" and "silent" would do the same thing. Any anagram pair breaks a sum-of-characters hash, because addition is commutative and your hash function did not notice.

Production hash functions are far more sophisticated. Polynomial rolling hashes weight each character by its position. But they still collide. The goal isn't to eliminate collisions. It's to spread them out so no single bucket becomes a traffic jam.

Your Language Already Picked a Strategy

Two main approaches handle collisions. Every language picks one, and you'll be expected to know which.

Chaining

Each bucket holds a linked list. When two keys collide in bucket 3, they both live there as nodes in the same list.

Bucket 3: ["ab"] → ["ba"] → null

Lookup scans the list until it finds a key match. Java's HashMap uses chaining, and upgrades chains to red-black trees when any single bucket holds more than 8 elements. At that point the O(n) list scan becomes O(log n) tree search.

Time: O(1) average, O(n) worst case if everything piles into one bucket. Space overhead: each node needs an extra pointer.

Open Addressing

No linked lists. Everything stays in the array. When bucket 3 is occupied, probe for the next open slot.

Linear probing checks 3, 4, 5, 6 in sequence. Cache-friendly, simple, fast in practice. The downside: occupied cells form long runs. Every collision into an existing run makes it longer. This is called primary clustering, and it compounds.

Quadratic probing steps by 1, 4, 9, 16 instead. Breaks up primary clusters but creates secondary clusters for keys that share the same initial hash.

Double hashing computes the step size with a second hash function. Each key gets its own probe sequence, which eliminates clustering at the cost of two hash computations per operation.

Python's dict uses open addressing with a randomized probe sequence derived from the object's hash. Load factor threshold is around 0.66 before it resizes.

Chaining vs open addressing: same collision, two different solutions

Chaining grows the bucket sideways. Open addressing finds the next empty slot in the array. Both work. Neither eliminates the collision.

At a Glance

Strategy	Cache Behavior	Extra Memory	Load Factor Limit
Chaining	Poor (pointer chasing)	Linked list nodes	Can exceed 1.0
Linear probing	Excellent	None	~0.7
Double hashing	Good	None	~0.7
Java HashMap	Mixed (list then tree)	List nodes	0.75 (default)

Load Factor: The Number That Controls Everything

The load factor (α) is stored keys divided by buckets.

α = keys / buckets

This single ratio determines average lookup time more than any other design choice. At α = 0.5, expected probes per lookup is about 1.5 for open addressing. At α = 0.9, it jumps to 5.5. At α = 1.0, you might as well be searching an array. Congratulations, you made a list.

Standard library resize thresholds:

Java HashMap: 0.75
Python dict: ~0.66
C++ unordered_map: 1.0 (brave choice)
Go maps: ~6.5 items per bucket (a different model entirely)

When the load factor hits the threshold, the table doubles in size and every key rehashes. That single rehash costs O(n), but it happens rarely enough that insertion stays O(1) amortized. The math: if you double at n elements, the next rehash happens after you've inserted n more. You spent O(n) once to handle n operations. One dollar per insert.

When O(1) Becomes O(n)

The O(1) guarantee for hash map operations is an average-case claim, not a worst-case guarantee. It depends on a well-distributed hash function and keys that don't conspire against you.

If every key hashes to bucket 3, your O(1) hash map is now an O(n) linked list with extra steps. With normal data and a good hash function, this doesn't happen. But an attacker can make it happen deliberately.

In 2011, researchers Klink and Wälde demonstrated this at the 28C3 conference. By crafting HTTP POST requests with 65,536 form parameters that all collided in PHP 5's hash table, they consumed 30 seconds of CPU per request. The attack worked against PHP, Python, Java, and Ruby simultaneously, because all of their hash functions were deterministic and therefore predictable. If you can compute which keys collide before you send them, you can turn any hash map lookup into a full bucket scan.

The fix was hash randomization. Python 3.3 enabled it by default via PYTHONHASHSEED. Java added extra bit-mixing. Go uses a per-map random seed generated at map creation. Now hash("hello") returns a different value on every Python startup, so an attacker can't precompute colliding inputs.

This is why "hash tables are O(1)" is a qualified statement: O(1) with a good hash function, randomization, and non-adversarial inputs. O(n) if those assumptions break.

The Caveat That Separates Good Answers From Great Ones

Most interview problems treat hash map operations as O(1). That's the right default. But the subtlety surfaces in specific places.

Custom keys require custom hash functions. In Python, tuples are hashable but lists aren't, because mutating a list after insertion would corrupt the table (the bucket it lives in would change without the table knowing). In Java, you override hashCode() and equals() together or you get silent incorrect behavior.

Worst-case guarantees sometimes matter. If an interviewer asks for O(log n) guaranteed time on all operations, a hash map isn't the answer. A balanced BST (Java's TreeMap, Python's SortedList) gives O(log n) worst case. For more, see BST vs Hash Map: O(log n) Beats O(1) More Often Than You Think.

The load factor explains surprise slow insertions. When the table just resized, that single insert cost O(n). Amortized over all inserts it's still O(1), but explaining why is what separates a candidate who says "O(1)" from one who understands it. The full analysis is in Dynamic Array Time Complexity: What Every append() Is Actually Doing.

Equal objects must hash identically. The classic bug: two objects compare equal but return different hashes. The table treats them as different keys even though your code treats them as the same.

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __eq__(self, other):
        return self.x == other.x and self.y == other.y

    def __hash__(self):
        return hash((self.x, self.y))  # tuple hash is well-defined

If you define __eq__ without __hash__, Python makes the object unhashable. You'll find out at runtime.

For how hash functions actually compute their output, see How Hash Functions Work: Keys, Buckets, and the O(1) Guarantee.

Three Interview Problems Where Collision Logic Is the Insight

Several interview problems involve collision thinking without the word "hash" appearing anywhere.

Group Anagrams (LeetCode 49) works by mapping each word to its sorted form as a key. Two anagrams produce the same sorted key. That's intentional collision: you want them in the same bucket. Understanding that hash maps group by key equality, not just key existence, is the whole insight.

Subarray Sum Equals K (LeetCode 560) stores prefix sums in a hash map. The "collision" that matters is when the same prefix sum appears twice, meaning a zero-sum subarray exists. Tracking the frequency of each sum value is the entire algorithm.

Rabin-Karp string matching uses a rolling hash to scan for pattern matches in O(1) per window. It handles collisions with a full string comparison fallback. The hash comparison is O(1) and fast; the string comparison catches false positives. Collision handling is the correctness guarantee, not a bug to avoid.

What to Know Cold

A hash collision is two different keys mapping to the same bucket. Unavoidable by the pigeonhole principle.
Chaining uses linked lists per bucket. Open addressing probes for the next free slot.
Load factor (keys / buckets) controls average performance. Most standard libraries resize at 0.66 to 0.75.
O(1) average degrades to O(n) worst case when collisions concentrate in one bucket.
Hash randomization (Python, Java, Go) prevents adversarial collision attacks.
Objects that compare equal must hash identically. Define __hash__ and __eq__ together or both stay home.

If you want to practice explaining these tradeoffs out loud under time pressure, SpaceComplexity runs voice-based mock interviews that score your DSA explanations the same way a real interviewer would.