How Hash Functions Work: Keys, Buckets, and the O(1) Guarantee

May 26, 202611 min read
dsaalgorithmsinterview-prepdata-structures
How Hash Functions Work: Keys, Buckets, and the O(1) Guarantee
TL;DR
  • Hash function maps any key to an array index in O(1): direct array access replaces linear search entirely.
  • Polynomial rolling hash gives each character position a unique weight, so anagrams collide only by coincidence.
  • Collision handling splits into two strategies: chaining (linked list per bucket) or open addressing (probe the flat array).
  • Load factor α = n/m is the single control knob; open addressing catastrophically slows above α ≈ 0.9 due to the 1/(1−α)² probe formula.
  • Amortized O(1) insertion holds because doubling the array at each resize makes rehash events exponentially rare, spreading the O(n) cost across n insertions.
  • Hash flooding lets an attacker force O(n) lookups by crafting colliding keys; SipHash with a per-runtime random seed is the standard fix.
  • SUHA proof shows expected lookup is Θ(1 + α): keep α bounded by a constant and O(1) is a theorem, not an assumption.

Python looks up a dictionary key in the same time whether the dictionary has 10 entries or 10 million. Java's HashMap.get() is as fast on line one of a log file as on line one million. You probably know this is O(1). Here is how hash functions pull that trick off, and what it costs to keep pulling it off.

You Don't Search. You Calculate.

A hash table is an array with a mathematical shortcut on top. Instead of scanning the array for your key, you call a function that tells you exactly which index to check.

index = hash(key) % array_size
value = array[index]

That is the whole trick: turn the key into an index in O(1), then do a direct array access in O(1). No comparisons. No iteration. Everything else in hash table design, collision handling, load factors, resizing, exists to make that one line hold up under pressure.

Key to hash to index to bucket: the full pipeline hash(key) % m gives you a direct array index. No searching, just arithmetic.

Integers Are Simple to Hash

For integers, the simplest approach is the division method:

h(k) = k mod m

where m is the number of buckets. Pick m as a prime not close to a power of 2 or 10, and keys tend to spread across buckets evenly.

Knuth's multiplication method is slightly more uniform. Compute k * A mod 1 (the fractional part of k times a constant A), then multiply by m and take the floor:

h(k) = floor(m × (k × A mod 1))

Knuth recommends A ≈ (sqrt(5) - 1) / 2 ≈ 0.618, the golden ratio. The fractional part of k × A distributes near-uniformly across [0, 1) regardless of k, so the resulting index spreads keys evenly even when m is a power of 2.

Strings Need a Polynomial

Strings need more care. A common mistake is XOR-ing all character codes together. It seems reasonable. It is wrong. "abc" and "bca" XOR to the same value because XOR ignores position. You have built a hash function that cannot tell an anagram from the original word.

The fix is polynomial rolling hash.

Pick a prime base p (31 is standard for lowercase ASCII) and treat the string as a polynomial evaluated at p:

h(s) = s[0] × p^(n-1) + s[1] × p^(n-2) + ... + s[n-1] × p^0   (mod m)

Each character gets a unique weight (a distinct power of p), so reordering characters changes the hash. Two different strings collide only when the polynomial evaluates to the same value, which happens with probability at most (n-1)/m under uniform key distribution.

def polynomial_hash(s: str, p: int = 31, m: int = 10**9 + 9) -> int: h, power = 0, 1 for ch in s: h = (h + (ord(ch) - ord('a') + 1) * power) % m power = (power * p) % m return h

Choose m as a large prime (10^9 + 9 is a popular pick) to minimize the chance that two strings share a hash. Java's String.hashCode() uses this pattern with p=31, wrapping at 32-bit integers. Then HashMap applies one extra step before computing the bucket index:

// Java HashMap: mix high bits into low bits int h = key.hashCode(); h = h ^ (h >>> 16);

Java's bucket count is always a power of 2. The final index is h & (m - 1), which uses only the lowest k bits. Keys whose hash codes differ only in the upper bits would all collide without this XOR. Spreading the high bits downward is the cheapest possible fix.

Three Things Every Hash Function Must Do

  • Determinism. Same key, same index, every time. A hash function that returns different values for the same input is not a hash function. It is a bug.
  • Speed. Hashing a fixed-width key (integer, pointer) should be O(1). Hashing a string of length k is O(k), which is fine since you have to read it anyway.
  • Spread. Keys land in different buckets. A function that maps everything to bucket 0 turns your hash table into a linked list. You still call it O(1) at parties and people believe you.

The formal ideal is called the Simple Uniform Hashing Assumption (SUHA): every key is equally likely to hash into any of the m buckets, independently of other keys. No real-world hash function perfectly satisfies this, but well-designed ones come close enough that the analysis holds in practice.

Collisions Are Guaranteed. Plan for Them.

No matter how good your hash function is, two distinct keys will eventually hash to the same bucket.

Skeletor birthday paradox meme The birthday paradox: 23 people, 50% chance of a shared birthday. Collisions arrive faster than your intuition suggests.

The birthday paradox explains why: with n items and m buckets, the expected number of colliding pairs is roughly n(n-1) / (2m). Insert 50 items into a 100-bucket table and you already expect about 12 collisions.

You cannot design around collisions. You can only choose how to handle them.

Two strategies dominate.

Chaining: Each Bucket Holds a List

Every bucket is a linked list. A collision means the new item is appended to that bucket's list.

bucket 3: → ("apple", 1.50) → ("grape", 2.00) → None
bucket 7: → ("mango", 3.00) → None
bucket 9: → None

Chaining: bucket array with linked lists showing collision resolution When two keys hash to bucket 3, they both live in the same list. Lookup walks the chain until it finds a match.

Insertion prepends to the list in O(1). Lookup walks the list comparing keys until it finds a match, which takes O(length of that list). Under uniform hashing, the expected list length at any bucket is α (defined below), so expected lookup is O(1 + α).

Open Addressing: Probe the Array Directly

No linked lists. Everything lives in the flat array. On a collision, probe for the next empty slot.

# Linear probing: check index, index+1, index+2, ... i = hash(key) % m while table[i] is not None and table[i][0] != key: i = (i + 1) % m

Open addressing: linear probing hops over occupied slots until it finds an empty one "apple" hashes to slot 3, which is taken. Slot 4 is also taken. Slot 5 is empty. Inserted. Three probes, zero extra allocations.

All data sits in one contiguous block of memory. Modern CPUs prefetch cache lines aggressively, so walking a few adjacent slots after a collision is cheap. The tradeoff: clustering. Dense runs of occupied slots grow over time, slowing both lookups and insertions.

Python's dict uses open addressing with a pseudorandom probe sequence (i = (5*i + 1 + perturb) % m where perturb starts at the original hash and shifts right each step) to break up clusters and visit every bucket.

Load Factor Is the Single Control Knob

The load factor α = n/m, where n is the number of stored items and m is the number of buckets.

α is the number that determines everything about your hash table's performance. Lower α means fewer collisions, faster lookups, more wasted space. Higher α means more collisions, slower lookups, less wasted space.

For open addressing with linear probing, the expected number of probes on an unsuccessful search is (1/2)(1 + 1/(1-α)^2). At α = 0.5, that is about 2.5 probes. At α = 0.9, it is about 50. At α = 0.99, it is about 5,000. The formula has a vertical asymptote at α = 1. This is why open-addressing tables never let α approach 1.

Load factor vs expected probes: probe count explodes as alpha approaches 1 That curve is not steep because of a formatting error. α = 0.75 is Java's default resize threshold for a reason.

α = 0.50: fast, half the array empty
α = 0.75: Java HashMap's default resize threshold
α = 0.90: noticeably slower, especially for open addressing
α → 1.0: catastrophic for open addressing

Chaining degrades more gently because list length grows linearly with α, not as 1/(1-α)^2. Go's map uses 8-slot chaining buckets and allows α up to about 6.5 before resizing.

Resizing Keeps α Bounded

When α hits the threshold, the table allocates a new array twice as large and rehashes every key into it.

The resize itself costs O(n): visit every item, compute its new index, place it. But consider how often it happens. Right after a resize, you have n/2 items in a 2m-slot table, so α = 1/4. You need n/2 more insertions before hitting the threshold again.

Total cost across all insertions: if you double at every power of 2, the resize costs form the series 1 + 2 + 4 + ... + n/2, which sums to n. Spread that O(n) cost over n insertions and you get O(1) amortized per insertion.

This is why doubling (or any constant growth factor greater than 1) is mandatory. If you grew the table by a fixed increment instead, the series would be arithmetic rather than geometric, summing to O(n^2) for n insertions.

The O(1) Guarantee Is a Theorem, Not a Hope

Under SUHA, the expected time for an unsuccessful search in a chaining hash table is Θ(1 + α).

Proof: hashing to a bucket is O(1). Then you walk the entire chain. The expected chain length at any bucket i is E[n_i]. Each of the n stored items falls in bucket i with probability 1/m (by SUHA), independently. So E[n_i] = n × (1/m) = n/m = α. Expected time is O(1) for hashing plus O(α) for the walk = Θ(1 + α).

For a successful search, you find the key partway through the list, not at the end. On average, you examine about α/2 items before the match. Expected time is Θ(1 + α/2).

If you keep α ≤ some constant c, then Θ(1 + α) = Θ(1 + c) = Θ(1). Resizing is what enforces that constant. Without it, α would grow without bound and the O(1) guarantee collapses.

When the Assumption Breaks: Hash Flooding

SUHA is an assumption. It holds when keys are chosen without knowledge of your hash function. An attacker who knows the function can craft inputs that all map to the same bucket, turning O(1) lookup into O(n).

In 2011, researchers demonstrated this at the 28C3 conference. PHP, Python, Java, and Ruby were all vulnerable. Sending 65,536 specially crafted POST parameter names to a PHP 5 server would pin one server core for about 30 seconds. A single laptop could bring down a cluster.

The fix is keyed hashing. On startup, the runtime generates a random secret. The hash function becomes h(key, secret). Without the secret, an attacker cannot predict which strings will collide.

Python switched to SipHash in version 3.4. Go seeds its hash function per-process. Java XORs a random seed into each HashMap at construction time. SipHash was designed by Jean-Philippe Aumasson and Daniel J. Bernstein in 2012, specifically in response to these attacks. It runs in sub-nanosecond time per byte while being strong enough that guessing the seed is not feasible.

The Short Version

  • A hash function maps keys to array indices. Every O(1) claim in hash table analysis comes from this single array-access step.
  • Polynomial rolling hash handles strings: prime base p gives each position a unique weight, prime modulus m keeps values bounded.
  • Collisions are mathematically unavoidable. Chaining appends to a list per bucket; open addressing probes the array for the next empty slot.
  • Load factor α = n/m determines expected collision frequency. Higher α means more probes, especially in open addressing.
  • Resizing doubles the array when α hits a threshold. The O(n) rehash cost amortizes to O(1) per insertion because resizes happen exponentially rarely.
  • Under SUHA with constant α, expected lookup is Θ(1). Worst case is still O(n), and adversarial inputs can force that worst case.
  • Modern runtimes use randomized hash seeds (SipHash) to neutralize adversarial collision attacks.

Explaining these tradeoffs out loud, not just knowing them, is exactly what separates a "hire" from a "strong hire" debrief. SpaceComplexity runs voice-based DSA mock interviews with rubric-based feedback, so you can practice the reasoning, not just read it.

For the full per-operation time complexity analysis, see Hash Table Time Complexity: Why Lookup Is O(1) (and When It Secretly Isn't). For why chaining is slower than open addressing in practice despite identical Big-O, see Array vs Linked List Performance: Same Big-O, Completely Different Speed. The amortized O(1) for resizing is the same argument used in Dynamic Array Time Complexity: What Every append() Is Actually Doing.

Further Reading