What Is a Hash Function? The Concept Every O(1) Lookup Depends On

June 4, 20269 min read
dsaalgorithmsinterview-prepdata-structures
What Is a Hash Function? The Concept Every O(1) Lookup Depends On
TL;DR
  • Hash function converts a key to an array index in O(k) time, making O(1) lookup possible without searching.
  • Polynomial rolling hash (prime 31) makes position count, so "dune" and "nude" hash differently despite sharing the same characters.
  • Collision is inevitable; chaining stores multiple keys per bucket as a linked list, keeping retrieval correct.
  • Load factor (keys / buckets) triggers rehashing — Java at 0.75, Python at 2/3 — which keeps average-case O(1).
  • The O(1) guarantee is average-case; adversarial keys can force O(n), which is why Python randomizes its hash seed since 3.3.
  • Frequency counting, complement lookup, canonical grouping, and sliding window state are the four hash map patterns that appear most in coding interviews.

You've been told hash maps are O(1). You've been nodding along and writing seen = {} in Two Sum without thinking too hard about why it works. At some point, an interviewer is going to ask you how. And "it's, like, fast because... pointers?" is not going to land.

A hash function converts a key into an array index. That one sentence explains why hash maps are fast, why they use memory the way they do, and why that O(1) occasionally lies to your face.

A Shelf-Finding System That Actually Works

Imagine you run a library with exactly 10 shelves. Every time a new book arrives, you need to decide which shelf it goes on. Option A: search every shelf until you find a spot. Option B: invent a rule.

Option A is O(n). You built a library app on top of option A. Your users are unhappy.

Option B: add up the ASCII values of the title's characters, take the result modulo 10.

def shelf_number(title: str) -> int: return sum(ord(c) for c in title) % 10

shelf_number("dune") computes 100 + 117 + 110 + 101 = 428, then 428 % 10 = 8. Shelf 8. Done. To find it later, run the same rule. No searching, no drama.

That is a hash function: a deterministic rule that converts a key to an index.

The rule is fast (one pass over the key), deterministic (same title, same shelf, always), and produces a number in the right range. Three properties, zero magic.

Why the Library Rule Breaks and What Fixes It

The shelf system has a problem. "dune" and "nude" have the same characters, different order, identical hash. You shelved Dune, a customer asks for Nude (presumably a very different book), and you confidently hand them the wrong one.

This is called a collision. It happens when two distinct keys produce the same index. It is not an edge case. It is Tuesday.

Good hash functions minimize collisions by making output sensitive to the position of characters, not just their presence. The standard approach for strings is polynomial rolling hash:

def hash_string(s: str, table_size: int) -> int: PRIME = 31 h = 0 for c in s: h = (h * PRIME + ord(c)) % table_size return h

Multiplying by a prime at each step makes position matter. "abc" and "bac" now hash differently. Java's String.hashCode() uses exactly this algorithm with prime 31, because prime numbers play nicer with modular arithmetic and reduce clustering. Python's hash() takes a different approach: it's randomized every time the interpreter starts, for security reasons we'll get to shortly.

Three properties any hash function needs:

  • Deterministic. Same key, same index, always. Otherwise you store data on row 7 and retrieve from row 14 and nothing makes sense.
  • Fast. O(k) where k is the key length. For integers, that's O(1).
  • Uniform. Keys should spread across the table. Clustering is the enemy.

Every Hash Table Is Just an Array With a Good Filing System

A hash table is an array where the hash function decides which slot to use. Not a tree, not a linked list, not a heap. An array. That is the entire reason it's fast.

class HashTable: def __init__(self, size: int = 16): self.size = size self.buckets = [[] for _ in range(size)] def put(self, key: str, value) -> None: index = hash_string(key, self.size) for pair in self.buckets[index]: if pair[0] == key: pair[1] = value return self.buckets[index].append([key, value]) def get(self, key: str): index = hash_string(key, self.size) for pair in self.buckets[index]: if pair[0] == key: return pair[1] return None

Each slot in buckets is a list. When two keys hash to the same index, both go into the same list. This is called chaining. You search within the bucket (usually tiny, one or two elements) rather than across the whole table.

Most languages use chaining or open addressing, where a collision means "probe the next empty slot" instead of maintaining a list. Open addressing is more cache-friendly but gets cranky near full capacity.

The whole structure takes O(n) space for n stored key-value pairs. The array slots you don't use are wasted, which is the trade-off you make for constant-time access.

Collisions Are Inevitable. Load Factor Is What You Control.

This is where interviewers watch you either get it or not.

If you have more keys than buckets, some bucket must hold multiple keys by pigeonhole. Easy case. The sneaky case: even with far fewer keys than buckets, the birthday problem guarantees collisions start appearing when the table is about half-full. You don't need to be past capacity to have problems.

The load factor is stored keys divided by buckets. It is the single number that controls whether your hash table stays fast or slowly turns into a linked list wearing a hash map costume.

Java's HashMap rehashes when load factor exceeds 0.75. Python's dict rehashes around 2/3. Rehashing allocates a new, larger array and re-inserts every key. It costs O(n), but happens infrequently enough that the amortized average stays O(1).

OperationAverageWorst Case
InsertO(1)O(n)
LookupO(1)O(n)
DeleteO(1)O(n)

The worst case requires either a terrible hash function or adversarial input. In 2011, a security researcher demonstrated that PHP 5 could be brought to its knees by sending 65,536 HTTP parameters with carefully crafted keys that all hashed to the same bucket. The server spent 30 seconds on what should have taken milliseconds. One web request. Thirty seconds. Server down.

Python 3.3 responded by randomizing its hash seed at startup (PYTHONHASHSEED). Java added extra mixing to its hash function. This is why hash("abc") returns a different number every time you restart a Python process, and why dictionary insertion order is not something you should assume is deterministic across runs.

The lesson: a hash function that an attacker can predict becomes a denial-of-service vulnerability. Interview problems won't hand you adversarial keys, but this story is a good one to know.

The O(1) Has Fine Print

Everyone says O(1) and means average case. Two clarifications actually matter.

Hashing a key is not free. For integer keys, it costs O(1). For string keys of length k, it costs O(k). Most interview problems use short strings or integers, so this is irrelevant. For problems with very long strings as keys, it is worth a sentence. The hash map time complexity breakdown covers this in detail.

The O(1) also assumes good distribution. An interview problem won't hand you adversarial keys, but saying "average case O(1), worst case O(n) in the degenerate case of all keys colliding" is the kind of thing that gets you to the next round. It shows you know what you're using, not just that you can use it.

Four Patterns Hash Functions Power in Interviews

Once you understand what a hash function does, these patterns become logical instead of something you memorized from a flashcard.

Frequency counting. One pass, count each element:

def first_non_repeating(s: str) -> str: counts = {} for c in s: counts[c] = counts.get(c, 0) + 1 for c in s: if counts[c] == 1: return c return ""

Complement lookup (Two Sum). The O(n²) approach checks every pair. The hash map turns "has this complement appeared before?" into an O(1) question:

def two_sum(nums: list[int], target: int) -> list[int]: seen = {} for i, n in enumerate(nums): complement = target - n if complement in seen: return [seen[complement], i] seen[n] = i return []

For every element you check in O(1) whether its partner was already seen. The full Two Sum breakdown covers extensions to this pattern.

Grouping by canonical key. Hash maps accept composite keys. Sort each word to get its canonical anagram form, then group:

from collections import defaultdict def group_anagrams(words: list[str]) -> list[list[str]]: groups = defaultdict(list) for word in words: key = tuple(sorted(word)) groups[key].append(word) return list(groups.values())

The hash function runs on the sorted tuple, not the original string. Two anagrams get the same key automatically.

Sliding window state. When you need to track what's inside a moving window, a frequency hash map serves as the window's state. Increment when the right pointer advances, decrement when the left catches up. State stays O(alphabet size), not O(window size).

Two Mistakes That Are Completely Avoidable

Assuming hash maps preserve insertion order everywhere. Python 3.7+ dicts do. JavaScript objects mostly do for string keys. Java's HashMap does not. If your solution depends on ordered traversal, use Java's LinkedHashMap or explicitly sort your output. This is the kind of thing that fails silently and looks correct on small examples.

Using a hash map when an array suffices. If the key domain is small and bounded (lowercase English letters, digits, ASCII), an array of size 26 or 128 is faster and simpler. No hash function, no collision handling, direct index. The array vs hash map tradeoff matters in practice because your interviewer will notice when you reach for a dict to track 26 possible characters.

If you want to practice explaining these trade-offs out loud under actual interview pressure, SpaceComplexity runs voice-based mock interviews with rubric-based feedback. Knowing the concept is one thing. Articulating it at 1 AM before a Google interview is a different skill entirely.

Further Reading