What Is a Rolling Hash? The O(1) Window Behind Fast String Search

June 18, 20269 min read
dsaalgorithmsinterview-prepdata-structures
What Is a Rolling Hash? The O(1) Window Behind Fast String Search
TL;DR
  • Rolling hash updates a window fingerprint in O(1) by removing the outgoing character and adding the incoming one — no re-read of the middle.
  • The polynomial formula is h = Σ s[i] * b^(m-1-i) mod p; the slide step is new_h = (old_h - s[i] * b^(m-1)) * b + s[i+m] mod p.
  • Rabin-Karp uses this to search in O(n + m) average time instead of O(n·m) brute force — character comparison only triggers on hash match.
  • Collisions never produce wrong answers; verify on every match, or use double hashing to reduce collision probability to roughly 1/(p1 × p2).
  • A prefix hash array lets you retrieve any substring hash in O(1) after O(n) preprocessing, enabling fast arbitrary-length substring comparison.
  • Rolling hash unlocks LeetCode 187 (Repeated DNA Sequences) in O(n) and LeetCode 1044 (Longest Duplicate Substring) in O(n log n) via binary search on length.
  • Reach for rolling hash when you need to fingerprint O(n) substrings of non-trivial length and brute-force re-hashing would cost O(n·m).

If you've ever wondered how Rabin-Karp searches a text without recomputing a hash from scratch at every position, the answer is rolling hash. It's a single arithmetic insight, and it collapses what would otherwise be an O(n·m) algorithm into something that genuinely feels like cheating.


The Problem With the Obvious Approach

Suppose you're searching for a pattern of length m inside a text of length n. The brute-force move is to check every window of size m in the text: does position 0 through m-1 match? Does 1 through m match? And so on.

That gives you O(n) windows. Each comparison takes O(m) time. Total: O(n·m).

For a text of 10,000 characters and a pattern of 100, that's a million operations. Fine. But at a text of 10 million characters with a slightly longer pattern, you're the engineer explaining to your manager why the search feature "takes a sec."

The core idea behind rolling hash: instead of comparing characters, compare fingerprints. If you can compute a short numeric fingerprint for each window in O(1) time, you only do the expensive character-by-character comparison when fingerprints match. Most windows won't match, so most comparisons never happen.

The trick is computing O(n) fingerprints without spending O(m) on each one.

Bigfoot sitting contemplatively, captioned: "When you realize most legacy enterprise search is brute force and users don't complain"

Most users don't complain. Until the 50 million character log file.


What Makes a Hash "Roll"

A regular hash reads every character in a string and produces a number. Slide the window one position right and you'd need to re-read all m characters to update it. That's O(m) per slide. Useless.

A rolling hash is designed so you can compute the next window's hash from the current one using just two operations:

  1. Remove the contribution of the character leaving the window (the leftmost one).
  2. Add the contribution of the character entering the window (the new rightmost one).

Both steps take O(1). The window "rolls" forward and the hash updates without touching the characters in the middle. The middle characters don't need to be consulted. They're not even invited.


The Math Behind the Roll

The standard choice is a polynomial hash. For a window containing characters s[i], s[i+1], ..., s[i+m-1]:

hash = s[i] * b^(m-1) + s[i+1] * b^(m-2) + ... + s[i+m-1] * b^0

where b is a base (commonly 31 for lowercase letters, 131 for general ASCII) and everything is taken modulo a large prime p (like 10^9 + 7) to keep numbers manageable.

When the window slides right by one:

  • The old leftmost character s[i] had contribution s[i] * b^(m-1). Subtract it.
  • Every remaining character's exponent drops by one (multiply the whole hash by b).
  • The new rightmost character s[i+m] enters with exponent 0, so just add it.

Putting it together:

new_hash = (old_hash - s[i] * b^(m-1)) * b + s[i+m]

Everything modulo p. You precompute b^(m-1) mod p once at setup. After that, each slide is four arithmetic operations. Four.

Here's a concrete Python implementation:

def search(text: str, pattern: str) -> list[int]: n, m = len(text), len(pattern) if m > n: return [] BASE = 131 MOD = 10**9 + 7 power = pow(BASE, m - 1, MOD) pattern_hash = 0 window_hash = 0 for i in range(m): pattern_hash = (pattern_hash * BASE + ord(pattern[i])) % MOD window_hash = (window_hash * BASE + ord(text[i])) % MOD matches = [] for i in range(n - m + 1): if window_hash == pattern_hash: if text[i:i + m] == pattern: matches.append(i) if i + m < n: window_hash = (window_hash - ord(text[i]) * power) % MOD window_hash = (window_hash * BASE + ord(text[i + m])) % MOD window_hash %= MOD return matches

The initial hash computation is O(m). Every subsequent slide is O(1). Character-by-character verification only happens on hash matches, which for a good hash function and non-adversarial input is rare.


Hash Collisions: The One Catch

Two different strings can produce the same hash. This is called a collision, and it's unavoidable with any fixed-size hash. At some point, two totally different windows will have identical fingerprints. This is mathematics being annoying.

A collision doesn't give you a wrong answer, it just wastes time. When hashes match, you verify with a real character comparison before reporting a match. The algorithm stays correct. Worst case, if your input is adversarially crafted by someone who has read your source code, memorized your modulus, and apparently has nothing better to do, collisions happen constantly and you're back to O(n·m). In practice, with a large prime modulus and a good base, collision probability per window is around 1/p, which for p around 10^9 is negligible.

A common defensive technique is double hashing: compute two independent hashes with different bases and moduli. The combined collision probability drops to roughly 1/p1 × 1/p2. For p1 and p2 both around 10^9, that's about one in 10^18. In practice, never.


Each Slide Costs O(1). The Full Pass Costs O(n + m).

StepComplexity
Initial hash setupO(m)
Each window slideO(1)
Total for n windowsO(n + m)
Verification on matchO(m) per match
Extra spaceO(1)

For the typical case with few matches, total time is O(n + m). This is the same asymptotic result as KMP, but rolling hash has a simpler implementation and generalizes to 2D patterns, multiple patterns, and longest-common-substring problems in ways KMP does not.


Where This Appears in Interviews

Rolling hash unlocks a category of problems that look like they require O(n²) or worse. These are the problems where you look at the constraints, think "O(n²) won't pass," and then stare at the screen for a while.

Repeated DNA Sequences (LeetCode 187). Find all 10-letter substrings that appear more than once. The brute-force reads all O(n·10) substrings into a hash set. A rolling hash lets you hash each window in O(1), keeping total time O(n).

Longest Duplicate Substring (LeetCode 1044). Binary search on the answer (length), then use rolling hash to check whether any substring of that length appears twice. The binary search runs log(n) iterations, each costing O(n) with rolling hash: O(n log n) total. Without rolling hash this is O(n² log n) or worse.

Rabin-Karp Multi-Pattern Search. Put the hashes of multiple patterns into a set, then scan the text once. Each window check is O(1) against the set. You pay O(m) for verification only on matches.

Problems involving sliding windows and string matching are the obvious home for rolling hash, but the binary-search-on-length pattern, paired with rolling hash to validate each candidate length, is a genuinely powerful combination that shows up on harder problems.

For a deeper walkthrough of the Rabin-Karp algorithm built around this idea, Rolling Hash Algorithm: The O(1) Window Trick for String Search goes into the full implementation with edge cases.


Querying Any Substring in O(1)

Some problems need to compare arbitrary substrings, not just a fixed-length sliding window. The solution is a prefix hash array computed once upfront.

def build_prefix_hash(s: str, base: int = 131, mod: int = 10**9 + 7) -> tuple: n = len(s) prefix = [0] * (n + 1) power = [1] * (n + 1) for i in range(n): prefix[i + 1] = (prefix[i] * base + ord(s[i])) % mod power[i + 1] = power[i] * base % mod return prefix, power def get_hash(prefix, power, l, r, mod: int = 10**9 + 7) -> int: return (prefix[r] - prefix[l] * power[r - l]) % mod

get_hash(prefix, power, l, r) returns the hash of any substring in O(1) after O(n) preprocessing. Two substrings compare in O(1) instead of O(m). This is the technique behind longest-common-prefix queries on suffix arrays and anywhere you need rapid substring equality checks across a large string.


Where Rolling Hash Fits in Your Prep

Rolling hash sits between the basic string patterns (two pointers, sliding window) and the specialist algorithms (suffix arrays, Aho-Corasick). You probably won't see it on a phone screen. It appears with real frequency on harder rounds at companies that test algorithmic depth.

The signal to reach for it: you need to compare or identify substrings efficiently, the naive approach involves hashing or comparing O(n) substrings of non-trivial length, and the total budget is O(n) or O(n log n).

Practicing this under timed conditions, with someone pushing back on your collision handling, is where voice-based mock interviews help. SpaceComplexity runs DSA mock interviews in exactly that format, with rubric-based feedback on both your solution and your explanation.


In Short

  • A rolling hash computes the next window's hash from the current one in O(1) by removing the outgoing character and adding the incoming one.
  • The polynomial hash formula is h = sum(s[i] * b^(m-1-i)) mod p, and the roll formula is new_h = (old_h - s[i] * b^(m-1)) * b + s[i+m].
  • Collisions don't produce wrong answers, only extra work. Verify with a real character comparison on every hash match.
  • Double hashing (two independent hashes) reduces collision probability to near zero in practice.
  • Core use cases: Rabin-Karp, repeated substring detection, longest duplicate substring via binary search, multi-pattern search.
  • Time: O(n + m) for a single pattern. Space: O(1) for the sliding version, O(n) for the prefix array version.

Further Reading