KMP vs Rabin-Karp: Same O(n+m), Different Bets

May 28, 202611 min read
dsaalgorithmsinterview-prepdata-structures
KMP vs Rabin-Karp: Same O(n+m), Different Bets
TL;DR
  • KMP guarantees O(n+m) worst-case; Rabin-Karp degrades to O(nm) under adversarial hash collisions, making KMP the only safe choice for security-sensitive or competitive programming contexts.
  • For multiple patterns, Rabin-Karp wins: hash all patterns into a set upfront, then one pass over the text finds all of them with no extra cost per added pattern.
  • Small alphabets (DNA, binary) favor KMP: frequent partial matches trigger more LPS jumps that skip useless alignments, a structural advantage Rabin-Karp cannot exploit.
  • KMP overlapping-match bug: after a full match set j = lps[m-1], not j = 0, or overlapping occurrences are silently dropped.
  • Rabin-Karp verification bug: always character-compare after a hash match; skipping it turns a probabilistic algorithm into a wrong one, not just a slow one.
  • Neither str.find nor String.indexOf uses either algorithm; production runtimes use Boyer-Moore-Horspool variants for better average-case performance on large alphabets.

KMP and Rabin-Karp both find a pattern in a text in O(n+m). Both improve on naive O(nm) search. If that's all you know, you'll pick whichever you feel like, and most of the time you'll be fine. Occasionally you'll pick the wrong one and not know why your code is slow, or in one specific case, wrong.

The difference matters in three scenarios: adversarial input, multi-pattern search, and small alphabets like DNA or binary strings. Each scenario has a clear winner.


Naive String Matching and Why It Fails

The obvious approach is O(nm): slide the pattern one position at a time, compare up to m characters at each stop.

def naive_search(text, pattern): n, m = len(text), len(pattern) for i in range(n - m + 1): if text[i:i+m] == pattern: yield i

For random text this is often fast in practice. For a pattern like "AAAB" in "AAAAAAAAAAAAB", every position forces nearly a full match before failing. You match three characters, fail on the fourth, advance by one, repeat. All that prefix work is thrown away. It's like reading a book, getting to page 400, deciding that's not the right book, and starting over from page 1.

KMP saves that work by restructuring how the pattern pointer retreats. Rabin-Karp avoids most of those comparisons entirely by comparing fingerprints instead of characters.

Three-column diagram comparing naive backtracking, KMP skip via LPS, and Rabin-Karp hash windows for the same text and pattern Naive backtracks and re-reads. KMP jumps via the LPS table. Rabin-Karp skips positions without reading them at all.


KMP: The Text Pointer Never Goes Backward

Knuth, Morris, and Pratt published the algorithm in 1977 in the SIAM Journal on Computing. The core observation: when you mismatch at position j in the pattern, the characters you already matched encode information about where to try next. You don't have to start over.

The failure function (commonly called the LPS array) captures this. For each position i in the pattern, lps[i] is the length of the longest proper prefix of pattern[0..i] that is also a suffix.

Pattern:  A  B  A  B  C
Index:    0  1  2  3  4
lps:      0  0  1  2  0

"ABAB" ends with "AB" which also starts it, so lps[3] = 2. After matching four characters and failing on the fifth, jump the pattern pointer to 2, not 0. The text pointer stays where it is.

def build_lps(pattern): m = len(pattern) lps = [0] * m length = 0 i = 1 while i < m: if pattern[i] == pattern[length]: length += 1 lps[i] = length i += 1 elif length != 0: length = lps[length - 1] # fall back without advancing i else: lps[i] = 0 i += 1 return lps def kmp_search(text, pattern): n, m = len(text), len(pattern) lps = build_lps(pattern) i = j = 0 while i < n: if text[i] == pattern[j]: i += 1 j += 1 if j == m: yield i - j j = lps[j - 1] # NOT 0: handles overlapping matches elif i < n and text[i] != pattern[j]: if j != 0: j = lps[j - 1] else: i += 1

The key invariant: i never decrements. The text pointer only moves forward. j can fall back via lps, but j can only decrease as much as it previously increased. Total j-movement is bounded by n. That gives O(n) for the search pass plus O(m) for building the LPS table.

One bug worth burning into memory: after a full match (j == m), set j = lps[m-1], not j = 0. Resetting to zero misses overlapping occurrences. The pattern "ABA" in "ABABA" appears at both index 0 and index 2. The fallback to lps[2] = 1 is what finds the second one.

KMP tracing pattern "ABABC" in text "ABABABABC" showing the text pointer holding still while j jumps via LPS on consecutive mismatches i marches right. j bounces around via the LPS table. The text pointer never looks back, which is the entire point.


Rabin-Karp: Hash First, Verify Later

Karp and Rabin published their algorithm in 1987 in the IBM Journal of Research and Development. The approach is almost the opposite of KMP: don't analyze the pattern's internal structure at all. Fingerprint it, then compare fingerprints as you slide a window across the text.

A naive hash-per-window would recompute from scratch each time, taking O(m) per position. The rolling hash eliminates this cost: given the hash of text[i..i+m-1], compute text[i+1..i+m] in O(1) by subtracting the outgoing character and adding the incoming one. It's like a checkout conveyor belt where you push one item off the back and add one item to the front, instead of rescanning the whole belt every time.

def rabin_karp_search(text, pattern): n, m = len(text), len(pattern) if m > n: return d, q = 256, 1_000_000_007 h_power = pow(d, m - 1, q) # d^(m-1) mod q: weight of the leading char ph = th = 0 for i in range(m): ph = (d * ph + ord(pattern[i])) % q th = (d * th + ord(text[i])) % q for i in range(n - m + 1): if ph == th and text[i:i+m] == pattern: # always verify yield i if i < n - m: th = (d * (th - ord(text[i]) * h_power) + ord(text[i + m])) % q th = (th + q) % q # subtraction can go negative

When hashes match but characters don't, that's a spurious hit: a ghost. Two completely different strings happened to hash to the same value. With a large prime modulus, the probability of a spurious hit per window is roughly 1/q. At q = 10^9+7 over a million windows, you'd expect fewer than 0.001 spurious hits total. Essentially never.

The catch: pathological or adversarial input can produce O(n) collisions, each costing O(m) to verify, degrading the algorithm to O(nm). This is why Python's str.find and Java's String.indexOf both use Boyer-Moore-Horspool variants instead. For adversarial inputs, either use a randomized hash seed, use double hashing, or switch to KMP.

Rolling hash window sliding across text "ABCABDABC" for pattern "ABC", annotating the drop of the outgoing character and addition of the incoming character at each step Each step drops the leftmost character from the hash and adds the rightmost one. O(1) per slide. The guard (th + q) % q prevents negative values in every language except Python.


The KMP vs Rabin-Karp Guarantee Gap

KMPRabin-Karp
Worst-case timeO(n+m)O(nm)
Average-case timeO(n+m)O(n+m)
SpaceO(m)O(1)
CorrectnessDeterministicProbabilistic
Multi-patternNo (use Aho-Corasick)Yes

KMP gives you a hard O(n+m) with no asterisks. Rabin-Karp gives you O(n+m) on typical inputs, but O(nm) is a real possibility under adversarial conditions.

For most interview problems, both work. For competitive programming with adversarial test cases, or security-sensitive applications, that difference is the whole question. Picking Rabin-Karp for security-sensitive code because "the expected case is fine" is the same logic as not wearing a seatbelt because you usually don't crash.


Rabin-Karp's Real Strength: Multiple Patterns at Once

For k patterns simultaneously, KMP requires k separate scans or the substantially more complex Aho-Corasick automaton. Rabin-Karp handles it by hashing all patterns into a set upfront:

def rabin_karp_multi(text, patterns): if not patterns: return m = len(patterns[0]) # assumes equal-length patterns d, q = 256, 1_000_000_007 h_power = pow(d, m - 1, q) pattern_hashes = {} for p in patterns: ph = 0 for c in p: ph = (d * ph + ord(c)) % q pattern_hashes[ph] = p th = 0 for c in text[:m]: th = (d * th + ord(c)) % q for i in range(len(text) - m + 1): if th in pattern_hashes and text[i:i+m] == pattern_hashes[th]: yield i, pattern_hashes[th] if i < len(text) - m: th = (d * (th - ord(text[i]) * h_power) + ord(text[i + m])) % q th = (th + q) % q

One pass over the text finds all k patterns. The hash set lookup is O(1), so adding more patterns barely changes the running time. Plagiarism detection, code similarity tools, multi-keyword document search: these are Rabin-Karp's natural domain.

Rabin-Karp also powers a useful interview trick: the binary-search approach to the longest duplicate substring. Binary search on candidate length L, and for each L use Rabin-Karp to check in O(n) whether any duplicate of that length exists. Total time is O(n log n). There's no clean equivalent using KMP.


Small Alphabets Tilt the Scales Toward KMP

Counterintuitive, but it follows from how the LPS table works. KMP benefits from small alphabets. On DNA (four characters: A, C, G, T) or binary strings, partial matches are frequent. The text contains lots of "AC" and "ACA" prefixes that partly match your pattern before failing, and each of those fires a KMP fallback that skips a useless alignment. The smaller the alphabet, the more the LPS jumps fire, and the more wasted comparisons get avoided.

Rabin-Karp has no such property. The rolling hash computes a fingerprint for every window regardless of how much of the pattern it resembles. Small alphabet, large alphabet: every window costs O(1). The algorithm doesn't exploit partial structure at all.

On large alphabets (natural language text, Unicode), most windows differ from the pattern in the first character or two, and the hash check dismisses them instantly. The difference between the algorithms shrinks. But in bioinformatics, where you're scanning a four-character genome for a motif, KMP's internal structure pays off more.


Two Bugs to Burn Into Memory

Both bugs produce silently wrong answers, not crashes. That's what makes them dangerous. Silent wrong answers are the worst category of bug. Your tests pass, your code ships, and somewhere downstream data is quietly wrong. No stack trace. No error message. Just vibes.

KMP: resetting j to 0 after a full match. When j reaches m, you must set j = lps[m-1], not j = 0. The LPS value tells you how much of the pattern re-aligns with what you just matched. Zero loses overlapping occurrences.

Rabin-Karp: skipping the character verification. If hashes match, you still do text[i:i+m] == pattern. Skipping that step turns a probabilistic algorithm into a wrong one. The verification is not optional. You're not doing extra work out of paranoia. You're doing it because hashes lie sometimes.

The second Rabin-Karp trap: the subtraction in the rolling hash update can produce a negative intermediate value in most languages. Always add q before the final % q: (th + q) % q. Python's arbitrary-precision integers mask this; every other language does not.


The Decision in Two Questions

Ask them in order.

One pattern or many? Searching for multiple patterns simultaneously, Rabin-Karp with a hash set is the natural choice. Single pattern, continue.

Do you need a worst-case guarantee? Input you control or known random: Rabin-Karp's expected O(n+m) is fine. Adversarial input, security context, competitive programming with stress tests: use KMP.

If correctness must hold regardless of input, KMP is the only safe choice. Tiebreaker for single-pattern, non-adversarial: small alphabet (DNA, binary) leans toward KMP because partial matches are common. Large alphabet (ASCII text, Unicode) makes either equally good.

For complete coverage of each algorithm's internals, the KMP reference walks through every case in the failure function build, and the Rabin-Karp reference covers double hashing and the Mersenne prime option. When you need multi-pattern at scale across variable-length patterns, Aho-Corasick is the full generalization.


SpaceComplexity runs voice-based DSA mock interviews where you explain algorithm tradeoffs like this one out loud and get rubric-based feedback on your reasoning, not just whether you got the right answer.


Recap

  • Naive: O(nm). Slides and compares from scratch at each position.
  • KMP: O(n+m) guaranteed. Builds the LPS table in O(m), scans in O(n) with the text pointer never retreating. Best for single pattern, small alphabet, deterministic guarantee. Classic bug: set j = lps[m-1] after a full match, not j = 0.
  • Rabin-Karp: O(n+m) average, O(nm) worst case. Rolling hash fingerprints each window in O(1). Best for multi-pattern search, plagiarism detection, longest duplicate substring via binary search. Classic bugs: always verify after a hash match; guard against negative values with (th + q) % q.
  • KMP benefits from small alphabets because partial matches trigger more LPS jumps and save more comparisons.
  • Rabin-Karp dominates multi-pattern because adding more patterns to a hash set costs nearly nothing.
  • Neither Python's str.find nor Java's String.indexOf uses KMP or Rabin-Karp in production. They use Boyer-Moore-Horspool variants.

Further Reading