The Z-Algorithm: Build One Array, Find Every Prefix Match in O(n)

You have a 1 MB log file and a 100-character search pattern. The naive approach burns 100 million comparisons in the worst case. The Z-algorithm uses n + m. Total. The trick: when you've already matched a long prefix somewhere earlier in the string, you don't throw that result away. You index it and reuse it.

The core idea is caching. For a string s of length n, the Z-array stores at each index i the length of the longest substring starting at position i that is also a prefix of s. Once you have that array, finding every occurrence of a pattern in a text takes a single pass.

Reach for the Z-algorithm when a problem reduces to prefix matching: finding all occurrences of a pattern in text, detecting repeating structure in a string, or asking how much of the beginning of a string you can see starting from each position.

What the Z-Array Actually Tells You

For a string s, z[i] answers: "how long a match do you get if you line up position i of s with the very start of s and compare forward?"

Take s = "aabcaab". The Z-array is:

s:    a  a  b  c  a  a  b
i:    0  1  2  3  4  5  6
z:    0  1  0  0  3  1  0

Position 4 has z[4] = 3 because s[4..6] = "aab" matches s[0..2] = "aab". Position 5 has z[5] = 1 because s[5] = "a" matches s[0] = "a" but s[6] = "b" ≠ s[1] = "a". Position 1 has z[1] = 1 for the same reason.

By convention, z[0] = 0. The full string trivially matches itself from position 0, but that answer is useless in practice, and setting z[0] = n would break the loop's inheritance logic.

Z-array for "aabcaab": each cell shows its z-value as a vertical bar. Position 4 has the tallest bar at height 3, corresponding to the 3-character prefix match "aab".

Z-values plotted as bars. The taller the bar, the longer the prefix match at that position.

The Z-Box: How You Avoid Redundant Work

The naive approach computes each z[i] from scratch in a while loop. That loop can run up to n iterations per position, giving O(n²) overall. The Z-algorithm stays O(n) by maintaining a window called the Z-box.

At any point during the scan, the Z-box [l, r) is the rightmost window we've already matched: the substring s[l..r-1] equals s[0..r-l-1]. Think of it as the furthest-right prefix match we know about.

When you arrive at position i, one of two things is true.

Outside the box (i >= r). No previously computed information helps. Run the naive comparison, extending until a mismatch. If you match anything at all, this becomes the new Z-box.

Inside the box (i < r). Here's the key. Since s[l..r-1] = s[0..r-l-1], position i inside the box corresponds to position k = i - l inside the prefix. You already know z[k]. That means z[i] is at least min(z[k], r - i).

Two sub-cases:

z[k] < r - i: The previous match from position k ended strictly inside the box. The same mismatch reason applies at position i (shifted by l). So z[i] = z[k], exactly. No while loop needed.
z[k] >= r - i: The previous match from position k reached or passed the box boundary. You know z[i] >= r - i, but you need to extend past r to find the exact length. Run the while loop starting from offset r - i.

In the second sub-case every iteration of the while loop advances r rightward. Since r is bounded by n and never decreases, the while loop runs at most n iterations total across the entire algorithm.

Z-box alignment diagram: two copies of the string, offset by l, showing how position i in the original maps to position k=i-l in the prefix copy. A dashed line connects them, and an annotation shows z[5] inherits from z[1].

The key insight: because the Z-box is a perfect copy of the prefix, position i inside it mirrors position k in the prefix. z[i] gets z[k] for free.

Three cases at position i: Case 1 (i outside box) does naive comparison; Case 2 (z[k] fits inside box) reads z[k] in O(1); Case 3 (z[k] hits boundary) extends beyond r, advancing r each time.

Only Case 3 does new work, and r can advance at most n times total. The amortized cost is O(1) per position.

Run It by Hand

s = "aabcaab", with the Z-box [l, r) updated at each step:

i=1: outside (r=0). Compare s[0..] vs s[1..]: 'a'='a' ✓, 'b'≠'a' ✗. z[1]=1. Box: [1, 2).
i=2: outside (i≥r). 'b'≠'a'. z[2]=0. Box unchanged.
i=3: outside. 'c'≠'a'. z[3]=0. Box unchanged.
i=4: outside. 'a'='a'✓ 'a'='a'✓ 'b'='b'✓ OOB. z[4]=3. Box: [4, 7).
i=5: inside (5<7). k=5-4=1. z[1]=1. r-i=2. 1<2 → z[5]=1, no loop.
i=6: inside (6<7). k=6-4=2. z[2]=0. r-i=1. 0<1 → z[6]=0, no loop.

Result: z = [0, 1, 0, 0, 3, 1, 0].

The critical moment is i=5. You could naively compare s[5..] against s[0..]. Instead, because you're inside the box built at i=4, you know position 5 corresponds to position k=1 in the prefix. z[1]=1 < r-5=2, so z[5] = 1 exactly, in O(1), no comparisons.

This is what makes the algorithm satisfying. You do the work once. You pay for it once. Every position inside a box is a freeloader riding on someone else's comparisons.

Build and Query Complexity

Operation	Time	Space	Notes
Build Z-array	O(n)	O(n)	Amortized; r advances at most n times total
Pattern match (text T, pattern P)	O(n + m)	O(n + m)	Combined string is n + m + 1 long
Smallest repeating period	O(n)	O(n)	One pass over Z-array after building it

Why O(n) holds

Every iteration of the while loop increments r by at least 1. r starts at 0, never decreases, and is bounded by n. So the while loop can run at most n times total across all n-1 iterations of the outer for loop. The outer loop contributes O(1) work per iteration outside the while loop. Total: O(n).

The amortized argument is standard: assign each "r advancement" a charge of 1. The while loop can only run as many times as r advances. r advances at most n times. Done.

Why O(n + m) for pattern matching

You build a combined string of length n + m + 1 (pattern + separator + text) and run the O(n) Z-function on it. The preprocessing is free: the separator trick is not a separate pass, it happens inside the single Z-function call.

One Structure, Every Language

All implementations below export two functions: zFunction(s) returning the Z-array (with z[0] = 0 by convention), and findPattern(text, pattern) returning 0-indexed match positions in text.

def z_function(s: str) -> list[int]:
    n = len(s)
    z = [0] * n
    l, r = 0, 0
    for i in range(1, n):
        if i < r:
            z[i] = min(r - i, z[i - l])
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        if i + z[i] > r:
            l, r = i, i + z[i]
    return z

def find_pattern(text: str, pattern: str) -> list[int]:
    combined = pattern + "$" + text
    z = z_function(combined)
    m = len(pattern)
    return [i - m - 1 for i in range(m + 1, len(combined)) if z[i] == m]

What You Can Build With This

The Z-array encodes one thing at every position: how far a match with the beginning of s extends from there. That turns out to be enough to solve a surprising number of problems. Algorithms people love a good "one weird trick" moment, and the separator trick is exactly that.

Exact pattern matching. Find every occurrence of pattern P (length m) in text T (length n) in O(n + m) time and space. The separator trick prevents false positives across the boundary.

String period detection. A string s of length n has period k if repeating s[0..k-1] reconstructs s. The Z-algorithm finds the smallest such k in O(n): it's the smallest i where i + z[i] == n and i divides n.

LeetCode 2223: Sum of Scores of Built Strings. For each index i, the score is the length of the longest common prefix between the suffix s[i..n-1] and s itself. That's exactly z[i], with the score at position 0 equal to n. The answer is n + sum(z[1..n-1]).

Rotate String (LeetCode 796). Checking if goal is a rotation of s is equivalent to checking if goal appears in s + s. Form goal + "$" + s + s and check if any Z-value equals len(goal).

Repeated Substring Pattern (LeetCode 459). A string is constructed by repeating a shorter substring if and only if it has a period k where k < n and k divides n. One pass over the Z-array finds it.

When to Reach for the Z-Algorithm

The Z-algorithm is the right tool when you see these signals in a problem:

"Find all occurrences of P in T" and the naive O(nm) approach is too slow.
"Is string A a rotation of string B?" (rotation means A appears in B + B).
"What is the shortest string whose repetition equals s?" (period detection).
"Count how many positions in s share a prefix of length exactly k with s" (scan Z-array for z[i] >= k).
The problem involves comparing prefixes of a string against every suffix of the same string.

The unifying shape is any question of the form "how does each position of a string relate to the beginning of that same string?"

Worked Example 1: Pattern Matching in `"aabxaabxaa"`

Problem. Find all positions where "aab" occurs in "aabxaabxaa".

Why Z fits. A match at position j in text means text[j..j+m-1] = pattern. That is exactly the condition z[m + 1 + j] = m on the combined string. One array, all matches.

Walk-through. Form combined = "aab$aabxaabxaa" (n=14, m=3).

combined: a  a  b  $  a  a  b  x  a  a  b  x  a  a
index:    0  1  2  3  4  5  6  7  8  9 10 11 12 13
z:        0  1  0  0  3  1  0  0  3  1  0  0  2  1

Positions where z[i] = m = 3: i=4 and i=8.

i=4: text position = 4 - 3 - 1 = 0. Match at index 0.
i=8: text position = 8 - 3 - 1 = 4. Match at index 4.

Verification. "aabxaabxaa" contains "aab" at positions 0 and 4. Correct.

At i=5, the Z-box from i=4 kicks in: k = 5 - 4 = 1, z[1] = 1, r - 5 = 2. Since 1 < 2, z[5] = 1 exactly, no comparisons. Same at i=6: z[6] = 0 from the box, no comparisons. The box saved four character comparisons in this small example. On real-world text it saves millions.

Combined string "aab$aabxaabxaa" with Z-values annotated. Bars of height 3 at positions 4 and 8 are highlighted in amber. Positions 5 and 6 show dashed arrows pointing back to the prefix, indicating Z-box inheritance.

Bars at i=4 and i=8 reach height 3 = m, flagging both matches. The Z-box fills i=5 and i=6 for free.

Worked Example 2: Detecting the Smallest Repeating Period

Problem. Given "abcabcabc", find the shortest string whose repetition builds it.

Why Z fits. If s has period k, then s[k..n-1] is a prefix of s. So z[k] = n - k, which means k + z[k] = n. Find the smallest such k that also divides n.

Walk-through. s = "abcabcabc", n = 9.

s: a  b  c  a  b  c  a  b  c
i: 0  1  2  3  4  5  6  7  8
z: 0  0  0  6  0  0  3  0  0

Check each i from 1 to n-1:

i=1: 1 + z[1] = 1 + 0 = 1 ≠ 9.
i=2: 2 + 0 = 2 ≠ 9.
i=3: 3 + z[3] = 3 + 6 = 9 == 9. Does 3 divide 9? Yes. Period found: "abc".

The Z-value z[3] = 6 comes efficiently from the algorithm: at i=3, outside the box, it compares s[3..8] = "abcabc" against s[0..5] = "abcabc". Six matches, then out of bounds. This sets the box to [3, 9). All subsequent positions (i=4 through i=8) are filled from the box in O(1) each.

Period detection for "abcabcabc": Z-values annotated on each cell, a span bracket shows z[3]=6 characters match the prefix, and the equation 3+6=9=n confirms period 3. Three colored "abc" blocks at the bottom show the repeating unit.

z[3] = 6 is the smoking gun. The span reaches the end of the string, and 3 divides 9, so "abc" is the generator.

The Parts That Bite You

The separator must be absent from both strings. The '$' convention works for lowercase-letter problems. For arbitrary binary strings or Unicode text, pick a sentinel that cannot appear. If the separator appears in the text, z[i] can exceed m and you would need to check z[i] == m rather than z[i] >= m anyway, but the whole guarantee depends on the separator blocking the match.

z[0] = 0 is a deliberate choice, not a limitation. Technically the entire string matches the prefix starting at position 0. But if you set z[0] = n, the Z-box initialization during the loop would read z[i - l] with l = 0, picking up z[0] = n as a previous value and giving wrong results. Leaving z[0] = 0 keeps the loop correct from i=1 onward. For problems like LeetCode 2223 that need the "score at position 0," you add n manually: answer = n + sum(z).

All-same strings are O(n), not O(n²). For s = "aaaa...a", the naive algorithm would choke. But the Z-box does not. At i=1 you run the while loop once per advancing character until you hit the end of the string at z[1] = n-1. The box is now [1, n). At i=2, z[i-l] = z[1] = n-1, r-i = n-2. Since n-1 >= n-2, you try one more character (which goes out of bounds immediately). Every subsequent position is filled from the box in O(1). Total while-loop iterations across the whole string: O(n).

KMP vs Z-algorithm. Both are O(n + m). KMP's failure function tells you how to shift the pattern on a mismatch. The Z-array tells you the length of the longest prefix match at every position. For single-pattern matching, they are equivalent. The Z-algorithm is often preferred in competitive programming because the core invariant (one sliding window) is easier to reason about and implement correctly without the two-phase KMP construction. For problems that are naturally "prefix match everywhere" rather than "skip and resume," Z is cleaner.

Interestingly, Z and KMP encode the same information in different shapes. If you deeply understand one, you can derive the other. Most people learn one and use it for everything. That's fine.

Where to Go From Here

The Z-algorithm sits within a wider family of linear-time string preprocessing structures. The suffix array answers more general substring queries in O(n log n) preprocessing and O(m + log n) search. The trie handles prefix queries across a set of strings rather than a single one. For the closest cousin, the KMP failure function is covered in most algorithm textbooks as an equivalent alternative to Z for single-pattern matching.

If you want to practice narrating algorithms like this under time pressure, SpaceComplexity runs voice-based mock interviews with rubric feedback. Knowing the O(n) proof matters. Being able to talk through it while coding is a different skill entirely.

Quick Recap

z[i] = length of longest substring starting at i that is also a prefix of s.
Maintain [l, r), the rightmost prefix match, to reuse previously computed values.
Outside the box: naive comparison, update box. Inside the box: read from mirror position, extend only if at boundary.
r is monotonically non-decreasing. Total while-loop iterations across all i is at most n. Algorithm is O(n).
Pattern matching: form pattern + "$" + text, build Z-array, report positions where z[i] == m.
Period detection: smallest i where i + z[i] == n and i | n.
The separator character must be absent from both input strings.
z[0] = 0 by convention. Add n separately when summing prefix match counts.