Suffix Automaton: Build It Once, Answer Every Substring Query

May 27, 202633 min read
dsaalgorithmsinterview-prepdata-structures
Suffix Automaton: Build It Once, Answer Every Substring Query
TL;DR
  • Suffix automaton is the smallest DFA accepting all suffixes of a string, implicitly indexing every substring in O(n) time and O(n) space
  • Endpos equivalence classes group substrings that end at identical positions; Lemma 2 proves sets nest or stay disjoint, forming the suffix link tree
  • The clone operation in extend() keeps every state spanning consecutive length ranges, which limits the structure to at most 2n-1 states and 3n-4 transitions
  • Count distinct substrings in O(n) with the formula Σ(len[v] - len[link[v]]) over all non-initial states
  • Longest common substring runs in O(|S|+|T|): build the SAM for S, stream T with suffix link fallback, track maximum match length
  • Clones get cnt=0, not cnt=1; propagating counts in anything other than decreasing-len order silently corrupts every occurrence count

You have a string. You need to know how many times a pattern appears in it. And how many distinct substrings it has. And the lexicographically k-th one. And the longest substring it shares with another string.

Most people reach for four different tools. The suffix automaton is one structure that does all four. Build it once in O(n), query it in O(pattern length). No other single structure matches that combination. It is, frankly, a little embarrassing for the competition.

It was called the DAWG (Directed Acyclic Word Graph) when Blumer, Blumer, Ehrenfeucht, Haussler, Chen, and Seiferas published "The Smallest Automaton Recognizing the Subwords of a Text" in 1985 in Theoretical Computer Science. Six authors. The name stuck less than the structure. Competitive programmers eventually won the naming war and called it the SAM, which is both shorter and sounds less like a 1980s video game peripheral.

The one-sentence mental model: the suffix automaton is the smallest deterministic finite automaton that accepts every suffix of a string, and it implicitly indexes every substring.


What Lives Inside Each State

The automaton is an acyclic directed graph. Nodes are states. Edges are labeled with characters. Starting from the initial state and following edges, every path you can trace spells out a substring of the original string.

That part is unsurprising. Tries do the same thing. What keeps the SAM small is the interesting bit, and it comes down to one insight: most substrings are basically the same substring wearing different lengths.

The key idea is endpos equivalence. For any substring t, define endpos(t) as the set of positions in s where t ends. In the string abbc (0-indexed), endpos("b") = {1, 2} and endpos("ab") = {1}.

Two substrings are endpos-equivalent when they always end at exactly the same positions. In abbc, the substrings c, bc, bbc, and abbc all end only at position 3. So endpos("c") = endpos("bc") = endpos("bbc") = endpos("abbc") = {3}. They all belong to the same equivalence class. Four strings, one state.

Endpos sets for "abbc": substrings that share the same ending positions collapse into one state

Four substrings, one state. That is where the O(n) bound comes from.

Three lemmas nail down the structure. You do not need to memorize proofs. You do need to believe them, because the construction only makes sense if you accept what they say:

Lemma 1. Two substrings u and w are endpos-equivalent if and only if u occurs in s only as a suffix of w (or vice versa).

Lemma 2. For any two substrings, their endpos sets are either disjoint or one is a proper subset of the other. They can never partially overlap.

Lemma 3. Within a single state, the substrings form a consecutive length sequence. If the longest substring in state v has length len(v), and the longest substring of state link(v) has length len(link(v)), then state v contains exactly one substring of each length from len(link(v)) + 1 up through len(v).

Lemma 2 is the one that makes everything work. Because endpos sets only nest or stay disjoint, they form a tree. That tree of endpos containment is exactly the tree of suffix links.


The Suffix Link Tree

Each state (except the initial state) has a suffix link pointing to the parent state whose endpos set strictly contains it.

More concretely: if state v has longest element w, then link(v) points to the state whose longest element is the shortest proper suffix of w that appears at strictly more positions.

In abbc:

  • State 5 represents {"c", "bc", "bbc", "abbc"} with endpos = {3}

  • Its suffix link points to state 0 (the initial state), whose endpos is all positions

  • State 4 represents {"b"} with endpos = {1, 2}

  • Its suffix link points to state 0

  • States 2 and 3 both link to state 4, because "b" is their parent in the containment tree

Suffix link tree for "abbc": parent endpos always strictly contains child endpos

The suffix links form a proper tree. The root has endpos = all positions. Every other state's endpos is a strict subset of its parent.

Parent always has a strictly larger endpos set than child. The root (initial state) contains the empty string and has every position in its endpos set.


Memory Layout

Each state stores three things:

struct State {
    int len;        // length of longest substring in this class
    int link;       // suffix link: index of parent state (-1 for root)
    map<char, int> next;  // transitions: character → next state
}

For a string of length n, you allocate at most 2n - 1 states. In practice, use a fixed array of 2n states to avoid reallocation. The initial state is index 0.

The number of transitions is at most 3n - 4. The proof splits transitions into two types. A transition (p → q via c) is continuous when len(p) + 1 == len(q). These form a spanning tree of the automaton's longest-path DAG structure, giving at most 2n - 2 continuous edges. A transition is non-continuous when len(p) + 1 < len(q). Each non-continuous transition corresponds to a different suffix of s, and there are at most n - 1 suffixes of length at least 2. So total transitions ≤ (2n - 2) + (n - 1) = 3n - 3, and the cases that achieve 2n - 1 states cannot simultaneously maximize transitions, giving the tighter 3n - 4.


The Construction Algorithm

The algorithm is online: feed characters one at a time, and after each extension the automaton correctly represents all substrings of the current prefix. You do not need the full string upfront. That sounds like a footnote until you realize the entire O(n) proof depends on it.

The extend(c) function:

  1. Create a new state cur with len[cur] = len[last] + 1.
  2. Walk backward from last via suffix links, adding a c-transition to cur from every state that lacks one.
  3. When you hit a state p that already has a c-transition (to some state q):
    • If len[p] + 1 == len[q], the existing transition is already continuous. Set link[cur] = q.
    • If len[p] + 1 < len[q], state q spans a range of lengths that needs to be split. Clone q: create clone with len[clone] = len[p] + 1, copy q's transitions and link into clone, then redirect all transitions that were pointing to q back up the suffix link chain to point to clone instead. Set link[q] = link[cur] = clone.
  4. If you walk all the way to the root without finding a c-transition, set link[cur] = 0.
  5. Update last = cur.

The clone operation is what maintains minimality. Without it, states would represent non-contiguous length ranges, violating Lemma 3.

Why It Is O(n)

len[last] increases by exactly 1 with each call to extend. The suffix link chain in step 2 walks through states with strictly decreasing len values. The total number of steps across all those chain walks is bounded by the total increase of len[last], which is n. Each clone operation costs O(1) and happens at most once per character. So the total work is O(n) with a fixed alphabet, or O(n log |Σ|) if transitions are sorted maps.


Construction Walk-Through: "abbc"

The critical step is extending with the second b. This is where the clone fires. It is also where most first implementations produce a subtly wrong occurrence counter and spend two hours staring at a test case before realizing it.

After "ab", the automaton has states 0 (init), 1 ("a", len=1), 2 ("b"/"ab", len=2). The suffix link tree has 0 at the root with 1 and 2 as children.

Now extend with b:

  • Create cur=3, len=3.
  • Walk from last=2: state 2 has no b transition. Add 2→3. Walk to link[2]=0.
  • State 0 already has a b transition to q=2.
  • Check: len[0]+1 = 1, len[2] = 2. Not continuous. Clone.
  • clone=4, len[4]=1. Copy transitions: state 2's transitions at this moment include {b:3}. Copy to clone.
  • Clone's link = link[2] = 0.
  • Walk back redirecting: state 0's b transition was pointing to 2, redirect to 4.
  • Set link[2] = 4, link[cur=3] = 4.

Clone operation: before and after extending "ab" with a second 'b'

State 4 (the clone) splits off "b" from "ab". Before the clone, state 2 was carrying both. After, each has its own state with its own endpos set.

State 4 (the clone) now represents "b" alone with endpos={1,2}. State 2 represents "ab" with endpos={1}. State 3 represents {"bb","abb"} with endpos={2}.


Core Operations

OperationTimeSpaceNotes
ConstructionO(n·|Σ|)O(n·|Σ|)O(n) with hash maps or fixed alphabet
Pattern searchO(m)O(1)Walk transitions; suffix link fallback if stuck
Count occurrencesO(n + m)O(n)Precompute cnt via topological sort on suffix link tree
Count all occurrencesO(n + m + k)O(n)k = number of match positions
Count distinct substringsO(n)O(1)Formula: Σ(len[v] − len[link[v]])
Longest common substringO(n + m)O(n)Build SAM for S, stream T with fallback
k-th lexicographic substringO(k·|Σ|)O(n)DP path counts, greedy descent
Shortest absent substringO(n)O(n)BFS/DP for missing transitions

Construction is O(n·|Σ|) because each state stores up to |Σ| transitions. With a hash map, each transition is O(1) average and the total is O(n). With a sorted map, it is O(n log |Σ|). For competitive programming with small alphabets (26 letters), a fixed int next[26] array initialized to -1 gives true O(n).

Why pattern search is O(m). Follow transitions character by character from the initial state. If no transition exists for the next character, the pattern does not appear in s. The automaton is deterministic, so there is no backtracking. This is the part that makes the SAM strictly better than KMP for multiple patterns against the same text: KMP re-processes the text for each pattern. The SAM processes each pattern in O(pattern length) with zero extra preprocessing.

Why counting occurrences requires O(n) preprocessing. The automaton itself does not store how many times each substring appears. You propagate counts up the suffix link tree. Mark non-cloned states with cnt = 1. Then process states in decreasing order of len (reverse topological order of the suffix link tree): cnt[link[v]] += cnt[v]. After this, cnt[v] equals the size of endpos(v) = the number of times the longest substring in state v appears in s.

Why distinct substrings is O(1) after build. Each state v contributes exactly len[v] - len[link[v]] distinct substrings: these are all the substrings of lengths from len[link[v]] + 1 up to len[v] that end at exactly the positions in endpos(v). Sum over all non-initial states.

Why the longest common substring algorithm is O(n + m). Build the SAM for S in O(n). Then stream T character by character, maintaining current state v and match length l. At each character c: if v has a c-transition, follow it and increment l. If not, follow suffix links until you find one or reach the root. Track the maximum l seen.

LCS algorithm: stream T through SAM(S) one character at a time, following suffix links on mismatch

Every time you hit a dead end, the suffix link bails you out. You slide to a shorter match that still appears in S, then try again.


One Structure, Every Language

class State: __slots__ = ("len", "link", "next", "cnt") def __init__(self): self.len = 0 self.link = -1 self.next: dict[str, int] = {} self.cnt = 0 class SuffixAutomaton: def __init__(self): self.st = [State()] self.last = 0 def extend(self, c: str) -> None: if c in self.st[self.last].next: q = self.st[self.last].next[c] if self.st[self.last].len + 1 == self.st[q].len: self.last = q return cur = len(self.st) self.st.append(State()) self.st[cur].len = self.st[self.last].len + 1 self.st[cur].cnt = 1 p = self.last while p != -1 and c not in self.st[p].next: self.st[p].next[c] = cur p = self.st[p].link if p == -1: self.st[cur].link = 0 else: q = self.st[p].next[c] if self.st[p].len + 1 == self.st[q].len: self.st[cur].link = q else: clone = len(self.st) self.st.append(State()) self.st[clone].len = self.st[p].len + 1 self.st[clone].next = dict(self.st[q].next) self.st[clone].link = self.st[q].link while p != -1 and self.st[p].next.get(c) == q: self.st[p].next[c] = clone p = self.st[p].link self.st[q].link = self.st[cur].link = clone self.last = cur def build_counts(self) -> None: order = sorted(range(len(self.st)), key=lambda i: -self.st[i].len) for v in order: if self.st[v].link != -1: self.st[self.st[v].link].cnt += self.st[v].cnt def count_distinct(self) -> int: return sum( self.st[i].len - self.st[self.st[i].link].len for i in range(1, len(self.st)) ) def count_occurrences(self, pattern: str) -> int: v = 0 for c in pattern: if c not in self.st[v].next: return 0 v = self.st[v].next[c] return self.st[v].cnt def longest_common_substring(s: str, t: str) -> str: sam = SuffixAutomaton() for c in s: sam.extend(c) sam.build_counts() v, l, best, best_pos = 0, 0, 0, 0 for i, c in enumerate(t): while v != 0 and c not in sam.st[v].next: v = sam.st[v].link l = sam.st[v].len if c in sam.st[v].next: v = sam.st[v].next[c] l += 1 if l > best: best, best_pos = l, i return t[best_pos - best + 1 : best_pos + 1]

What Problems It Actually Solves

You built the thing. Here is what you get for your trouble.

The suffix automaton handles any query about substrings of a fixed string s. Build it once. Then:

Substring frequency queries. After buildCounts(), countOccurrences(pattern) runs in O(|pattern|). This is strictly better than KMP for multiple patterns against the same text, because KMP re-processes the text for each new pattern. The SAM never re-processes anything.

Counting distinct substrings. The formula Σ(len[v] − len[link[v]]) gives the exact count in O(n). You do not need to generate them, enumerate them, or stuff them into a hash set that silently runs out of memory at n=10^5. The structure already knows.

Longest common substring of two strings. Build the SAM for S, then stream T character by character with suffix link fallback. Naive comparison is O(|S|·|T|). Rolling-hash binary search is O((|S|+|T|) log(|S|+|T|)) with extra constants. The SAM does it in O(|S|+|T|). No approximation, no probabilistic hash, just exact.

Lexicographic k-th substring. Compute dp[v] = number of distinct paths from v. Each path spells a distinct substring. Descend greedily, choosing the smallest character whose subtree has enough paths to contain rank k.

Shortest string not appearing in s. BFS on the automaton, tracking minimum length of paths that miss a transition. The first state where some character has no outgoing edge gives the answer.

String compression problems. The SAM can detect the smallest rotation of a string, find minimal periods, and identify the longest palindromic substring with some augmentation.


When to Use the Suffix Automaton

Five signals this problem wants a suffix automaton:

  1. A fixed string s with many queries about its substrings.
  2. You need how many times a pattern appears, not just whether it appears.
  3. You need to count distinct substrings of s.
  4. You need the longest common substring of two strings (not subsequence).
  5. n up to 10^5 or 10^6, and you need something faster than suffix array binary search.

The one signal that points away from the SAM: you need queries with online updates (inserting into the middle of s). The standard SAM only supports appending to the end.

Worked Example 1: How Many Distinct Substrings?

Problem: given s, count the number of distinct substrings. (Classic competitive programming problem. Every beginner reaches for a set. The set reaches back.)

Brute force: generate all O(n^2) substrings, stuff them into a hash set. O(n^2) time and space. Works fine for n up to a few thousand. Quietly kills your process at n=10^5.

Suffix array approach: build the SA and LCP array, answer is n(n+1)/2 - sum(LCP). O(n log n) build, O(n) formula.

SAM approach: build in O(n), then sum len[v] - len[link[v]] over all non-initial states. O(n) total. Correctness follows from Lemma 3: each state v contributes the substrings with lengths in (len[link[v]], len[v]], and those ranges partition all distinct substrings exactly.

For abbc: states 1("a"), 4("b"), 2("ab"), 3("abb","bb"), 5("c","bc","bbc","abbc") contribute 1+1+1+2+4 = 9. Enumerate them: {a, b, c, ab, bb, bc, abb, bbc, abbc}. Correct.

Worked Example 2: Longest Common Substring

Problem: given strings S and T, find the longest substring that appears in both.

Nested loops are O(|S|·|T|). Binary search on sorted suffixes is O((|S|+|T|) log(|S|+|T|)).

Build the SAM for S. Then process T left to right, maintaining two variables: current state v and current match length l. At each character c of T:

  • If v has a c-transition: follow it, increment l.
  • If v has no c-transition: keep following suffix links (each time resetting l to len[link[v]]) until you find a c-transition or hit the root. If you hit the root and still no c-transition, reset l to 0 and stay at root.

Track the maximum l seen. The LCS length is that maximum.

Why this is correct. Each state v represents all substrings of S that end at the same set of positions. When you follow v's suffix link, you land in the parent state whose substrings are slightly shorter but appear more often. The current match length l is always at most len[v], maintaining the invariant that the last l characters of T[0..i] appear as a substring of S.

Total time: O(|S|) to build + O(|T|) to stream = O(|S| + |T|).


The Non-Obvious Pitfall (Clones Do Not Count)

Cloned states do not count as occurrences. When you clone a state during extend, the clone represents substrings that existed before the new character. No new occurrence happened. Initializing cnt = 0 for clones is mandatory. Set cnt = 1 only for freshly created (non-clone) states, then propagate up the suffix link tree.

Initialize all states to cnt = 1 and every occurrence count will be wrong. The bugs are completely silent. The automaton structure is correct. The transition graph is correct. Only the counts are off. You will submit a wrong answer, stare at the structure for twenty minutes, convince yourself the logic is right, and only then remember this paragraph.

A second trap: the topological order for propagating counts must be decreasing by len. The suffix link tree has the property that parents always have smaller len than children, so processing in decreasing len order processes all children before their parents. Propagating in any other order produces garbage.


Quick Recap

  • The suffix automaton is the smallest DFA accepting all suffixes of s, implicitly indexing every substring.
  • Each state represents an endpos equivalence class: all substrings that end at exactly the same positions in s.
  • Endpos sets form a tree (via Lemma 2: disjoint or one contains the other). Suffix links trace this tree.
  • Every state v contains substrings of consecutive lengths [len(link[v])+1, len[v]]. This is what makes the size linear.
  • Construction is online, O(n) amortized, creates at most 2n-1 states and 3n-4 transitions.
  • The extend(c) function has three cases: no conflict (link to root), continuous conflict (link directly), non-continuous conflict (clone + redirect).
  • Count distinct substrings in O(n) via the formula. Count occurrences in O(pattern length) after O(n) preprocessing. Longest common substring in O(|S| + |T|).
  • Initialize cnt = 1 for non-clone states only. Propagate counts in decreasing-len order.
  • If you need ordered suffix operations (binary search, LCP), prefer the suffix array. If you need multiple patterns against a static text, the SAM wins.

If you want to see how quickly you can explain suffix links under pressure, SpaceComplexity will put you in a realistic voice-based mock interview and tell you exactly which part of your explanation lost the interviewer.


For related string algorithm foundations, the KMP algorithm covers the failure function which shares structural DNA with suffix links. The suffix array is the suffix automaton's cache-friendlier sibling for sorted-suffix problems. And for the trie analogy that makes DAWGs click, trie data structure is worth reading first if the automaton concept feels abstract.


Further Reading