Suffix Automaton: Build It Once, Answer Every Substring Query

- Suffix automaton is the smallest DFA accepting all suffixes of a string, implicitly indexing every substring in O(n) time and O(n) space
- Endpos equivalence classes group substrings that end at identical positions; Lemma 2 proves sets nest or stay disjoint, forming the suffix link tree
- The clone operation in extend() keeps every state spanning consecutive length ranges, which limits the structure to at most 2n-1 states and 3n-4 transitions
- Count distinct substrings in O(n) with the formula Σ(len[v] - len[link[v]]) over all non-initial states
- Longest common substring runs in O(|S|+|T|): build the SAM for S, stream T with suffix link fallback, track maximum match length
- Clones get cnt=0, not cnt=1; propagating counts in anything other than decreasing-len order silently corrupts every occurrence count
You have a string. You need to know how many times a pattern appears in it. And how many distinct substrings it has. And the lexicographically k-th one. And the longest substring it shares with another string.
Most people reach for four different tools. The suffix automaton is one structure that does all four. Build it once in O(n), query it in O(pattern length). No other single structure matches that combination. It is, frankly, a little embarrassing for the competition.
It was called the DAWG (Directed Acyclic Word Graph) when Blumer, Blumer, Ehrenfeucht, Haussler, Chen, and Seiferas published "The Smallest Automaton Recognizing the Subwords of a Text" in 1985 in Theoretical Computer Science. Six authors. The name stuck less than the structure. Competitive programmers eventually won the naming war and called it the SAM, which is both shorter and sounds less like a 1980s video game peripheral.
The one-sentence mental model: the suffix automaton is the smallest deterministic finite automaton that accepts every suffix of a string, and it implicitly indexes every substring.
What Lives Inside Each State
The automaton is an acyclic directed graph. Nodes are states. Edges are labeled with characters. Starting from the initial state and following edges, every path you can trace spells out a substring of the original string.
That part is unsurprising. Tries do the same thing. What keeps the SAM small is the interesting bit, and it comes down to one insight: most substrings are basically the same substring wearing different lengths.
The key idea is endpos equivalence. For any substring t, define endpos(t) as the set of positions in s where t ends. In the string abbc (0-indexed), endpos("b") = {1, 2} and endpos("ab") = {1}.
Two substrings are endpos-equivalent when they always end at exactly the same positions. In abbc, the substrings c, bc, bbc, and abbc all end only at position 3. So endpos("c") = endpos("bc") = endpos("bbc") = endpos("abbc") = {3}. They all belong to the same equivalence class. Four strings, one state.

Four substrings, one state. That is where the O(n) bound comes from.
Three lemmas nail down the structure. You do not need to memorize proofs. You do need to believe them, because the construction only makes sense if you accept what they say:
Lemma 1. Two substrings u and w are endpos-equivalent if and only if u occurs in s only as a suffix of w (or vice versa).
Lemma 2. For any two substrings, their endpos sets are either disjoint or one is a proper subset of the other. They can never partially overlap.
Lemma 3. Within a single state, the substrings form a consecutive length sequence. If the longest substring in state v has length len(v), and the longest substring of state link(v) has length len(link(v)), then state v contains exactly one substring of each length from len(link(v)) + 1 up through len(v).
Lemma 2 is the one that makes everything work. Because endpos sets only nest or stay disjoint, they form a tree. That tree of endpos containment is exactly the tree of suffix links.
The Suffix Link Tree
Each state (except the initial state) has a suffix link pointing to the parent state whose endpos set strictly contains it.
More concretely: if state v has longest element w, then link(v) points to the state whose longest element is the shortest proper suffix of w that appears at strictly more positions.
In abbc:
-
State 5 represents
{"c", "bc", "bbc", "abbc"}withendpos = {3} -
Its suffix link points to state 0 (the initial state), whose endpos is all positions
-
State 4 represents
{"b"}withendpos = {1, 2} -
Its suffix link points to state 0
-
States 2 and 3 both link to state 4, because
"b"is their parent in the containment tree

The suffix links form a proper tree. The root has endpos = all positions. Every other state's endpos is a strict subset of its parent.
Parent always has a strictly larger endpos set than child. The root (initial state) contains the empty string and has every position in its endpos set.
Memory Layout
Each state stores three things:
struct State {
int len; // length of longest substring in this class
int link; // suffix link: index of parent state (-1 for root)
map<char, int> next; // transitions: character → next state
}
For a string of length n, you allocate at most 2n - 1 states. In practice, use a fixed array of 2n states to avoid reallocation. The initial state is index 0.
The number of transitions is at most 3n - 4. The proof splits transitions into two types. A transition (p → q via c) is continuous when len(p) + 1 == len(q). These form a spanning tree of the automaton's longest-path DAG structure, giving at most 2n - 2 continuous edges. A transition is non-continuous when len(p) + 1 < len(q). Each non-continuous transition corresponds to a different suffix of s, and there are at most n - 1 suffixes of length at least 2. So total transitions ≤ (2n - 2) + (n - 1) = 3n - 3, and the cases that achieve 2n - 1 states cannot simultaneously maximize transitions, giving the tighter 3n - 4.
The Construction Algorithm
The algorithm is online: feed characters one at a time, and after each extension the automaton correctly represents all substrings of the current prefix. You do not need the full string upfront. That sounds like a footnote until you realize the entire O(n) proof depends on it.
The extend(c) function:
- Create a new state
curwithlen[cur] = len[last] + 1. - Walk backward from
lastvia suffix links, adding ac-transition tocurfrom every state that lacks one. - When you hit a state
pthat already has ac-transition (to some stateq):- If
len[p] + 1 == len[q], the existing transition is already continuous. Setlink[cur] = q. - If
len[p] + 1 < len[q], stateqspans a range of lengths that needs to be split. Clone q: createclonewithlen[clone] = len[p] + 1, copy q's transitions and link into clone, then redirect all transitions that were pointing toqback up the suffix link chain to point tocloneinstead. Setlink[q] = link[cur] = clone.
- If
- If you walk all the way to the root without finding a
c-transition, setlink[cur] = 0. - Update
last = cur.
The clone operation is what maintains minimality. Without it, states would represent non-contiguous length ranges, violating Lemma 3.
Why It Is O(n)
len[last] increases by exactly 1 with each call to extend. The suffix link chain in step 2 walks through states with strictly decreasing len values. The total number of steps across all those chain walks is bounded by the total increase of len[last], which is n. Each clone operation costs O(1) and happens at most once per character. So the total work is O(n) with a fixed alphabet, or O(n log |Σ|) if transitions are sorted maps.
Construction Walk-Through: "abbc"
The critical step is extending with the second b. This is where the clone fires. It is also where most first implementations produce a subtly wrong occurrence counter and spend two hours staring at a test case before realizing it.
After "ab", the automaton has states 0 (init), 1 ("a", len=1), 2 ("b"/"ab", len=2). The suffix link tree has 0 at the root with 1 and 2 as children.
Now extend with b:
- Create cur=3, len=3.
- Walk from last=2: state 2 has no
btransition. Add 2→3. Walk to link[2]=0. - State 0 already has a
btransition to q=2. - Check: len[0]+1 = 1, len[2] = 2. Not continuous. Clone.
- clone=4, len[4]=1. Copy transitions: state 2's transitions at this moment include {b:3}. Copy to clone.
- Clone's link = link[2] = 0.
- Walk back redirecting: state 0's
btransition was pointing to 2, redirect to 4. - Set link[2] = 4, link[cur=3] = 4.

State 4 (the clone) splits off "b" from "ab". Before the clone, state 2 was carrying both. After, each has its own state with its own endpos set.
State 4 (the clone) now represents "b" alone with endpos={1,2}. State 2 represents "ab" with endpos={1}. State 3 represents {"bb","abb"} with endpos={2}.
Core Operations
| Operation | Time | Space | Notes |
|---|---|---|---|
| Construction | O(n·|Σ|) | O(n·|Σ|) | O(n) with hash maps or fixed alphabet |
| Pattern search | O(m) | O(1) | Walk transitions; suffix link fallback if stuck |
| Count occurrences | O(n + m) | O(n) | Precompute cnt via topological sort on suffix link tree |
| Count all occurrences | O(n + m + k) | O(n) | k = number of match positions |
| Count distinct substrings | O(n) | O(1) | Formula: Σ(len[v] − len[link[v]]) |
| Longest common substring | O(n + m) | O(n) | Build SAM for S, stream T with fallback |
| k-th lexicographic substring | O(k·|Σ|) | O(n) | DP path counts, greedy descent |
| Shortest absent substring | O(n) | O(n) | BFS/DP for missing transitions |
Construction is O(n·|Σ|) because each state stores up to |Σ| transitions. With a hash map, each transition is O(1) average and the total is O(n). With a sorted map, it is O(n log |Σ|). For competitive programming with small alphabets (26 letters), a fixed int next[26] array initialized to -1 gives true O(n).
Why pattern search is O(m). Follow transitions character by character from the initial state. If no transition exists for the next character, the pattern does not appear in s. The automaton is deterministic, so there is no backtracking. This is the part that makes the SAM strictly better than KMP for multiple patterns against the same text: KMP re-processes the text for each pattern. The SAM processes each pattern in O(pattern length) with zero extra preprocessing.
Why counting occurrences requires O(n) preprocessing. The automaton itself does not store how many times each substring appears. You propagate counts up the suffix link tree. Mark non-cloned states with cnt = 1. Then process states in decreasing order of len (reverse topological order of the suffix link tree): cnt[link[v]] += cnt[v]. After this, cnt[v] equals the size of endpos(v) = the number of times the longest substring in state v appears in s.
Why distinct substrings is O(1) after build. Each state v contributes exactly len[v] - len[link[v]] distinct substrings: these are all the substrings of lengths from len[link[v]] + 1 up to len[v] that end at exactly the positions in endpos(v). Sum over all non-initial states.
Why the longest common substring algorithm is O(n + m). Build the SAM for S in O(n). Then stream T character by character, maintaining current state v and match length l. At each character c: if v has a c-transition, follow it and increment l. If not, follow suffix links until you find one or reach the root. Track the maximum l seen.

Every time you hit a dead end, the suffix link bails you out. You slide to a shorter match that still appears in S, then try again.
One Structure, Every Language
class State: __slots__ = ("len", "link", "next", "cnt") def __init__(self): self.len = 0 self.link = -1 self.next: dict[str, int] = {} self.cnt = 0 class SuffixAutomaton: def __init__(self): self.st = [State()] self.last = 0 def extend(self, c: str) -> None: if c in self.st[self.last].next: q = self.st[self.last].next[c] if self.st[self.last].len + 1 == self.st[q].len: self.last = q return cur = len(self.st) self.st.append(State()) self.st[cur].len = self.st[self.last].len + 1 self.st[cur].cnt = 1 p = self.last while p != -1 and c not in self.st[p].next: self.st[p].next[c] = cur p = self.st[p].link if p == -1: self.st[cur].link = 0 else: q = self.st[p].next[c] if self.st[p].len + 1 == self.st[q].len: self.st[cur].link = q else: clone = len(self.st) self.st.append(State()) self.st[clone].len = self.st[p].len + 1 self.st[clone].next = dict(self.st[q].next) self.st[clone].link = self.st[q].link while p != -1 and self.st[p].next.get(c) == q: self.st[p].next[c] = clone p = self.st[p].link self.st[q].link = self.st[cur].link = clone self.last = cur def build_counts(self) -> None: order = sorted(range(len(self.st)), key=lambda i: -self.st[i].len) for v in order: if self.st[v].link != -1: self.st[self.st[v].link].cnt += self.st[v].cnt def count_distinct(self) -> int: return sum( self.st[i].len - self.st[self.st[i].link].len for i in range(1, len(self.st)) ) def count_occurrences(self, pattern: str) -> int: v = 0 for c in pattern: if c not in self.st[v].next: return 0 v = self.st[v].next[c] return self.st[v].cnt def longest_common_substring(s: str, t: str) -> str: sam = SuffixAutomaton() for c in s: sam.extend(c) sam.build_counts() v, l, best, best_pos = 0, 0, 0, 0 for i, c in enumerate(t): while v != 0 and c not in sam.st[v].next: v = sam.st[v].link l = sam.st[v].len if c in sam.st[v].next: v = sam.st[v].next[c] l += 1 if l > best: best, best_pos = l, i return t[best_pos - best + 1 : best_pos + 1]
What Problems It Actually Solves
You built the thing. Here is what you get for your trouble.
The suffix automaton handles any query about substrings of a fixed string s. Build it once. Then:
Substring frequency queries. After buildCounts(), countOccurrences(pattern) runs in O(|pattern|). This is strictly better than KMP for multiple patterns against the same text, because KMP re-processes the text for each new pattern. The SAM never re-processes anything.
Counting distinct substrings. The formula Σ(len[v] − len[link[v]]) gives the exact count in O(n). You do not need to generate them, enumerate them, or stuff them into a hash set that silently runs out of memory at n=10^5. The structure already knows.
Longest common substring of two strings. Build the SAM for S, then stream T character by character with suffix link fallback. Naive comparison is O(|S|·|T|). Rolling-hash binary search is O((|S|+|T|) log(|S|+|T|)) with extra constants. The SAM does it in O(|S|+|T|). No approximation, no probabilistic hash, just exact.
Lexicographic k-th substring. Compute dp[v] = number of distinct paths from v. Each path spells a distinct substring. Descend greedily, choosing the smallest character whose subtree has enough paths to contain rank k.
Shortest string not appearing in s. BFS on the automaton, tracking minimum length of paths that miss a transition. The first state where some character has no outgoing edge gives the answer.
String compression problems. The SAM can detect the smallest rotation of a string, find minimal periods, and identify the longest palindromic substring with some augmentation.
When to Use the Suffix Automaton
Five signals this problem wants a suffix automaton:
- A fixed string s with many queries about its substrings.
- You need how many times a pattern appears, not just whether it appears.
- You need to count distinct substrings of s.
- You need the longest common substring of two strings (not subsequence).
- n up to 10^5 or 10^6, and you need something faster than suffix array binary search.
The one signal that points away from the SAM: you need queries with online updates (inserting into the middle of s). The standard SAM only supports appending to the end.
Worked Example 1: How Many Distinct Substrings?
Problem: given s, count the number of distinct substrings. (Classic competitive programming problem. Every beginner reaches for a set. The set reaches back.)
Brute force: generate all O(n^2) substrings, stuff them into a hash set. O(n^2) time and space. Works fine for n up to a few thousand. Quietly kills your process at n=10^5.
Suffix array approach: build the SA and LCP array, answer is n(n+1)/2 - sum(LCP). O(n log n) build, O(n) formula.
SAM approach: build in O(n), then sum len[v] - len[link[v]] over all non-initial states. O(n) total. Correctness follows from Lemma 3: each state v contributes the substrings with lengths in (len[link[v]], len[v]], and those ranges partition all distinct substrings exactly.
For abbc: states 1("a"), 4("b"), 2("ab"), 3("abb","bb"), 5("c","bc","bbc","abbc") contribute 1+1+1+2+4 = 9. Enumerate them: {a, b, c, ab, bb, bc, abb, bbc, abbc}. Correct.
Worked Example 2: Longest Common Substring
Problem: given strings S and T, find the longest substring that appears in both.
Nested loops are O(|S|·|T|). Binary search on sorted suffixes is O((|S|+|T|) log(|S|+|T|)).
Build the SAM for S. Then process T left to right, maintaining two variables: current state v and current match length l. At each character c of T:
- If v has a c-transition: follow it, increment l.
- If v has no c-transition: keep following suffix links (each time resetting l to
len[link[v]]) until you find a c-transition or hit the root. If you hit the root and still no c-transition, reset l to 0 and stay at root.
Track the maximum l seen. The LCS length is that maximum.
Why this is correct. Each state v represents all substrings of S that end at the same set of positions. When you follow v's suffix link, you land in the parent state whose substrings are slightly shorter but appear more often. The current match length l is always at most len[v], maintaining the invariant that the last l characters of T[0..i] appear as a substring of S.
Total time: O(|S|) to build + O(|T|) to stream = O(|S| + |T|).
The Non-Obvious Pitfall (Clones Do Not Count)
Cloned states do not count as occurrences. When you clone a state during extend, the clone represents substrings that existed before the new character. No new occurrence happened. Initializing cnt = 0 for clones is mandatory. Set cnt = 1 only for freshly created (non-clone) states, then propagate up the suffix link tree.
Initialize all states to cnt = 1 and every occurrence count will be wrong. The bugs are completely silent. The automaton structure is correct. The transition graph is correct. Only the counts are off. You will submit a wrong answer, stare at the structure for twenty minutes, convince yourself the logic is right, and only then remember this paragraph.
A second trap: the topological order for propagating counts must be decreasing by len. The suffix link tree has the property that parents always have smaller len than children, so processing in decreasing len order processes all children before their parents. Propagating in any other order produces garbage.
Quick Recap
- The suffix automaton is the smallest DFA accepting all suffixes of s, implicitly indexing every substring.
- Each state represents an endpos equivalence class: all substrings that end at exactly the same positions in s.
- Endpos sets form a tree (via Lemma 2: disjoint or one contains the other). Suffix links trace this tree.
- Every state v contains substrings of consecutive lengths [len(link[v])+1, len[v]]. This is what makes the size linear.
- Construction is online, O(n) amortized, creates at most 2n-1 states and 3n-4 transitions.
- The
extend(c)function has three cases: no conflict (link to root), continuous conflict (link directly), non-continuous conflict (clone + redirect). - Count distinct substrings in O(n) via the formula. Count occurrences in O(pattern length) after O(n) preprocessing. Longest common substring in O(|S| + |T|).
- Initialize
cnt = 1for non-clone states only. Propagate counts in decreasing-len order. - If you need ordered suffix operations (binary search, LCP), prefer the suffix array. If you need multiple patterns against a static text, the SAM wins.
If you want to see how quickly you can explain suffix links under pressure, SpaceComplexity will put you in a realistic voice-based mock interview and tell you exactly which part of your explanation lost the interviewer.
For related string algorithm foundations, the KMP algorithm covers the failure function which shares structural DNA with suffix links. The suffix array is the suffix automaton's cache-friendlier sibling for sorted-suffix problems. And for the trie analogy that makes DAWGs click, trie data structure is worth reading first if the automaton concept feels abstract.