Suffix Tree Data Structure: Index Every Substring at Once, Query in O(m)

May 26, 202638 min read
dsaalgorithmsinterview-prepdata-structures
Suffix Tree Data Structure: Index Every Substring at Once, Query in O(m)
TL;DR
  • A suffix tree is a compressed trie of all suffixes, with edge labels stored as index pairs so the whole structure fits in O(n) space.
  • Ukkonen's algorithm builds the suffix tree in O(n) using three interacting tricks: a global end pointer, an active point, and suffix links.
  • Pattern search costs O(|P|) because every substring of the text is a prefix of some suffix, making lookup a root-to-node walk.
  • The longest repeated substring is the path label of the deepest internal node, found by a single O(n) DFS with no extra data structures.
  • A generalized suffix tree over S + "$" + T + "#" solves longest common substring in O(|S| + |T|) after linear construction.
  • Suffix arrays are preferred for pure search (3-5x smaller memory footprint), but suffix trees win when you need multiple query types over the same text.
  • All 14 language implementations (Python 3 through PHP) demonstrate the same compressed trie insert, search, and longest-repeated-substring pattern.

You have a 3-billion-character DNA sequence and you need to find every location where a given gene fragment appears. Scanning naively takes O(n) per query, and you have thousands of queries. That's slow enough to be useless.

The suffix tree solves this. Build it once in O(n) time. Then every pattern search costs only O(|pattern|), independent of how large the text is. The longest repeated substring? A single DFS. The longest common substring of two sequences? Combine them and DFS again.

It is the most capable string index in practical use. It is also, judging by how many explanations quietly say "just trust the suffix link," among the most underexplained. Here is how it actually works.


One Sentence Before We Go Deep

A suffix tree is a compressed trie of all suffixes of a string, with edge labels stored as index pairs so the whole structure fits in O(n) space.

Once you have it, any substring of the text corresponds to a root-to-node path in the tree, which means "does this pattern exist?" reduces to "can I walk this path?"


The Shape of the Thing

Start with the string banana and append a unique terminator $ that does not appear in the text. (Yes, you need it. The math breaks without it. We get to why in about 150 words.) You now have banana$ of length 7. Its suffixes are:

0: banana$
1: anana$
2: nana$
3: ana$
4: na$
5: a$
6: $

Build a trie of all seven. Then compress every chain of single-child nodes into one edge with a multi-character label. The result is the suffix tree.

                    root
          ┌──────────┼──────────────┬─────────────┐
         "$"        "a"          "banana$"       "na"
          │          │               │              │
        leaf(6)   [node]          leaf(0)        [node]
                 ┌──┴──────┐                   ┌───┴───────┐
                "$"      "na"                 "$"        "na$"
                 │          │                  │             │
              leaf(5)   [node]              leaf(4)       leaf(2)
                        ┌──┴──┐
                       "$"  "na$"
                        │      │
                     leaf(3) leaf(1)

Suffix tree for "banana$" showing 7 leaves and 3 internal nodes connected by labeled edges The complete suffix tree for "banana$". Blue circles are internal nodes, each representing a repeated substring. Amber circles are leaves, labeled with the starting index of their suffix.

Seven leaves, three internal nodes (plus root). Every leaf corresponds to exactly one suffix. Every internal node (except root) represents a substring that appears at least twice.

Why at Most 2n-1 Nodes

In any tree where every internal node has at least two children, the number of internal nodes is at most (leaves - 1).

The proof is a counting argument on edges. Let there be L leaves and I internal nodes, giving a tree with L + I nodes and L + I - 1 edges. Every internal node has at least 2 children, so it contributes at least 2 edges downward. But the root receives no upward edge, so the upward edges total L + I - 1. The downward edges from internal nodes are at least 2I. Since each non-root node receives exactly one edge from its parent: L + I - 1 >= 2I, which gives I <= L - 1.

For a suffix tree on text of length n, there are exactly n leaves (the unique terminator $ ensures no suffix is a prefix of another, so every suffix ends at its own leaf). So internal nodes <= n - 1 and total nodes <= 2n - 1.

Edge Labels Are Pairs, Not Strings

Storing full string labels on edges would use O(n²) space in the worst case. The fix is elegant: store (start, end) index pairs pointing into the original text. The label for an edge is implicitly text[start..end]. All 2n-1 edges share one backing array. Total space: O(n).

In practice, a node needs the pair, a parent pointer, suffix link, and a children structure. On a 64-bit system this runs to roughly 32-48 bytes per node, times up to 2n nodes, so about 10-20 bytes per input character. A suffix array uses 4 bytes per character. That 3-5x size gap is why suffix arrays win for pure search. Suffix trees win when you need more than one kind of query on the same text.

Diagram showing naive string storage versus (start, end) index pair storage for suffix tree edges Storing string literals on edges blows up to O(n²) space on pathological inputs. Storing index pairs keeps every edge at a fixed 8 bytes, with all edges sharing the original text as their backing array.


Weiner Built It. McCreight Fixed It. Ukkonen Made It Stream.

Peter Weiner introduced the structure in 1973, calling it a position tree. Donald Knuth's student Vaughan Pratt credits Knuth with calling it "Algorithm of the Year 1973." Weiner's construction ran in O(n) but was notoriously difficult to understand. The paper's reputation in the algorithms community is consistent: brilliant, correct, and impenetrable.

Edward McCreight simplified the construction in 1976, reducing the constant factors and making the algorithm more teachable. More teachable. Not easy.

Esko Ukkonen published the algorithm we use today in 1995: online, O(n), and the first construction that builds the suffix tree for the characters seen so far, making it streamable. This is "Ukkonen's algorithm." Three algorithms, twenty-two years. Nobody said string indexing was supposed to be this hard.


Building It: Ukkonen's Three Tricks

Naively, you insert each of the n suffixes into the compressed trie one at a time. Inserting suffix i costs O(i) in the worst case. Total: O(n²). Ukkonen gets it down to O(n) with three ideas that interact.

Getting all three ideas working in your head simultaneously is the challenge. Most explanations, including excellent ones, quietly lose the reader somewhere in Trick 2.

Diagram: Ukkonen Construction State

String: "abcabx"

Phase 1 (after 'a'):    Phase 4 (after 'a'):    Phase 6 (after 'x'):
root                    root                    root
 └─ "a"                  ├─ "ab..."              ├─ "ab"
                         ├─ "b..."               │   ├─ "cabx" leaf(0)
active=(root,∅,0)        └─ "c..."               │   └─ "x"   leaf(3)
remainder=0              active=(root,'a',1)     ├─ "b"
                         remainder=1              │   ├─ "cabx" leaf(1)
                                                 │   └─ "x"   leaf(4)
                                                 ├─ "cabx"    leaf(2)
                                                 └─ "x"       leaf(5)

Three-panel diagram showing Ukkonen's three tricks: global end pointer, active point triple, and suffix link arrow between internal nodes The three tricks are independent ideas that compose into a linear algorithm. The global end pointer handles all leaf extensions for free. The active point eliminates redundant root-to-node walks. Suffix links turn what would be O(n) split sequences into O(1) amortized jumps.

Trick 1: The Global End Pointer

Every leaf edge extends to the end of the string. Instead of updating each leaf at every phase (O(n) updates per phase, O(n²) total), use a global variable end that represents "current position." All leaf edges point to this single variable. When you add character i, end increments by 1 and all leaf edges grow for free. Rule 1 (extend a leaf) costs O(1) total across all n phases.

Trick 2: The Active Point

The active point is a triple (activeNode, activeEdge, activeLength) that tracks where the next suffix insertion should start. Without it, you would walk from root for every extension. With it, you jump directly to the right position.

When the algorithm ends a phase without finishing all pending insertions (Rule 3 fired), the active point sits at the exact spot where the next character was absorbed implicitly. The next phase resumes from there.

Trick 3: Suffix Links

A suffix link from internal node v (representing string ) points to internal node w (representing string α). After creating an internal node during a split, you follow suffix links to find where to make the next split. This turns what would be O(n) root-to-node traversals into O(1) amortized hops. This is also the part where most suffix tree explanations say "just follow the suffix link" and hope you do not ask why the link exists or how it got there.

The Three Extension Rules

For each phase i (processing character S[i]), the algorithm runs through all pending extensions:

  • Rule 1: The path ends at a leaf. The global end pointer already extended it. Free.
  • Rule 2: The path ends at an internal node or mid-edge, and the next character differs from S[i]. Create a new leaf (splitting the edge if needed). Create an internal node at the split point and add a pending suffix link.
  • Rule 3: The next character already matches S[i]. The suffix is implicitly in the tree. Stop. Increment the remainder counter. All remaining extensions in this phase also hit Rule 3.

Rule 3 is the showstopper. Once it fires, the phase ends immediately. The active point stays put, and remainder records how many suffixes are still pending implicit insertion. They will be resolved when a future character forces a split. If no future character forces a split, those suffixes are already implicit in the tree. Either way, you do not need to touch them.

Why the Total Work Is O(n)

The active point's activeLength can increase by at most 1 per character added and decreases with every Rule 2 split. So the total number of "steps along an edge" is bounded by n. Suffix link hops are O(1) each and happen at most once per Rule 2 firing. Rule 2 fires at most n times total (once per leaf). Rule 3 ends each phase early. The amortized cost per character is O(1). The full formal proof in Ukkonen 1995 runs to about four pages. This version is the right intuition, just abbreviated.


What You Get: Operations and Their Costs

OperationTimeSpaceNotes
Build (Ukkonen's)O(n)O(n)Online, streamable
Build (naive)O(n²)O(n)Insert each suffix
Search for pattern PO(|P|)O(1)Walk root-to-node
Find all k occurrences of PO(|P| + k)O(k)Walk then DFS subtree
Longest repeated substringO(n)O(1) extraDFS, deepest internal node
Longest common substringO(n + m)O(n + m)Generalized suffix tree
Count distinct substringsO(n)O(1) extraSum all edge label lengths
BWT constructionO(n)O(n)DFS lexicographic order

Search Is O(|P|) Because P Must Be a Suffix Prefix

Every substring of T is a prefix of some suffix of T. Follow that logic: if pattern P occurs in T starting at position i, then P is a prefix of the suffix T[i..n]. That suffix is a root-to-leaf path in the tree. So P must match the first |P| characters of some root-to-leaf path, meaning you can walk P character by character from the root and either exhaust P (found) or fail (not found). Walking |P| characters costs O(|P|).

Finding All Occurrences Is O(|P| + k)

Walk root-to-node in O(|P|). Now collect all leaves in the subtree rooted at that node. Each leaf's suffixIndex is an occurrence start position. The subtree has exactly k leaves if P occurs k times. DFS the subtree costs O(k).

Longest Repeated Substring: Deepest Internal Node

Every internal node has at least two leaf descendants. Each leaf is a different suffix. The path label from root to that internal node is therefore a string that appears at least twice. The deepest internal node (by total edge label length from root) is the longest repeated substring.

For "banana":
  Internal node at "ana" (depth 3) has leaves for suffixes 1 and 3.
  "ana" appears at positions 1 and 3. LRS = "ana".

Suffix tree for "banana$" with the path root-to-"ana" highlighted in green as the longest repeated substring The greyed-out edges are irrelevant for this query. The highlighted path traces "a" then "na" to reach the deepest internal node, representing "ana". Two leaf descendants confirm it appears twice.

DFS the tree tracking cumulative path length. At each internal node, compare to best. O(n) total. The connection between "deepest internal node" and "longest repeated string" feels obvious once you see it, and completely non-obvious until you do.

Longest Common Substring: The Sentinel Trick

To find the longest common substring of strings S and T, build a generalized suffix tree for S + "$" + T + "#" where "$" and "#" are distinct terminators not in either string. In the resulting tree, mark each leaf as coming from S or T based on which side of the separator its start index falls on.

During DFS, propagate upward: an internal node is "mixed" if its subtree contains leaves from both S and T. The deepest mixed internal node (by string depth) gives the LCS. This runs in O(|S| + |T|) after O(|S| + |T|) construction.

Count Distinct Substrings in O(n)

Each edge in the suffix tree contributes exactly len(edge_label) distinct substrings that are not represented by any ancestor edge. Summing all edge label lengths gives the total count of distinct non-empty substrings. Since there are O(n) edges and you have the lengths as index pairs, this sum is O(n).

BWT in O(n)

Traverse the suffix tree in DFS order, always following edges in lexicographic order of their first character. The sequence in which you visit leaves gives the suffix array SA. The Burrows-Wheeler Transform is then BWT[i] = T[(SA[i] - 1 + n) mod n]. This transformation underlies bzip2 compression and the FM-index used in genome aligners like BWA.


One Structure, Every Language

Each implementation builds a suffix tree by inserting suffixes one at a time (O(n²) build) with O(|P|) search and O(n) LRS. For production use with long strings, replace the build with Ukkonen's algorithm as described above.

from __future__ import annotations from typing import Optional class Node: __slots__ = ("label", "suffix_index", "children") def __init__(self, label: str = "", suffix_index: int = -1) -> None: self.label = label self.suffix_index = suffix_index self.children: dict[str, Node] = {} class SuffixTree: def __init__(self, text: str) -> None: self.text = text + "$" self.root = Node() for i in range(len(self.text)): self._insert(i) def _insert(self, start: int) -> None: node = self.root suffix = self.text[start:] pos = 0 while pos < len(suffix): c = suffix[pos] if c not in node.children: node.children[c] = Node(label=suffix[pos:], suffix_index=start) return child = node.children[c] j = 0 while j < len(child.label) and pos + j < len(suffix) and suffix[pos + j] == child.label[j]: j += 1 if j == len(child.label): pos += j node = child continue # Split edge at position j remainder_c = child.label[j] new_c = suffix[pos + j] split = Node(label=child.label[:j]) node.children[c] = split child.label = child.label[j:] split.children[remainder_c] = child split.children[new_c] = Node(label=suffix[pos + j:], suffix_index=start) return def search(self, pattern: str) -> bool: node, pos = self.root, 0 while pos < len(pattern): c = pattern[pos] if c not in node.children: return False child = node.children[c] j = 0 while j < len(child.label) and pos < len(pattern) and pattern[pos] == child.label[j]: j += 1 pos += 1 if pos == len(pattern): return True if j < len(child.label): return False node = child return True def longest_repeated_substring(self) -> str: def dfs(node: Node, path: str) -> str: best = "" for child in node.children.values(): new_path = path + child.label if child.children and len(new_path) > len(best): best = new_path candidate = dfs(child, new_path) if len(candidate) > len(best): best = candidate return best return dfs(self.root, "")

Where the Suffix Tree Data Structure Belongs in Your Toolkit

Any problem that asks you to query the same text many times for different patterns belongs to the suffix tree family.

The key scenarios:

  • Repeated multi-pattern search. You have one large document and thousands of patterns. Build the tree once in O(n), answer each query in O(|pattern|). Total: O(n + sum of pattern lengths) vs O(n * sum of pattern lengths) naively.

  • Longest repeated substring. Any input where you want the longest string that appears at least twice: repeated genomic sequences, repeated code blocks, plagiarism detection. One DFS, O(n).

  • Longest common substring. Given strings S and T, find their longest common substring using a generalized suffix tree. Key in bioinformatics for sequence alignment and in diff tools.

  • Count distinct substrings. Summing edge label lengths gives the answer in O(n) after O(n) construction.

  • Compression preprocessing. The BWT, used by bzip2, flows naturally from a suffix tree DFS. The FM-index (which powers tools like BWA for genome alignment) is a compressed representation derived from the suffix tree via BWT and suffix arrays.

  • String matching with wildcards and mismatches. The tree's structure supports approximate matching extensions that are much harder to achieve with a bare suffix array.


Two Signals That Tell You to Reach for This

A suffix tree (or suffix array, its leaner cousin) is the right tool in two recognizable situations.

Signal 1: Multiple Queries, One Fixed Text

Problem: You are given a book of 500,000 words. A user can submit any query string and must see all page numbers where it appears, in under 100ms, regardless of how many users query simultaneously.

Why not KMP or Rabin-Karp? Both scan the full text O(n) per query. For high query volume, that multiplies fast.

The suffix tree approach: build the tree once offline, O(n). Store it. Each query walks the tree in O(|query|) and then collects leaves in O(occurrences). A 500,000-character text with suffix tree uses roughly 10MB. A single query for "justice" (7 chars) touches at most 7 internal nodes regardless of text size.

Construction cost is amortized over all future queries. If you expect more than one query, the suffix tree wins.

Signal 2: "Longest X That Appears Twice"

Problem: Given a DNA string of length 10⁷, find the longest sequence that appears at least twice. This would help identify repeated elements in the genome.

The brute force approach checks all O(n²) pairs of substrings, each comparison up to O(n). That is O(n³). Even the "smarter" approach of checking all O(n²) substrings with hashing is O(n²).

The suffix tree approach:

  1. Build the suffix tree in O(n). Each internal node has at least 2 leaf descendants, so it represents a string that appears at least twice.
  2. DFS the tree, tracking cumulative path label length.
  3. At each internal node, compare its string depth to the current best.
  4. Return the path label of the deepest internal node found.

Total: O(n) build + O(n) DFS = O(n). For a 10⁷ character string, this finishes in under a second on any modern machine.

The connection: every internal node IS a repeated substring, and depth in the suffix tree IS frequency-weighted substring length. The longest repeated substring problem maps perfectly to "find the maximum-depth internal node," which is a one-pass DFS.


Before You Move On

  • A suffix tree is a compressed trie of all suffixes, built in O(n) with Ukkonen's algorithm.
  • The unique terminator $ forces every suffix to end at a distinct leaf, giving exactly n leaves and at most n-1 internal nodes. Total: at most 2n-1 nodes.
  • Edge labels are (start, end) index pairs into the original text, so the whole structure uses O(n) space. In practice: 10-20 bytes per character.
  • Ukkonen's three tricks: global end pointer (leaf extension is free), active point (jump to the right insertion spot), suffix links (amortized O(1) hops between related nodes).
  • Search costs O(|pattern|) because a pattern in the text is a prefix of some suffix, which is a root-to-node path.
  • Deepest internal node = longest repeated substring.
  • Generalized suffix tree on S + "$" + T + "#" gives longest common substring in O(|S| + |T|).
  • DFS in lexicographic order gives the suffix array, from which the BWT follows immediately.
  • For pure search on a static text, suffix arrays are usually preferred: simpler to implement and 3-5x smaller memory footprint.

If you followed the Ukkonen section to the end, you now have better intuition for amortized analysis than most people who confidently use the word in engineering interviews. That is worth something.

If you want to practice explaining these tradeoffs out loud the way an interviewer expects, SpaceComplexity runs voice-based DSA mock interviews with rubric feedback on exactly this kind of structural reasoning. You will know you have it down when you can explain why suffix links guarantee O(1) amortized cost per extension, not just that they do.

For more on the data structures that surround suffix trees in interview prep, see our deep dives on the trie data structure, hash map internals, and binary search invariants.


Further Reading