What Is a Suffix Tree? Every Suffix Indexed, One Query at a Time

June 4, 20269 min read
dsaalgorithmsinterview-prepdata-structures
What Is a Suffix Tree? Every Suffix Indexed, One Query at a Time
TL;DR
  • Suffix tree compresses all n suffixes into a trie where edges store index pairs rather than characters, achieving O(n) space with O(m) pattern search after a one-time O(n) build
  • Ukkonen's algorithm constructs the suffix tree online in O(n) time using suffix links and an active point to batch implicit suffix insertions
  • Longest repeated substring, longest common substring, and distinct substring count all fall out of the tree topology without extra preprocessing
  • Space constant is brutal: each node needs 20-40 bytes, so a 10MB string produces a 400-800MB suffix tree — 40-80x the original size
  • Suffix arrays store equivalent information as n sorted integers, using roughly 8x the original string with better cache locality and simpler construction
  • Interview angle: recognize when the O(m)-per-query property applies; implementing Ukkonen's cold in 45 minutes is not a realistic expectation

You have a 100MB genome sequence. A biologist wants to search for 10,000 different patterns in it. Naive string search means 10,000 passes over 100 million characters. A trillion comparisons. Your server fan audibly sighs. Your cloud bill quietly does not.

A suffix tree turns that into one O(n) preprocessing step, then O(m) per query, where m is the pattern length. Build the index once. The text never gets scanned twice.

That is the problem suffix trees exist to solve.

What a Suffix Is

A suffix of a string is any substring that runs all the way to the end. For "banana":

  • "banana" (starts at index 0)
  • "anana" (starts at index 1)
  • "nana" (starts at index 2)
  • "ana" (starts at index 3)
  • "na" (starts at index 4)
  • "a" (starts at index 5)

A string of length n has exactly n suffixes.

The Compression That Makes O(n) Space Work

Put all n suffixes into a regular trie and you get O(m) pattern search. But storing every character explicitly costs O(n²) space. All suffixes together have n + (n-1) + ... + 1 = n(n+1)/2 total characters. You just wanted a search index, not a monument.

A suffix tree compresses this trie so every edge represents a substring, not a single character. Any internal node with only one child collapses into its parent's edge. The tree still supports O(m) search, now in O(n) space.

Here is what makes the space bound tight: edges do not store characters at all. Each edge stores two integers, a start index and an end index into the original string. The characters live in memory once. Every edge borrows a slice of them. A 10-million-character string only needs about 2n integers for all edge labels, regardless of how long those substrings are.

That is the trick. Character storage stays O(n) because it was already O(n) in the original string. The tree only adds O(n) bookkeeping on top of it.

Walking Through "banana"

Take "banana". First, append a sentinel character "$" that does not appear elsewhere in the string. This guarantees every suffix ends at a unique leaf. Without it, "a" is a prefix of "ana", and those paths would share a leaf incorrectly. The "$" looks decorative. It is load-bearing.

The full suffix list: "banana$", "anana$", "nana$", "ana$", "na$", "a$", "$".

In the completed tree, paths from root to leaf spell out these suffixes. Common prefixes share edges. The two suffixes starting with "a", "anana$" and "ana$", both follow the edge labeled "a" from the root, then split at an internal node where their remaining characters diverge. The shared prefix "ana" is represented once, not twice.

Suffix tree for the string banana$, showing all seven suffixes as leaf nodes with shared prefixes compressed onto internal edges

From the root, the structure branches on first characters: one branch handles all "b"-suffixes, one handles all "a"-suffixes, one handles all "n"-suffixes, and so on. Each branch collapses into a single edge wherever only one path continues. Internal nodes exist only at genuine branching points.

A suffix tree for a string of length n has exactly n leaves (one per suffix) and at most n-1 internal nodes, since every internal node must have at least two children. Total nodes: O(n). That is why the space bound holds.

Pattern Search in O(m)

Given a pattern P of length m, walk the tree from the root. At each node, find the outgoing edge whose label starts with the next unmatched character of P. Follow it. Step through the edge's characters one by one against P's remaining characters.

If you exhaust P successfully, the pattern exists in the text. The subtree below your current position contains all occurrences. Each leaf in that subtree stores a starting index in the original text, so you can enumerate all k occurrences in O(m + k) total time.

If you fall off the tree before consuming all of P, the pattern does not exist. No further search needed. Done.

This is what makes repeated queries fast. Build the tree once in O(n) time. Then each of your 10,000 queries runs in O(m), not O(n). The gain compounds with query volume.

Other Problems That Fall Out for Free

Pattern matching is the headline use case, but a built suffix tree handles a range of string problems without additional preprocessing. The tree topology encodes information about the entire string's substring structure. You get a whole toolkit from one build step.

Longest repeated substring. The deepest internal node in the tree corresponds to the longest substring appearing at least twice. The path from root to that node spells it out. O(n) to find after the O(n) build.

Longest common substring of two strings. Build a generalized suffix tree over both strings, separated by a unique marker character. The deepest internal node that has leaves from both strings gives the answer directly.

Number of distinct substrings. Every path from root to any node represents a distinct substring. Sum the edge lengths across all root-to-node paths.

Shortest unique substring at each position. For each position i, find the shortest prefix of suffix i that appears nowhere else in the text. This reads directly off the tree structure without additional passes.

These are not features bolted on after the fact. They fall out naturally from the tree's structure once you understand what depth and branching represent.

Building the Suffix Tree: Ukkonen's O(n) Algorithm

The naive approach is O(n²): insert each suffix into the compressed trie one at a time, merging edges as you go. For long strings, that is too slow.

Ukkonen's algorithm (1995) builds the suffix tree online, left-to-right, in O(n) time. It processes one character at a time and maintains an "active point" (active node, active edge, active length) plus a count of suffixes waiting to be inserted explicitly.

The key insight: after processing position i, many suffixes ending at i already exist implicitly in the tree from prior extensions. Ukkonen tracks what still needs explicit insertion and batches those insertions using suffix links, pointers between internal nodes that skip redundant traversal.

The algorithm is notoriously hard to implement without bugs. It involves roughly half a dozen bookkeeping invariants around suffix links, active point updates across three distinct extension rules, and careful handling of the string's end. Not "tricky but manageable" hard. More like "read the paper, understood nothing, rewrote it three times, still have an off-by-one in the suffix link update" hard.

Most engineers who need this in production use a library or switch to a suffix array. Reading the original paper twice and still feeling uncertain is not unusual. It is the most complex O(n) string algorithm most people ever encounter.

A tweet reading "6 hours of debugging can save you 5 minutes of reading documentation"

The Ukkonen paper does not help as much as you would hope. Read it anyway.

For alphabets of constant size, Ukkonen's runs in O(n) time and O(n) space. For general alphabets, sorting edge labels adds an O(n log n) factor.

Space: Why the Numbers Surprise You

The asymptotic space is O(n). The constant, however, is brutal in practice. Each node stores several fields: a map from first character to child node (a hash map or fixed-size array over the alphabet), a parent pointer, a suffix link, and the (start, end) edge label. That comes to roughly 20 to 40 bytes per node, with up to 2n nodes.

For a 10MB input string (n = 10,000,000), the suffix tree uses somewhere between 400MB and 800MB. That is 40 to 80 times the size of the original string. Your genome sequence fits in an email attachment. The suffix tree wants a dedicated VM and two apologies.

Programmers in 1975 fit games in kilobytes of machine code; by 2025 an AI-generated calculator app uses 1GB RAM

The suffix tree constant factor: a proud tradition.

A suffix array stores the same positional information as n integers. At 32-bit width, that is 40MB for the same input. Pair it with an LCP array and you cover nearly everything a suffix tree can do, in roughly 80MB total. The suffix array is why suffix trees are textbook material but rarely what production systems deploy.

Suffix Tree vs Suffix Array: Why Production Uses Arrays

Both structures encode equivalent information about a string's substrings. A suffix array is the sorted permutation of all n suffix starting indices. With a longest-common-prefix (LCP) array alongside it, you can answer the same queries a suffix tree answers.

Why suffix arrays win in practice:

  • 4 to 8 times smaller memory footprint
  • Better cache locality during queries (array traversal beats pointer chasing through scattered nodes)
  • Simpler construction: O(n log n) with comparison sort, O(n) with SA-IS or DC3 algorithms
  • Far fewer implementation footguns (the suffix tree had a whole armory)

Why suffix trees still matter:

  • The tree structure makes algorithmic reasoning explicit and easy to visualize
  • Generalized suffix trees over multiple strings are more natural to express
  • Some proof arguments read more clearly in tree form

For interviews, understand the tree structure and what it enables. Know that suffix arrays are the practical alternative and roughly why.

What Interviewers Actually Test

You will almost certainly not be asked to implement Ukkonen's algorithm in a 45-minute interview. It is too involved, and no reasonable interviewer expects it cold.

What you might encounter:

  • A recognition problem. "Find the longest substring appearing at least K times." The answer is the deepest node with K or more leaves below it. The test is recognizing that a suffix structure is the right tool, not implementing one.
  • A follow-up on efficiency. After a naive string matching solution: "How would you make this O(m) per query for repeated searches?" Suffix tree or suffix array is the answer.
  • System design context. Full-text search engines like Elasticsearch and Lucene, and DNA analysis tools, build specialized string indices whose design descends directly from suffix structures.

For most string interview problems, KMP and Rabin-Karp are far more frequently tested. Suffix trees are the powerful conceptual layer behind the hardest string problems, not a replacement for pattern-matching fundamentals.

If you want to practice the kind of reasoning that makes suffix structures click under pressure, SpaceComplexity runs voice-based mock interviews where you work through string problems out loud, the way you will in an actual interview.

Further Reading