Suffix Array: How 4n Bytes Beat the Suffix Tree at Its Own Game

The suffix tree is the theoretical ideal for string indexing. It answers any substring query you can dream up, in optimal time. Udi Manber, who helped invent the suffix array, once called the suffix tree "the most versatile data structure for string processing." High praise. The problem is that a good suffix tree implementation costs 15 to 20 bytes per character. On a 10 MB file, that is 200 MB of memory before you have done anything useful. You have burned your whole L3 cache on pointers before reading a single result.

The suffix array is the practical answer. Four bytes per character. Same asymptotic bounds for nearly every application you care about. Built in linear time. Every modern bioinformatics aligner, every production full-text search system, and bzip2 all run on suffix arrays. This is not a compromise people settled for. It is the right tool.

The mental model: sort every suffix of your string, keep only their starting positions, and you have a suffix array.

Reach for one when you need to search a large text repeatedly, find the longest repeated substring, count distinct substrings, or produce a Burrows-Wheeler Transform. It is the backbone of offline string algorithms the way a hash map is the backbone of lookup problems.

What Even Is a Suffix Array

Take the string "banana". It has six suffixes, one starting at each position:

index | suffix
------+--------
  0   | banana
  1   | anana
  2   | nana
  3   | ana
  4   | na
  5   | a

Sort those suffixes lexicographically and record their starting positions:

rank | suffix   | start
-----+----------+------
  0  | a        |  5
  1  | ana      |  3
  2  | anana    |  1
  3  | banana   |  0
  4  | na       |  4
  5  | nana     |  2

The suffix array is just the right-hand column: SA = [5, 3, 1, 0, 4, 2].

That is it. An array of n integers, each a valid index into the original string. No pointers. No nodes. No edge labels. The suffixes themselves are never stored. You never allocate them. SA[i] holds the starting position of the i-th lexicographically smallest suffix, and you reconstruct any suffix on demand with a slice of the original string.

The suffix array for "banana" alongside its sorted suffixes: SA = [5, 3, 1, 0, 4, 2]

Sorted suffixes of "banana" with their starting positions. The right column is your entire data structure.

The original string sits at one address. The suffix array sits right next to it, a flat block of n four-byte integers. A suffix tree for the same string would have up to 2n - 1 nodes, each carrying a parent pointer, child pointers, edge label indices, and a suffix link. On a modern 64-bit system those pointers are eight bytes each. Implementations land at 15 to 20 bytes per input character. That is a lot of memory for what is essentially a sorted phone book you never look at linearly.

Side-by-side memory comparison: a suffix tree's pointer tangle versus a suffix array's flat integer block

Left: the suffix tree for "banana" with its pointers and suffix links. Right: the suffix array. Same data, 4-5x less memory.

The suffix array is just 4n bytes and nothing else.

Building a Suffix Array

Naive O(n² log n): You Could, But Why Would You

The obvious approach: collect all n suffix indices, sort them by comparing the corresponding suffix strings with a standard comparator, done. Sorting n strings takes O(n log n) comparisons; each comparison walks up to n characters. Total: O(n² log n). Fine for strings under a few thousand characters. Do not use this on a genome.

Prefix Doubling: O(n log n)

Manber and Myers introduced the prefix doubling algorithm in their 1990 tech report and 1993 SIAM paper. The idea is to sort suffixes by progressively longer prefixes, doubling the comparison window each round until no two suffixes share a rank.

Start by assigning each position a rank equal to its character's code point. After k rounds, every suffix is ranked by its first 2^k characters. Because ranks fit in a single integer, each round sorts n integer pairs and can be done with radix sort in O(n). After at most ⌈log₂ n⌉ rounds, all ranks are distinct and the sort is complete.

Each round, you replace a string comparison with an integer pair comparison. The comparison window doubles but the comparator stays O(1). You are basically telling the algorithm: stop reading characters, just compare two ints.

For "banana", the rounds look like this:

Round 0 (gap=1, sort by first 1 char):
  SA = [1,3,5,0,2,4]
  ranks: a→0, b→1, n→2
  rank = [1,0,2,0,2,0]   (indices 0..5)

Round 1 (gap=1, sort by (rank[i], rank[i+1])):
  Keys:  5→(0,−1)  3→(0,2)  1→(0,2)  0→(1,0)  4→(2,0)  2→(2,0)
  SA = [5,3,1,0,4,2]
  rank = [2,1,3,1,3,0]   ties at "an" and "na"

Round 2 (gap=2, sort by (rank[i], rank[i+2])):
  Keys:  5→(0,−1)  3→(1,0)  1→(1,1)  0→(2,3)  4→(3,−1)  2→(3,3)
  SA = [5,3,1,0,4,2]
  rank = [3,2,5,1,4,0]   all distinct → done

All ranks are distinct after gap=2, so the final suffix array for "banana" is [5, 3, 1, 0, 4, 2].

Three rounds of prefix doubling on "banana": ranks go from character codes to fully distinct integers

Each round doubles the comparison window. The rank array goes from "full of ties" to "all distinct" in ⌈log₂ n⌉ steps.

Each round requires a sort of integer pairs. With radix sort that is O(n) per round and O(n log n) total. Comparison-based sort raises this to O(n log²n). Most practical implementations use the comparison-based version with a good constant; the asymptotic difference rarely matters outside of very long strings.

SA-IS: O(n), and Yes That Is Real

Nong, Zhang, and Chan achieved true linear time in 2009 with the SA-IS (Suffix Array by Induced Sorting) algorithm. It classifies each position as S-type (the suffix is lexicographically smaller than its neighbor) or L-type (larger), identifies LMS (leftmost-S) positions as the "interesting" ones, sorts only those LMS substrings recursively, and then induces the full sorted order from the sorted LMS positions in two linear passes. The proof of correctness is non-trivial. The implementation is around 200 lines of C.

The libsais library by Ilya Grebnov is the gold-standard implementation, handling strings up to 2³¹ − 1 characters.

All implementations below use prefix doubling. It is readable, fast enough for any interview, and fast enough for most production workloads. Swap in libsais when you are indexing a chromosome.

The LCP Array: Your Neighbors Have Been Talking

The suffix array alone handles binary-search-based pattern matching fine. But you want more. Many more operations open up with a companion structure: the LCP (Longest Common Prefix) array.

LCP[i] is the length of the longest common prefix between SA[i-1] and SA[i], the two adjacent suffixes in sorted order.

For "banana" with SA = [5, 3, 1, 0, 4, 2]:

rank | suffix   | start | LCP
-----+----------+-------+----
  0  | a        |  5    |  0   (no predecessor)
  1  | ana      |  3    |  1   (a)
  2  | anana    |  1    |  3   (ana)
  3  | banana   |  0    |  0
  4  | na       |  4    |  0
  5  | nana     |  2    |  2   (na)

LCP = [0, 1, 3, 0, 0, 2].

LCP array for "banana" showing shared prefixes between adjacent sorted suffixes

LCP[i] counts how many characters the current suffix shares with its neighbor. These six numbers unlock longest-repeated-substring and distinct-substring-count in O(n).

Building the LCP Array: Kasai's O(n) Trick

You could compute every LCP entry naively in O(n²) total. Kasai et al. (2001) showed how to do it in O(n) using a single pass over the original string order, not the sorted order. It looks like a magic trick until you understand the lemma.

The key lemma: if the suffix starting at position i has LCP value k with its predecessor in sorted order, then the suffix starting at position i + 1 has LCP value at least k - 1 with its own predecessor.

Why? Remove the first character from both the suffix at i and its sorted predecessor. The remaining strings match for at least k - 1 characters (the full k-character match minus the deleted character). The suffix at i + 1 is exactly s[i+1:], and its sorted predecessor will share at least those k - 1 characters.

This lemma means the variable k (tracking the current LCP) can decrease by at most 1 per iteration and can only increase when we find new matching characters. The total number of character comparisons across all n iterations is bounded by 2n: k increments at most n times (once per character compared and found equal) and decrements at most n - 1 times. The algorithm is O(n) amortized.

def build_lcp(s: str, sa: list[int]) -> list[int]:
    n = len(s)
    rank = [0] * n
    for i, idx in enumerate(sa):
        rank[idx] = i
    lcp = [0] * n
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        lcp[rank[i]] = k
        if k:
            k -= 1
    return lcp

Every Operation and Its Cost

Operation	Time	Space	Notes
Build: naive	O(n² log n)	O(n)	Sorting suffix indices with string comparisons
Build: prefix doubling	O(n log n)	O(n)	Manber & Myers 1990/1993; O(n log²n) with comparison sort
Build: SA-IS	O(n)	O(n)	Nong, Zhang & Chan 2009
Build LCP: Kasai	O(n)	O(n)	After SA is built; amortized analysis
Search: binary	O(m log n)	O(1)	m = pattern length; log n steps, each O(m)
Search: LCP-accelerated	O(m + log n)	O(n)	Requires LCP array plus sparse table RMQ
Count all occurrences	O(m log n)	O(1)	Two binary searches for left/right boundary
Longest repeated substring	O(n)	O(n)	max(LCP) gives length; adjacent pair gives positions
Distinct substring count	O(n)	O(n)	n(n+1)/2 − sum(LCP)
Longest common substring (2 strings)	O(n + m)	O(n + m)	Concatenate with sentinel, scan LCP
LCP of arbitrary pair (SA[i], SA[j])	O(1) query	O(n log n)	Sparse table RMQ on LCP array

Why Search Is O(m log n)

Binary search on the suffix array makes ⌈log₂ n⌉ probes, each comparing the pattern against s[SA[mid] : SA[mid] + m] and touching at most m characters. Total: O(m log n).

The suffixes are sorted. Binary search on them exactly like you binary search an array of numbers, except the comparator reads characters out of the original string on demand. No extra memory. No tree to traverse. No node following. Just integer arithmetic and a string slice.

Finding "ana" in three probes. Each probe compares the pattern against the suffix at SA[mid]. The sorted order guarantees you discard half the search space every time.

LCP-Accelerated Search: O(m + log n)

The basic binary search re-compares characters at every step. With the LCP array and a sparse table for range minimum queries, you can track how many leading characters of the pattern already match the current lower and upper boundaries of the search interval. When the two boundaries share a long LCP, you skip ahead past the confirmed match. Total character comparisons across all steps is O(m + log n). The technique is described in the original 1993 Manber-Myers paper and simulates a walk through the corresponding suffix tree without building one.

Why Distinct Substrings is n(n+1)/2 − sum(LCP)

The total number of substrings of a length-n string is n(n+1)/2: the suffix at position i contributes n − i substrings (all its own prefixes). Across all n positions that sums to 1 + 2 + ... + n = n(n+1)/2.

In the sorted suffix array, the suffix at rank r shares its first LCP[r] characters with the suffix at rank r − 1. Those LCP[r] strings were already counted when we processed the previous suffix. So each suffix contributes exactly (n − SA[r]) − LCP[r] new distinct substrings. Summing over all ranks:

distinct = Σ (n − SA[r] − LCP[r])
         = n(n+1)/2 − sum(LCP)

For "banana": 21 − (0 + 1 + 3 + 0 + 0 + 2) = 21 − 6 = 15. Verified by enumeration.

BWT Comes for Free (No, Really)

The Burrows-Wheeler Transform, used by bzip2 and the FM-index behind every modern short-read aligner, reduces to reading one character before each sorted suffix. If you are new to the BWT: it rearranges a string into a form that compresses extremely well by clustering repeated characters together. Your DNA sequencer uses it. Your .bz2 files use it. The bioinformatics pipeline that assembled the human reference genome used it. All of that traces back to one formula.

BWT[i] = s[(SA[i] − 1 + n) % n]

If you have the suffix array, you get the BWT in one O(n) pass. The suffix array is the central structure. The BWT is a free side effect.

BWT derivation from the suffix array for "banana": one character before each sorted suffix

BWT("banana") = "nnbaaa". Read the character just before each sorted suffix (wrapping around). That is literally it.

One Structure, Every Language

Each implementation below builds the suffix array with O(n log n) prefix doubling, computes the LCP array with Kasai's O(n) algorithm, and provides binary search for a pattern returning the first occurrence index or -1.

def build_suffix_array(s: str) -> list[int]:
    n = len(s)
    sa = list(range(n))
    rank = [ord(c) for c in s]
    gap = 1
    while gap < n:
        g = gap
        r = rank[:]
        def key(i: int) -> tuple[int, int]:
            return (r[i], r[i + g] if i + g < n else -1)
        sa.sort(key=key)
        tmp = [0] * n
        for i in range(1, n):
            tmp[sa[i]] = tmp[sa[i - 1]] + (0 if key(sa[i]) == key(sa[i - 1]) else 1)
        rank = tmp
        if rank[sa[-1]] == n - 1:
            break
        gap <<= 1
    return sa


def build_lcp(s: str, sa: list[int]) -> list[int]:
    n = len(s)
    rank = [0] * n
    for i, idx in enumerate(sa):
        rank[idx] = i
    lcp = [0] * n
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        lcp[rank[i]] = k
        if k:
            k -= 1
    return lcp


def search(s: str, sa: list[int], pattern: str) -> int:
    lo, hi, m = 0, len(sa), len(pattern)
    while lo < hi:
        mid = (lo + hi) // 2
        if s[sa[mid]:sa[mid] + m] < pattern:
            lo = mid + 1
        else:
            hi = mid
    if lo < len(sa) and s[sa[lo]:sa[lo] + m] == pattern:
        return sa[lo]
    return -1

The Problems This Structure Actually Solves

Pattern matching with offline preprocessing. Build the suffix array once in O(n log n) and answer every subsequent pattern query in O(m log n). If the text is fixed and the patterns vary, this beats running KMP or Boyer-Moore from scratch every time. The text is already sorted. You just binary search it.

Longest repeated substring. Scan the LCP array for its maximum value. The two suffixes adjacent at that position share the longest repeated substring. One pass, O(n). The suffix tree version of this requires a DFS over up to 2n nodes. The suffix array version is two lines: build LCP, call max.

Counting distinct substrings. Apply n(n+1)/2 − sum(LCP). One pass, O(n). Common in competitive programming, useful in compression analysis, and genuinely surprising that it works when you first see it.

Longest common substring of two strings. Concatenate S = A + '$' + B where '$' is a sentinel smaller than any character in either string. Build the suffix array and LCP array for S. Scan the LCP array for pairs where one suffix came from A and the other from B. The maximum such LCP value is the answer. Total time O(|A| + |B|). The sentinel prevents matches from crossing the boundary between the two strings. Without it you would find substrings spanning both inputs, which is not what you want.

Burrows-Wheeler Transform. Read BWT[i] = s[(SA[i] − 1 + n) % n] in one O(n) pass. The BWT drives both bzip2 compression and the FM-index used by bioinformatics aligners like BWA and Bowtie2. Every time you decompress a .bz2 file, this formula ran in the background.

Arbitrary LCP queries. Build a sparse table (range minimum query structure) on the LCP array in O(n log n). Any query "what is the length of the longest common prefix of the suffixes starting at positions i and j?" answers in O(1). This makes the suffix array and LCP array together nearly equivalent in power to a suffix tree, at a fraction of the memory cost.

When Should You Actually Use This

The structural signal is: you are dealing with a single large string (or a small number of strings), and you need to answer multiple queries about substrings, repetitions, or occurrences.

Three signals narrow it down fast:

The text is fixed; the patterns vary. Preprocessing the text once pays off across many queries.
The problem uses the words "repeated," "common," or "distinct" alongside "substring."
The string is long enough that O(n²) brute force times out but O(n log n) is fine.

Worked Example 1: Longest Repeated Substring

Problem: Given a string, find the longest substring that appears at least twice.

Brute force first. For every pair of positions (i, j), compute the LCP of s[i:] and s[j:]. That is O(n²) pairs, each comparison O(n), giving O(n³) total. With hashing and binary search on length you can get to O(n² log n). Neither will pass on a string of length 10⁵.

Why a suffix array fits. Sort all suffixes. Any repeated substring must appear as a common prefix of two adjacent suffixes in the sorted order, and the LCP array records exactly these lengths. The answer is max(LCP).

def longest_repeated_substring(s: str) -> str:
    sa = build_suffix_array(s)
    lcp = build_lcp(s, sa)
    max_len = max(lcp)
    if max_len == 0:
        return ""
    idx = lcp.index(max_len)
    return s[sa[idx]:sa[idx] + max_len]

Total time: O(n log n) to build the suffix array, O(n) to build LCP, O(n) to scan. Space O(n).

Worked Example 2: Count Distinct Substrings

Problem: Given a string of length n, count the number of distinct (non-empty) substrings.

Brute force. Collect all O(n²) substrings into a hash set. O(n²) time and O(n²) space. Your interviewer will say "can we do better?" before you finish typing.

Why a suffix array fits. Each suffix SA[i] of length n − SA[i] contributes all its prefixes as substrings. The LCP[i] prefixes it shares with SA[i−1] were already counted. The number of new distinct substrings contributed by SA[i] is exactly (n − SA[i]) − LCP[i]. Summing over all ranks gives n(n+1)/2 − sum(LCP).

def count_distinct_substrings(s: str) -> int:
    n = len(s)
    sa = build_suffix_array(s)
    lcp = build_lcp(s, sa)
    return n * (n + 1) // 2 - sum(lcp)

Total time O(n log n). Space O(n). No hash set, no deduplication, no string storage.

The Short Version (Bookmark This Part)

A suffix array is a flat integer array of size n holding the starting positions of all suffixes in sorted order. It uses exactly 4n bytes.
Prefix doubling builds it in O(n log n) by sorting pairs of integer ranks across ⌈log₂ n⌉ rounds. SA-IS achieves O(n) via induced sorting.
The LCP array stores the length of the longest common prefix between adjacent sorted suffixes. Kasai's algorithm builds it in O(n) using the observation that LCP values can decrease by at most 1 per step.
Binary search on the suffix array finds any pattern in O(m log n). With an LCP sparse table, search drops to O(m + log n).
Longest repeated substring = max(LCP). Distinct substring count = n(n+1)/2 − sum(LCP). Both take O(n) after the arrays are built.
The BWT, used by bzip2 and every modern short-read aligner, is a free O(n) side product of the suffix array.
Suffix arrays and their LCP arrays together answer most questions you would otherwise ask a suffix tree, at roughly 4× lower memory.

If you want to practice explaining these trade-offs out loud under interview pressure, SpaceComplexity runs voice-based DSA mock interviews with rubric-based feedback. String algorithms come up more often than people expect once you hit senior-level interviews.

For related deep dives, the trie data structure covers prefix-based string indexing and complements suffix arrays well. Binary search invariants covers the exact technique the suffix array search relies on. And hash map time complexity provides context for when a hash-based approach is the right alternative.