Suffix Array Search: Sort Once, Query in O(m log n)

You have a text. Queries keep arriving. Your naive O(n·m) scanner handles the first one with dignity. By query five thousand it's rereading the entire string from the beginning every single time, like someone who forgets they already checked if "aardvark" is in the dictionary and flips back to page one to make sure.

A suffix array fixes this. Preprocess once. Every query after that costs O(m log n). For most string-heavy workloads, that is a genuinely good trade.

Every Suffix, Sorted

A suffix is any substring starting at position i and running to the end of the string.

For "banana":

index 0: banana
index 1: anana
index 2: nana
index 3: ana
index 4: na
index 5: a

Sort those six suffixes lexicographically and record the starting index of each:

a       starts at 5
ana     starts at 3
anana   starts at 1
banana  starts at 0
na      starts at 4
nana    starts at 2

A suffix array is the array of starting indices in that sorted order: [5, 3, 1, 0, 4, 2].

Six integers. That is the entire data structure. The original text stays in memory; the suffix array just records how to visit its suffixes in sorted order. No suffix strings stored. No tree nodes. No pointers. A flat array of integers.

Why Sorting Makes Suffix Array Search Fast

When suffixes are sorted, something useful falls out automatically. Every suffix that starts with "ana" groups together in a contiguous block, right between suffixes starting with "an" and "b". That structure is exactly what binary search needs.

Finding all occurrences of a pattern is now two binary searches: one to find the left boundary of the block, one to find the right.

def search(s: str, sa: list[int], pattern: str) -> list[int]:
    n = len(s)
    m = len(pattern)

    lo, hi = 0, n
    while lo < hi:
        mid = (lo + hi) // 2
        if s[sa[mid] : sa[mid] + m] < pattern:
            lo = mid + 1
        else:
            hi = mid
    left = lo

    lo, hi = left, n
    while lo < hi:
        mid = (lo + hi) // 2
        if s[sa[mid] : sa[mid] + m] <= pattern:
            lo = mid + 1
        else:
            hi = mid

    return sorted(sa[i] for i in range(left, lo))

Each comparison reads m characters. Binary search runs O(log n) steps. Per query: O(m log n). No matter how much text you have, that log n stays polite.

For "banana" with SA = [5, 3, 1, 0, 4, 2] and pattern "ana":

Index 5 → suffix "a". Less than "ana". Left boundary moves right.
Index 3 → suffix "ana". Prefix matches. Left boundary = 1.
Index 1 → suffix "anana". Prefix matches. Continue.
Right boundary = 3 (first position where prefix exceeds "ana").

Occurrences returned: [1, 3]. "banana" contains "ana" at positions 1 and 3.

How You Actually Build It

The simplest build: generate all n suffixes, sort them, record the indices.

def build_suffix_array(s: str) -> list[int]:
    return sorted(range(len(s)), key=lambda i: s[i:])

One line. It works. The cost is O(n² log n): sorting n suffixes takes O(n log n) comparisons, and each comparison can touch up to O(n) characters in the worst case. For a 100,000-character file, you will notice. For a LeetCode input of a few thousand characters, you probably will not.

The O(n log n) algorithm by Manber and Myers (1990) avoids re-comparing characters using a prefix-doubling trick. Each pass ranks every suffix by its first 2^k characters. After log n passes, every suffix has a unique rank and no character has been compared twice across the full sort.

SA-IS (2003) pushes this to O(n) and is what production libraries actually ship. For interviews, the one-liner is almost always fine. The faster algorithms are context for design discussions, not something you would be expected to code cold.

What the LCP Array Unlocks

Once you have a suffix array you might as well squeeze more out of it.

lcp[i] stores the length of the longest common prefix between the suffix at position i in sorted order and the one at position i-1. For "banana":

SA:   [5,  3,  1,  0,  4,  2]
       a   ana anana banana na nana
LCP:  [0,  1,  3,   0,   0,  2]

Between "a" and "ana": one shared character. LCP = 1. Between "ana" and "anana": three shared characters. LCP = 3. Between "anana" and "banana": none. LCP = 0.

With the LCP array, the longest repeated substring in the text is just the maximum value in the LCP array. For "banana", that maximum is 3, pointing to "ana" appearing at both index 1 and index 3.

Kasai's algorithm builds the LCP array in O(n) by exploiting a cheap lower bound: if lcp[rank[i]] = k, then lcp[rank[i+1]] >= k - 1. That means you extend rather than restart each comparison, so the total work across all n suffixes is linear.

def build_lcp_array(s: str, sa: list[int]) -> list[int]:
    n = len(s)
    rank = [0] * n
    for i, suffix_start in enumerate(sa):
        rank[suffix_start] = i

    lcp = [0] * n
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        lcp[rank[i]] = k
        if k:
            k -= 1
    return lcp

Space: Why Practitioners Prefer This Over the Suffix Tree

A suffix tree encodes the same information but stores actual substrings inside tree nodes. In theory it is O(n) space. In practice each node carries pointer overhead, and that overhead is not cheap. On a modern 64-bit system a suffix tree lands between 10n and 20n bytes per character.

A suffix array is exactly n integers. At four bytes each, that is 4n bytes. Add the LCP array: 8n bytes.

For a 1 GB text file, the suffix tree wants 10 to 20 GB of RAM. The suffix array wants 8 GB. That is the kind of number that ends architectural debates.

The suffix tree finds an exact match in O(m) versus the suffix array's O(m log n). The log n factor in query time rarely matters. The 2-to-5x difference in RAM almost always does. This is why genome assemblers, full-text search engines, and log analysis tools reach for suffix arrays even though suffix trees are theoretically equivalent.

Where Suffix Arrays Actually Show Up

Most string problems in standard coding interviews reach for KMP or a hash-based approach. The suffix array appears at the harder end of the problem spectrum, and more often in system design discussions than in algorithm rounds.

Longest repeated substring. The answer is the maximum value in the LCP array. One scan after building SA and LCP.

Number of distinct substrings. A string of length n has n(n+1)/2 total substrings if all were unique. Subtract the sum of the LCP array to remove duplicates.

Longest common substring of two strings. Concatenate with a separator that does not appear in either, build a single SA and LCP over the combined string, scan for the highest LCP value between suffixes from different strings.

Multiple pattern queries over the same text. Build once at O(n log n) or O(n), answer each query at O(m log n). The more queries you run, the better the amortized cost per query. At a few hundred queries the suffix array has already paid back its build time many times over.

If you need to walk through trade-offs like these out loud under interview pressure, SpaceComplexity runs voice-based mock interviews where you work through both the algorithm design and the complexity analysis in real time.

Complexity at a Glance

	Build SA	Build LCP	Pattern search
Naive	O(n² log n)	O(n)	O(m log n)
Manber-Myers	O(n log n)	O(n)	O(m log n)
SA-IS	O(n)	O(n)	O(m log n)
Space	4n bytes	4n bytes	,

The O(m log n) query cost is the same regardless of how you built the SA. The build algorithm is the variable. The query is not.

The Short Version

A suffix array stores the starting indices of all suffixes in lexicographic order.
Pattern search is binary search: O(m log n) per query after a one-time build.
The LCP array adds O(n) space and unlocks repeated-substring and distinct-substring queries.
Practical space is 4n to 8n bytes, which is why it beats the suffix tree in memory-constrained settings.
For most coding interviews, understand the concept and the naive build. Full O(n) algorithms are out of scope.
Where it appears: competitive programming, staff-level string design questions, and any problem asking about substring frequencies across a fixed text.