What Is Lexicographic Order? The String Comparison Behind Half Your Sorting Problems

- Lexicographic order compares sequences character by character left to right; the first mismatch decides the winner
- Character values rule sorting: digits (48) < uppercase (65) < lowercase (97) in ASCII, so '0' < 'A' < 'a'
- Sorting numbers as strings gives wrong results: "10" comes before "9" because '1' < '9', not because 10 < 9
- String comparison costs O(m) per pair, making sort O(mn log n) rather than O(n log n) for strings of average length m
- Tuple comparison is lexicographic by default in Python and Java, giving multi-key sort without writing a custom comparator
- Tries store strings in lexicographic order by construction, so prefix search and sorted traversal come for free
- Next permutation means the lexicographically smallest array strictly larger than the current one — the algorithm only makes sense once you see it that way
You sorted ["10", "9", "2"] and got ["10", "2", "9"] back. Congratulations. You have now experienced the exact confusion that has broken production software at companies you definitely recognize and respect.
That output is not a bug. It is lexicographic order, working exactly as designed. It is the rule your language uses by default to sort strings. It is how tries work internally. And it is the invisible concept hiding inside a dozen interview problems that look like they are about something else entirely.
Understanding it takes five minutes. Not understanding it will cost you more than that.
It's the Dictionary, Generalized
Lexicographic order is dictionary ordering applied to sequences. Open any physical dictionary. "ant" comes before "antelope" because when "ant" runs out of letters, the shorter string loses. "apple" comes before "banana" because 'a' < 'b'. "care" comes before "cart" because at position 3, 'e' < 't'.
The formal rule: scan two sequences left to right until you find a mismatch. Whoever has the smaller element at that position comes first. If one sequence is a prefix of the other, the shorter one is smaller.
That is the whole thing. A trie is just this rule applied structurally. "Next permutation" is just this rule applied to arrays of numbers. Custom sort comparators that look complicated are usually just this rule with a different character ranking. Once you see it, you cannot unsee it.
Characters Are Numbers
Here is the thing your CS professor glossed over: computers do not have a concept of "letters." Under the hood, every character is a number. In ASCII and Unicode, 'a' is 97, 'b' is 98, 'z' is 122. Uppercase letters get lower numbers: 'A' is 65. Digits are lower still: '0' is 48.
Three categories, in ascending order: digits, uppercase, lowercase. So 'A' < 'a' < 'z'. And '0' < 'A' < 'a'. When you sort strings, you are sorting integers, just with extra steps.
print("Apple" < "apple") # True: 'A' (65) < 'a' (97) print("b" < "ba") # True: "b" is a prefix of "ba" print("z" < "aa") # False: 'z' (122) > 'a' (97)
Most of the time this does not bite you, because interviews usually involve strings of the same case. But if you are comparing mixed-case strings or sorting filenames, keep the ordering in mind or you will spend an afternoon debugging something embarrassing.
The Gotcha That Breaks Number Sorting
This is the one that gets people.
When you sort numbers as strings, "10" comes before "9" because '1' (49) < '9' (57). The numeric value is irrelevant. Character values are all that matter. Your program does not know you meant numbers. It sees characters.
sorted(["10", "9", "2", "100"]) # ['10', '100', '2', '9'] # '1' < '2' < '9', so everything starting with '1' clusters first
This is one of the most common bugs in custom sort comparators. You sort version numbers, log file names, or IDs as strings and the order looks wrong. Someone has to file a ticket. Someone has to fix it. That someone could have read this post.
The fix is always the same: either convert to integers before sorting, or use a key function that extracts the numeric part.
The interview problem that makes this explicit is LeetCode 179 (Largest Number). You have integers like [3, 30, 34, 5, 9] and need to arrange them to form the largest possible number. Neither numeric sort nor lexicographic sort gives the right answer. You need a custom comparator that compares "330" vs "303" by concatenating in both orders and picking the larger result. The problem only makes sense once you understand why default string sort gives the wrong answer here.
Comparison Is Not Free
Integer comparison is O(1). You load two values, compare them, done.
String comparison is O(m) in the worst case, where m is the length of the shorter string. You scan character by character until you find a mismatch or one string ends.
This means sorting n strings of average length m costs O(mn log n), not O(n log n). For short strings, the difference disappears into constants. For long strings with long common prefixes (URLs, file paths, DNA sequences), the comparison cost starts to dominate.
a = "pneumonoultramicroscopicsilicovolcanoconiosis_a" b = "pneumonoultramicroscopicsilicovolcanoconiosis_b" # Every comparison scans ~45 characters before finding the diff
The comparison itself uses O(1) extra space. But when you materialize a comparison key, that changes. Sorting each string's characters to build an anagram key allocates O(m) per string, O(mn) total. For the anagram problem that allocation is necessary. For sorting strings directly, no extra space is required beyond the sort itself.
This is one reason tries exist. Inserting n strings of length m into a trie costs O(mn). A depth-first traversal visits strings in lexicographic order automatically. Total cost: O(mn). A comparison sort needs O(mn log n). When you have a large dictionary to sort, the trie wins on complexity.
It Hides in More Interview Problems Than You Realize
Anagram grouping
The standard approach for grouping anagrams is to sort each string's characters and use the result as a hash key.
from collections import defaultdict def group_anagrams(strs): groups = defaultdict(list) for s in strs: key = "".join(sorted(s)) # lexicographic sort of characters groups[key].append(s) return list(groups.values())
Any two anagrams produce the same sorted key because they contain exactly the same characters. "eat", "tea", and "ate" all produce "aet". Sorting a string's characters is lexicographic sort applied to one string at a time. Simple idea, weirdly satisfying when it clicks.
Next permutation
LeetCode 31 asks for the next permutation of an array. The algorithm:
- Scan right to left for the first element smaller than its right neighbor (index
i). - Scan right to left for the first element larger than
nums[i](indexj). - Swap
nums[i]andnums[j]. - Reverse everything after index
i.
This is completely opaque without the right framing. "Next permutation" means the smallest array, in lexicographic order, that is strictly larger than the current one. Once you know that, steps 1 through 4 become reconstructible from logic rather than memorizable steps. You stop trying to memorize a four-step dance and start understanding what you are actually building.
def next_permutation(nums): n = len(nums) i = n - 2 while i >= 0 and nums[i] >= nums[i + 1]: i -= 1 if i >= 0: j = n - 1 while nums[j] <= nums[i]: j -= 1 nums[i], nums[j] = nums[j], nums[i] nums[i + 1:] = reversed(nums[i + 1:])
Custom comparators
Whenever you write a custom sort, you are defining a comparison order that is a variant of lexicographic order with a different ranking for the elements. The custom comparator bugs that fail in interviews almost always trace back to either getting the ordering wrong or applying lexicographic comparison when numeric comparison was needed.
Encoding a multi-key sort as a tuple sort gets you the lexicographic behavior for free. Python and Java both compare tuples lexicographically by default, so sorting pairs (value, index) gives you value-first ordering with index as the tiebreaker without writing any comparator at all.
items = [(3, 2), (1, 5), (3, 0), (1, 2)] sorted(items) # [(1, 2), (1, 5), (3, 0), (3, 2)] # Sorted by first element; ties broken by second element
This shows up in sweep-line algorithms, interval merging, and any problem where you sort events with tie-breaking rules. It is one of those tricks that looks like a hack until you realize it is actually principled.
Prefix search on sorted strings
If you have a sorted list of strings, binary search finds all strings with a given prefix in O(m log n) time, where m is the prefix length. Search for the prefix as the lower bound, and the prefix with its last character incremented as the upper bound.
import bisect words = ["apple", "application", "apply", "banana", "band"] lo = bisect.bisect_left(words, "app") hi = bisect.bisect_left(words, "apq") # 'q' = chr(ord('p') + 1) print(words[lo:hi]) # ['apple', 'application', 'apply']
This works because the list is already in lexicographic order, so the prefix range is contiguous. One sorted array, zero trie overhead.
Tries Are Lexicographic Order Made Physical
A trie stores strings by spelling them out character by character from root to leaf. When you traverse a trie in depth-first order, visiting children in sorted character order, you get all stored strings in lexicographic order automatically.
A trie is a physical materialization of the lexicographic comparison function. The structure of the tree is the sorted order. You store in a way that sorting is already done. It is not a fancy data structure. It is the dictionary rule, turned into memory.
That is why tries answer "find all strings with prefix X" in O(m) time. The prefix corresponds to a path from the root. Everything in the subtree below that path shares the prefix, already in lexicographic order. No sort needed.
Five Rules That Actually Matter
Shorter strings are smaller when they are a prefix of the longer one. "app" < "apple" < "applet".
Character order by Unicode value: digits < uppercase < lowercase. '0' is 48, 'A' is 65, 'a' is 97.
Sorting numbers as strings gives wrong results for variable-length numbers. Always convert to integers first, or use a numeric key function.
String comparison costs O(m), not O(1). Sorting n strings of average length m costs O(mn log n).
Tuple comparison is lexicographic by default in Python and Java. Use it for multi-key sort without writing a custom comparator.
Tries enumerate strings in lexicographic order by construction. If a problem asks for the lexicographically smallest result, a trie might give it to you for free.
The tricky part in interviews is not knowing the definition. It is recognizing when a problem is secretly asking you to reason about lexicographic order before you ever write a comparator, then explaining that reasoning out loud while you code. That spoken explanation is what separates a "strong hire" from a "hire." It is the exact gap that SpaceComplexity is designed to close: voice-based mock interviews where you walk through the reasoning out loud, not just the code.