B-Tree and B+ Tree: A Billion Keys, Three Disk Reads

- B-tree packs hundreds of keys per node to match disk page size, cutting a billion-key lookup from 30 disk reads to 3
- B+ tree stores data only in leaves and links them for sequential range scans, which is why every major database uses it
- Minimum degree t controls the whole tree: nodes hold t-1 to 2t-1 keys, height is O(log_t n)
- Proactive splitting on insert and proactive merging on delete keep the tree balanced in a single downward pass
- InnoDB's 16 KB pages hold roughly 1,000 keys per node, keeping most billion-row tables at 3 to 4 levels deep
- Range queries cost O(log n + k) in a B+ tree: find the start leaf, then walk the linked list
Your database has a billion rows. You run SELECT * FROM users WHERE id = 7891234. The answer comes back in milliseconds. Not because your SSD deserves a raise, but because the index structure touched exactly three disk pages to find that row out of a billion.
The B-tree is the data structure behind nearly every database index, every modern file system, and most key-value stores on the planet. Bayer and McCreight invented it in 1970 at Boeing Research Labs to solve a specific problem: how do you search a dataset too large to fit in memory, when every disk read costs orders of magnitude more than a memory access?
What happens when you skip the B-tree chapter.
Picture a balanced search tree where each node holds not one key but hundreds, sized to match exactly one disk page. Instead of a branching factor of 2 (like a binary search tree), you get a branching factor of 500 or 1,000. That collapses a tree that would be 30 levels deep down to 3 or 4 levels. Three levels means three disk reads. That is the entire trick.
Reach for a B-tree when you need sorted data on disk, range queries, or ordered iteration over a dataset that does not fit in memory. The B+ tree variant (the one databases actually use) adds fast sequential scans across ranges of rows.
What does the "B" stand for?
Nobody knows. Bayer and McCreight never said. Boeing, balanced, broad, bushy, and Bayer have all been suggested. McCreight once noted they were working at Boeing and could not use the name without talking to the lawyers. Fifty-six years later, the mystery is part of the folklore. It is the most famous unsolved naming problem in computer science, beating out whatever "GNU" recursively stands for.
How a B-tree works internally
A binary search tree stores one key per node. Every comparison means following a pointer, which on disk means a new page read. A disk seek takes 5 to 10 milliseconds. A RAM access takes about 100 nanoseconds. That is a 50,000x to 100,000x gap. You could fly from New York to London in the time it takes disk I/O to catch up with RAM. For a billion keys, a perfectly balanced BST has height log2(10^9) = 30. That is 30 disk reads per lookup, or 150 to 300 milliseconds. Unacceptable.
A B-tree fixes this by packing hundreds of keys into each node, so each node fills exactly one disk page. Instead of one comparison per disk read, you get hundreds. The branching factor explodes, and the height collapses.
Same billion keys. One tree needs 30 disk reads, the other needs 3.
What lives inside each node
Every B-tree node contains:
- An array of keys in sorted order
- An array of child pointers (one more than the number of keys)
- A count n of how many keys the node currently holds
- A boolean leaf flag
If a node has keys [10, 20, 30], it has four child pointers: one for keys less than 10, one for keys between 10 and 20, one for keys between 20 and 30, and one for keys greater than 30. The keys partition the search space, and the child pointers point to subtrees that cover each partition.
Three keys, four exits. Each child pointer leads to a subtree covering its labeled range.
One number controls the whole tree
Every B-tree has a parameter t called the minimum degree (CLRS notation). Some textbooks use Knuth's convention, defining a B-tree of "order m" where m is the maximum number of children (so m = 2t). Same structure, different label. Two textbook authors walk into a bar, argue about notation, leave with the same data structure. This article uses the CLRS convention because it makes the invariants cleaner. This single number t dictates the shape:
- Every node holds at most 2t - 1 keys
- Every non-root node holds at least t - 1 keys
- Every internal non-root node has at least t children and at most 2t children
- The root can have as few as 1 key (or 0 if the tree is empty)
- All leaves sit at the same depth. Always.
When t = 2, you get a 2-3-4 tree (each node has 2, 3, or 4 children). When t = 3, nodes hold 2 to 5 keys. When t = 1000 (a realistic value for disk-based trees), nodes hold 999 to 1999 keys.
Why every leaf is at the same depth
This property keeps B-trees balanced with zero effort. You never rotate nodes or recolor anything. If you have ever implemented a red-black tree and questioned your career choices, this is where you start appreciating B-trees. The tree grows and shrinks only at the root. When a split propagates all the way up and the root itself splits, a new root is created and the height increases by one. Every leaf moves down one level simultaneously. When merges cascade to the root and the root empties, the root is removed and the height decreases by one. The tree is always perfectly balanced because it only changes height at the top.
How nodes map to disk pages
B-trees earn their keep right here. A disk page (also called a block) is the smallest unit the operating system reads from disk. Typical sizes are 4 KB, 8 KB, or 16 KB. Reading one byte costs the same as reading the whole page, because the hardware fetches the entire block. You are paying for the whole burrito whether you eat it or not, so you might as well stuff it with keys.
A B-tree node is sized to fill exactly one disk page. With MySQL InnoDB's 16 KB pages and 8-byte integer keys plus 8-byte child pointers:
- Each key-pointer pair: 16 bytes
- Keys per node: roughly (16384 - overhead) / 16, which gives around 1,000
- So t is around 500, and nodes hold 500 to 1,000 keys
With a fanout of roughly 1,200 per internal page and 468 records per leaf page (based on Jeremy Cole's InnoDB analysis with 4-byte integer keys):
- Height 2 (root + leaves): ~563,000 rows
- Height 3 (root + 1 internal level + leaves): ~677 million rows
- Height 4 (root + 2 internal levels + leaves): ~814 billion rows
One disk page, one node, a thousand keys. That is why the tree is so short.
Most production tables with hundreds of millions of rows fit in 3 to 4 levels. The root page is almost always cached in the buffer pool. So a lookup in a billion-row table needs about three disk reads (four levels, root cached). A BST would need 30.
Why the height stays logarithmic
For n keys and minimum degree t, the height h satisfies:
h ≤ log_t((n + 1) / 2)
The proof counts the minimum number of keys a tree of height h can hold. The root has at least 1 key, so at least 2 children. Every other internal node has at least t children. So the number of nodes at each level is:
- Level 0: 1 node (root)
- Level 1: at least 2 nodes
- Level 2: at least 2t nodes
- Level 3: at least 2t^2 nodes
- Level i (for i >= 1): at least 2t^(i-1) nodes
Every non-root node has at least t - 1 keys. Adding up:
n >= 1 + (t-1) * sum(2 * t^(i-1), i=1..h)
= 1 + 2(t-1) * (t^h - 1) / (t - 1)
= 1 + 2(t^h - 1)
= 2t^h - 1
So t^h <= (n + 1) / 2, which gives h <= log_t((n + 1) / 2).
Plug in real numbers: n = 1 billion, t = 500. That gives h <= log_500(500,000,001) = about 3.2. The tree is at most 4 levels deep.
B+ tree: the variant databases actually use
Most databases do not use a plain B-tree. They use a B+ tree, which makes one structural change that transforms range query performance. Think of it as B-tree's more popular sibling who gets all the production jobs.
In a B+ tree, internal nodes store only keys for routing. All actual data lives in the leaf nodes. Internal nodes are pure indexes. They tell you which direction to go, but they do not hold the row data themselves.
This has three consequences:
-
Internal nodes are smaller. Without data pointers or row payloads, internal nodes can pack more keys per page. More keys per page means a higher branching factor, which means a shorter tree.
-
Every search goes to a leaf. In a plain B-tree, you might find the key in an internal node and stop. In a B+ tree, you always descend to a leaf. This makes performance predictable.
-
Leaves form a linked list. Each leaf node has a pointer to the next leaf and the previous leaf. To answer a range query like
WHERE age BETWEEN 25 AND 35, you find the leaf containing 25, then walk the linked list until you pass 35. No need to bounce back up to internal nodes.
B-Tree vs B+ Tree: what changes
Left: B-tree finds 30 in the root and stops. Right: B+ tree always descends to a leaf, but gains a linked list.
In a B-tree with keys [10, 20, 30, 40, 50]:
- An internal node might hold key 30 along with its associated data
- Search for 30 could stop at an internal node
In a B+ tree with the same keys:
- Internal nodes hold copies of keys (like 30) purely for routing
- The actual record for key 30 lives in a leaf
- All leaves are chained: [10, 20] -> [30, 40] -> [50]
When you split a leaf in a B+ tree, you copy the middle key up to the parent (the leaf keeps its copy). When you split an internal node, you move the middle key up (same as a regular B-tree). Get this wrong and your tree silently loses data. Copy vs move. Two words. Entire databases riding on which one you pick.
Why databases choose B+ trees
Range queries dominate real workloads. "Give me all orders from last month" is a range scan. "Find users with IDs 1000 to 2000" is a range scan. Pretty much any WHERE clause with BETWEEN, >, or < is a range scan. The leaf linked list in a B+ tree turns these into sequential reads, which are orders of magnitude faster than random reads on both spinning disks and SSDs.
MySQL InnoDB, PostgreSQL, MongoDB WiredTiger, and nearly every relational database engine uses B+ trees for indexing. SQLite uses B+ trees for table storage (the data rows) and plain B-trees for secondary indexes. The file systems NTFS, HFS+, XFS, and Btrfs all use B-tree variants as well.
Each engine adds its own twist. InnoDB's clustered index stores the full row data in the leaf nodes of the primary key's B+ tree, so a primary key lookup returns the row without a second I/O. PostgreSQL uses a Lehman-Yao B-link tree variant where every node has a right-link pointer to its sibling, letting readers detect and recover from concurrent page splits without holding locks. SQLite uses a B*-tree optimization that redistributes keys between siblings before splitting, keeping nodes roughly 2/3 full instead of 1/2.
What every B-tree operation costs (the cheat sheet)
| Operation | Time complexity | Disk I/Os | Space | Notes |
|---|---|---|---|---|
| Search | O(log n) | O(log_t n) | O(1) iterative | Binary search within each node |
| Insert | O(log n) | O(log_t n) | O(t) for split | Proactive splitting, single downward pass |
| Delete | O(log n) | O(log_t n) | O(t) for merge | Proactive merging, single downward pass |
| Traverse | O(n) | O(n / t) | O(h) stack | Visit every key in sorted order |
| Min / Max | O(log n) | O(log_t n) | O(1) | Follow leftmost or rightmost pointers |
| Range query (B+ tree) | O(log n + k) | O(log_t n + k/t) | O(1) | k = number of results |
| Split child | O(t) | O(1) | O(1) | Copy t-1 keys to new node |
| Merge children | O(t) | O(1) | O(1) | Combine two (t-1)-key nodes |
| Bulk load | O(n) presorted | O(n / t) | O(n) | Build bottom-up from sorted data |
The time complexities above use binary search within each node, which gives O(log t) per node and O(log t * log_t n) = O(log n) overall. With linear scan within each node (as Rust's BTreeMap does for its small B=6 nodes), the per-node cost is O(t) and the total is O(t * log_t n), which is still O(log n) in big-O terms since t is a constant for any given tree.
Search: walk down, compare, follow
Search starts at the root, like everything in life. At each node, you scan the keys to find where the target fits. If you find an exact match, you are done (in a B-tree) or you keep descending to the leaf (in a B+ tree). If you do not find it, you follow the child pointer that corresponds to the gap where the key would be.
BTreeSearch(node, key):
i = 0
while i < node.n and key > node.keys[i]:
i += 1
if i < node.n and key == node.keys[i]:
return (node, i) // found
if node.leaf:
return null // not in tree
return BTreeSearch(node.children[i], key)
The search visits exactly h nodes, where h is the height. Each node visit costs one disk read. With binary search inside each node (the keys are sorted), the in-memory work per node is O(log t). Total: O(log t * log_t n) = O(log n).
For disk I/O, only the height matters: O(log_t n) = O(log n / log t). A large t crushes the I/O count. With t = 500 and a billion keys, that is about 3 reads instead of 30.
Insert: split on the way down
The CLRS insertion algorithm uses a single downward pass with proactive splitting. The idea: before descending into any full child, split it first. This guarantees the parent always has room for the promoted median key. It is like widening the doorways before carrying a couch through the house.
Here is the procedure:
-
If the root is full (has 2t - 1 keys), create a new empty root, make the old root its child, and split the old root. The tree grows taller by one level.
-
Walk down from the root toward the appropriate leaf. At each internal node, find the child you would descend into. If that child is full, split it before descending. The split pushes the median key up into the current node (which is guaranteed to have room, because we split it on the way down if it was full).
-
When you reach a leaf, insert the key in sorted position. The leaf is guaranteed to have room.
Splitting a full node: take a node with 2t - 1 keys. The median key (at index t - 1) moves up to the parent. The left t - 1 keys stay in the original node. The right t - 1 keys move to a new node. If the node is internal, the children are redistributed as well. Think of it like a cell dividing. One becomes two, and the median key floats up to the parent like a helium balloon.
The median key (30) moves up. The remaining keys split evenly into two children.
Why proactive splitting works: when you are at node x and about to descend into child c, you check if c is full. If so, you split c. This pushes one key up into x. But x was not full (because we ensured that on the previous step). So x has room. The invariant propagates all the way down.
The alternative (split on the way up, after insertion causes overflow) requires backtracking. That is harder to implement for disk-based trees where you want to minimize seeks. The proactive approach is a single downward pass.
Delete: borrow or merge on the way down
Deletion is the most complex B-tree operation. If insertion is widening doorways, deletion is the couch going back the other way through a house that has been remodeled since. The CLRS algorithm uses the same proactive strategy as insertion but in reverse: before descending into any child with the minimum number of keys (t - 1), ensure it has at least t keys by borrowing from a sibling or merging.
Three cases arise:
Case 1: Key is in a leaf. Remove it directly. The leaf is guaranteed to have at least t keys (because of the proactive fattening on the way down), so removing one key leaves at least t - 1, which satisfies the minimum.
Case 2: Key is in an internal node x. You cannot just remove it because the key acts as a separator between two child subtrees. Three subcases:
-
2a: The child y preceding the key has at least t keys. Find the predecessor of the key (the rightmost key in y's subtree). Replace the key with the predecessor, then recursively delete the predecessor from y's subtree.
-
2b: The child z following the key has at least t keys. Symmetric: find the successor, replace, and delete.
-
2c: Both y and z have exactly t - 1 keys. Merge: move the key down from x into y, append all of z's keys and children to y, and free z. Then recursively delete from the merged node.
Case 3: Key is not in the current node, and we need to descend into child c_i. If c_i has only t - 1 keys:
-
3a: An immediate sibling of c_i has at least t keys. Rotate: move a key from the parent down into c_i, and move a key from the sibling up into the parent. This gives c_i one more key without violating any invariants.
-
3b: All immediate siblings of c_i have exactly t - 1 keys. Merge c_i with one sibling: pull a key from the parent down into the merged node. The parent loses one key and one child, but the proactive approach ensures the parent had at least t keys before this step.
All of this happens in a single downward pass. The height decreases only when the root ends up with zero keys after a merge of its two children. Then the merged child becomes the new root.
Range query in a B+ tree: find, then scan
This is where B+ trees shine. To find all keys in [lo, hi]:
- Search for lo. This takes O(log_t n) and lands you at a leaf.
- Walk the leaf linked list, collecting keys, until you pass hi.
Step 2 is sequential I/O. Each leaf node holds O(t) keys, so reading k results costs O(k/t) disk reads. The total cost is O(log_t n + k/t), which is optimal. No tree traversal can do better than finding the start point plus reading the output.
Step 1: find the starting leaf. Step 2: walk the linked list until you overshoot. Two steps, done.
Why splits are cheap over time
Insertions trigger splits, which cost O(t) each (copying keys to a new node). But splits are rare. Over n insertions into an initially empty tree, the total number of splits is at most n / (t - 1). Each split costs O(t). So the total split cost is O(n * t / (t - 1)) = O(n). Amortized, each insertion pays O(1) for splits on top of the O(log n) search cost.
The argument: every split creates a new node with t - 1 keys. To fill that node back to 2t - 1 keys (making it eligible for another split), you need t more insertions into that node's subtree. So each split is paid for by the t insertions that preceded it.
One Structure, Every Language
Below is a complete, runnable B-tree implementation in each language. Every implementation supports search, insert, and in-order traversal. The minimum degree t is configurable. Deletion is omitted for space (it roughly triples the code) but follows the algorithm described above. If you want to implement deletion as an exercise, we salute your bravery.
class BTreeNode: def __init__(self, t: int, leaf: bool = True): self.t = t self.leaf = leaf self.keys: list[int] = [] self.children: list['BTreeNode'] = [] class BTree: def __init__(self, t: int): self.t = t self.root = BTreeNode(t) def search(self, key: int, node: 'BTreeNode | None' = None) -> tuple['BTreeNode', int] | None: node = node or self.root i = 0 while i < len(node.keys) and key > node.keys[i]: i += 1 if i < len(node.keys) and key == node.keys[i]: return (node, i) if node.leaf: return None return self.search(key, node.children[i]) def insert(self, key: int): root = self.root if len(root.keys) == 2 * self.t - 1: new_root = BTreeNode(self.t, leaf=False) new_root.children.append(self.root) self._split_child(new_root, 0) self.root = new_root self._insert_non_full(self.root, key) def _split_child(self, parent: BTreeNode, i: int): t = self.t full_child = parent.children[i] new_child = BTreeNode(t, leaf=full_child.leaf) new_child.keys = full_child.keys[t:] if not full_child.leaf: new_child.children = full_child.children[t:] full_child.children = full_child.children[:t] median = full_child.keys[t - 1] full_child.keys = full_child.keys[:t - 1] parent.children.insert(i + 1, new_child) parent.keys.insert(i, median) def _insert_non_full(self, node: BTreeNode, key: int): if node.leaf: i = len(node.keys) - 1 node.keys.append(0) while i >= 0 and key < node.keys[i]: node.keys[i + 1] = node.keys[i] i -= 1 node.keys[i + 1] = key else: i = len(node.keys) - 1 while i >= 0 and key < node.keys[i]: i -= 1 i += 1 if len(node.children[i].keys) == 2 * self.t - 1: self._split_child(node, i) if key > node.keys[i]: i += 1 self._insert_non_full(node.children[i], key) def traverse(self, node: 'BTreeNode | None' = None) -> list[int]: node = node or self.root result = [] for i in range(len(node.keys)): if not node.leaf: result.extend(self.traverse(node.children[i])) result.append(node.keys[i]) if not node.leaf: result.extend(self.traverse(node.children[len(node.keys)])) return result
What problems B-trees solve
B-trees are the right tool when the data needs to stay sorted and it lives on a medium where random access is expensive compared to sequential access. If your data fits in memory and you do not need ordering, use a hash table and enjoy your life.
Database indexes. This is the primary use case. Every CREATE INDEX in MySQL, PostgreSQL, or SQLite builds a B+ tree. The index maps column values to row locations, enabling O(log n) lookups, efficient range scans, and ordered iteration. Without a B-tree index, every query is a full table scan.
File system metadata. NTFS stores its Master File Table using B+ trees. HFS+ (macOS before APFS) used B-trees for its catalog and extents files. Btrfs (the Linux file system) is named after B-trees. These file systems need to map file names to disk locations for billions of files.
Key-value stores. Any storage engine that supports range queries likely uses a B-tree or LSM-tree variant. RocksDB, LevelDB, and WiredTiger all use B-tree-inspired structures. When you need "give me all keys between A and Z" in addition to point lookups, a hash table just stares at you blankly. A B-tree can help.
In-memory ordered maps. Rust's standard library BTreeMap uses a B-tree with B=6 for its ordered map. It performs linear search within each small node, which turns out to be faster than binary search due to cache locality. When you need sorted iteration or range queries in memory, a B-tree often beats a red-black tree in practice despite having the same O(log n) complexity, because it makes better use of cache lines.
External sorting and merging. When you sort a dataset too large for memory, B-trees organize the merge passes. The branching factor determines how many sorted runs you can merge simultaneously, and each disk read brings in a full page of sorted data.
Recognizing the B-tree pattern in interviews
B-trees rarely appear as the answer to a LeetCode problem. Nobody is going to ask you to implement B-tree deletion on a whiteboard (and if they do, leave). They show up in system design interviews, storage engine discussions, and anywhere you need to reason about disk I/O.
Signal 1: "design an index" or "range queries on disk"
Problem: Design a system that stores 100 million user records and supports both point lookups by ID and range queries like "find all users with IDs between 5000 and 6000."
A hash table gives you O(1) point lookups but cannot do range queries. A sorted array gives you O(log n) search and range queries, but insertions are O(n). A balanced BST gives you O(log n) everything, but on disk each node is a separate page read.
The B-tree is almost always the answer. It gives O(log n) search, O(log n) insert, and O(log n + k) range queries. On disk, the height is O(log_t n) where t is hundreds, so the actual number of disk reads is 3 or 4 for any dataset size. The leaf linked list (B+ tree variant) makes range scans sequential.
You would explain: "I will use a B+ tree index on the user ID column. With 16 KB pages and 8-byte keys, each internal node holds about 1,000 keys. For 100 million records, the tree is 3 levels deep. Point lookups need 3 disk reads. Range queries find the start in 3 reads, then scan leaves sequentially."
Signal 2: "sorted data with frequent updates"
Problem: You are building an autocomplete service. Users type prefixes, and you need to return the top results that start with that prefix. The dataset has 50 million entries and changes frequently as new entries are added.
A trie handles prefix matching, but if you also need sorted results or range iteration, a B-tree on the string keys gives you prefix range queries naturally. Search for the prefix as a lower bound, then scan forward through the leaf list.
The insight is that a B+ tree range query [prefix, prefix + 1) is exactly a prefix search. The leaf linked list gives you all matching entries in sorted order, and you stop when the prefix no longer matches.
You would explain: "I will store entries in a B+ tree keyed by the string value. A prefix search becomes a range scan: find the first leaf where the key >= prefix, then walk the leaf list until keys no longer start with the prefix. Insertions are O(log n). The tree stays balanced automatically."
Quick reference (pin this somewhere)
- A B-tree packs hundreds of keys per node, matching the disk page size, to minimize I/O
- Minimum degree t controls everything: nodes hold t-1 to 2t-1 keys, have t to 2t children
- Height is O(log_t n). For a billion keys with t=500, that is 3 to 4 levels
- All leaves are always at the same depth. The tree only grows or shrinks at the root
- Insertion uses proactive splitting: split full nodes on the way down, single pass
- Deletion uses proactive fattening: ensure children have enough keys on the way down
- B+ trees store data only in leaves and link leaves together for range scans
- Every major database (MySQL, PostgreSQL, SQLite, MongoDB) uses B+ tree indexes
- Rust's standard library
BTreeMapis the only major stdlib B-tree implementation (B=6) - For in-memory use, B-trees beat red-black trees in practice due to cache locality
If you want to practice explaining data structure trade-offs like these in a live setting, SpaceComplexity runs voice-based mock interviews where you walk through your design decisions in real time, with rubric-based feedback on both your solution and how you communicate it.
For more on how ordered data structures compare, see our deep dive on AVL trees vs red-black trees and the hash map complexity analysis that explains why hash tables win on point lookups but lose on ordering. If you want to understand why arrays beat linked lists in traversal despite the same big-O, the cache locality deep dive explains the hardware story that B-trees exploit.