What Is a B+ Tree? The Leaf List That Powers Every Database Index

June 4, 20269 min read
dsaalgorithmsinterview-prepdata-structures
What Is a B+ Tree? The Leaf List That Powers Every Database Index
TL;DR
  • B+ tree stores all records in leaf nodes; internal nodes hold only routing keys and pointers, making them a pure index layer with no data.
  • Leaf nodes link together as a sorted linked list, so range queries cost O(log n + k) after a single O(log n) tree descent.
  • Branching factor (set by disk page size) controls tree height; with m=250, a billion-row table fits in 4 levels instead of 30.
  • B+ tree vs B-tree: the median key is copied up during a split, keeping leaves complete and the leaf-level linked list intact.
  • Clustered indexes store the full row in the leaf; secondary indexes store a pointer, adding a second traversal for every row in a range scan.
  • B+ tree vs hash map: hash maps give O(1) point lookups but can't answer range queries; B+ trees win for databases because range queries matter.

You run SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31'. Your orders table has 400 million rows. It comes back in under 10 milliseconds. Your DBA didn't sacrifice a goat. No one's hardware is that special. That's a B+ tree doing exactly the one thing it was designed for.

B+ tree structure diagram showing internal routing nodes, leaf data nodes, and the linked list connecting leaves

Why a Binary Search Tree Fails Here

A BST finds any value in O(log n) comparisons. For one million records, that's about 20 steps. Sounds reasonable until you realize that in a database, each step might load a separate disk page. A disk read costs around 10 milliseconds. Twenty disk reads is 200 milliseconds per lookup, before anyone else touches the disk. Your p99 dashboard becomes a crime scene.

The problem isn't the algorithm. It's the shape. A BST is tall and skinny. Each node holds one key and two pointers. To cut disk reads, you need a tree that is short and wide. If a single node holds 250 keys instead of 1, a tree covering a million records needs 3 levels, not 20. That's the entire insight behind B+ trees. Short tree, fat nodes, few disk reads.

Two Node Types, Two Completely Different Jobs

A B+ tree has two kinds of nodes and they absolutely do not do the same thing.

Internal nodes are the routing layer. They hold keys and pointers to child nodes. They never hold actual data records. Their only job is to point you somewhere else. Think of them as bouncers who check your key and wave you to the right section. They know nothing about what's inside.

Leaf nodes hold everything real. Each leaf stores key-value pairs where the value is either the full row or a pointer to it. Every leaf also holds a pointer to the next leaf in key order. That linked list is what makes range queries cheap. The rest of the tree exists to find the first leaf. After that, the list takes over.

                    [20 | 40]
                   /    |    \
         [5|12|17]   [22|30|35]   [43|50|58]
              |           |            |
           (data)      (data)       (data)

Leaf linked list: [5|12|17] → [22|30|35] → [43|50|58]

The root [20 | 40] means: keys less than 20 live in the left subtree, keys from 20 to 39 live in the middle, keys 40 and above go right. No data lives in the root. No data lives in any internal node. Only at the leaves.

Every Search Takes Height Steps

To find key 30, start at the root [20 | 40]. Since 30 is between 20 and 40, descend to the middle child. You're now at leaf [22|30|35]. Scan the keys. Found it. Two nodes visited, two disk reads.

Every search in a B+ tree visits exactly one node per level. For a tree of height h, that's h disk reads total. With a branching factor of 250 (typical for an 8KB database page storing 8-byte keys), height 3 covers 250³ = 15 million records. Height 4 covers nearly 4 billion.

PostgreSQL uses 8KB pages by default, getting roughly 250 to 300 keys per internal node. MySQL InnoDB uses 16KB pages. Both end up with the same practical result: a production B+ tree for a billion-row table sits at height 4. Four disk reads, not 30.

Range Queries Are Just a Linked List Walk

Point lookups are fine. Range queries are where B+ trees genuinely earn their keep.

Say you want all keys between 22 and 50. Traverse the tree to find the leaf containing 22. That's O(log n) disk reads. You're now at [22|30|35]. Collect 22, 30, 35. Follow the next-pointer to [43|50|58]. Collect 43, 50. Stop at 58. Done.

Range queries cost O(log n + k), where k is the number of results. Once you land on the first leaf, you stop re-entering the tree. You just walk the list. It's almost embarrassingly simple once you see it.

A BST can't do this. It has no linked list between leaves. A range query on a BST requires an in-order traversal of the full tree structure, visiting every node in the range regardless of how many results you need. The B+ tree's linked list makes the range query nearly free after the initial descent.

B+ Tree vs. B-Tree: The Copy, Not the Move

The B-tree came first. It stores actual data at every level. Find a key in an internal node and you return immediately, without descending to a leaf. Sounds like a shortcut.

But that shortcut has costs. Data scattered across levels means range queries require a full in-order traversal since there's no leaf-level linked list. Internal nodes also share page space with data, which reduces their branching factor and makes the tree taller.

The B+ tree makes a different trade. All data goes to the leaves. Internal nodes become a pure index, holding only keys and pointers. Tighter packing per page, higher branching factor, shorter tree.

There's one detail that trips people up. When a B+ tree node overflows and splits, the median key is copied up to the parent. Not moved. It stays in the leaf. The leaf keeps a complete set of keys. In a B-tree, the median key is moved to the parent and disappears from the lower level. The copy rule is what keeps the leaf level complete and makes the linked list traversal correct.

Order Drives Height, Height Drives Disk Reads

In a B+ tree of order m:

  • Each internal node holds between ⌈m/2⌉ and m children.
  • Each leaf node holds between ⌈(m-1)/2⌉ and m-1 key-value pairs.
  • Non-root nodes stay at least half full. This is the rule that keeps the tree balanced after deletions.

The order m is determined by the disk page size. If your page is 8KB and each key-pointer pair costs 16 bytes, you fit roughly 500 entries per internal node. More entries per node means a shallower tree. Database engineers debate page sizes with the intensity of people who care too much about something that makes a very large difference.

The relationship between order and height is why page sizes matter. A higher branching factor directly translates to fewer disk reads per query. Everything else flows from this.

Why log_m Beats log_2

OperationTime
Point searchO(log_m n)
InsertO(log_m n)
DeleteO(log_m n)
Range query [a, b]O(log_m n + k)
SpaceO(n)

The log is base m, not base 2. For m = 250, log₂₅₀(1,000,000,000) ≈ 3.7. A BST over the same data needs log₂(1,000,000,000) ≈ 30. Both are logarithmic. The base changes everything once disk I/O enters the picture.

Space is O(n): each record lives in exactly one leaf, plus O(n/m) internal nodes for the routing layer. The routing layer is small relative to the leaf level for any reasonable branching factor.

Clustered Indexes Store Rows in the Leaves

This distinction comes up in system design interviews regularly, and it's worth having straight.

In a clustered index (what MySQL InnoDB uses for its primary key), the leaf nodes hold the actual row data. The B+ tree and the table are the same physical structure. Range scans on the primary key read data sequentially from disk.

In a non-clustered index (secondary indexes), the leaf nodes hold the indexed column plus a pointer back to the primary key. A lookup on a secondary index does two traversals: one to find the pointer, one to fetch the row. For a range query returning many rows, those secondary lookups become random disk reads instead of sequential ones.

This is why ORDER BY primary_key LIMIT 1000 outperforms ORDER BY created_at LIMIT 1000 on a large table with no composite index. The primary key scan is sequential. The secondary index scan is not. Your slow query log will confirm this at the worst possible time.

What Shows Up in Interviews

B+ trees appear in system design rounds, rarely in algorithm rounds. They surface right when you think you're just there to implement a queue or explain consistent hashing, and the interviewer asks "so how does the database actually store this?" Fun.

Three questions come up constantly:

  • "Why use a B+ tree instead of a hash map for a database index?" Hash maps give O(1) point lookups but can't answer range queries at all. B+ trees give O(log n) point lookups and O(log n + k) range queries. For a database that serves BETWEEN queries, the hash map is useless.
  • "Why not just use a BST?" Same asymptotic complexity, but a BST has a branching factor of 2. Every level of the tree is a disk read. B+ trees collapse dozens of BST levels into a few wide hops.
  • "When is a secondary index a performance problem?" When the query returns many rows and each one requires a separate pointer dereference to fetch the full row from the primary store.

For deeper index design, database indexing for system design interviews covers covering indexes, composite indexes, and when to avoid them. B-Tree and B+ Tree: A Billion Keys, Three Disk Reads works through the structural differences with more examples. The disk I/O cost comparison lives in B-tree vs Binary Search Tree.

If you're working through system design prep and want to practice explaining these tradeoffs out loud, which is how these questions actually get scored, try a mock interview at SpaceComplexity. Knowing the B+ tree internals on paper is one thing. Articulating them clearly under a 45-minute clock is the real test.

The Short List

  • All data lives in the leaves. Internal nodes are pure routing.
  • Leaf nodes link together as a list. Range queries traverse the list after one O(log n) descent.
  • Branching factor m (set by page size) determines tree height. For m = 250, a billion-row table fits in height 4.
  • The median key is copied up during a split, not moved. Leaves stay complete.
  • Range query cost is O(log n + k). Point query cost is O(log n). Hash maps can't do range queries.
  • Clustered indexes store the row in the leaf. Secondary indexes store a pointer. Pointer chasing is slow for large range scans.

Further Reading