Array vs Linked List Performance: Same Big-O, Completely Different Speed

- Cache locality determines real-world speed far more than Big-O notation when traversing data structures
- A CPU loads 64-byte cache lines, delivering 15 free integer reads alongside every single array access
- The hardware prefetcher detects sequential stride in arrays and pre-loads future cache lines before you need them
- Pointer chasing in linked lists creates a serial dependency chain that defeats the prefetcher entirely, causing one DRAM stall per node
- Stroustrup's benchmark: sorted insertion into a vector beat a linked list by 80x despite O(n) shifts versus O(1) insertion
- Linked lists win only when you already hold a direct pointer to a node and need O(1) deletion without scanning
Both structures traverse in O(n). One is routinely 5 to 10 times faster. The Big-O is identical. The hardware has other opinions.
A sequential scan over a million integers takes roughly 1 millisecond in a contiguous array. The same scan over a linked list of identical values takes 5 to 10 milliseconds. In extreme cases the gap is wider. The algorithmic complexity gives you no warning this is coming. Once you understand cache locality and what the CPU actually does with memory, the gap becomes completely obvious. And you stop reaching for linked lists by default.
The Memory Hierarchy Is Not Flat
Most programmers carry a mental model where "memory" is one thing. The CPU reads it, writes it, done. Cute model. Wrong model.
The real picture has four layers, each with dramatically different speeds.
The memory hierarchy. Each layer is 5 to 100x slower than the one above it.
| Level | Latency | Typical capacity |
|---|---|---|
| L1 cache | ~2 ns | 32-64 KB per core |
| L2 cache | ~5-10 ns | 256 KB-1 MB per core |
| L3 cache | ~15 ns | 4-64 MB shared |
| DRAM | ~80-100 ns | Gigabytes |
The gap between L1 and DRAM is 50 to 100 times. If your data is already in L1 cache, you get an answer in 2 nanoseconds. If it is in main memory, your CPU waits 80 to 100 nanoseconds, idle, stalled. Which tier your data lives in when you need it is the entire performance story.
The engineer's guide to bedroom organization.
The CPU Never Fetches One Byte
One fact changes everything: when your CPU reads a single variable from memory, it does not fetch just that variable. It fetches a 64-byte block called a cache line.
On every modern x86-64 processor, memory moves in 64-byte chunks. Always. Even if you read one int.
A 64-byte cache line holds:
- 16 ×
int32(4 bytes each) - 8 ×
int64or pointer (8 bytes each) - 4 × 16-byte structs
You asked for one integer. The CPU brought 15 of its friends.
When you access arr[0], the CPU loads arr[0] through arr[15] in one operation. Access to arr[1] is already in L1 cache. So are arr[2] through arr[15]. You asked for one element and got 15 for free. That is not a bonus feature. That is just how the hardware works.
Arrays and the Prefetcher Are Allies
Sequential array access does something even better: it triggers the hardware prefetcher.
Every modern CPU contains dedicated prefetch circuitry that watches your memory access pattern. When it sees a stride (you read address X, then X+64, then X+128, same gap each time), it starts fetching ahead. Like a very attentive waiter who noticed you are about to finish your drink and already has the next one coming. By the time your loop reaches arr[16], the cache line covering elements 16 through 31 is already sitting in L1, fetched proactively while the CPU was processing earlier elements.
The prefetcher detects the stride and loads the next cache line before you ask. The loop barely touches DRAM.
total = 0 for x in arr: # arr is a contiguous block of integers total += x
For a million-element array, this loop generates about 62,500 cache line fetches (1,000,000 elements / 16 per line). The prefetcher handles most of them before the loop needs them. You pay the DRAM penalty once at the start, then run almost entirely out of L1 cache.
Linked List Nodes Scatter Across the Heap
A linked list node stores the value plus at least one pointer to the next node. In C that is:
struct Node { int value; // 4 bytes // 4 bytes padding (alignment requirement) Node* next; // 8 bytes on 64-bit }; // Total: 16 bytes per node
In Python or Java the overhead is higher. Python wraps each object in a 28-byte structure. Java adds a 16-byte object header before you touch any data. A "linked list of integers" is really a chain of heap objects that happen to contain integers. They just happen to contain integers. Do not be fooled.
These nodes were allocated at different moments during the program's lifetime. The heap allocator placed them wherever free space existed. Node 1 is at 0x1A00. Node 2 at 0x4F80. Node 3 at 0x2230. No two adjacent nodes are likely to share a cache line. One cache miss per node, in the worst case.
Same logical sequence. Very different physical arrangement.
Pointer Chasing Serializes Everything
To traverse a linked list you follow pointers one at a time:
- Load node 1 from address
0x1A00. - Read its
nextfield: it says0x4F80. - Load node 2 from address
0x4F80. - Read its
nextfield: it says0x2230. - Repeat.
Each load depends on the result of the previous load. You cannot request node 2 until you finish loading node 1, because you do not know where node 2 lives until you read node 1's pointer. The loads form a strict serial chain. The CPU cannot issue them in parallel.
The CPU is literally idle between loads. Just sitting there, waiting for the address to arrive from DRAM. 80 to 100 nanoseconds of nothing, per node.
This is the hardware prefetcher's failure mode. It monitors memory access addresses and looks for a stride. Your access sequence looks like: 0x1A00, 0x4F80, 0x2230, 0x6B10. No stride. No pattern. The prefetcher gives up. Every node arrives cold from DRAM.
Linked list: one DRAM wait per node. Array: one DRAM wait per 16 nodes.
Every pointer dereference is a potential 80 to 100 nanosecond stall. Arrays pay that penalty once per 16 elements. Linked lists pay it once per element. At a million elements, the difference in DRAM fetches is roughly 62,500 versus 1,000,000. That factor of 16 maps almost directly to the 5 to 10x real-world speedup you measure.
The Paradox That Stroustrup Demonstrated
In 2012, Bjarne Stroustrup (creator of C++) gave a keynote at GoingNative that made people uncomfortable. He ran a benchmark: sorted insertions into a list versus a vector.
The algorithmic argument for the list is airtight. Insertion into a sorted linked list is O(1) at the insertion point. Insertion into a sorted vector is O(n) because every element above the insertion point shifts by one. By the logic we all learned, the list should win decisively as n grows.
The actual benchmark for 500,000 elements: the linked list took roughly 1 hour 47 minutes. The vector took 1 minute 18 seconds.
Let that land. An hour and forty-seven minutes versus one minute eighteen seconds. An 80-fold gap, in favor of the "worse" algorithm.
The reason is that finding the insertion point dominates the runtime, not the insertion itself. To find where to insert in a sorted structure, you traverse it. And the traversal of a vector is 5 to 10 times faster than the traversal of a linked list, for exactly the reasons above. After you find the insertion point, yes, the vector shifts elements. But shifting is a bulk memmove over contiguous memory. It is embarrassingly cache-friendly. The vector traversal is fast. The insertion is fast. The linked list traversal is the bottleneck, and it is not close.
The O(n) algorithm won because O(1) on a cache-hostile structure is slower than O(n) on a cache-friendly one at any realistic size.
When Linked Lists Actually Win
Linked lists are not useless. They have exactly one legitimate job: O(1) deletion when you already hold a direct pointer to the node.
The LRU cache is the textbook example. Every cache access requires promoting the touched node to the front of the list. If you have a pointer to the node (from your hash map), the move is O(1): update four pointers, done. No traversal at all. The pointer-chasing problem only matters when you are scanning; with direct node access you skip the scan entirely.
That is why the standard LRU implementation pairs a hash map with a doubly-linked list. The hash map gives you O(1) access to the node. The list gives you O(1) re-ordering once you have the pointer. The traversal never happens. (Here is a full breakdown of why those two structures need each other.)
For everything else, the default should be an array. Even the O(1) amortized appends of a dynamic array come with far better cache behavior than a linked list. Middle-of-array insertions that look expensive on paper are often faster in practice than the linked list alternative, because the copy is cache-friendly and the find is brutally fast.
What This Means in Interviews
Interviewers rarely ask directly about cache lines. But this knowledge reshapes your answers to questions that look like pure Big-O questions.
"Which data structure would you choose for this collection?" The answer is almost always array (or dynamic array), unless you specifically need O(1) pointer-based deletion. The burden of proof is on the linked list.
"Why is your solution efficient?" Saying "it's O(n)" is table stakes. Saying "and it accesses memory sequentially, so the prefetcher keeps everything in L1 cache" is the answer that signals you understand the constant factor, not just the growth rate. This is the same principle that explains why hash map O(1) lookups have a real cost despite looking "free" in Big-O notation. The constant matters. Memory layout is what the constant measures.
The first answer passes. The second one gets remembered.
SpaceComplexity runs voice-based mock interviews that push exactly this: not just "what is the complexity?" but "why would you choose this structure over that one, and what does the hardware think?"
The Short Version
- CPUs fetch 64 bytes at a time. Access one
int, get 15 more for free in the same cache line. - The hardware prefetcher detects sequential stride patterns and loads future cache lines before you ask.
- Linked list nodes scatter across the heap at allocation time. No two adjacent nodes are likely in the same cache line.
- Pointer chasing creates a serial dependency chain: you cannot issue load N+1 until load N completes and reveals the next address. The prefetcher cannot help.
- Each pointer dereference risks an 80 to 100 ns DRAM stall. Sequential array access costs roughly one DRAM fetch per 16 elements.
- Real benchmarks show 5 to 10x traversal speedups for arrays, and 80x for sorted insertion, because the traversal advantage swamps everything else.
- Linked lists win only when you hold a direct pointer to a node and need O(1) deletion without scanning.