What Is Memory Locality? The Cache Property Behind Real Speed

June 6, 20269 min read
dsaalgorithmsinterview-prepdata-structures
What Is Memory Locality? The Cache Property Behind Real Speed
TL;DR
  • Spatial locality means adjacent memory accesses share a 64-byte cache line, making array traversal nearly free after the first miss.
  • Temporal locality means recently-used addresses stay warm in cache, benefiting hot variables and small memoization tables.
  • A cache miss to DRAM costs ~50x an L1 cache hit, which explains why two O(n) algorithms can have 5–10x real-world speed differences.
  • Arrays beat linked lists on traversal by 5–10x because contiguous memory lets the hardware prefetcher load the next cache line before you ask.
  • Column-major matrix loops trigger near-constant cache misses; row-major loops fire almost none — the same principle behind columnar OLAP databases.
  • Interviewers ask this as "why did you pick an array?" — the answer isn't Big-O, it's spatial locality and prefetcher behavior.

Your algorithm is O(n). Your colleague's algorithm is also O(n). Their code finishes in 200ms. Yours takes a full second. You've checked the logic three times. It's correct. You've double-checked the Big-O. It's identical. You've even stared at it suspiciously on the train home.

The algorithm isn't the problem. Where your data lives in memory is.

Memory locality is the property of accessing memory addresses that are physically close together, or accessing the same address again shortly after the first read. The CPU's cache makes nearby, repeated accesses nearly free. Scattered, one-off accesses pay the full price of DRAM every time. When two O(n) solutions differ by 5-10x in real runtime, cache behavior is almost always why. That's also the complete answer to "why did you choose an array here?" in any performance conversation.

Your CPU Has a Very Good Short-Term Memory (By Design)

Modern CPUs execute billions of instructions per second. DRAM has roughly 80-100 nanoseconds of latency per access. If the CPU had to ask DRAM for every value it needed, it would spend most of its time staring at the ceiling. So CPUs cheat.

The cheat is a cache: a small, fast memory bank sitting between the CPU core and main memory. Think of it as how far you have to walk to get information. L1 is the sticky note on your monitor. L2 is the whiteboard across the room. L3 is the shelf in the hallway. DRAM is the filing cabinet in a basement you've visited exactly once, lost the combination to, and now pretend doesn't exist.

When you read a value, the CPU checks L1 first (~2ns). Miss there, it checks L2 (~5ns). Miss there, L3 (~17ns). Miss everywhere, it fetches from DRAM (~80-100ns).

CPU and memory hierarchy: L1, L2, L3 caches and DRAM with latency labels

A cache miss to DRAM costs roughly 50x an L1 hit. Every access that falls through to DRAM turns a 2ns read into something close to 100ns. The cache only helps when your access pattern cooperates with it.

The 64-Byte Windfall

When you read a single 4-byte integer, the CPU doesn't fetch just those 4 bytes. It fetches the entire 64-byte cache line containing that integer. On every modern x86-64 CPU, that's 64 bytes, which holds 16 adjacent 32-bit integers.

The next integer is already in the cache line. Free. The one after that. Also free. If you access data adjacent to what you just read, you get 15 more reads for nothing. This is spatial locality, and it's the reason arrays feel fast.

Arrays have near-perfect spatial locality. Elements live contiguously in memory, so looping through one means you're almost always reading already-cached data. The hardware prefetcher goes further: it detects the sequential stride and pre-loads the next cache line before you even ask for it. The CPU is literally reading ahead for you like an overachieving intern.

total = 0 for x in arr: # arr[0], arr[1], arr[2]... contiguous in memory total += x # one cache miss loads 16 ints; the next 15 are free

Linked Lists: Every Node a New Adventure

Linked lists have near-zero spatial locality. Each node holds a value and a pointer to the next node. Nodes are allocated separately, scattered across the heap wherever malloc found an empty slot. Following node.next means loading a random address that has almost certainly been evicted from cache since you last visited anything nearby.

You pay for DRAM on nearly every node. Worse: because each fetch must complete before the next pointer's address is known, the CPU can't pipeline or prefetch the reads. It can't read ahead because it doesn't know where ahead is. It just waits. This is sometimes called a serial load dependency or pointer-chasing chain. With arrays, the CPU can issue several memory requests simultaneously because the addresses are predictable. With linked lists, every node is a hostage situation: you don't find out where node N+1 is until node N arrives from DRAM.

node = head while node: total += node.val # random address, probably a cache miss node = node.next # another random address, another wait

Cache-aware benchmarks on million-element datasets consistently show arrays running 5-10x faster than linked lists. Stroustrup demonstrated in a widely-cited talk that a sorted vector beats a sorted linked list at almost any meaningful size, and the gap widens as you scale. Pointer chasing overwhelms any theoretical advantage long before you hit production traffic.

When the interviewer asks why your linked list traversal is 5x slower than an array in practice "All I know is how to reverse a linked list." Same energy when someone asks you to explain cache miss patterns from first principles.

The Matrix Test: Same Data, Two Very Different Days

Matrix operations make spatial locality easy to observe and measure. A row-major matrix stores matrix[i][0], matrix[i][1], matrix[i][2]... contiguously. Iterating row by row walks memory sequentially. Iterating column by column jumps an entire row width on every access.

# Cache-friendly: row by row for i in range(rows): for j in range(cols): total += matrix[i][j] # sequential, prefetcher loves this # Cache-hostile: column by column for j in range(cols): for i in range(rows): total += matrix[i][j] # jumps by one full row each step

For a large matrix, the column-major loop fires a cache miss on nearly every access while the row-major loop fires almost none. The gap on modern hardware is 5-10x, sometimes more. Same data. Same algorithm. Different address pattern.

This isn't just a toy example. NumPy defaults to row-major (C order) because of this, and most high-performance linear algebra code is written assuming C order. If you've ever seen a benchmark where switching loop order gave a 4x speedup with no algorithmic change, this is what happened. Column-oriented OLAP databases like ClickHouse, Redshift, and Parquet store one column's values contiguously so aggregate queries scan a single sequential stream instead of scattering across rows. The storage format is a locality decision, made deliberately, with this exact tradeoff in mind.

Temporal Locality: Hot Variables Stay Warm

The second flavor is temporal locality. Read the same address repeatedly in a short window and the CPU keeps it warm in L1, betting you'll ask again. Loop counters, running totals, array lengths. The CPU handles these automatically.

Memoization makes temporal locality explicit. First call: compute and store. Second call: cache hit, essentially free. But the memo table itself needs to fit in cache for this to actually help. A hundred-entry table stays warm. A table with millions of entries thrashes the cache constantly. You've traded recomputation cost for DRAM-fetch cost, and that tradeoff isn't obviously good at scale. Profile before assuming memoization is free.

What Interviewers Are Actually Asking

Locality shows up in coding interviews in a few reliable shapes.

"Why is your solution fast in practice?" When you pick an array over a linked list, the Big-O answer is often identical. The cache answer is what distinguishes a strong response: "Both are O(n), but the array walks contiguous memory. Each cache miss loads 16 elements at once, and the prefetcher loads the next batch before I ask. The linked list pays for DRAM on every node because pointer chasing prevents the CPU from pipelining reads."

"Compare these two implementations." When one chases pointers and the other does sequential reads, locality is the performance story. Name it and give the cost ratio. A DRAM access is roughly 50x slower than an L1 hit.

"Why did you use an array here instead of a linked list?" Same question, different wrapping. "Contiguous memory gives me spatial locality and prefetcher support. Pointer chasing eliminates both."

"Why are OLAP databases faster for analytics than Postgres?" This is the same question at the system design layer. Column-oriented storage concentrates one column's values in a single contiguous region, so a SELECT AVG(price) FROM orders scans just the price column sequentially. Row-oriented storage has every row's columns interleaved, so that same query scans the entire table to extract one field per row. Two storage models, same data, completely different cache behavior.

At performance-focused companies and in system design questions, this goes further. The columnar vs. row-oriented storage tradeoff is a locality tradeoff. Graph algorithms walking random neighbor pointers are cache-hostile by nature, which is why specialized graph stores use compact adjacency layouts. Every data structure decision is implicitly a cache access pattern decision. The interviewers at Google and Stripe who ask about this want you to connect algorithm choice to hardware behavior, not just asymptotic complexity.

Interviewer: Sort an array of 0s, 1s, and 2s. Me: First I will sort using Bubble Sort... "My grandma runs faster than your code" is also the reaction to cache-naive implementations at scale. Pick your data structures accordingly.

See array vs linked list performance for detailed benchmarks, and cache eviction policy to understand what happens when the cache fills up and has to decide what to throw out.

The One Number Worth Remembering

You don't need to memorize exact clock counts. A cache miss to DRAM costs roughly 50x an L1 cache hit. That turns 5ns reads into 100ns reads and compounds across every element in your data structure.

Most engineers learn Big-O and stop there. Cache behavior is the gap between theoretical complexity and actual runtime. Arrays beat linked lists on traversal by 5-10x. Consistently. Not occasionally, not in worst case. Every time, because physics. The cache line is 64 bytes. DRAM latency is 80-100ns. These numbers don't change.

When you choose a data structure, ask one extra question: are my accesses sequential or scattered? Sequential means spatial locality for free. Scattered means factoring the miss cost into the tradeoff.

Knowing the concept is the easy part. Explaining it clearly in 30 seconds while someone watches you code is the actual skill. SpaceComplexity runs voice-based mock interviews with rubric feedback on exactly this kind of reasoning, if you want to practice before the real thing.

Further Reading