What Is Cache-Friendly Code? Why Two O(n) Loops Can Run 10x Apart

You write a matrix traversal. Your friend writes the same algorithm. You both land on O(n²). You run them on the same machine. Yours takes 80 milliseconds. Theirs takes 900.

Same problem. Same complexity. Same computer. Eleven times slower.

No bugs. No inefficient data structure. No algorithm mistake. Just a different loop order.

That gap is what cache-friendly code is about, and it's the kind of thing that turns a "passes all test cases" solution into a "causes a production incident" solution.

Your CPU Has a Memory Problem

RAM is slow. Not "we should profile this sometime" slow. More like "300 clock cycles waiting for a single number" slow. At 3 GHz, that's 100 nanoseconds your processor spends completely idle, staring at the memory bus like it's waiting for a late delivery.

The fix is a tiered hierarchy of smaller, faster caches between the processor and RAM.

Memory hierarchy: L1 through RAM, each level ~10× slower

Level	Typical Size	Latency
L1 cache	32-64 KB	~1-4 ns
L2 cache	256 KB-1.25 MB	~5-12 ns
L3 cache	4-64 MB (shared)	~15-40 ns
Main RAM	GBs	~80-100 ns

When your code reads a value, the CPU checks L1 first. Hit, and you're done in nanoseconds. Miss, and it falls through to L2, then L3, then the agonizing trip to RAM. Each level is roughly an order of magnitude slower, and the gap between L3 and RAM is the one that kills you.

Most programmers never think about this. The ones who do write code that runs several times faster without changing the algorithm at all.

A Cache Miss Buys You 16 Integers. Use Them.

When the CPU fetches any value from memory, it doesn't fetch just that value. It grabs a 64-byte aligned block called a cache line. On x86 (Intel and AMD), that's always 64 bytes. An int32 is 4 bytes. So one cache miss loads 16 integers at once, whether you wanted them or not.

If your next 15 reads are those 16 integers in sequence, they cost nothing extra. They're already sitting in L1. One cache miss bought you 15 free reads.

If your next access is at some distant address, you get another miss. You loaded 15 integers you never touched, then immediately evicted them to make room. You paid for 64 bytes and used 4. The hardware was generous and you ignored it.

This is the whole game. Cache-friendly code exploits the free rides. Cache-unfriendly code throws them in the trash.

The Prefetcher Is Trying to Help. Don't Break Its Heart.

There's a piece of hardware called the prefetcher. Its job is to watch your access pattern and pre-load cache lines before you ask for them. When you read arr[0], arr[1], arr[2] in order, it loads arr[16] through arr[31] while you're still processing the first batch. The latency vanishes.

# Sequential: prefetcher is happy, latency hidden
total = 0
for x in arr:
    total += x

# Strided: new cache line every 16th element, prefetcher gives up
total = 0
for i in range(0, len(arr), 16):
    total += arr[i]

The strided loop reads every 16th element. Each read lands on a fresh cache line. You paid for 64 bytes and used 4. The prefetcher detects no recognizable pattern, gives up, and every miss becomes a full round-trip to RAM. You've burned your hardware advantage for a slightly shorter loop body.

The Loop Order That Decides the Benchmark

This is the canonical cache example. The one that shows up in interviews. The one where two versions of the same code show a 10x gap and you sit there quietly recalibrating your beliefs.

Python and C store 2D arrays in row-major order. matrix[0][0], matrix[0][1], matrix[0][2] are contiguous in memory. Row 0 first, then row 1, then row 2. Sequential.

n = 3000
matrix = [[i * n + j for j in range(n)] for i in range(n)]

def row_major_sum(matrix):
    total = 0
    for i in range(len(matrix)):
        for j in range(len(matrix[0])):
            total += matrix[i][j]   # adjacent in memory, prefetcher thriving
    return total

def col_major_sum(matrix):
    total = 0
    for j in range(len(matrix[0])):
        for i in range(len(matrix)):
            total += matrix[i][j]   # jumps 12,000 bytes between rows
    return total

The row-major version is typically 5 to 20 times faster on large matrices, despite identical O(n²) complexity.

In the row-major loop, each cache miss loads a chunk of the current row. The next 15 elements are free.

In the column-major loop, matrix[0][j], matrix[1][j], matrix[2][j] are separated by n * 4 bytes. For a 3000-column matrix, that's 12,000 bytes between consecutive accesses. Each element lives on a different cache line. You get a miss for every single read. The prefetcher, confused, goes home.

Row-major vs column-major cache access: green cells hit the cache, red cells each cause a miss

The matrix doesn't change. The algorithm doesn't change. The loop order decides whether you're cooperating with the hardware or at war with it.

Pointer Chasing: Why Linked Lists Are the Anti-Cache

Linked lists are the textbook cache-unfriendly data structure, and it's not even close. The problem is pointer chasing.

class Node:
    def __init__(self, val):
        self.val = val
        self.next = None

Nodes get allocated at arbitrary addresses. To traverse, you read node.val, then read node.next to find the next address, then jump there. That address is probably on a completely different cache line. You can't issue the load for node N+1 until node N's load completes and reveals its address. The prefetcher has no idea what's coming. It's serial cache misses all the way to RAM, one after another, like a game of hot potato with your memory subsystem.

Traversing 1 million nodes runs roughly 5 to 10 times slower than scanning the equivalent array, even though both are O(n). Bjarne Stroustrup demonstrated a version showing a 25x gap between std::vector and std::list traversal on modern hardware. Same asymptotic complexity. Wildly different real performance.

Arrays win because the hardware was built for sequential access. Linked lists win when you need O(1) insertion at a known position. Know which battle you're in.

You can read a detailed breakdown in the array vs linked list performance guide.

When Half Your Cache Line Is Dead Weight

Cache behavior also shapes how you organize data, not just how you traverse it.

Say you're processing a million coordinate pairs and only need the x values:

# Array of Structs (AoS)
# Memory: x1 y1 x2 y2 x3 y3 x4 y4 ...
points = [(x1, y1), (x2, y2), (x3, y3), ...]

# Struct of Arrays (SoA)
# Memory: x1 x2 x3 x4 ... | y1 y2 y3 y4 ...
xs = [x1, x2, x3, ...]
ys = [y1, y2, y3, ...]

With AoS, a cache line loads interleaved (x, y) pairs. If your loop only touches x, half the bytes in every cache line are y values you never read. They're just along for the ride, taking up space, not paying rent.

With SoA, x values pack together. Every byte you load is a byte you use. SoA effectively doubles your useful cache bandwidth when you access only one field.

This is why game engines use SoA for entity component systems, why column stores outperform row stores on analytics, and why NumPy defaults to contiguous per-column storage. When you're processing one attribute across millions of records, SoA isn't a micro-optimization. It's the difference between your working set fitting in L3 or spilling to RAM.

What Big-O Forgot to Mention

Big-O strips away constant factors. Two O(n) algorithms have the same asymptotic complexity. One might run 10 times faster in practice because its cache miss rate is 6% instead of 95%.

The hidden constant in most performance-sensitive code is dominated by cache behavior.

This is why a hash map lookup can be slower than binary search on a small sorted array, even though O(1) beats O(log n) in theory. A hash map computes an index and jumps to a semi-random location. Binary search on a small array stays in L1 the whole time. With 16 or 32 elements, binary search wins. The crossover depends on the working set and access pattern. For more on this, see the hash map time complexity breakdown and the binary search algorithm guide.

The algorithm analysis tells you the shape of the curve. Cache behavior sets the constant. Both matter, and in production systems, the constant is often what you're actually optimizing.

How This Shows Up in Interviews

Cache-friendly code comes up in interviews in a few specific ways, and knowing the mechanism is what separates a textbook answer from a convincing one.

In system design, questions about why column stores outperform row stores on analytics, or why data structure choices matter at scale, trace back to cache behavior. Interviewers at Google and Meta look for this kind of hardware awareness at the senior level. "The column store reads only the columns it needs" is a half-answer. "It reads only the relevant columns, which are contiguous in memory, so cache utilization is much higher" is the full one.

In DSA interviews, solving a matrix problem with a column-major traversal will sometimes prompt a follow-up. Spotting the access pattern and explaining the cache implication reads as senior-level thinking. Most candidates solve the problem and stop. Few mention why one traversal order would run faster on real hardware.

In complexity discussions, distinguishing theoretical complexity from practical performance by naming cache effects as part of the constant factor analysis is the kind of nuance that ends up in strong-hire write-ups.

If you want to practice reasoning through this kind of thing out loud, under time pressure, with a follow-up question waiting, SpaceComplexity runs voice-based DSA mock interviews where performance tradeoffs are part of the rubric.