The Call Stack: What Really Happens When You Call a Function

You type factorial(5) and get 120 back. Between those two moments, the machine tracked five simultaneous contexts, kept their local variables from colliding, and knew exactly which line to resume at after each one returned. No global state. No coordination code. Magic, apparently.

It is not magic. It is the call stack. And once you see the mechanism clearly, recursion stops feeling mysterious, stack overflow stops feeling random, and iterative DFS stops feeling like a clever trick. It starts feeling obvious.

Where Your Function's Memory Actually Lives

A process's virtual memory is carved into regions. Text holds the compiled machine code. The heap holds dynamically allocated memory: your malloc, your new, every Python object. The stack holds the active call chain.

The stack is a contiguous slab of memory that grows toward lower addresses every time you call a function and shrinks back when the function returns.

Process virtual memory layout: stack at top growing downward, heap growing upward, data segment and text segment at bottom. rsp marks the current stack boundary.

Virtual memory layout. The stack and heap grow toward each other. They never actually meet unless something has gone horribly wrong.

On Linux and macOS, the main thread's stack is 8 MB. macOS gives secondary threads 512 KB. Windows gives each thread 1 MB by default. These are not recommendations. They are walls.

What a Stack Frame Contains

Every function call gets its own chunk of that stack region: a stack frame. The frame holds everything the function needs to run and return cleanly.

Four things live in a typical frame: the return address, the saved frame pointer, local variables, and any callee-saved registers this function touched.

A single stack frame with labeled sections from top to bottom: return address (8 bytes), saved frame pointer / rbp (8 bytes), local variables (variable size), callee-saved registers. rsp points to the bottom of the frame.

One stack frame. The return address is the machine's version of a sticky note that says "come back here when you're done."

Let's define each:

Return address. When the function finishes, the CPU needs somewhere to jump. The call instruction pushes this automatically before jumping. It is the address of the instruction immediately after the call site.
Saved frame pointer. The frame pointer register (rbp on x86-64) is a stable anchor into the current frame. Every function saves the caller's rbp first, so it can restore it before returning.
Local variables. Any variable that doesn't fit in a CPU register gets carved out here.
Callee-saved registers. If the function uses rbx, r12 through r15, the ABI requires it to save and restore them. Those saved values live in the frame.

One thing that does not usually live on the stack: arguments. On x86-64 Linux and macOS, the first six integer or pointer arguments pass in registers: rdi, rsi, rdx, rcx, r8, r9. Only a seventh argument or beyond spills to the caller's frame. This is worth knowing because it means a function call with six or fewer arguments adds almost no stack overhead beyond the frame itself.

The Call: Three Instructions Do All the Setup

Take this Python function, compiled to x86-64:

def add(a, b):
    result = a + b
    return result

The caller puts a in rdi and b in rsi. Then:

call add        ; pushes return address onto stack, jumps to add

That single call instruction does two things in one step: it pushes the address of the next instruction (the one after call) onto the stack, then jumps to add. The CPU has now committed to coming back.

Inside add, the prologue runs three instructions:

add:
    pushq  %rbp          ; save caller's frame pointer
    movq   %rsp, %rbp    ; set rbp = current top of stack (our frame base)
    subq   $16, %rsp     ; carve out 16 bytes for local variables

Before/after the three prologue instructions. Left side shows rsp and rbp pointing to caller's frame base. Right side shows rsp moved down by 16 bytes, rbp anchored at the saved frame pointer, and the return address sitting at rbp+8.

Three instructions. That's the entire frame setup. The epilogue is leave; ret, which is two more. Five instructions total to enter and exit any function.

Subtracting from rsp allocates space because the stack grows downward. The compiler decides how much to subtract based on how many local variables the function has. It rounds up to maintain 16-byte alignment (an SSE requirement on x86-64).

After the prologue, rbp anchors the frame. Locals live at rbp - 8, rbp - 16, and so on. The saved frame pointer lives at rbp. The return address lives at rbp + 8.

Walking Through factorial(3)

def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)

Here is what happens when factorial(3) is called.

Frame 1 is pushed. n = 3, return address points back to the caller. We check 3 <= 1, which is false, so we recurse.

Frame 2 is pushed. n = 2, return address points to the multiply instruction inside Frame 1 (the n * part, which can't run until the recursive call returns). We check 2 <= 1, false, recurse again.

Frame 3 is pushed. n = 1, return address points to Frame 2's multiply. We check 1 <= 1, true. We return 1.

Three stack frames drawn vertically with frame 3 at the bottom (most recent, lowest address). Each frame shows its n value and return address. Arrows trace the unwinding path: frame 3 returns 1, frame 2 computes 2×1=2, frame 1 computes 3×2=6.

The stack at maximum depth. Three frames, three separate n variables, zero shared state. That's the whole trick.

Now the unwinding. Frame 3's epilogue runs:

    leave    ; restores rsp = rbp, then pops rbp (restores caller's frame pointer)
    ret      ; pops return address, jumps to it

ret pops the return address and jumps. We land inside Frame 2, right at the multiply. Frame 2 computes 2 * 1 = 2, runs its own epilogue, jumps back into Frame 1. Frame 1 computes 3 * 2 = 6, returns. Done.

The call stack is how the machine keeps "where was I?" at every nesting level, without any global bookkeeping.

Recursion Is Just a Call Stack. Literally.

This resolves most of the confusion around recursive code.

There is nothing special about a function calling itself. The CPU pushes a frame, runs the function body, pops when done. The code inside the frame is identical to its caller's code, but the frames are separate allocations with separate variables. n in Frame 1 and n in Frame 2 are different memory addresses.

The depth of recursion is the height of the stack at any given moment, which is exactly your space cost.

A linear chain like factorial(n) pushes n frames before hitting the base case: O(n) space. Binary search halves the problem each call, so the chain is at most log n frames deep: O(log n) space. A tree DFS pushes at most one frame per level of the tree: O(h) space, where h is the tree height. You can read the space complexity directly off the call chain's shape. The recursion time complexity guide works through this in more detail using recurrence trees.

The Explicit Stack Is the Same Algorithm

Iterative DFS with an explicit stack is not an optimization. It is the recursive algorithm, written out manually.

def dfs_recursive(node):
    if node is None:
        return
    visit(node)
    dfs_recursive(node.left)
    dfs_recursive(node.right)

def dfs_iterative(root):
    stack = [root]
    while stack:
        node = stack.pop()
        if node is None:
            continue
        visit(node)
        stack.append(node.right)   # right pushed first
        stack.append(node.left)    # left comes off first (LIFO)

Every item you push onto the iterative stack represents state that would have been saved in a call frame: which node are we processing, what comes next. The CPU calls it a frame. You call it a list entry. Same idea.

Side-by-side comparison: recursive call frames on the call stack labeled node=A, node=B, node=D alongside an iterative explicit stack (a Python list) containing identical values. Both hold the same information; only the storage location differs.

Same algorithm. The recursive version outsources memory management to the OS. The iterative version does it manually on the heap. The OS's version has a 8 MB limit. The heap's version is "how much RAM do you have."

The practical difference is budget. Your stack list lives on the heap and can grow as large as available memory. The CPU's call stack is capped at 8 MB. Converting recursion to iteration means trading the OS-managed call stack for a heap-allocated one: same asymptotic space, much higher ceiling. The iterative preorder traversal walkthrough shows this substitution in concrete code for trees.

When the Stack Runs Out

Push enough frames and rsp walks off the end of the stack region. On Linux, the kernel catches this as a page fault and delivers SIGSEGV. The program crashes.

Anakin Skywalker and Padme meme: "I'm going to write a recursive function", "With a base case, right?", Anakin stares silently as the image recursively repeats, shrinking infinitely.

The classic. The image is itself a stack overflow. A rare case where the meme format is also the bug.

Python adds a software guard before the OS gets involved. The interpreter raises RecursionError at 1,000 frames by default, well before the kernel's wall. The choice of 1,000 is not principled science. It is a reasonable default that stops beginners from crashing their REPL while still being low enough that you notice the problem.

import sys

sys.getrecursionlimit()       # 1000
sys.setrecursionlimit(5000)  # push the software limit out

Raising the limit does not eliminate the ceiling. It moves it. A 500,000-frame deep call on Python will still crash. The OS limit doesn't care what setrecursionlimit says.

Some compilers eliminate this problem entirely for a specific class of recursion. If a function's last act is a plain return recursive_call(...), the compiler can reuse the current frame instead of pushing a new one. The stack stays constant depth. This is tail call optimization, and GCC and Clang do it automatically at -O2. Python deliberately skips it. Guido van Rossum's reasoning: TCO erases the stack frames that make tracebacks readable, and confusing tracebacks are a worse trade than stack efficiency for most Python programs. Reasonable position. Painful if you wrote a tight tail-recursive accumulator and just discovered it on prod.

Recap

Every function call pushes a stack frame containing: the return address, the saved frame pointer, local variables, and callee-saved registers.
The call instruction pushes the return address and jumps; ret pops it and jumps back.
Three prologue instructions set up the frame: save rbp, point rbp at current rsp, subtract N from rsp to allocate locals. The epilogue (leave; ret) tears it down.
Recursive depth equals frame count, which equals space cost.
Iterative DFS with an explicit stack is the same algorithm with a bigger budget.
Stack overflow happens when rsp walks off the stack region. Python raises RecursionError at 1,000 frames as a software guard before this.
Tail call optimization reuses the current frame for the last recursive call, keeping stack depth constant. GCC and Clang do it; Python deliberately does not.

If you want practice explaining this kind of mechanism out loud under pressure, SpaceComplexity runs voice-based mock interviews with rubric-based feedback on exactly these follow-up questions: "walk me through what happens in memory when this function is called."