The Debugging Round Interview Scores Your Process, Not the Fix

- Debugging round interview formats: Amazon OA snippets, Retool failing tests, Stripe Bug Squash open-source repos — three shapes, one rubric
- Four scored dimensions: hypothesis discipline, reproduction, root-cause analysis, and narration — finding the bug is not enough
- A methodical wrong hypothesis beats a lucky silent correct answer because interviewers score diagnostic reasoning, not outcomes
- Read the codebase for 5-10 minutes before forming any hypothesis: structure, types, call graph, data flow, failing test last
- The debugging loop: state expected vs. actual, reproduce with minimal input, rank hypotheses, test one thing at a time, apply the minimal fix, verify
- Best prep: clone real GitHub issues labeled "bug", record yourself narrating out loud — LeetCode does not build this skill
You're handed a GitHub repo. There's a failing test. Your interviewer is watching, and you have 45 minutes.
This is the debugging round. It's showing up everywhere. Stripe calls it the "Bug Squash," which sounds like a summer camp activity and is actually one of the harder technical screens you'll face. Retool uses it as their first-round filter. Amazon has a dedicated debugging section in every online assessment. Meta's AI-enabled round opens with bug-fixing before any implementation. The format varies. The premise doesn't: you didn't write this code, something is wrong, find it.
Most candidates treat it like a normal coding interview. It isn't. There's no algorithm to design. The code already exists. Your job is diagnosis. Think less "write a BFS" and more "why is the BFS returning the wrong nodes."
Three Formats, One Rubric
The debugging round takes three distinct shapes.
The snippet format is what Amazon's OA uses. You get five to seven short blocks with single logic errors, twenty minutes total. Flipped comparison, off-by-one, return placed one line too early. Speed and accuracy both matter because the round is automated and timed.
The failing test format is Retool's. You clone a small custom repo, run the tests, and find what's broken. Bugs are more varied, sometimes multiple. The interviewer is live and watching throughout.
The open-source format is Stripe's. You clone a real library (Express, Day.js, or Sass) with a known bug and a failing test. The codebase is large, compact, and uncommented, written by engineers who absolutely were not thinking about you reading it in forty-five minutes. You can use any tool you like, including documentation and Stack Overflow. The difficulty isn't algorithmic. It's navigational.
All three share one thing: the interviewer already knows where the bug is. They're not watching to see if you find it. They're watching how you look for it.
What Gets Scored
Four things.
Hypothesis discipline. Do you change one variable at a time? Strong candidates make a prediction before running any code: "I think the issue is the loop termination condition, so I'm going to add a log here and check the boundary value." Weak candidates just start editing. Random edits aren't debugging. They're guessing with extra steps.
Reproduction before anything else. Confirm you can trigger the bug reliably, then find the smallest input that still triggers it. If the test is already failing, run it first. See the actual error output before you assume anything.
Root cause, not symptom. An off-by-one in a loop can manifest as a crash, an empty return, a wrong value, or a subtle test failure three functions down the call stack. Interviewers are specifically watching whether you treat the symptom as the problem or trace back to the source. A broad try-catch that makes the test pass without fixing the underlying logic is the debugging equivalent of sweeping dust under a rug. They've seen it a hundred times. It reads poorly every time.

"Error on line 265" is not the bug. It's where the bug decided to say hello.
Narration. The interviewer can't credit thinking they can't see. If you find the bug in silence, you get credit for the outcome but not the process. If you explore the wrong hypothesis but explain it clearly before testing it, you demonstrate diagnostic reasoning the correct-but-silent candidate never proves they have. As one Stripe interviewer noted: the quality of the diagnosis is measured not just by what you change, but by how you arrive at the conclusion that nothing else needs to be changed.
Your narration needs to loop: what you see, what you think it means, what you'll test next. That cycle, repeated throughout the session, is the signal. Silence gets you rejected in any live technical setting.
A Methodical Wrong Hypothesis Beats a Lucky Silent Answer
Let's say you spend eight minutes on the wrong hypothesis. You add logs, inspect values, eliminate a branch, and conclude clearly: "Okay, the issue isn't in the sort comparator. Let me look at how we're building the input array." That is a very good interview moment. You showed systematic reasoning, you updated your model, and you moved on without spiraling.
A methodical candidate who explores the wrong hypothesis can outscore one who stumbles onto the right answer with no explanation.
Interviewers are watching for diagnostic reasoning, not correct outcomes. Does this person have a system? When their first hunch is wrong, do they update and move on, or do they start randomly editing things hoping something sticks? Is their mental model of the code getting sharper as they gather evidence?
You can perform well in a debugging interview even when the bug is genuinely hard. The work isn't to be brilliant. It's to be visible.
That same principle shapes how the hardest technical problems are scored: dead ends are data, not failure.
Read the Code Before You Form a Hypothesis
Before you touch the code, read it.
This is the part most candidates skip, and it's why they spend fifteen minutes chasing the wrong function. The best candidates spend five to ten minutes understanding the codebase before forming any hypothesis. Less than five and you're flying blind. More than ten and you look paralyzed. Five to ten is the zone.
Read in this order:
- Structure: what files exist, what each directory owns
- Types and signatures: what data flows through the system
- Call graph: which functions call which
- Data flow: where values come from, where they go
- Tests last: after you have context, read the failing test
Most candidates jump straight to the failing test. This feels fast. It isn't. Without context you'll form a plausible-but-wrong hypothesis and chase it fifteen minutes down a dead end because you didn't know where the actual transformation happens. The reading phase is how you build a mental model precise enough to debug against.
"I'm reading through the codebase to understand the structure before I form any hypotheses" is a sentence worth saying out loud. It buys you time and demonstrates exactly the discipline the interviewer is looking for.
The Loop
Once you have a mental model, the process is tight.
- State expected vs. actual behavior. One sentence: "The function should return the sum of values above zero, but it's returning the total sum."
- Reproduce with the smallest input. If the failing test is complex, can you write a three-line unit test that also fails? Smaller means faster iteration.
- Rank hypotheses by likelihood and cost to test. What's most probably wrong? What's cheapest to check? Pick the highest-value test first.
- Test one thing at a time. Add a log, inspect a value, comment out a branch. One action per iteration. This is not the time to delete everything and rewrite from scratch.
- Update your model. Your hypothesis was right (fix it) or wrong (eliminate it, move on). Either outcome is progress.
- Apply the minimal fix. Fix the root cause. Don't refactor. Don't improve adjacent code. The smallest correct change is the right change. Resist the urge to "clean things up while you're in there."
- Verify. Run the test. Check that adjacent tests still pass. Briefly scan for other inputs your fix might affect.
This loop works on a three-line snippet and on a ten-thousand-line open-source library. The discipline is the same. The temptation to skip steps grows with the size of the codebase. Don't.
How to Actually Prepare
Debugging improves with deliberate practice, not more LeetCode. Solving another two-sum variant will not help you here.
Build a toy bug repository. Create small programs with intentional bugs in each category: off-by-one, null state, async race condition, wrong mutation, API contract mismatch. Time yourself finding each one. Narrate your process out loud. Yes, out loud. Into a room that might be empty. Do it anyway.
Use real open-source issues. GitHub has thousands of repos with issues labeled "bug." Clone one, reproduce the reported failure, fix it. This is exactly what the Stripe round is. The bugs are real, the codebases are unfamiliar, and the skills transfer directly.
Record yourself. Play it back. Pause every time you went silent for more than thirty seconds. That silence is your score dropping. Narration feels awkward at first. It becomes automatic with repetition. Athletes watch film. You should too.
Practice the reading phase in isolation. Clone a repo you've never seen. Give yourself five minutes. Close it. Write down the structure from memory. Repeat until reading unfamiliar code feels like navigation rather than translation.
Practicing at SpaceComplexity builds the habit of narrating your reasoning live in a realistic interview setting, which matters in a debugging round exactly as much as it does anywhere else.
Your approach to debugging under pressure is the method you'll use in production at 2am. The round isn't just a test. It's a preview.
The Short Version
- The debugging round tests diagnosis, not implementation. Stripe, Retool, Amazon, Meta, and Google all run it, in different forms.
- Four things get scored: hypothesis discipline, reproduction, root-cause analysis, narration.
- A methodical wrong hypothesis beats a lucky silent correct answer.
- Read the codebase five to ten minutes before forming any hypothesis. Structure, types, call graph, data flow, tests.
- The loop: state expected vs. actual, reproduce, rank hypotheses, test one at a time, minimal fix, verify.
- Practice on real GitHub issues labeled "bug." Record yourself. Narration is a learnable skill.