Circuit Breaker Pattern: The System Design Interview Guide

Your payment service is struggling. Response times have climbed from 100ms to 5 seconds. You open Grafana and stare at the graphs the way you stare at a car that just made a new noise. Your order service has 20 threads, each now blocked waiting for a payment response that may never come. All 20 fill up. Incoming requests queue behind them. Order service is unresponsive. So is the catalog service that calls it. And the API gateway. Your phone is buzzing. One slow service just took down your entire platform.

That is a cascading failure. And it is, objectively, a terrible time.

The circuit breaker pattern is the standard system design solution for stopping cascading failures. It is also one of those patterns where knowing the name gets you maybe 20% of the credit in an interview. The other 80% is explaining what it actually does, why the thresholds exist, and what happens when it trips.

What a Circuit Breaker Actually Does

The name comes from your home electrical panel. When a circuit gets overloaded, the physical breaker trips and stops current from flowing. It does not fix the problem. It stops the problem from spreading.

In software, a circuit breaker wraps calls to a downstream service, monitors failures, and when failures exceed a threshold, immediately rejects requests without attempting the remote call at all.

This does two things. It gives the failing service time to recover without being bombarded by traffic it cannot handle. And it keeps your service responsive by failing fast instead of blocking threads for 5 seconds each until your pool is exhausted.

Michael Nygard formalized the pattern in "Release It!" (2007). Netflix open-sourced Hystrix in 2012, the reference implementation for a decade. Today Resilience4j is standard for JVM services, and the pattern is built into Envoy, Istio, and AWS SDK retry configs.

Three States, One State Machine

The circuit breaker cycles through three states:

Circuit breaker state machine diagram

CLOSED is normal operation. Requests pass through. The breaker tracks failures in a rolling window. Everything is fine. Probably.

OPEN is the tripped state. No requests reach the downstream service. Every call fails immediately (or executes a fallback). The circuit holds open for a configured wait duration, typically 30 seconds, giving the downstream service time to recover without load pressure.

HALF-OPEN is the probe state. After the wait duration expires, the circuit allows a small number of test requests through. Resilience4j defaults to three. If those succeed, the circuit closes and normal traffic resumes. If any fail, the circuit opens again and the clock resets.

The half-open state is the key insight. Think of it like cautiously opening the door after a fire alarm: you send three people in first, not three thousand. A service that just recovered might handle 3 concurrent requests fine but collapse under 3,000. The limited probe prevents a flood of backed-up requests from re-overwhelming a service the moment it comes back online.

The Numbers Behind the Thresholds

Two metrics drive most circuit breaker implementations.

Failure rate threshold: Resilience4j defaults to 50%. If more than half of requests in the current window fail, the circuit opens. The window size also matters: Resilience4j requires at least 100 completed calls before evaluating failure rate. This prevents a single error from tripping the circuit when traffic is low. One failed health check at 3am should not yank the breaker.

Slow call rate threshold: A service that responds in 8 seconds on every request is nearly as bad as one that returns errors. It is just... politely slow about ruining your day. Resilience4j lets you define a slow call duration threshold (default: 5 seconds) and treat slow calls as failures. Most interview candidates only think about errors. Latency is a failure mode too.

Parameter	Resilience4j Default	Hystrix Default
Failure rate threshold	50%	50%
Slow call rate threshold	50%	N/A
Minimum calls before evaluation	100	20
Wait duration in open state	30s	1s
Probe requests in half-open	3	1
Slow call duration threshold	5s	1s

The wait duration and minimum calls parameters are most commonly misconfigured. Set the minimum too low and a brief traffic spike trips the circuit unnecessarily. Set the wait duration too short and the circuit thrashes between open and half-open before the downstream service has actually recovered. You end up playing whack-a-mole with your own infrastructure.

Circuit Breakers, Retries, Timeouts, Bulkheads: Not the Same Thing

These four resilience patterns get conflated constantly in interviews. Candidates describe them like they are a list of synonyms. They are not.

Retry handles transient failures. The database hiccuped for 50ms. Try again. Retry assumes the problem is brief and self-resolving.

Timeout prevents waiting forever. If a service takes more than 5 seconds, abandon the call. Without timeouts, one slow service eventually exhausts your thread pools. This is how you end up with 20 blocked threads and a bad Grafana session.

Circuit breaker handles persistent failures. If 60% of calls are failing, retrying just generates more load on an already failing service. You are not helping. You are making it worse.

Bulkhead isolates resource pools. If payment service hangs, you do not want inventory service calls to be starved of threads. Bulkheads give each downstream dependency its own thread pool or connection pool so one bad neighbor does not evict everyone else.

In a production system you compose them. A request carries a 5-second timeout, retries up to three times with exponential backoff, and the retry loop is wrapped in a circuit breaker. The bulkhead ensures payment failures do not affect inventory calls. See idempotency in system design for why retries require idempotent downstream operations.

There are two classic interview traps here. The first is describing these as alternatives. The second is subtler: placing retry logic outside the circuit breaker. If your retry loop wraps the circuit breaker, the breaker never sees the repeated failures. It cannot do its job. The retry loop must sit inside the circuit breaker. When the circuit opens, retries stop immediately instead of hammering an already-struggling service three more times.

You Need a Fallback

A circuit breaker without a fallback plan is like a fire alarm that just makes noise. Useful for detecting the problem. Not useful for surviving it.

Return a cached response. If you have the user's account balance from 10 minutes ago, return it with a timestamp. Stale data beats an error page for most read operations. See caching strategies for how to structure this.

Return a degraded response. A recommendation service is down. Return the user's recently viewed items or a static popular list. The user experience degrades gracefully instead of exploding.

Queue the work. The payment service is down. Do not fail the order. Write it to a durable queue and process it when payment recovers. This requires idempotency downstream but keeps the user's transaction intact. See message queue vs pub/sub for how to design this queue layer.

Return an error fast. The honest last resort. At least you are not holding resources and not making the user wait 30 seconds for the exact same failure.

The fallback is where interviewers probe for depth. Saying "I'd use a circuit breaker" is table stakes. Saying "when the circuit opens, order service queues the order and returns a pending status to the user" signals you have thought through actual system behavior, not just pattern names.

Circuit Breaker Pattern: How to Use It in Your System Design Interview

The signal is any synchronous call from Service A to Service B where B could become slow or unavailable under load. Payment services, third-party APIs, recommendation engines, ML inference endpoints. If B can be slow and A would block waiting for it, you want a circuit breaker.

The question that invites circuit breakers: "What happens if one of your services goes down?"

Walk through the three states concretely. Give real numbers. Name your fallback. A complete answer sounds like this: "I would configure a circuit breaker on order service's calls to payment service. Fifty percent failure rate threshold over a 100-call window. If it trips, order service queues the order for async processing and returns a pending status to the user. After 30 seconds, three probe requests go through. If payment has recovered, normal flow resumes."

That answer covers the mechanism, threshold, fallback, and recovery path. Most candidates cover one or two of those four, then trail off into silence hoping the interviewer does not notice.

If the interviewer asks about tradeoffs, two things matter. False positives: the circuit opens when the service is actually okay, causing unnecessary degradation. Mitigate with rate thresholds and a sufficient minimum sample size, not simple consecutive error counts. False negatives: the threshold is too high and the circuit never opens, letting cascading failures develop.

A third tradeoff worth mentioning at senior levels: distributed state. With 20 instances of order service, each holds its own circuit breaker state. They open and close independently based on what each instance happens to see. Some teams replicate circuit state across instances via Redis for consistency, at the cost of adding a centralized dependency. This is the kind of tradeoff that separates a good answer from a great one.

What the Interviewer Is Actually Scoring

When you bring up circuit breakers, you are demonstrating that you think about failure modes, not just the happy path. The gap between junior and senior system design answers is whether the candidate thinks about what happens when things break.

Most junior answers describe a system in the state where everything works. Senior answers describe a system under failure, degradation, and recovery. The circuit breaker pattern is basically a litmus test for which mode you are in.

You do not need to implement Resilience4j from memory. You need five things:

Name the problem the pattern solves (cascading failures from persistent downstream failure)
Describe the three states and transitions at a high level
Give a concrete threshold (50% failure rate, 30-second wait)
Name a fallback
Acknowledge one tradeoff

If you can also explain how it composes with retry and timeout, and why retry sits inside the circuit breaker, you are ahead of most candidates in the room.

Getting the mechanics to come out cleanly under pressure is its own skill. SpaceComplexity runs voice-based system design mock interviews with rubric feedback, which is a faster way to iron out the rough edges in your explanations than rehearsing them in silence. Running the "what if X service fails?" question out loud a few times before your interview is the prep that actually transfers.