What Is a Race Condition? The Bug Every Concurrent System Hides

June 6, 20269 min read
system-designinterview-prepdistributed-systemsalgorithms
What Is a Race Condition? The Bug Every Concurrent System Hides
TL;DR
  • Race condition: two threads read-modify-write the same value concurrently; one write is silently lost
  • Critical section is the code between read and write — any thread that enters during that window corrupts the result
  • Mutex locks serialise the critical section; compare-and-swap (CAS) achieves the same without blocking, using optimistic retries
  • Distributed race conditions (flash sale, bank transfer) require database-level optimistic or pessimistic locking — in-process locks don't cross server boundaries
  • Deadlock is the failure mode of over-locking; prevent it by always acquiring locks in the same global order
  • Interview signal: name the specific failure mode (double-sell, lost update) before proposing a fix — reaching for the solution first suggests pattern-matching, not reasoning

You write a function that increments a counter. It works in tests. You deploy. Ten servers, a million requests per second, and the number is wrong. Not catastrophically wrong. Not "database is on fire" wrong. Just slightly, randomly, maddeningly wrong. A number that keeps coming out smaller than it should be, with no error, no stack trace, nothing to debug.

A race condition happens when two operations both read a value, compute a new value from what they read, then both write back. Only one write survives. The other is silently lost.

That's it. The bug has no manners and no remorse.

The Counter That Lies

Here is the simplest version of the problem. Two threads, each incrementing a shared counter 100,000 times. Expected result: 200,000.

import threading counter = 0 def increment(): global counter for _ in range(100_000): counter += 1 # this is NOT atomic t1 = threading.Thread(target=increment) t2 = threading.Thread(target=increment) t1.start() t2.start() t1.join() t2.join() print(counter) # should be 200,000. Usually isn't.

Run this and you get 143,892 or 178,541. Different every time. The counter is lying to your face with full confidence.

counter += 1 looks like one operation but it is three: read the current value into a register, add 1, write it back. The CPU can switch threads at any point between those three steps:

Thread A: reads counter  →  5
Thread B: reads counter  →  5    ← same value, before A wrote back
Thread A: writes 6
Thread B: writes 6                ← should be 7, one increment lost

The window between the read and the write is the critical section. Any thread that sneaks in during that window corrupts the result. The CPU has no concept of "that was rude." It is doing exactly what you told it to do.

Why This Is Harder Than It Looks

Race conditions are infuriating because they do not always trigger. They depend on exact thread scheduling, which varies with CPU load, OS scheduling decisions, and apparently the phase of the moon. Your unit tests pass. CI passes. You merge with confidence. Production fails on a Friday at 4:58pm.

Three conditions have to be true for a race condition to occur: at least two threads access the same resource, at least one of them writes, and no synchronization controls the order. Break any one of those, and you are safe. Most fixes attack the third condition.

Works on my machine meme, developer certifying their own laptop as the canonical production environment, the classic defense when a race condition only reproduces under real load

Tests green. CI green. Counter still lying.

How to Fix It

Locks

A mutex ensures only one thread is in the critical section at a time:

import threading counter = 0 lock = threading.Lock() def increment(): global counter for _ in range(100_000): with lock: counter += 1 t1 = threading.Thread(target=increment) t2 = threading.Thread(target=increment) t1.start() t2.start() t1.join() t2.join() print(counter) # 200,000. Every time.

Thread B waits at the lock if Thread A holds it. Performance is slower, but the result is correct. Sometimes correctness is the feature.

Atomic Operations

Some operations are atomic at the hardware level, meaning the CPU guarantees no other thread can interrupt them mid-execution. Java exposes this directly:

import java.util.concurrent.atomic.AtomicInteger; AtomicInteger counter = new AtomicInteger(0); counter.incrementAndGet();

Under the hood, incrementAndGet uses compare-and-swap (CAS): read the current value, compute the new value, write back only if the value has not changed since the read. If another thread changed it first, CAS fails and retries. No blocking, no waiting, just optimistic retries.

Atomic operations are faster than locks under low contention because threads never have to park and wait. Redis INCR works the same way: the server is single-threaded, so every INCR is atomic by design. No lock needed if there is only ever one writer.

Immutability

If data never changes, there is no race. Functional languages lean hard into this. For a configuration object that thousands of threads read, making it immutable and swapping the whole reference atomically is safer and faster than protecting every field with a lock. The lock disappears entirely because the problem disappears.

When You Cannot Share a Lock: Distributed Systems

Single-process race conditions yield to mutex locks. Distributed race conditions are harder because there is no shared memory across servers. Two processes on two machines cannot both acquire() the same lock object. They live in completely separate universes.

The classic example is a flash sale. You have 10 tickets. Two users click "buy" at the same millisecond on different servers:

Server A: SELECT count FROM inventory WHERE id=1  →  10
Server B: SELECT count FROM inventory WHERE id=1  →  10
Server A: UPDATE inventory SET count=9 WHERE id=1
Server B: UPDATE inventory SET count=9 WHERE id=1  ← should be 8

You sold 11 tickets. You own 10. Congratulations, you have invented a new class of customer complaint and a very awkward support email.

Two database-level fixes:

Optimistic locking checks that the row has not changed since you read it:

UPDATE inventory SET count = count - 1, version = version + 1 WHERE id = 1 AND version = 5 AND count > 0; -- if 0 rows updated, someone else got there first, retry

Pessimistic locking holds the row exclusively during the transaction:

BEGIN; SELECT count FROM inventory WHERE id=1 FOR UPDATE; UPDATE inventory SET count = count - 1 WHERE id = 1; COMMIT;

For systems where the database lock is too slow under load, you move to a distributed lock: a Redis Redlock, a Zookeeper node, or a dedicated lock service. The full design tradeoffs live in distributed lock system design.

How Interviewers Test Race Conditions

System design interviews are where race conditions bite most often. Interviewers watch whether you notice the concurrent write problem before they have to tell you about it.

Common scenarios and their standard fixes:

ScenarioRace conditionTypical fix
Inventory or ticket bookingDouble-sellOptimistic locking or distributed lock
Bank account transferLost updateTransaction with SELECT FOR UPDATE
Rate limitingExceeding the limitAtomic increment (Redis INCR)
User registrationDuplicate accountUnique constraint, catch the error, retry
Cache invalidationStale read after writeWrite-through cache or short TTL

When you design any system touching shared mutable state, ask: what happens if two requests arrive at exactly the same millisecond? That single question catches most race conditions before you draw the second box on the whiteboard.

The choice between optimistic and pessimistic locking matters here. Optimistic locking works well when conflicts are rare. Pessimistic locking is worth the latency when conflicts are common or the cost of a conflict is high, like in payment processing where double-charging a card is not a theoretical concern.

Idempotency is a related defense. Even if your service handles the race internally, the client might send the same request twice after a timeout. Idempotency in system design covers how to make those duplicates safe.

For a concrete walkthrough under real load, see flash sale system design, where the inventory race is the central problem.

For coding interviews, most LeetCode problems are single-threaded. But a subset tests concurrency directly: LeetCode 1114 (Print in Order), 1115 (Print FooBar Alternately), and 1116 (Print Zero Even Odd) all require you to synchronize threads without a race.

The Cost of Synchronization

Locks have real overhead. Under high contention, threads spend more time waiting than working. You have parallelized the code and somehow made it slower than a single-threaded version, because everything is queuing for the same lock and the queue is the bottleneck now.

Keep the critical section as small as possible. Read and compute outside the lock. Write inside it. Minimize the window, minimize the waiting.

The other failure mode is deadlock. Two threads each hold a lock the other needs, so both wait forever:

Thread A: holds lock_1, waiting for lock_2
Thread B: holds lock_2, waiting for lock_1

Meme about fixing production deadlocks, two processes each holding the one thing the other needs, neither going anywhere

Two threads walk into a bar. Neither leaves.

Prevention is simple. Always acquire locks in the same order across all code paths. If every thread acquires lock_1 before lock_2, this cycle cannot form. Consistent ordering breaks the circle. There is no cleverness involved, just discipline.

What the Interviewer Is Looking For

When an interviewer describes a scenario with concurrent writes, there is an implicit test in the prompt: will you notice the race condition?

The answer is not "use a distributed lock" reflexively. It is a sequence: name the concurrent write, identify the specific consequence (double-sell, lost update, duplicate account), weigh the cost of consistency against the cost of a lock, then pick the right tool for the conflict frequency and severity.

Noticing the failure mode before proposing the fix is the signal. Proposing the fix first suggests you pattern-matched rather than reasoned. An interviewer watching you jump straight to "Redis Redlock" without explaining what it is solving has very little evidence you understood the problem.

Practicing this out loud is what SpaceComplexity trains: voice-based mock interviews where an AI plays the interviewer, probes your concurrency instincts in real time, and scores whether you named the failure mode before reaching for the solution.

Further Reading