Database Replication for System Design Interviews: The Guide

June 3, 20269 min read
interview-prepcareersystem-designalgorithms
TL;DR
  • Database replication serves three distinct goals: high availability, read scaling, and geo-latency reduction — name all three when you introduce it.
  • Single-leader replication avoids write conflicts naturally but caps write throughput at one machine; it is the right starting model for most interview designs.
  • Leaderless replication uses the W + R > N quorum formula to tune the durability-vs-throughput dial at read and write time.
  • Asynchronous replication lag creates two named anomalies — read-your-writes and monotonic reads — each with a concrete, nameable fix.
  • Failover is harder than it looks: split brain and data loss are distinct failure modes that require consensus protocols or external coordinators, not just replica promotion.
  • Replication and sharding are complementary: replicate for redundancy and read scale, shard for write scale and data volume exceeding a single machine.

You're 35 minutes into designing Instagram. The boxes are drawn. You have a CDN. You mentioned Cassandra by name, which was bold. Your interviewer has been nodding.

Then they lean forward: "What happens when your primary database fails?"

Not a replica. Not a cache. The primary. The node you drew as one circle and quietly moved on from twenty minutes ago.

Most candidates say something about backups. Then slow down. Then go quiet. Database replication is the actual answer, and knowing how to explain it precisely, tradeoffs included, separates a hire from a no-hire on this question.

What Database Replication Actually Solves

Replication means keeping the same data on multiple machines. You do it for three reasons, and naming all three when you introduce it sends a signal that's disproportionate to the effort:

  1. High availability. If your primary crashes, a replica takes over with minimal downtime.
  2. Read scaling. Route reads to replicas. Your primary handles writes; replicas absorb the read load.
  3. Geo-latency. A user in Tokyo reading from a Tokyo replica doesn't wait for a round trip to Virginia.

Each reason implies different configuration choices. A single synchronous standby in the same datacenter serves high availability. A pool of async read replicas serves read scaling. Cross-region replicas serve geo-latency. Most real systems need all three simultaneously, which is why listing them instead of just saying "so it doesn't go down" makes you sound like someone who has thought about this before.

Three Models, Three Different Tradeoffs

Single-Leader Replication

One node accepts all writes. That node is the leader. Every other node is a follower. The leader writes changes to a replication log; followers apply those changes in order.

writes ──►  [ Primary ] ──► replication log ──►  [ Replica 1 ]
                                             └──►  [ Replica 2 ]
reads  ──►  any replica (or primary for strong consistency)

This is the default model in MySQL, PostgreSQL, MongoDB replica sets, and Redis. Start here for almost any interview design.

Single-leader gives you simple conflict avoidance because there is only one place where conflicting writes can happen. The downside is that the primary is the write bottleneck. If write throughput outgrows one machine, single-leader replication alone doesn't help. That's when you add sharding, which is a separate tool for a separate problem (more on that distinction shortly).

Multi-Leader Replication

Multiple nodes accept writes. Each leader replicates to every other. If one datacenter goes down, the others keep accepting writes. Cross-region write availability: solved.

The cost is conflict resolution. Two leaders can accept conflicting writes for the same row at the same time. You need a policy: last-write-wins (LWW), application-level merge, or a CRDT. LWW is simple and silently drops writes. CRDTs preserve all writes but only work for specific data shapes like counters and sets. "Application-level merge" is a polite way of saying your engineers will be arguing about this at 2am one day.

Multi-leader is the right answer when routing every write through a single primary in a distant datacenter creates latency you can't accept.

Leaderless Replication

Any node can accept a write. The client sends writes to multiple nodes simultaneously. Reads query multiple nodes and pick the most recent value using version numbers.

This is the Dynamo model, from Amazon's 2007 paper and implemented in Cassandra and DynamoDB. Correctness comes from quorum. With N replicas, if you write to W nodes and read from R nodes, when W + R > N, at least one node in every read set must overlap with every write set.

N = 3,  W = 2,  R = 2   →   W + R = 4 > 3   quorum achieved
N = 3,  W = 1,  R = 1   →   W + R = 2 < 3   no guarantee (high-throughput mode)

The W + R > N formula is the thing to memorize and drop in an interview. Tuning W and R lets you shift the dial: high W means strong write durability, high R means fresh reads, low both means maximum throughput with eventual consistency. The interviewer will follow up on it. Be ready.

Synchronous or Asynchronous: Pick Your Poison

Within any model, you choose how the leader handles acknowledgment.

Synchronous. The write is not confirmed to the client until at least one replica confirms it. Zero data loss on failover. Higher write latency because you're waiting for a network round trip before returning success.

Asynchronous. Confirmed as soon as the primary commits locally. Replicas catch up in the background. Low latency, but if the primary crashes before a replica syncs, those writes are gone. You told the client "success," the primary died, and the data went with it. That's a fun conversation to have with your on-call team at 3am.

In practice, most production systems use a hybrid: one synchronous replica, the rest asynchronous. MySQL calls this semi-synchronous replication. PostgreSQL uses synchronous_standby_names. You get durability on failover without forcing every write to wait for every replica to confirm.

For most interview designs, async replication with one synchronous standby is the right answer. Say it explicitly, then immediately acknowledge the consistency window it creates on the read path.

Lag Creates Two Specific Failure Modes

Async replication creates a window of inconsistency. Under normal conditions, replica lag is milliseconds. Under load, or after a network hiccup, it stretches. Two anomalies you must know by name:

Read-your-writes. A user submits a profile update, then immediately loads their profile page. The read goes to a lagging replica. They see the old value. Their update appears to have vanished. They file a bug report. You spend an afternoon convincing them the data isn't lost. Fix: route writes and any immediately-following read from the same user to the primary. Or track the user's last-write position and refuse to serve reads from replicas that haven't reached it.

Monotonic reads. A user makes two reads from two different replicas. Replica 1 is ahead of Replica 2. The user sees three messages in their inbox on the first request, refreshes, and sees two. Time went backward. Fix: sticky reads. Route each user consistently to the same replica, typically by hashing the user ID.

These are not theoretical edge cases. They appear in every social app, every e-commerce cart, every notification feed. Naming them and their fixes signals that you've thought about replicated systems in production, not just on a whiteboard. The Instagram system design is a concrete example: read replicas absorb the massive read load, but the lag window means the feed can briefly show stale data.

Failover Is Harder Than It Looks

When the primary fails, a replica must be promoted. Here is what can go wrong, because something always goes wrong:

Data loss. In async replication, the newly promoted primary may not have received the last few writes. You minimize this with semi-synchronous replication, but you can't fully eliminate it in pure async mode.

Split brain. A network partition can cause two nodes to each believe they are the primary. Both accept writes. Both keep going. When the partition heals, you have conflicting data and no automatic way to reconcile. Picture two managers who each think they run the team, both sending conflicting instructions for three hours while everyone nods along. The fix is a consensus protocol (Raft or Paxos) or an external coordinator like ZooKeeper that guarantees only one node can hold the primary lease at a time.

Stale reads after promotion. If the new primary was 500ms behind at the moment of promotion and clients start routing to it immediately, those clients may read data that should have been there but wasn't.

Production tooling handles most of this: Patroni and repmgr for PostgreSQL, Orchestrator for MySQL, automatic failover in managed RDS and Cloud SQL. In an interview you don't need to name these tools. You do need to acknowledge that failover requires coordination. You don't just select the most caught-up replica and point traffic at it. You could. But you'd find out why that was wrong very soon after.

Replication Is Not Sharding

These two get conflated constantly, the way people mix up RAM and storage, except the consequences are louder.

Replication gives you redundancy and read scale by copying the same data across multiple machines. Sharding gives you write scale and storage capacity by partitioning different data across machines.

They're complementary. A production database typically does both: shard data across many primary nodes, then replicate each shard across two or three replicas. Instagram's PostgreSQL setup does this. So does most key-value store design at scale, covered in the key-value store system design guide.

The interview heuristic: if the bottleneck is reads or availability, lead with replication. If the bottleneck is writes or data volume exceeding a single machine, lead with sharding. If it's both, layer them explicitly.

How to Introduce Replication Before You're Asked

You don't wait for the interviewer to probe you. You volunteer it when you draw the database layer.

When you sketch your primary database, add replicas immediately. Then say something like: "I'll add a read replica pool here, for read scaling and failover. I'll use async replication with one synchronous standby. That means there's a lag window on the read path I need to design around."

Then connect the lag to the specific consistency requirement of the system. Social feed? Eventual consistency is fine. A post appearing 200ms late is not a correctness problem. Payment system or inventory reservation? You read from the primary or use synchronous replication, because serving stale stock counts leads to overselling. That's a correctness problem. A loud one.

The signal interviewers look for is that you connect the replication model to the consistency requirement of the specific system, not that you recite the definition.

If the interviewer pushes on tradeoffs, you have three levers: synchronous vs. asynchronous (latency vs. durability), quorum configuration in leaderless systems (the W, R, N triangle), and failover coordination strategy (consensus protocol vs. managed service).

Making these conversations feel natural under pressure takes practice. SpaceComplexity runs voice-based system design mock interviews where you work through exactly these tradeoffs in real time, with rubric-based feedback on how clearly you articulate consistency decisions.

Further Reading