Raft vs Paxos: Consensus Explained for System Design Interviews

June 3, 20268 min read
interview-prepcareersystem-designalgorithms
Raft vs Paxos: Consensus Explained for System Design Interviews
TL;DR
  • Split-brain occurs when two nodes both believe they're leader; consensus algorithms prevent it by requiring a quorum before any decision commits.
  • Quorum math: to survive F failures you need 2F+1 nodes, so a 5-node cluster tolerates 2 failures and gives you the CP side of CAP.
  • Paxos (Lamport, 1989) was the first formally proven-correct consensus protocol; Multi-Paxos powers Google Spanner, Chubby, and ZooKeeper's ZAB variant.
  • Raft (2014) decomposes the problem into leader election, log replication, and safety; etcd, CockroachDB, TiKV, and Consul all run it.
  • Key raft vs paxos difference: Raft restricts leadership to nodes with complete logs, eliminating the repair phase; Paxos allows any node to lead and reconciles afterward.
  • In a system design interview, name Raft for modern systems and quote the tradeoff: quorum writes add 1–2 ms in a datacenter, a full cross-region round-trip for geo-distributed deployments.

You have a database with three replicas. A network partition splits them: two accept a write, one doesn't. The partition heals. Which value is correct?

Nobody knows. That's it. That's the whole problem. The raft vs paxos comparison comes up in nearly every system design interview that touches replication, and knowing what each one actually does is one of the cleaner ways to signal distributed systems depth. (The other way is using the word "linearizable" without flinching. Both work.)

Two Leaders, One Disaster

Picture two database nodes. A network blip makes each one think the other is dead. Both promote themselves to leader, accept conflicting writes, and diverge. When the network recovers, you have two histories with no way to know which is authoritative.

This is split-brain. It's why distributed databases scare people.

Node A (thinks it's leader)    Node B (thinks it's leader)
        write: balance=100              write: balance=200
              |                               |
        partition heals
              |
         x=100 or x=200?

The only way to prevent split-brain is to guarantee exactly one node can act as leader at any given time. Consensus algorithms enforce this guarantee. Basically, your nodes hold a vote before they act. Democracy, but for servers. Slightly more reliable than the other kind.

Majority Rules: The Quorum Math

Every consensus algorithm is built on one idea: a decision is only valid when a majority of nodes agree to it.

For N nodes, the quorum is N/2 + 1. With 5 nodes you need 3. With 3 you need 2. The reason this works is intersection: any two majority groups must share at least one member. That shared member carries the memory of any previous decision forward, so two conflicting decisions can't both reach quorum.

The fault tolerance formula: to survive F failures, you need 2F + 1 nodes.

Cluster sizeFailures tolerated
3 nodes1
5 nodes2
7 nodes3

The cost: the minority side of a network partition stops serving rather than risk a conflict. This is CAP theorem's CP side made concrete. You get consistency and partition tolerance; availability takes the hit. Your node will sit silently in the corner, refusing to answer, which is at least honest.

Paxos: The One That Came First

Leslie Lamport described Paxos in 1989. It sat unpublished for nine years because reviewers kept asking him to remove the fictional Greek island framing. He eventually published it in 1998 as a joke, then published Paxos Made Simple in 2001 when nobody could understand the original. Paxos Made Simple is 14 pages. We should probably renegotiate what "simple" means.

Basic Paxos works in two phases:

Phase 1 (Prepare/Promise). A proposer sends Prepare(n) to a quorum of acceptors. Each acceptor promises not to accept any proposal numbered below n, and returns any value it has already accepted.

Phase 2 (Accept/Accepted). If the proposer collects promises from a majority, it sends Accept(n, v). Acceptors accept unless they've since promised a higher proposal number.

One run of Basic Paxos decides one value. Real systems need to decide a continuous stream of values, which requires Multi-Paxos, an extension Lamport described loosely. The gap between the algorithm and a working implementation is large enough that every team fills it in differently. Google Spanner, Chubby, and Megastore all run Multi-Paxos. They each built the engineering themselves.

A programmer dressed as a clown at a computer with the caption "How it feels while failing to write basic logic through code"

Attempting to implement Multi-Paxos from Lamport's original description

Apache ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a Paxos-like protocol with different ordering semantics and a synchronization phase added to recovery. ZooKeeper coordinates Kafka and Hadoop.

Raft: Built for Understandability

Diego Ongaro and John Ousterhout published Raft in 2014 with one explicit goal: make consensus understandable. Their paper won Best Paper at USENIX ATC. The title was literally "In Search of an Understandable Consensus Algorithm." They ran user studies showing Raft significantly easier to learn than Paxos, which is an impressive finding because the bar Paxos set was extremely low.

Raft separates the problem into three pieces.

Leader Election

All nodes start as followers. If a follower hears nothing from a leader within a random timeout (150-300ms), it becomes a candidate and requests votes. Majority wins.

The timeout is random on purpose. If all nodes had the same timeout, they'd all declare candidacy simultaneously, split the vote, and loop forever. Randomness breaks ties cheaply. Whether this is brilliant or embarrassing depends on your feelings about formal methods, but it works.

Time is divided into terms, which act as a logical clock. Stale messages from old terms are rejected. Each term starts with an election.

[Follower] ──timeout──> [Candidate] ──majority votes──> [Leader]
     ↑                                                       |
     └───────────────── heartbeat ───────────────────────────┘

A node only votes for a candidate whose log is at least as up-to-date as its own. This single constraint ensures elected leaders always have every committed entry. No log repair needed after election.

Log Replication

The leader handles all writes. It appends each write to its local log, then sends AppendEntries RPCs to followers. Once a majority acknowledges, the entry is committed and applied to the state machine. Followers get notified on the next heartbeat.

Client ──> Leader ──AppendEntries──> Follower 1  ✓
                                ──> Follower 2  ✓
                                ──> Follower 3  (partitioned)

Majority ack → commit → respond to client

Safety

Committed entries are permanent. Any future leader will have them, because only nodes with up-to-date logs can win elections. Commit requires majority; election requires majority. The two constraints share members, closing the loop.

Raft vs Paxos: One Practical Difference

Both achieve equivalent safety. The engineering differs.

Paxos allows any node to become leader and then reconciles the log in a repair phase afterward. This flexibility supports out-of-order log entries, useful in high-latency networks. The cost is a more complex recovery path.

Raft allows only nodes with complete logs to become leader. The election itself carries the consistency burden, so no repair phase is needed. Entries commit in order.

For a system design interview, Raft is the concrete, nameable answer for modern systems. Paxos variants show up in older infrastructure. New projects reach for etcd or Consul; existing Google infrastructure runs Multi-Paxos.

What's Actually Running This

SystemProtocolWhat It Powers
etcdRaftKubernetes cluster state and config
CockroachDBRaftPer-range replication (each 64MB range has its own Raft group)
TiKVRaftTiDB storage layer
ConsulRaftService discovery, distributed KV
Google SpannerMulti-PaxosGlobal distributed relational database
Google ChubbyMulti-PaxosDistributed lock service
Apache ZooKeeperZABKafka broker coordination, Hadoop HA

When designing a distributed key-value store or a distributed cache, the replication strategy underneath is usually one of these. Knowing which system uses which protocol, and why, is the kind of detail that lands in an interview.

When and How to Bring It Up

Most system design interviews don't require walking through Raft phases. What they reward is knowing when your design needs consensus and what it costs.

Three situations call for it:

Leader election. If your design has a primary node, explain that election uses a consensus protocol. "We'd run a 5-node cluster backed by Raft, so we tolerate 2 failures and elect a new leader in under a second."

Replicated state machine. When you need strict ordering across replicas. "Writes commit after majority acknowledgment. We're CP in CAP terms: the minority partition stops serving."

Coordination services. Distributed locks, config storage, service discovery. "We'd back this with etcd, which uses Raft, so configuration changes are linearizable."

When you bring up consensus, call out the tradeoff immediately. Quorum writes add latency: a write isn't complete until a majority of nodes acknowledge it. In a 5-node cluster within a single datacenter, that's 1-2ms. Geo-distribution adds the full round-trip between regions to every write. If your interviewer asks why you'd consider eventual consistency instead, that latency number is the answer.

Something like: "We need strong consistency here because [reason], so I'd use Raft-backed etcd. The tradeoff is that writes are slower than an AP system, and if we lose more than two nodes out of five, writes stop until we recover quorum."

You don't need to cite Paxos phases in most interviews. But if the design involves global distribution, knowing that Google's answer was Multi-Paxos with TrueTime shows depth. For anything else, Raft is the answer. Just use etcd. Stop suffering.

To practice threading consensus into a full system design under pressure, SpaceComplexity runs voice-based mock interviews with rubric feedback on your tradeoff reasoning.

The system design interview prep guide covers how to structure your answer before you get to the consensus layer.

Further Reading