Key-Value Store System Design: From One Hash Map to a Billion Keys

May 27, 202612 min read
interview-prepcareeralgorithmsdsa
Key-Value Store System Design: From One Hash Map to a Billion Keys
TL;DR
  • LSM-tree is the right storage engine for a durable key-value store: WAL for crash recovery, memtable for write speed, SSTables for persistence
  • Bloom filters skip irrelevant SSTables before any disk access, keeping reads fast even with seven levels of compaction
  • Consistent hashing with virtual nodes limits key migration to O(keys/N) when nodes join or leave the cluster
  • Tunable quorum (N/W/R): R+W>N gives strong consistency, W=1/R=1 gives maximum availability, and you should name the tradeoff out loud
  • Gossip, hinted handoff, and Merkle trees cover failure detection, temporary unavailability, and replica anti-entropy respectively
  • State your CAP tradeoff explicitly in the first five minutes or the interviewer will surface it as a gap later
  • Tombstone deletes mean disk space is reclaimed lazily during compaction, not at delete time

The question sounds deceptively easy. "Design a key-value store." You picture a dictionary. A hash map. Five minutes, done, thank you.

Then the follow-ups arrive. Make it persistent. Distributed. Fault-tolerant. Oh, and it needs to handle a million writes per second. Now you're designing DynamoDB, and the clock is running.

A "This is Fine" dog sitting calmly while the room burns, captioned "40-minute system design interview / design WhatsApp"

System design interviews, basically.

This question shows up at every level of the interview loop because it forces you to reason all the way from a single machine to a distributed system without losing the thread. The canonical answer sits on two foundations: the Amazon Dynamo paper (SOSP 2007) and the LSM-tree storage engine behind RocksDB. Know both and you can handle any depth the interviewer throws at you.

Ask These Questions First or Draw the Wrong System

Before you touch the whiteboard, clarify scope. The design changes substantially based on the answers.

Functional requirements:

  • put(key, value): store or overwrite
  • get(key): retrieve, null if missing
  • delete(key): remove
  • Optional: TTL, batch operations, conditional writes

Non-functional requirements:

  • Scale: 100K writes/sec? 1M?
  • Durability: can we lose writes on a crash?
  • Consistency: strong (all replicas agree instantly) or eventual (converge within seconds)?
  • Availability vs. consistency: this is where the CAP tradeoff lives

A typical interview scope is a distributed, persistent store optimized for availability and partition tolerance with tunable consistency. Think DynamoDB-style rather than Redis-style (Redis is primarily in-memory and single-node by default). State your CAP choice explicitly at the start. "I'll design a store that favors availability and partition tolerance, with eventual consistency by default and a tunable quorum." That single sentence tells the interviewer you've thought past the happy path.

The API Fits on a Napkin

def put(key: str, value: bytes) -> None: ... def get(key: str) -> bytes | None: ... def delete(key: str) -> None: ...

Keys are strings. Values are opaque bytes: the store does not care about schema. Deletes are implemented as tombstone markers, not immediate removals. Expected time complexity is O(1) amortized for all three operations.

One Node. Then Fifty. In That Order.

Nail the single-node design first. Two main options.

In-memory hash table. Fast. Simple. O(1) for every operation. Loses data on restart unless you layer persistence on top. This is basically Redis. Good for caches and ephemeral data, not for a durable store.

Log-structured storage with an LSM-tree. Writes go to an in-memory buffer first, flush to disk in sorted runs, and get merged in the background. High write throughput, durable. This is RocksDB, LevelDB, Cassandra's storage tier, and DynamoDB's underlying engine.

For a durable key-value store, the LSM-tree is the right engine. It converts random writes into sequential writes, which are orders of magnitude faster on both spinning disk and SSDs due to reduced write amplification compared to B-trees.

The Write Path: Where Your Data Lives Before It's Safe

Here is what happens on put("user:42", data):

  1. Write-Ahead Log (WAL). The pair is appended to a WAL on disk before anything else. Sequential write, extremely fast. This is your crash-recovery guarantee: if the process dies before flushing, the WAL lets you replay writes on startup.

  2. Memtable. The pair enters an in-memory sorted structure (a skip list or red-black tree). Reads from the memtable are fast because the data is already sorted.

  3. Flush to SSTable. When the memtable hits a size threshold (RocksDB defaults to 64 MB), it is frozen, written to disk as an immutable SSTable (Sorted String Table), and a new memtable takes over. SSTables are write-once and never modified.

  4. Compaction. Background threads merge SSTables, discard overwritten values and tombstones, and maintain sorted order across levels. Two main strategies: leveled compaction (lower read amplification, higher write amplification, better for read-heavy workloads) and tiered/universal compaction (lower write amplification, higher space amplification, better for write-heavy workloads).

Deletes work as tombstones. A delete("user:42") appends a tombstone record. The key disappears from query results immediately, but the disk space is reclaimed only after compaction processes that level.

Diagram showing the write path: put() call flows through WAL append on disk, into the in-memory Memtable, then flushes to an immutable SSTable at L0, and background compaction merges down through L1 to L6.

Every write hits the WAL first, then the memtable. Compaction runs in the background, merging levels and purging tombstones.

Bloom Filters Do the Heavy Lifting on Reads

Reading is harder than writing in an LSM-tree. A key might be in the memtable or in any of several SSTable levels, with newer data taking precedence over older data.

The fix is a Bloom filter per SSTable. A Bloom filter answers "is this key definitely not here?" in O(1). If the filter says no, skip that SSTable. If it says maybe, do the disk read. LevelDB uses 10 bits per key for roughly a 1% false positive rate. False positives waste one disk read. False negatives are impossible.

The read path in order:

  1. Check the active memtable (in-memory).
  2. Check any immutable memtable awaiting flush.
  3. For each SSTable level, check the Bloom filter, then binary-search within the file if needed.

At L0, SSTables may have overlapping key ranges, so you might check several. At L1 and below, key ranges within a level are disjoint, so you check at most one file per level. Seven levels total in RocksDB. That is at most seven disk reads in the worst case for a key that exists, and often zero for a key that does not.

Diagram showing the read path: get() checks the Memtable first, then probes each SSTable level. At each level, the Bloom filter either skips the file (definitely not here) or triggers a binary search within the file (maybe here). L0 may check several overlapping files; L1 through L6 check at most one file per level.

At L0, key ranges overlap so you check several SSTables. At L1 and below, ranges are disjoint per level. Bloom filters skip most of them.

How to Add a Node Without Migrating Half Your Data

One node cannot hold a billion keys. You need to shard across many machines.

Naive sharding: hash(key) % N. The problem is that when N changes, almost every key remaps. Adding one node triggers a full data migration. Every time.

Consistent hashing solves this by mapping both keys and nodes onto a hash ring. A key lands on the first node clockwise from its hash position on the ring. When you add a node, only the keys between the new node and its predecessor need to move. When a node leaves, its keys shift to the next node clockwise. Everything else stays put.

Virtual nodes make the load distribution even. Each physical node gets 100 to 200 random positions on the ring rather than one. This prevents hotspots and means removing one physical node spreads its load evenly across all remaining nodes rather than dumping everything on one neighbor. DynamoDB, Cassandra, and Riak all use virtual nodes.

Diagram showing a hash ring before and after adding a new node E between A and B. Before: four nodes A, B, C, D with keys k1-k4 distributed across arcs. After: node E is added, and only the keys in the arc between A and E move. All other keys stay with their existing nodes.

Adding node E between A and B moves only the keys in that arc. Everything outside that slice stays exactly where it is.

Three Copies, One Coordinator

Partitioning spreads data. Replication protects it.

Choose a replication factor N (typically 3). For each key, the N first distinct physical nodes clockwise from its hash position form the preference list. The first node in the list is the coordinator for that key. It accepts the write and forwards it to the other N-1 nodes asynchronously.

Replication happens in the background after the coordinator gets enough acknowledgments. The coordinator does not wait for all N replicas before returning success. This is why the system stays available when some nodes are slow or down.

Diagram showing a hash ring where key K maps to coordinator node A. Node A replicates asynchronously to nodes B and C, forming the preference list for N=3. Node D is outside the preference list. The right panel shows the quorum options: W=2 R=2 for strong consistency, W=1 R=1 for maximum availability, and W=N for strong durability at the cost of latency.

The coordinator takes the write synchronously, then fans out to replicas asynchronously. You get acknowledged before all three agree.

The Consistency Dial Has Three Knobs

The elegance of the Dynamo model is that consistency is a dial, not a switch. Three values control it: N (total replicas), W (replicas that must acknowledge a write), and R (replicas consulted on a read).

  • Strong consistency: R + W > N. With N=3, W=2, R=2, any read will overlap with at least one node that received the latest write.
  • Maximum availability: W=1, R=1. Writes and reads complete as long as any one node is up. Stale reads are possible.
  • Typical production default: N=3, W=2, R=2. Majority quorum in both directions, tolerates one node failure.

When W=N, every replica must acknowledge a write before you return success. One slow node determines your write latency. When W=1, writes are fastest but a crash before replication means data loss. There is no free lunch between latency and guarantees. Name this tradeoff explicitly in the interview.

Everything Fails. Here's the Playbook.

Gossip protocol for membership. Each node randomly picks a few peers every second and exchanges membership state. Information propagates in O(log N) rounds across the cluster. No central coordinator needed.

Hinted handoff for availability. When the target node for a write is temporarily down, a different node accepts the write and stores a "hint" that this data belongs elsewhere. When the original node recovers, the hint is replayed. Writes can always complete even when the preferred replica is unavailable.

Merkle trees for anti-entropy. How do you verify that two replicas of the same key range are consistent without sending the full dataset? Each node maintains a Merkle tree over its key ranges. Exchange tree roots. If roots match, the ranges are identical. If not, walk down the tree to find the diverging subtree. Merkle trees let you locate the specific out-of-sync keys without transferring the full dataset.

Diagram showing two replica nodes, each holding a Merkle tree over the same key range. Replica 1 has root hash 4f2a; Replica 2 has root hash 8c9f. Matching subtrees are shown in green. The divergent leaf (key k4 with different values) is shown in red. Only k4 needs to be synced.

Step 1: exchange roots. Step 2: walk the first differing subtree. Step 3: sync only the specific keys that diverge. The rest is free.

The Table Every Interviewer Waits For

A good interviewer will ask you to defend each row. Come prepared.

DecisionOption AOption B
Storage engineIn-memory hash table (Redis)LSM-tree (RocksDB)
Consistency levelStrong (R+W>N)Eventual (W=1, R=1)
Compaction strategyLeveled (low read amp)Tiered (low write amp)
Conflict resolutionLast-write-wins (simple)Vector clocks (accurate, client-resolved)
Failure detectionHeartbeat (simple)Gossip protocol (scalable)

"Why last-write-wins over vector clocks?" Because vector clocks require the client to resolve conflicts, which is complex for most applications and often unneeded when concurrent writes to the same key are rare. LWW is simpler and acceptable when the application tolerates rare data loss on concurrent updates.

45 Minutes, Zero Panic: The Clock Breakdown

Most candidates run out of road at consistent hashing and never get to failure handling. Here is how to pace yourself:

  • 0 to 5 min: Requirements and scope. Ask the questions above. State your CAP choice.
  • 5 to 10 min: API design and data model. Small section, but it anchors the rest.
  • 10 to 20 min: Single-node storage engine. WAL, memtable, SSTable, write path and read path.
  • 20 to 30 min: Partitioning and replication. Consistent hashing, virtual nodes, preference list, quorum.
  • 30 to 40 min: Failure handling and consistency. Gossip, hinted handoff, Merkle trees, conflict resolution.
  • 40 to 45 min: Tradeoffs and extensions. Compaction strategy, Bloom filter tuning, TTL, batch writes.

Do not skip the storage engine. Most candidates jump straight to consistent hashing and never explain what each node actually stores. The interviewer will ask. Walking through the write path from WAL to SSTable is what separates a pass from a strong hire.

Explaining distributed systems clearly under time pressure is its own skill, separate from knowing the concepts. SpaceComplexity runs voice-based system design mock interviews with rubric scoring, so you find out where your explanation breaks down before the real thing.

Recap

  • API: put, get, delete with opaque byte values and tombstone deletes
  • Storage engine: WAL for durability, memtable for write speed, SSTables for persistence, compaction for cleanup
  • Read optimization: Bloom filters skip irrelevant SSTables; binary search within files
  • Partitioning: Consistent hashing with virtual nodes, O(keys/N) movement on topology change
  • Replication: N-node preference list, coordinator handles forwarding
  • Consistency: Tunable via W + R > N quorum; name the latency cost explicitly
  • Failure handling: Gossip for detection, hinted handoff for availability, Merkle trees for anti-entropy
  • Conflict resolution: Last-write-wins (simple) or vector clocks (accurate, client-resolved)

Further Reading