Hot Partitions in System Design Interviews: Every Cause, Every Fix

June 3, 202610 min read
interview-prepcareersystem-designalgorithms
Hot Partitions in System Design Interviews: Every Cause, Every Fix
TL;DR
  • Hot partition: one shard absorbs disproportionate traffic, causing CPU saturation and p99 spikes while the rest of the cluster sits idle
  • Three root causes: celebrity/popular keys, sequential time-based keys, and low-cardinality partition keys with skewed distribution
  • Detection: watch per-partition traffic, not cluster averages; alert when any single partition exceeds 20% of total throughput
  • Write sharding (key salting): append a random suffix to scatter writes across N sub-shards, then scatter-gather on reads
  • Hybrid fan-out: fan-out on write for regular users, fan-out on read for accounts above a follower threshold (Twitter used ~1M followers as the cutoff)
  • Proactively raise the hot key risk the moment you describe a partition strategy, before the interviewer asks "what happens when a celebrity posts?"

You design a system that handles millions of users. You shard by user ID. You test it. Traffic spreads evenly. You feel good about yourself. Then a celebrity joins your platform and suddenly one node is at 100% CPU while the rest are idle at 8%.

Load testing didn't catch it because your test users don't have 50 million followers. That's a hot partition, one of the most common distributed systems failure modes, and it comes up in nearly every system design interview because most candidates miss it until they're prompted.

What Is a Hot Partition?

A partition is a horizontal slice of your data, routed by a partition key. In a healthy system, traffic spreads roughly evenly across all shards. In practice, some keys get orders of magnitude more traffic than others.

When one partition absorbs significantly more traffic than its peers, it becomes a hot partition. Every partition has a fixed capacity budget: CPU cycles, disk I/O, network bandwidth, provisioned read and write throughput. When one partition's traffic overwhelms that budget, requests queue, p99 latency spikes, and the node starts throttling or dropping work. The other partitions sit largely idle, watching their colleague get crushed, doing nothing.

Ideal:                          Hot partition:

P1 [██████████]                 P1 [██████████]
P2 [██████████]                 P2 [███]
P3 [██████████]                 P3 [████████████████████] ← 3x share
P4 [██████████]                 P4 [██]

DynamoDB throttles when a single partition exceeds 3,000 read capacity units or 1,000 write capacity units per second. Not a lot of runway when a post goes viral. Cassandra shows this as high GC pressure and compaction stalls on a single node. Kafka shows it as growing consumer lag on one partition while others drain normally.

Three Ways Your Partition Key Backfires

The popular-key problem

Some data is just more popular. A celebrity account on a social platform generates millions of reads per hour. A trending product listing gets thousands of concurrent views. A viral hashtag concentrates write traffic on a single key.

You designed the perfect hash. Equal distribution. Beautiful. Then Drake joined your platform.

If you shard by user ID or content ID and some IDs are 1,000x more popular than others, no amount of hashing saves you. The key itself is the bottleneck. This is why it's called the celebrity problem, and it's the first thing interviewers ask about when you describe any fan-out system.

Distracted Boyfriend meme: all the traffic staring at the celebrity key while the rest of the cluster watches helplessly

Traffic (the boyfriend) staring at the celebrity key while every other partition can only stand there and watch.

Sequential and time-based keys

This one is subtle in the way a slow gas leak is subtle. Everything looks fine for months. Then you turn on the lights.

If your partition key contains a timestamp, all writes go to the most recent partition. Yesterday's data is cold. Today's data is getting hammered.

Timestamp-based key:

Key: 2026-06-03 → Partition 3  ← ALL writes today
Key: 2026-06-02 → Partition 2  ← cold
Key: 2026-06-01 → Partition 1  ← cold

This shows up in IoT telemetry, financial transaction logs, audit trails, and anything append-heavy. It's also why "never use a bare timestamp as your partition key" is standard advice in DynamoDB, Cassandra, and Bigtable docs alike.

Low-cardinality keys

If you partition by a status field with values like active, inactive, and pending, and 95% of your records are active, you have a hot partition by construction. You need cardinality. You have three values and 95% of your data in one of them. The math does not work in your favor.

The same problem hits any boolean flag, region code, or category ID with highly skewed distribution.

How to Spot One Before Your Users Do

Cluster-level metrics lie.

Your dashboards will tell you everything is fine. P50 latency looks good. Average throughput is healthy. One node is silently drowning, and the aggregates have smeared that fact across the rest of the cluster until it's invisible.

The signal is percentile divergence across partitions. If your median partition handles 200 requests per second and one handles 4,000, that's your hot partition. A common alert threshold: trigger when any single partition absorbs more than 20% of total cluster traffic. In a 10-partition cluster, each partition should sit near 10%.

Hide the Pain Harold sitting at a laptop, forced smile on his face

Your on-call engineer looking at the cluster dashboard at 2am. "P50 is fine. P50 is fine. P50 is fi, "

Tools worth knowing for interviews: CloudWatch Contributor Insights for DynamoDB surfaces which partition keys are consuming throughput. Per-partition consumer lag in Kafka is exposed via JMX or Prometheus. Node-level disk and compaction metrics in Cassandra reveal load skew before latency becomes obvious.

Every Fix Has a Tradeoff

Write sharding (key salting)

Append a random suffix to the partition key. Instead of writing all traffic for celebrity_id to one partition, write to celebrity_id_0 through celebrity_id_99. Writes scatter across 100 logical shards.

# Write: scatter to N shards shard_id = random.randint(0, 99) key = f"{celebrity_id}_{shard_id}" db.write(key, value) # Read: gather from all shards, merge results = [db.read(f"{celebrity_id}_{i}") for i in range(100)] total = sum(results)

The cost lives entirely on the read side. To reconstruct the full count or feed, you scatter-gather across all 100 shards and merge. This is a good trade when writes vastly outnumber reads, and a bad trade when reads need to be fast.

Adaptive partition splitting

Some databases handle this automatically. DynamoDB's Adaptive Capacity detects sustained hot partitions and splits them, redistributing data across new partition boundaries without application changes. Cassandra's virtual nodes (256 per physical node by default) reduce variance at cluster setup time by giving each node many small token ranges instead of one large one.

This is the closest thing to a free fix, but it only stretches so far. It doesn't resolve the celebrity-key problem where a single key is genuinely that much more popular than everything else.

Caching

For read-heavy hot keys, place a cache in front of the hot partition. A Redis cache or in-process cache layer absorbs the majority of reads before requests reach the partition. If 95% of reads for a celebrity's profile hit the cache, the underlying partition sees only 5% of actual traffic.

The tradeoff is staleness and invalidation complexity. A local cache on a fleet of 100 application servers means 100 independent cache entries, each potentially returning slightly different data after a write. Acceptable for eventually-consistent reads. Not acceptable for anything requiring strict read-your-own-writes guarantees.

Fan-out on read (pull model)

For social media feeds, instead of pushing a post to every follower's feed at write time, store the post once in the author's partition and let followers fetch it at read time. Readers query the author's partition directly and merge the result with their own non-celebrity feed.

This avoids write amplification across follower shards, but concentrates read traffic on the celebrity's partition. You still need caching in front of the celebrity's data. Congratulations: you traded a write hot key for a read hot key.

Hybrid fan-out

This is the answer interviewers want when you're designing a social media feed. Fan-out on write for regular users, fan-out on read for accounts above a follower threshold. Twitter's architecture used roughly 1 million followers as the cutoff. Regular users get pre-computed feeds pushed to a timeline store at write time. Celebrity content is fetched lazily and merged at read time.

Normal user posts:
  Write → fan-out to follower timelines (pre-computed, fast read)

Celebrity posts:
  Write → store once in author's shard
  Read  → fetch from author's shard + merge with follower's own feed

No single strategy works across the full distribution of users. The hybrid approach acknowledges that directly and handles each case optimally.

Partition key redesign

Sometimes the right move is to rethink the key from scratch. For time-series data, replace the bare timestamp with a compound key: device_id + hour_bucket + random_shard_suffix. This distributes writes across devices, time windows, and shards simultaneously.

Old:  timestamp           → all writes cluster at "now"
New:  device_id + hour + random(0-9)
      → writes spread across time AND shards

This is the highest-leverage fix when you catch the problem at design time. It's also the hardest to retrofit after launch because it changes your data layout. For a deeper look at partitioning strategies and their failure modes, the same principles apply at the shard level.

Match the Fix to the Problem

Root causeFirst fixWhy
Celebrity or viral keyCaching + hybrid fan-outReads dominate; caching absorbs them
Time-based keysCompound key with shard suffixBreaks temporal clustering at write time
Low-cardinality keysRedesign the partition keyNo tactical fix works; there aren't enough distinct values
Sudden, unpredictable spikeAdaptive capacity + cachingBuys time while you assess a permanent fix
Write-heavy hot keyWrite shardingDistributes writes; scatter-gather acceptable if reads are infrequent

Surface Hot Partitions Early in Your System Design Interview

The worst outcome in a system design interview is getting a follow-up question you should have preempted. "What happens when a celebrity posts?" is the interviewer's polite way of saying they already spotted the hole and are now waiting to see if you catch it. If you catch it yourself and bring it up proactively, that's a very different signal.

The moment you describe a partition strategy, flag the hot key risk and your mitigation. One sentence is enough: "I want to note that this partition key could create hot partitions in the celebrity case. We can handle reads with caching and use a hybrid fan-out for writes to keep the design tractable."

Don't wait to be asked. Problems that should trigger a hot partition discussion:

  • Social media or news feeds: the celebrity problem is almost guaranteed to come up
  • Real-time leaderboards: high-write keys for top players, time-bucketed partition keys
  • Event streaming pipelines: uneven key cardinality in Kafka topic partitions
  • IoT telemetry: sequential timestamp keys clustering writes at "now"
  • E-commerce flash sales: single product ID absorbing all write traffic for the sale window

Consistent hashing distributes keys uniformly across partitions. But uniform key distribution doesn't mean uniform traffic. A key that maps to P3 is still a hot partition if it gets 1,000x more requests than every other key. Consistent hashing solves the wrong problem here. For a deeper look at how the algorithm works, see the consistent hashing guide.

If you want to practice surfacing these tradeoffs under pressure rather than reading about them, SpaceComplexity runs rubric-based mock interviews that flag when you skip over failure modes and gives you specific feedback on where your system design reasoning broke down.

Further Reading