Design a Chat App Like WhatsApp: The System Design Walkthrough

May 27, 202613 min read
interview-prepcareerdistributed-systemsmock-interviews
Design a Chat App Like WhatsApp: The System Design Walkthrough
TL;DR
  • WebSockets are the only viable transport for real-time chat at scale; polling at 500M users burns 100M requests/second saying nothing new.
  • A Redis routing registry maps each user to the chat server holding their open connection, refreshed on every heartbeat with a short TTL.
  • Partition Cassandra messages by conversation_id, not user_id, to avoid hot partitions on high-traffic users or groups.
  • Snowflake IDs give time-ordered, globally unique message keys without a central coordinator, solving sort order and deduplication at once.
  • The offline path writes to Cassandra before acknowledging to the sender, then flushes the queue to the client on reconnect via sequence number.
  • Presence is a heartbeat-plus-TTL pattern; naming the tradeoff between accuracy and load (1.7M heartbeats/second at scale) is a signal interviewers look for.
  • Fan-out on write for small groups, fan-out on read for very large ones; WhatsApp caps at 1,024 members where write fan-out remains tractable.

You have 45 minutes. The interviewer says "design WhatsApp." Two billion users. One hundred billion messages a day. Go.

Most engineers freeze here. Not because they lack the knowledge, but because the scope feels infinite. You don't need to design all of WhatsApp. You need to demonstrate you can think clearly about distributed systems under constraints. This walkthrough covers requirements, architecture, the data model, the tricky bits, and how to talk through all of it in 45 minutes without your brain melting.


Nail the Requirements First

The biggest interview mistake is architecting before clarifying scope. Spend the first three to five minutes narrowing the problem. The interviewer is watching whether you run straight to the whiteboard or ask smart questions first.

The questions that actually matter:

  • One-to-one messages only, or group chats too?
  • Do we need media (images, video)?
  • Read receipts (single tick, double tick, blue tick)?
  • Online presence and last seen?
  • Message history: server-side storage, or is the client the source of truth?
  • End-to-end encryption?

For a typical 45-minute interview, scope it to: 1-1 and group messaging, delivery receipts, online presence, and media sharing. Deprioritize calling, stories, and multi-device sync as extensions you can describe if time allows. You will not build them in 45 minutes. Neither will a team of 200 engineers.

Your functional requirements:

  • Send and receive text messages in real time
  • Deliver to offline users when they reconnect
  • Show message delivery status (sent, delivered, read)
  • Show online/offline presence
  • Support group chats (say, up to 1,000 members)
  • Share media files

Capacity: Make the Numbers Work

Interviewers want to see you anchor decisions in data. Back-of-envelope estimates tie your architecture to reality instead of vibes.

Assume 500 million daily active users, each sending about 40 messages per day. That's 20 billion messages per day, or roughly 230,000 messages per second. Peak traffic is about 2x average, so plan for ~500,000 messages/second at peak.

Average text message: ~200 bytes. Daily storage for messages: roughly 4 TB per day (text only, before encryption overhead). Media is separate and moves through a different pipeline entirely.

Now the connection load. If 10% of DAU are online simultaneously, that's 50 million concurrent WebSocket connections. A tuned server holds about 100,000 connections. You need 500+ chat servers just to hold connections, before redundancy.

Every architectural decision that follows is a response to two numbers: 500K messages/second at peak, and 50M concurrent connections.

Capacity estimates table: DAU through heartbeats per second, showing the numbers that drive every decision The two numbers that drive everything else: 500K msg/s at peak, 50M concurrent connections.


Why WebSockets, Not HTTP

The first real design decision is the transport. You have three options:

  • Short polling: Client asks "any messages?" every few seconds. Simple but wasteful. At 500M users polling every 5 seconds, that's 100M requests/second just to say "nothing new."
  • Long polling: Client opens a request and the server holds it until there's a message. Better. Still expensive: a new HTTP connection for every message cycle.
  • WebSockets: A single persistent, full-duplex TCP connection per client. Server pushes when a message arrives. Client pushes when it sends. This is the only viable option for real-time chat at scale.

WebSockets let a chat server hold 100K concurrent connections without 100K threads. The Erlang runtime (which WhatsApp actually uses) takes this further: each connection is a lightweight process consuming about 300 bytes of RAM, enabling millions of connections per server.

Side-by-side comparison of short polling, long polling, and WebSocket transport options Short polling generates 100M requests per second to say "nothing new." That's a no. WebSocket: one handshake, server pushes.

See push vs pull tradeoffs for a deeper look at why the connection model matters so much at scale.


WhatsApp System Design: Six Moving Parts

At the top level, you have six components:

  1. Clients (iOS, Android, Web) holding persistent WebSocket connections
  2. Chat servers managing those connections and routing messages
  3. A routing registry (Redis) mapping user_id → chat_server_id
  4. Message storage (Cassandra) for message history and offline queues
  5. A presence service tracking online status
  6. A media service handling uploads and CDN distribution

The chat servers are stateful: they hold the open connections. When user A sends a message to user B, the server handling A's connection needs to know which server holds B's connection. That's the routing registry's job. Redis stores user_id → server_id with a short TTL, refreshed on every heartbeat.

High-level WhatsApp architecture: clients, API gateway, chat server cluster, Redis routing registry, Cassandra, presence service, and media service Six components. The critical insight: chat servers are stateful, so you need a routing registry to connect them.


Tracing One Message End to End

Walk through the happy path: Alice sends "hey" to Bob, both online.

  1. Alice's app sends the message over her open WebSocket to Chat Server 1.
  2. Chat Server 1 writes the message to Cassandra and acknowledges receipt to Alice. Alice's app shows a single tick.
  3. Chat Server 1 looks up Bob's routing entry in Redis: Bob is connected to Chat Server 3.
  4. Chat Server 1 sends the message to Chat Server 3 via an internal RPC call.
  5. Chat Server 3 pushes the message to Bob's open WebSocket.
  6. Bob's app receives the message and sends back an acknowledgment.
  7. Chat Server 1 receives the ack, updates message status in Cassandra, and pushes the status update to Alice. Double tick.
  8. Bob opens the chat. His app sends a "read" event. Blue tick for Alice.

The offline path is the interesting one. If Bob is offline, step 4 fails. Chat Server 1 stores the message in an offline queue in Cassandra keyed to Bob's user ID. When Bob reconnects, his client's first action is to poll for undelivered messages since its last known sequence number. The server flushes the queue, Bob receives the messages, and the acks flow back to the senders.

Sequence diagram showing the online message path (Alice to Bob via Redis lookup) and the offline path (queue in Cassandra, flush on reconnect) Online path: 8 steps, sub-100ms. Offline path: write before you ack, queue in Cassandra, flush when they come back.


What to Put in the Database

The Cassandra schema is the most scrutinized part of this interview. Get the access patterns right, then let the schema follow.

Your two primary access patterns:

  1. "Give me all messages in conversation X, newest first, paginated by cursor"
  2. "Give me all conversations for user Y, sorted by last activity"

That maps to two tables:

-- Messages: partition by conversation, sort by message_id (Snowflake) CREATE TABLE messages ( conversation_id UUID, message_id BIGINT, -- Snowflake ID (time-sortable) sender_id UUID, content BLOB, -- encrypted client-side type TINYINT, -- 0=text, 1=image, 2=video status TINYINT, -- 0=sent, 1=delivered, 2=read PRIMARY KEY (conversation_id, message_id) ) WITH CLUSTERING ORDER BY (message_id DESC); -- Conversations per user: partition by user CREATE TABLE user_conversations ( user_id UUID, last_message_at TIMESTAMP, conversation_id UUID, type TINYINT, -- 0=direct, 1=group unread_count INT, PRIMARY KEY (user_id, last_message_at) ) WITH CLUSTERING ORDER BY (last_message_at DESC);

The message_id is a Snowflake ID: 41-bit timestamp + 10-bit machine ID + 12-bit sequence. This gives you k-sortable IDs that are globally unique without a central coordinator and sort correctly in the database.

Hot partition warning: If you use user_id as the partition key for messages, a popular user's entire history lands on one Cassandra node. Partition by conversation_id instead. Conversations spread naturally.

For media, the client encrypts the file, uploads it directly to blob storage (S3/GCS) via a pre-signed URL from the Media Service, and sends the resulting URL as the message content. The server never sees plaintext media.


Keep the API Small

Three channels.

REST (via API Gateway):

POST   /auth/register               -- phone number + OTP
POST   /groups                      -- create group
PUT    /groups/{id}/members         -- add/remove members
GET    /conversations               -- list threads, cursor-paginated
GET    /conversations/{id}/messages -- message history, cursor-paginated
POST   /media/upload-url            -- get pre-signed URL for media

WebSocket (persistent, after auth):

Client → Server:  { type: "message", to: user_id, content, client_msg_id }
Server → Client:  { type: "message", from: user_id, message_id, content }
Server → Client:  { type: "ack", client_msg_id, status: "delivered"|"read" }
Server → Client:  { type: "presence", user_id, status: "online"|"offline" }

The client_msg_id is a UUID generated on the device. The server uses it for deduplication: if the device retransmits after a timeout without receiving an ack, the server recognizes the ID and drops the duplicate. This matters. Networks are rude.


Presence Is Harder Than It Looks

Presence is deceptively hard. At 500M DAU with 10% online, you have 50M heartbeat updates flowing continuously. If clients send a heartbeat every 30 seconds, that's ~1.7 million heartbeats per second.

The design: clients ping the server every 30 seconds. The chat server updates a Redis key presence:{user_id} with a TTL of 60 seconds. If the key expires (no heartbeat for 60 seconds), the user appears offline.

When user A's contact B comes online, A needs to know. The presence service uses Redis Pub/Sub: when B's status changes, the service publishes to a channel. Chat servers holding connections of B's contacts subscribe and push the update.

This is a good place to make a tradeoff explicit in your interview. Exact presence is expensive. Many apps show "last seen" rather than live online/offline to reduce the heartbeat load. Twitter/X shows "last active X hours ago." Slack shows presence but throttles updates aggressively for large workspaces. Name the tradeoff; don't just pick one silently.

See building heartbeat systems for the reliability guarantees and failure detection details.


Group Fan-Out: Pick Your Poison

Groups introduce fan-out: one message needs to reach potentially 1,000 members.

Fan-out on write (preferred for small groups): When a message arrives, the server immediately looks up all group members and pushes the message to each online member's chat server. Offline members get queued entries. This is write-heavy but makes reads cheap.

Fan-out on read (preferred for very large groups, like Slack channels): Store the message once. When a member opens the conversation, the server fetches it. Read-heavy but simple storage. Used when group size makes write fan-out impractical.

WhatsApp groups cap at 1,024 members, where write fan-out is still manageable. At that scale, a single message generates 1,024 delivery attempts. At 500K messages/second with average group size 5, you get roughly 2.5M deliveries/second, still within headroom on 500 chat servers.

For end-to-end encryption in groups, WhatsApp uses the Signal Protocol's Sender Key scheme: the sender generates one symmetric key for the group and encrypts it individually for each member's public key. Subsequent messages use the shared key, so encryption cost doesn't grow with group size.

Fan-out on write: one message forks into N delivery attempts to online members, with offline members queued in Cassandra Write fan-out for 1,024 members is manageable. For a Slack channel with 50K members, you'd switch to fan-out on read.


The Bottlenecks Worth Calling Out

Tweet: "Why is the AWS outage affecting you? I thought you were decentralized?" with Squid Game characters Every distributed system has a single point of failure that its designers insisted couldn't possibly be one.

The routing registry is a single point of failure. Mitigate with Redis Cluster (sharded) and a local cache on each chat server for hot users.

The Cassandra message table under a single active group generates a hot partition. Shard by (conversation_id, bucket) where bucket is a time-based shard (e.g., week number).

Chat server failover: When a chat server crashes, 100K connections drop simultaneously. Clients detect the TCP close and reconnect within seconds. The new server queries Redis for undelivered messages. No messages are lost because they were written to Cassandra before the server acknowledged to the sender. This reconnect storm is real: 100K simultaneous reconnects to the remaining servers. Handle with exponential backoff and jitter on the client.

Geographic distribution: Put chat servers in every major region. Route users to the nearest region. Cross-region messages incur one extra hop (sender's region to recipient's region), adding 30 to 80ms depending on datacenter distance. Worth it to keep most messages sub-50ms.


How to Talk Through This Under Time Pressure

The interview is a conversation, not a presentation. Narrate your reasoning, not just your conclusions. Say "I'm choosing Cassandra here because messages are append-only and we need high write throughput; the tradeoff is no multi-row transactions" rather than just drawing a Cassandra box.

A rough 45-minute clock:

  • 0-5 min: Requirements and scope. Ask questions. Write the decisions you made.
  • 5-10 min: Capacity estimates. Drive the numbers yourself; don't wait to be asked.
  • 10-25 min: High-level architecture and message flow. The WebSocket choice, the routing registry, the happy path and offline path.
  • 25-35 min: Data model and API. Schema decisions, Snowflake IDs, the fan-out tradeoff.
  • 35-45 min: Deep dive on whatever the interviewer cares about most. Presence, group scaling, encryption, failover. Follow their lead.

If you run out of time, explicitly say what you'd do next: "Given more time I'd detail the media pipeline, multi-device sync, and how we handle the reconnect storm." Showing you know what's left is as valuable as having done it.


The Short Version

  • Scope the problem before drawing anything. The requirements conversation is scored.
  • WebSockets are the only viable transport for real-time chat at scale.
  • Chat servers are stateful. A routing registry (Redis) maps users to their server.
  • Partition messages by conversation_id in Cassandra. Snowflake IDs give time-ordered, globally unique message keys.
  • The offline path: write before ack, queue for offline users, flush on reconnect.
  • Presence is a heartbeat-plus-TTL problem. Explicit tradeoff between accuracy and load.
  • Group fan-out: fan-out on write for small groups, fan-out on read for huge ones.
  • Call out bottlenecks (hot partitions, routing registry SPOF, reconnect storms) and how you'd address them.
  • Narrate your tradeoffs. "I chose X over Y because Z" is what gets written on the scorecard.

If you want to rehearse talking through this out loud under time pressure, SpaceComplexity runs voice-based system design mock interviews with rubric feedback on your communication, not just whether the diagram is right.


Further Reading