Design Zoom: The Video Conferencing System Design Interview

In December 2019, Zoom had 10 million daily meeting participants. By April 2020, that number was 300 million. The infrastructure didn't collapse. Engineers kept their cameras off, as usual, but the video kept going.

That jump isn't luck. It's the result of deliberate architectural choices: where to put the compute, how to route media, and what to trade away when the network gets ugly.

This walkthrough covers the full design. Protocol stack, routing models, data model, scaling problems, and the tradeoffs interviewers actually want to hear. The 45-minute clock is at the end.

A developer's Monday standup meeting bingo card, featuring squares like "can everyone hear me?" and "you're on mute" The distributed system keeping this call alive is more interesting than anyone on it. (r/ProgrammerHumor)

Start With the Constraints

Before any diagrams, nail the scope. Video conferencing systems vary wildly depending on what you include.

Functional requirements:

Create and join meetings via a shareable link
Real-time audio and video with low latency (under 150ms target)
Up to 1,000 participants per meeting (say 50 with video on)
Screen sharing
Text chat within a meeting
Cloud recording (stored, not just local)
Meeting metadata: title, host, scheduled time, access controls

Non-functional requirements:

99.99% availability for the media path
P95 video latency under 150ms
Support 10 million simultaneous participants globally
Graceful degradation under network loss (audio-only fallback)
E2EE optional (note the tradeoff, discussed below)

Clarify the load. 10 million concurrent participants, in meetings averaging 5 people, is 2 million simultaneous sessions. That is your dimensioning number.

The Two Planes You Have to Separate

Every video conferencing system has two completely different problems running in parallel. Conflate them and your design falls apart.

The signaling plane handles control messages: join, leave, mute, hand-raise, chat, SDP/ICE negotiation. This is boring infrastructure. WebSockets, a message broker, a database. Latency matters but not microseconds.

The media plane carries the actual audio and video packets. This is the hard part. It demands low-latency UDP paths, specialized media servers, and adaptive encoding logic. Getting this wrong makes calls choppy.

Draw two separate boxes in your diagram from the start. Interviewers notice when candidates conflate these, and it colors everything that follows. One giant "video server" box is not an architecture.

The Protocol Stack (Briefly, Because It Matters)

WebRTC is the foundation. It handles browser-to-server (and peer-to-peer) real-time media. The relevant pieces:

ICE (Interactive Connectivity Establishment): finds the best network path between two endpoints by gathering candidates (local IPs, server-reflexive IPs from STUN, relayed IPs from TURN) and testing them in priority order.
STUN (Session Traversal Utilities for NAT): tells a client its public IP as seen from the outside. Works for roughly 80% of connections.
TURN (Traversal Using Relays around NAT): relays all media through a server when direct paths fail. 100% success rate, but every packet your users send now takes a detour through your infrastructure. Use it as a fallback, not a default.
RTP (Real-time Transport Protocol): carries audio/video packets over UDP. Sequence numbers and timestamps let receivers reorder and buffer correctly.
RTCP (RTP Control Protocol): side-channel feedback. Reports packet loss, jitter, and round-trip time. The SFU uses this to make adaptive bitrate decisions. Think of it as the media plane's constant health check.
DTLS-SRTP: encrypts the media. Mandatory in WebRTC.

The signaling channel is separate from all of this. It runs over a persistent WebSocket connection and carries SDP offers/answers plus ICE candidates so the two peers (or client and SFU) can negotiate a connection before any media flows.

ICE connection establishment: client A and client B discover their public IPs via STUN, exchange ICE candidates through the signaling server, then establish a direct UDP media path or fall back to TURN relay STUN finds your public IP. TURN relays your packets when the direct path is blocked. The signaling channel stitches both sides together before any media flows.

P2P vs MCU vs SFU: This Is the Core Tradeoff

This is the most important decision in the design. Most candidates skip it or get it wrong.

Peer-to-peer (P2P): every client sends its stream directly to every other client. For a 5-person call, each client sends 4 streams and receives 4 streams. Upload bandwidth requirement grows as O(n). Works for 2-person calls. Falls apart at 5+. Your ISP's feelings are hurt.

MCU (Multipoint Control Unit): all clients send one stream to the server. The server decodes, composites, and re-encodes a single mixed stream back to each participant. Client bandwidth stays flat regardless of participant count. The cost: the server does transcoding work proportional to participant count. CPU-intensive, expensive, latency-additive (200-400ms). Mostly used for telephony and broadcast.

SFU (Selective Forwarding Unit): all clients send one stream to the server. The server forwards individual streams to each recipient without mixing or re-encoding. Clients receive N-1 streams and handle their own compositing layout. Server CPU stays low because there is no transcoding.

Zoom and Google Meet both use an SFU architecture. This is the right answer for a general-purpose video conferencing system at scale.

The catch: in a 50-person meeting with video on, each client would receive 49 streams. Nobody wants 49 video boxes. Their browser really doesn't want 49 video boxes. The SFU solves this with active speaker detection and stream subscription management. The SFU only forwards the streams the client is actually rendering, typically the active speaker plus a few others. When you switch gallery view, the SFU changes which streams it forwards to you.

P2P mesh connections vs MCU central transcoding box vs SFU selective forwarding: three architectures compared across client bandwidth, server CPU, and latency P2P melts upload bandwidth. MCU melts server CPU. SFU keeps both manageable by doing less work on the server and smarter selection on what to forward.

Simulcast vs SVC: Two Answers to the Same Problem

Even with stream selection, bandwidth varies by participant. The person on a phone on LTE cannot receive the same stream as the person on gigabit fiber.

Two approaches:

Simulcast: the sender encodes the same video at multiple resolutions (say 1080p, 720p, and 360p) and sends all three streams up. The SFU picks which resolution to forward to each subscriber based on their RTCP feedback. More upload bandwidth from the sender, simpler SFU logic.

SVC (Scalable Video Coding): the sender encodes a single layered stream with a base layer (360p) and enhancement layers (720p, 1080p). Each layer is additive. The SFU can drop enhancement layers to reduce bandwidth without the sender changing anything. More complex codec, lower upload cost.

Zoom uses SVC with VP8/H.264, monitoring RTCP receiver reports to make downgrade decisions in real time. When packet loss spikes above a threshold, the SFU stops forwarding enhancement layers within milliseconds. Simulcast is the fallback for older clients.

How the Services Fit Together

Clients connect to the nearest edge over WebSocket for signaling and establish a DTLS-SRTP session with the nearest SFU for media.

The Signaling Service is stateless and can run as many replicas as needed. It handles SDP/ICE exchange between client and SFU. Meeting state lives in Redis, which every signaling instance can read.

The Meeting Service owns CRUD for meetings: create, schedule, close, enforce capacity limits. It writes to PostgreSQL and publishes events to Kafka.

The Recording Service consumes the raw RTP stream directly from the SFU, runs server-side decoding and compositing, writes to S3, and updates meeting records when the recording is ready.

Full system architecture showing five layers: clients connect to edge WebSocket gateway and SFU cluster; core services handle signaling, meeting management, and recording; data layer uses Redis for hot state, PostgreSQL for durable records, S3 for recordings, and Kafka for events Two distinct paths through the system: the signaling plane (blue) and the media plane (amber). Keep them visually separate in your interview diagram.

What Lives in Postgres vs Redis

meetings
  id UUID PRIMARY KEY
  host_user_id UUID NOT NULL
  title TEXT
  join_link_token TEXT UNIQUE NOT NULL  -- random token, short-lived
  status ENUM('scheduled', 'active', 'ended')
  max_participants INT DEFAULT 100
  scheduled_start TIMESTAMPTZ
  started_at TIMESTAMPTZ
  ended_at TIMESTAMPTZ

participants  -- one row per user per meeting session
  id UUID PRIMARY KEY
  meeting_id UUID NOT NULL REFERENCES meetings(id)
  user_id UUID NOT NULL
  joined_at TIMESTAMPTZ
  left_at TIMESTAMPTZ
  is_host BOOLEAN DEFAULT FALSE

recordings
  id UUID PRIMARY KEY
  meeting_id UUID NOT NULL REFERENCES meetings(id)
  s3_key TEXT NOT NULL
  duration_seconds INT
  status ENUM('processing', 'ready', 'failed')
  created_at TIMESTAMPTZ

Redis holds the hot state: who is currently in the meeting, who has their mic muted, the active speaker. Structure it as a hash per meeting keyed by meeting:{id}:participants. This data is ephemeral. Lose it on restart and the meeting state rebuilds from participant join events. Which is fine. The meeting was probably going to end eventually anyway.

The API (One Endpoint Does All the Work)

POST /meetings              -- create (returns join_link_token)
GET  /meetings/:id          -- fetch meeting details
POST /meetings/:id/join     -- exchange auth for SFU address + ICE config
POST /meetings/:id/leave    -- mark participant left, update Redis
POST /meetings/:id/record   -- start recording (host only)
GET  /meetings/:id/participants -- current participant list (REST or WebSocket subscription)

The /join response is the critical one. It returns the SFU WebSocket endpoint, the ICE server list (STUN URLs, TURN credentials), and the SDP offer from the SFU. The client completes the ICE handshake and then starts sending media. Everything else in this API is housekeeping.

Where Things Break at Scale

SFU fan-out for large meetings. A 1,000-person webinar has one speaker and 999 listeners. A single SFU cannot handle 999 outbound streams for a single session. The solution is a cascaded SFU (also called an SFU mesh or media forwarding tree). Edge SFUs receive from the speaker SFU and forward to local participants. This distributes egress across the tree.

Signaling server fan-out. When the host mutes everyone, the signaling service needs to deliver that event to all participants in real time. With 1,000 participants potentially spread across 50 signaling server instances, the event has to fan out via a pub/sub layer. Kafka or Redis pub/sub per meeting topic handles this without the signaling instances needing to know about each other.

Hot meeting join storms. A meeting link shared publicly can produce thousands of simultaneous join requests. The Meeting Service becomes the bottleneck. Solutions: rate-limit joins per meeting, pre-warm meeting state in Redis on meeting creation, use a virtual waiting room with a leaky bucket admission queue.

Recording at scale. Recording is expensive. Do not put this on the SFU. This is in the same category of advice as "do not run untrusted code as root." Run a separate recording relay that subscribes to the SFU as a special participant, receives all streams, and writes them to disk independently. This isolates the recording failure domain from the live media path.

Cascaded SFU for large meetings: origin SFU in US-West receives the speaker stream, three edge SFUs in Europe, Asia-Pacific, and US-East each subscribe to the origin and serve roughly 330 regional participants The origin SFU fan-out stays flat at 3 regardless of how many participants join. Adding a new region means adding one edge SFU, not scaling the origin.

The Tradeoffs Worth Discussing

E2EE vs cloud recording. True end-to-end encryption means the server never has plaintext media, so it cannot record or run server-side features like live transcription. Zoom offers both modes but they are mutually exclusive: enable E2EE and recording is disabled. This isn't a product gap. It's physics. Say this explicitly in the interview rather than treating them as orthogonal features.

SFU vs MCU. SFU scales better and costs less but shifts compositing work to the client. MCU reduces client CPU but adds server cost and encoding latency. For a general-purpose product, SFU wins. MCU makes sense only for broadcast-style events where thousands of read-only viewers need a single composited stream.

UDP vs TCP. Media flows over UDP. A few lost packets are fine: video is redundant enough to conceal them. Requiring retransmission adds latency, which is worse than a dropped frame. Fall back to TCP (or TLS-tunneled UDP) only when UDP is blocked at the firewall.

Recording storage. Store raw streams separately per participant for maximum flexibility in post-processing, then composite on read. This costs more storage but makes features like transcript alignment and layout changes in post-production possible without re-recording.

The 45-Minute Clock

0-5: Clarify requirements. Confirm participant count, video/audio only or also screen share, recording, E2EE optional.
5-12: Draw the two-plane architecture (signaling vs media). Introduce SFU and explain why P2P and MCU don't scale.
12-20: Protocol walkthrough. ICE/STUN/TURN, RTP, WebSocket signaling. Draw the connection establishment flow.
20-28: Data model and API. meetings, participants, recordings. The /join endpoint response.
28-35: Scaling deep dives. Cascaded SFU, signaling fan-out, recording service isolation.
35-42: Tradeoffs. E2EE, SVC vs simulcast, UDP vs TCP fallback.
42-45: Wrap up with monitoring. RTCP metrics, packet loss dashboards, SFU health checks.

Video Conferencing System Design: The Recap

Separate signaling (WebSocket, control messages) from the media plane (RTP over UDP) from the start.
SFU is the right routing model. P2P falls apart above 4 participants. MCU is too CPU-expensive for general use.
Active speaker detection and stream subscription management let the SFU scale to 50+ video participants without melting client bandwidth.
ICE/STUN handles most NAT traversal. TURN handles the rest. Never rely on TURN as the default path.
Redis holds live meeting state. PostgreSQL holds durable meeting records. S3 holds recordings.
Cascaded SFUs handle large meetings. The origin SFU feeds edge SFUs that serve regional participants.
E2EE and cloud recording are mutually exclusive. State the tradeoff explicitly.
Recording runs as a dedicated service subscribing to the SFU, not on the SFU itself.

If you want to practice explaining this architecture out loud, with a follow-up like "how would you handle 10,000-person webinars?" or "walk me through the E2EE key exchange," SpaceComplexity runs voice-based system design mock interviews with rubric feedback on your architecture decisions, not just your final diagram.

For the WhatsApp walkthrough, which covers the WebSocket signaling patterns in depth, see the chat system design guide. For a deep dive on video storage and CDN delivery for recordings, the YouTube system design article covers the upload and serving pipeline. The distributed cache patterns behind Redis meeting state are in the distributed cache walkthrough.