Design a Load Balancer: The System Design Interview Walkthrough

- Round robin suits uniform workloads; least connections adapts to variable request durations; consistent hashing minimizes disruption during server adds and removes
- L4 load balancers route by IP and port with high throughput; L7 load balancers read HTTP headers for content-based routing and SSL termination; production systems stack both
- Run both active health checks (periodic probes) and passive health checks (watching live traffic) to catch crashes and application-layer errors the health endpoint misses
- The load balancer is a single point of failure: fix it with a virtual IP and VRRP active-passive, or BGP ECMP for active-active at scale
- Externalize session state to Redis instead of sticky sessions so any backend can serve any request and server failures don't blow up user sessions
- Direct Server Return (DSR) lets backends reply directly to clients, bypassing the load balancer on the return path and cutting 80-90% of traffic at extreme scale
You have two servers. Traffic is spiking. You tell your interviewer: "I'll add a load balancer." They nod, then say: "Great. Walk me through how you'd design one."
That's where most candidates freeze. Using a load balancer is table stakes. Designing one is a real system with hard tradeoffs. This load balancer system design interview walkthrough covers everything you need: the algorithms, the health check logic, L4 vs L7, how to make the load balancer itself not a single point of failure, and how to pace the conversation inside 45 minutes.
Actual footage of round-robin under unequal request weights.
Start by Nailing the Scope
Before drawing anything, ask four questions. They change the entire design.
What traffic type are we handling? HTTP/HTTPS only, or raw TCP/UDP too? HTTP lets you do application-aware routing. TCP/UDP requires you to stay at layer 4 and give up content inspection.
Do we need session affinity? If users store session state in server memory (old Rails apps, WebSockets), subsequent requests must hit the same backend. If session state lives in Redis, you're free.
What scale? Thousands of requests per second or millions? Latency SLA? This decides whether you need one L7 proxy or an L4/L7 two-tier stack.
What's the health check contract? A /health endpoint? TCP ping? An app-level check that queries the database?
Spend the first five minutes here. Interviewers are grading how you scope ambiguity, not just how you draw boxes.
The Load Balancer Sits Between Clients and Backends
Clients always talk to the VIP. The LB cluster handles the picking. Backends just receive.
Clients talk to a single stable IP (or a DNS name). The load balancer picks a backend on every connection or request, proxies the traffic, and monitors whether backends are alive.
There are two flavors of this proxy depending on which OSI layer you operate at.
L4 vs L7: Which Layer Do You Work At?
L4 sees connections. L7 sees requests. Both are load balancers. Only one can read your cookies.
Layer 4 (transport layer) sees IP addresses, ports, and TCP/UDP. It cannot read HTTP headers or cookies. Routing decisions happen at connection open, not per-request. The payoff is raw throughput and low latency. AWS Network Load Balancer, HAProxy in TCP mode.
Layer 7 (application layer) reads HTTP headers, the URL path, cookies, and body content. You can route /api/* to one backend cluster and /static/* to another. You can strip or inject headers. SSL termination lives here. L7 can read your cookies. Your auth cookies. The cost is higher CPU per connection. AWS Application Load Balancer, Nginx, Envoy, HAProxy in HTTP mode.
Production systems use both in sequence. An L4 load balancer at the edge (cheap, handles DDoS volume, millions of connections) fans out to a tier of L7 proxies that do the smart routing. Saying this in your interview is an immediate signal upgrade.
Which Routing Algorithm Should You Pick?
This is where most answers are too thin. Know each algorithm, its failure mode, and when to pick it.
Round Robin. Request 1 goes to server A, request 2 to B, request 3 to C, repeat. Simple. Breaks when requests vary wildly in cost. A server grinding through ten slow database queries gets the same next request as an idle server. Round robin does not know or care. That's your problem.
Weighted Round Robin. Same rotation but each server gets a weight proportional to its capacity. A server with 2x the CPU gets 2x the requests. Useful when your fleet is heterogeneous. Weights are static, so they still don't adapt to runtime load. You tuned them for Tuesday. It's Friday.
Least Connections. Send each new request to the server with the fewest active connections. Adapts to runtime load without any explicit weight tuning. Works well for long-lived connections and requests with variable duration. The load balancer must maintain a connection counter per backend, which adds state.
IP Hash. Hash hash(client_ip) mod N to pick a server. The same client always hits the same server. Simple session affinity. Fails hard when a server goes down: all its clients scatter and lose their state. Also skews if your users sit behind a corporate NAT. One IP, thousands of people, all on the same server. Your most important enterprise client is now your most overloaded server.
Consistent Hashing. Place servers on a virtual ring. Each request maps to the nearest server clockwise. When a server is added or removed, only ~1/N of requests remap (versus N-1/N for modulo hashing). Add 150 virtual nodes per physical server to even out the load distribution on the ring. This is the right answer for distributed caching (Memcached, Redis Cluster) and for load balancers that need to minimize disruption during deploys.
Least Response Time. Combine active connection count with observed response latency. More accurate than pure least connections. Adds more bookkeeping.
In most interviews, round robin is your default. Upgrade to least connections when request durations vary. Reach for consistent hashing when you need affinity with graceful server turnover.
# Consistent hashing: place virtual nodes on a sorted ring import bisect, hashlib class ConsistentHashRing: def __init__(self, nodes, virtual_nodes=150): self.ring = {} self.sorted_keys = [] for node in nodes: for i in range(virtual_nodes): key = self._hash(f"{node}-{i}") self.ring[key] = node self.sorted_keys.append(key) self.sorted_keys.sort() def _hash(self, s): return int(hashlib.md5(s.encode()).hexdigest(), 16) def get_node(self, request_key): h = self._hash(request_key) idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys) return self.ring[self.sorted_keys[idx]]
Each server gets 150 virtual nodes spread across the ring. When Server B goes down, only the slice it owned moves to the next server clockwise. Everyone else stays put.
When a server is removed from this ring, only the keys that mapped to it get redistributed to the next server clockwise. Every other server is untouched.
Health Checks: Know When a Backend Is Dead
A routing algorithm is worthless if it sends traffic to a crashed server. Load balancers use two complementary mechanisms.
Active health checks. The load balancer periodically probes each backend on its own. TCP probe: try to open a connection. HTTP probe: send GET /health, expect 200 OK. Mark unhealthy after 2-3 consecutive failures, mark healthy again after 2-3 consecutive successes. The hysteresis window prevents flapping. Probe interval is typically 5-10 seconds; a 10-second interval means up to 30 seconds to detect a crash at the default threshold. Tune the interval against the blast radius you can tolerate.
Passive health checks. The load balancer watches real traffic. If a backend returns 5xx errors or times out on N consecutive requests, it pulls it from the pool without waiting for a probe cycle. This catches problems that the /health endpoint misses, like a process that's alive, returning 200 on /health, and timing out on every actual request. Your health endpoint is optimistic. Your users are not.
The production answer: run both. Active checks catch silent crashes fast. Passive checks catch application-layer failures immediately.
Twenty seconds between crash and reroute at this probe interval. Tune the interval down to reduce blast radius. Tune it up to reduce probe overhead. There's no free lunch.
The Load Balancer Is Also a Single Point of Failure
This is the question every candidate forgets to raise. Surface it yourself. You've spent ten minutes explaining how you'll make backends resilient. You've quietly built a system where one server crash takes down everything. That server is the load balancer.
If your single load balancer dies, everything dies. The fix is two load balancers sharing a Virtual IP (VIP), managed by VRRP (Virtual Router Redundancy Protocol) via a tool like Keepalived.
Active-passive setup. The primary LB owns the VIP and handles all traffic. The backup LB monitors the primary via heartbeat. If the primary fails, the backup claims the VIP in about one second and traffic shifts. Clients never saw an IP change. This is the default production setup for on-premises deployments.
One second failover. Clients see zero IP change. The VIP is the lie that holds the system together, and it's a very useful lie.
Active-active setup. Both load balancers advertise the same VIP via BGP. Upstream routers distribute traffic across both using ECMP (Equal-Cost Multi-Path). True horizontal scaling of the load balancer layer itself. Used at large scale (Google, Cloudflare). Requires BGP-capable routers and more operational complexity.
DNS-based geo-routing sits above this. Route 53 or Cloudflare routes clients to the nearest regional load balancer cluster. Each region runs its own active-passive or active-active pair.
Sticky Sessions: The Right Answer Is to Avoid Them
Sticky sessions route a client to the same backend every time, preserving in-process session state. The load balancer does this via cookie injection (it writes a cookie naming the backend) or IP hash.
The problem: when a backend dies, all its stickied users lose their session and get an error. Load distribution becomes uneven as some backends accumulate "heavy" users. Sticky sessions are how you find out which server is overloaded. The answer is the one holding your most important enterprise account.
The better answer: externalize session state. Store session data in Redis or a database. Any backend can serve any request. Your load balancer becomes stateless, your backends become interchangeable, and you can scale or kill individual servers without blowing up user sessions.
If sticky sessions are unavoidable (legacy apps, long-lived WebSocket connections), prefer cookie-based stickiness over IP hash. Cookies survive NAT. IP hash does not.
Always Terminate TLS at the Load Balancer
Terminate TLS at the load balancer. The load balancer holds the certificate and private key, decrypts incoming HTTPS, and forwards plain HTTP to backends. This:
- Offloads crypto from every backend server
- Centralizes certificate management (one renewal, not N)
- Lets the L7 load balancer actually read HTTP headers (impossible on encrypted traffic)
The internal network between LB and backends is typically trusted (private VPC), so plain HTTP is acceptable. If your security posture requires end-to-end encryption, re-encrypt to the backends at the cost of additional CPU.
Scaling Bottlenecks and How to Break Them
| Bottleneck | Symptom | Fix |
|---|---|---|
| Single LB CPU saturated | High CPU, latency rising | Active-active ECMP, or upgrade to async event-driven proxy |
| Connection table full | New connections rejected | Tune ulimit, tune ip_local_port_range, or scale out the LB tier |
| Bandwidth at NIC limit | Packet drops | Add NICs, use SR-IOV, or route directly to backends (DSR) |
| Health check storm | Too many probe threads | Use async health checker; separate health-check plane from data plane |
| Hot shard in consistent hash | One backend gets 5x traffic | Increase virtual node count; add real replicas |
Direct Server Return (DSR) is worth naming in your interview. The load balancer handles the inbound packet and picks a backend, but the backend sends the response directly to the client, bypassing the load balancer on the return path. Read traffic (which is 80-90% of most workloads) skips the LB entirely. This is how Google's Maglev works at scale.
These Four Tradeoffs Define Your Design
Round Robin vs Least Connections. Round Robin is O(1) with no state. Least Connections adapts to heterogeneous load but needs a counter per backend. Pick least connections when request duration varies.
L4 vs L7. L4 is faster and simpler. L7 unlocks content-based routing, SSL termination, and richer health checks. The two-tier pattern (L4 edge, L7 middle) is the practical answer for large systems.
Active-Passive vs Active-Active HA. Active-Passive is simpler but wastes one machine. Active-Active uses all capacity but requires BGP routing infrastructure. Default to Active-Passive; mention Active-Active when the interviewer asks how you'd scale the LB itself.
Sticky Sessions vs Stateless Backends. Sticky sessions add complexity and failure modes. Stateless backends with shared Redis is almost always the right architecture. The only exception is WebSocket connections that must stay pinned to one process for the connection lifetime.
How to Run the Load Balancer System Design Interview in 45 Minutes
Spend it like this:
- 0-5 min: Scope clarification. Traffic type, scale, session affinity, health check contract.
- 5-12 min: High-level architecture. VIP, LB cluster, backend pool. Draw it.
- 12-22 min: Routing algorithms. Cover round robin through consistent hashing. Pick one for your use case and justify it.
- 22-30 min: Health checks. Active vs passive. Thresholds and intervals. What happens when a backend dies.
- 30-38 min: HA for the load balancer itself. VIP/VRRP. Briefly mention active-active with BGP.
- 38-45 min: Deep dive on one area. The interviewer will pick: SSL termination, sticky sessions, DSR, or scaling.
If the interviewer interrupts and goes deep earlier, follow them. This clock is a fallback, not a script.
The Full Checklist
- Clarify traffic type, scale, session requirements, and health check contract before drawing anything.
- Round robin for uniform workloads; least connections when request duration varies; consistent hashing when you need graceful affinity.
- L4 for throughput, L7 for intelligence. Two-tier combines both.
- Active health checks detect crashes; passive health checks catch application errors. Run both.
- The load balancer is a SPOF. Fix it with VIP + VRRP active-passive, or BGP ECMP active-active.
- Externalize session state to Redis. Avoid sticky sessions unless the protocol forces them.
- SSL terminates at the load balancer. Backends get plain HTTP on the private network.
- DSR bypasses the load balancer on the return path. Useful at extreme scale.
If you want to practice walking through this under real interview pressure, SpaceComplexity runs voice-based mock system design interviews with rubric-based feedback on exactly this kind of question. It's the difference between knowing the design and being able to explain it in real time.
For more interview fundamentals, see the guides on consistent hashing in distributed systems, designing a distributed cache, and tradeoff framing.