Load Balancing Algorithms: The System Design Interview Guide

You have one server. It runs great. You ship, users come, it keeps running great. You smile. Then HackerNews discovers you at 2 a.m. and your single EC2 instance starts sobbing.

So you add a second server. Problem solved. Except now you have a new problem: who decides which server gets each request? That decision is load balancing. Knowing the right algorithm for the workload in front of you is what separates a box labeled "LB" from a Strong Hire answer.

The Problem One Server Can't Solve

A single server has a ceiling. CPU, memory, open file descriptors, network bandwidth. It runs out of something. The load balancer's job is to distribute traffic across a server pool so no single node becomes the bottleneck.

It also handles failure. When a server crashes, the load balancer routes around it. Users never see the error. That combination, horizontal scaling plus fault tolerance, is why load balancers appear in nearly every system design interview once the interviewer starts asking "what happens at scale."

Seven Algorithms, One Mental Model

Every algorithm encodes an assumption about what "distributing traffic fairly" means. Knowing the assumption each one makes lets you pick the right one quickly and explain your reasoning out loud. That explanation is the whole game in an interview.

Round Robin: The Equal-Request Assumption

Round robin cycles through servers in order. Request 1 goes to server A, request 2 to server B, request 3 to server C, back to A. No state to track, no measurements, no feelings.

It assumes every request costs roughly the same and every server has roughly the same capacity. When those hold (stateless REST servers behind a CDN, identical instance sizes, uniform request types), round robin is nearly impossible to beat on simplicity and overhead.

It breaks when requests have wildly different costs. One video transcode that takes 30 seconds can saturate a server while round robin keeps cheerfully routing more requests to it. The algorithm has no idea. It does not care. It is cycling.

Weighted Round Robin: Handling Unequal Servers

If your pool includes mixed instance sizes, weighted round robin assigns each server a weight proportional to capacity. A server with weight 3 gets three requests per cycle; a weight-1 server gets one.

Use this when servers are heterogeneous but request cost is still roughly uniform. It also works well for gradual rollouts: give a new server tier a low weight, watch it carefully, then ramp up once you're confident it isn't on fire.

Least Connections: Watching What's Actually Happening

Instead of cycling blindly, least connections routes each new request to the server with the fewest active connections. It adapts in real time to uneven load.

This is the right algorithm when request duration varies significantly. Long-running database queries, file uploads, WebSocket connections. Anything where one request holds a connection open for seconds or minutes. Least connections prevents a server drowning in slow requests from receiving even more.

NGINX exposes this as least_conn. HAProxy calls it leastconn and recommends it as the default for session-based protocols.

Weighted Least Connections: Both Problems at Once

Weighted least connections normalizes the active connection count by server capacity. A server with twice the CPU gets twice the weight, so having 10 connections doesn't disqualify it the way it would a smaller instance.

Reach for this when you have heterogeneous servers and variable request duration. It's the most broadly correct general-purpose algorithm for production pools and the answer you should default to when the interviewer hasn't given you enough constraints to be specific.

IP Hash: Deterministic Routing Without State

IP hash computes a hash of the client IP and maps it to a server. The same client always routes to the same server, as long as the pool is unchanged.

This gives you session affinity without the load balancer storing any session state. Fast, stateless from the balancer's perspective.

The trap: users behind a corporate NAT or proxy all share one IP. Your entire office is one person to the internet, and that one person is hammering a single backend server while the others sit idle wondering if they missed a meeting. IP hash is also fragile to pool changes: add or remove a server and the hash mapping shifts, rerouting a large fraction of clients to new servers.

Consistent Hashing: Pool Changes Without Chaos

Consistent hashing arranges servers on a virtual ring. A request hashes to a position on the ring and routes to the nearest server clockwise. When a server is added or removed, only the keys near that point reroute. The rest of the pool stays undisturbed.

This is the algorithm behind distributed caches like Memcached and Redis Cluster. See distributed cache system design for how the ring model works in depth. In a load balancer context, consistent hashing is the right choice when server-side caching gives a real performance benefit and you want maximum cache hit rate while remaining resilient to pool changes.

Least Response Time: The True Signal

Least response time routes to the server with the lowest combination of active connections and measured response time. Instead of using connection count as a proxy for load, it watches how long requests are actually taking.

It's the most adaptive algorithm, but the most expensive to compute and the most sensitive to noise. A momentary GC pause can shift traffic sharply. Use it for latency-sensitive services where workloads are genuinely heterogeneous and you have solid server observability. If you don't have good observability, you're just routing by vibes.

Layer 4 vs. Layer 7: The Question That Decides Everything

This is the question interviewers ask to see if you understand load balancers or just know their names.

A Layer 4 load balancer operates at the TCP/UDP level. It sees source IP, destination IP, and port number. It makes routing decisions without reading packet contents. Fast, low overhead, limited intelligence. It is polite. It does not pry.

A Layer 7 load balancer reads actual application-layer data: HTTP headers, cookies, URL paths, request bodies. It can route /api/* to one server pool and /static/* to another, send mobile traffic to a different fleet, and inspect cookies for session affinity. It can also terminate TLS, compress responses, and apply rate limits.

L4 vs L7 load balancer architecture

The tradeoff is overhead. Layer 7 must terminate the TCP connection, parse the HTTP request, make a routing decision, then open a new connection to the backend. This adds latency and CPU cost.

For high-throughput services at millions of requests per second, that overhead matters. A tiered architecture handles it: Layer 4 at the front for raw speed, distributing across a pool of Layer 7 balancers, which do content-based routing to backends.

AWS Network Load Balancer (NLB) is Layer 4. AWS Application Load Balancer (ALB), NGINX, and Envoy operate at Layer 7. In interviews, default to Layer 7 unless you have a performance argument for Layer 4. Most application workloads benefit from HTTP-level visibility.

Health Checks: Your Load Balancer's Immune System

Health checks are how the load balancer knows when to stop sending traffic to a server that is technically still running but has become useless.

Passive health checks notice failures after a request fails. Active health checks probe servers on a schedule regardless of traffic. Active is better. You catch degraded servers before users hit them, not after.

Health checks range from simple TCP pings to full HTTP requests that test a /health endpoint. A good /health endpoint returns 200 only if the server is actually ready, including downstream dependencies like the database. The load balancer removes any server failing N consecutive checks and re-adds it once it recovers.

Set the threshold carefully. Too sensitive and a brief GC pause causes unnecessary traffic redistribution. Too lenient and users hit a broken server for seconds. A common production pattern: check every 5 seconds, remove after 2 failures, re-add after 3 successes.

Sticky Sessions and the State Problem

Sticky sessions, sometimes called session affinity, ensure a client always routes to the same backend server. The load balancer sets a cookie or uses IP hash to remember the mapping.

The reason you need them is server-side state. A shopping cart in memory, a WebSocket connection mid-stream, an expensive computation cached locally. Anything that doesn't live in a shared external store forces you toward session affinity.

The cost is real: sticky sessions create uneven load. If a server holding 20% of active sessions gets a traffic spike, you can't redistribute easily. A crashed server also loses all sessions mapped to it. Your users' shopping carts just vanished. Not great.

The cleaner solution is to externalize state. Move sessions to Redis, persist the cart in a database, put caches behind a shared layer. Stateless backends let you use round robin or least connections freely and survive node failures without losing user data. The distributed cache system design walkthrough shows how shared session storage eliminates the affinity problem entirely.

Consistent hashing offers a middle path when you have server-local caches: clients hash deterministically to the same server for cache affinity, but pool changes only disrupt a fraction of clients rather than all of them.

How to Talk About This in an Interview

You don't wait to be asked. When your design has multiple servers, you draw the load balancer immediately. When you do, say two things out loud: which layer it operates at and why, and which algorithm you'd pick for this specific workload.

Most candidates draw a box labeled "Load Balancer" and move on. What gets you a Strong Hire is the sentence after: "I'd use least connections here because these inference requests have highly variable duration. Round robin would let some servers back up while others sit idle."

For stateful systems, name sticky sessions and say you'd prefer to externalize state to avoid uneven load. If the interviewer pushes on consistency, bring in consistent hashing and explain the ring model briefly.

One more thing: mention health checks. Say the load balancer actively probes servers and removes them if they fail a threshold. It's a small detail that signals you understand the system heals itself without human intervention at 3 a.m.

If you want reps doing this under real interview pressure, SpaceComplexity runs voice-based mock system design interviews where you can practice exactly this kind of live decision-making with rubric-based feedback.

For a detailed walkthrough of designing the load balancer itself, including how to make the balancer highly available (the load balancer is also a single point of failure), see Design a Load Balancer. And for how load balancing connects to rate limiting at the edge, Rate Limiter System Design covers the intersection in depth.

What Weak Answers Miss

Not naming a layer. Saying "load balancer" without Layer 4 or Layer 7 tells the interviewer you've seen the box but haven't thought about what's inside it.

Reaching for sticky sessions as the default for stateful systems. The right answer is almost always: externalize state, use stateless routing.

Forgetting health checks. They're not optional. They're how the system heals itself automatically.

Ignoring load balancer failure. A single load balancer is a single point of failure. Production deployments use an active-passive or active-active pair with a virtual IP. Bring this up if the interviewer asks about reliability.