Horizontal vs Vertical Scaling: The System Design Interview Guide

- Vertical scaling (scale up) upgrades one machine: more CPU, RAM, storage. No code changes, but there is a hard hardware ceiling (around 192 vCPUs on EC2).
- Horizontal scaling (scale out) adds machines behind a load balancer. No ceiling, but your application must be stateless first or it breaks immediately.
- The stateless requirement is the real constraint: sessions and app state must live in external stores (Redis, databases, object storage) before horizontal scaling works cleanly.
- Databases follow a different playbook: start on one large instance, add read replicas when reads are the bottleneck, shard only when writes are.
- In an interview, anchor your choice to the actual bottleneck (CPU, I/O, memory) and request volume rather than defaulting to horizontal as the "right" answer.
- Horizontal is not always better: for stateful systems or moderate traffic, one well-tuned instance is simpler, cheaper, and easier to operate than a distributed one.
Your interviewer draws a box on the whiteboard and labels it "server." Traffic grows 10x. What do you do with the box?
There are exactly two answers. You either make the box bigger, or you add more boxes. Vertical scaling adds resources to the existing machine; horizontal scaling adds more machines. That's the whole idea, and it comes up in nearly every system design interview. The interesting part is knowing which one to reach for, when, and how to reason through it out loud without fumbling.
Most candidates know the vocabulary. Fewer can explain which choice fits a given bottleneck, why stateful systems play by different rules, and how to introduce the discussion at the right moment without it sounding like a line they rehearsed in the shower.
Make the Box Bigger: Vertical Scaling
Vertical scaling (scale up) means upgrading the hardware on your existing server. More CPU cores, more RAM, faster NVMe storage. On AWS, you move from a t3.medium to an m5.2xlarge to an m7i.48xlarge. Similar progressions exist on GCP and Azure.
The ceiling is real. The largest general-purpose EC2 instances top out around 192 vCPUs and 768 GiB of RAM. Above that, you are physically done scaling vertically. Until you hit that ceiling, though, vertical scaling is usually the fastest way to buy yourself more headroom.
The advantages are genuine:
- No application changes required. Your code doesn't know whether it's running on 4 cores or 128.
- No distributed state problems. One machine, one memory space, no network hops between processes.
- Lower operational complexity. One instance to monitor, patch, deploy, and debug.
- Latency stays low. In-process communication is nanoseconds. Network hops are microseconds to milliseconds.
The disadvantages matter too:
- Single point of failure. That one big machine goes down and your whole service goes down with it.
- Cost curves steeply. High-end instances are disproportionately expensive. Doubling specs rarely doubles the price linearly.
- Downtime for upgrades. Moving to a bigger instance type usually means a restart.
Vertical scaling makes sense early in a system's life. When the system is stateful and doesn't scale horizontally without redesign. When the problem is "we need 2x capacity this week" and not "we need 100x capacity over the next two years."
And that single point of failure thing? It's not theoretical.

The most dangerous systems are the ones surviving because of one invisible engineer. Replace "engineer" with "one big server" and the story is the same.
Add More Boxes: Horizontal Scaling
Horizontal scaling (scale out) means running multiple instances of your service behind a load balancer. Instead of one 128-core machine, you run sixteen 8-core machines. Requests get distributed across all of them.
This approach has no hard hardware ceiling. You keep adding nodes. Google and Amazon run millions of servers. That's horizontal scaling taken to its logical extreme.
The advantages:
- No single point of failure. Losing one node out of ten means 10% capacity loss, not total unavailability.
- Incremental capacity. Add exactly as much as you need, when you need it.
- Rolling deployments. Push changes to one node at a time, zero downtime.
- Throughput scales roughly linearly. Double the nodes, roughly double the throughput.
The disadvantages:
- Your application must be stateless. If any request can land on any server, no server can hold state that others lack.
- Operational complexity. Load balancers, service discovery, distributed tracing, health checks, consensus on shared state.
- Network overhead. Nodes communicating over the network adds latency that shared memory never has.
The stateless requirement is the key constraint. If your API server stores user sessions in local memory, horizontal scaling breaks immediately. Server A handled the login, so it holds the session. Server B gets the next request and has no idea who this user is. The fix: externalize all state. Sessions go to Redis, user data goes to a shared database, files go to object storage. Each server becomes fungible, replaceable, interchangeable.
Horizontal vs Vertical Scaling: Side by Side
| Vertical Scaling | Horizontal Scaling | |
|---|---|---|
| Also called | Scale up | Scale out |
| Mechanism | Bigger machine | More machines |
| Hard ceiling | Yes (hardware limits) | No |
| Application changes | None | Stateless required |
| Availability | Single point of failure | High availability |
| Failure mode | Catastrophic | Graceful degradation |
| Operational complexity | Low | High |
| Best for | Databases, early growth | Stateless services, high traffic |
The Database Problem
Databases are where scaling gets complicated, and system design interviews love to probe this.
Stateless services scale horizontally with almost no friction. Add a node, update the load balancer, done. But databases hold state. Horizontal scaling for databases requires either replication or sharding, and both have real tradeoffs.
Replication copies your data across multiple nodes. Reads scale because any replica can serve a read query. Writes don't scale the same way, because every write has to propagate to replicas, and you have to decide what "consistent" means across them. Database replication is the right move when your bottleneck is read traffic.
Sharding partitions your data across multiple nodes. Each shard holds a subset. Reads and writes both distribute, but now you have to know which shard holds which data. Cross-shard queries get expensive. Resharding as you add nodes is operationally painful. The database sharding guide covers the mechanics in depth.
The sensible progression in an interview: start the database on a single large instance (vertical), move to read replicas when read traffic is the bottleneck, then shard when write traffic is the bottleneck. Don't jump to sharding in round one. Interviewers notice when you reach for the most complex solution before you've established what the bottleneck even is.
How to Bring This Up in an Interview
This is where most candidates leave points on the table. They know vertical versus horizontal. They don't know when to introduce the topic or how to frame it as a reasoned argument rather than a buzzword recitation.
Mention scaling after you've established baseline requirements. Know the expected requests per second, the number of users, the read/write ratio, the consistency requirements. Then you can make an actual argument for your scaling approach.
Here's what that sounds like:
"At 1,000 requests per second, a single well-tuned instance handles this comfortably. I'd start with one m5.2xlarge and defer the complexity. Once we project hitting CPU saturation or we need high availability, we can scale up to a larger instance type or scale out behind a load balancer. If the application is stateless, scaling out is better long-term because it removes the single point of failure. For the database, I'd keep it on one instance initially and add read replicas if read traffic becomes the bottleneck."
This answer signals three things the interviewer is actually looking for: you don't automatically reach for horizontal scaling; the choice follows from the real bottleneck; application servers and databases scale on different timelines with different strategies.
Saying "we'll just scale horizontally" without reasoning tells the interviewer you're pattern-matching to a word you've heard in a YouTube video.
When the interviewer hands you a capacity jump ("now handle 10M users"), walk through it methodically. Where is the bottleneck: CPU, memory, I/O, network bandwidth? Can we buy time by scaling vertically, or are we near hardware limits? If we scale out, what changes in the application layer to make it stateless? How does the database tier scale separately from the application tier?
Design for Stateless From Day One
One point worth making explicitly in an interview: if your services hold no local state, horizontal scaling becomes a configuration change rather than a redesign.
No in-memory sessions, no local file caches, no node-specific data. You can run 1 instance or 1,000 instances without touching a line of application code. The load balancer distributes requests. A node goes down and traffic shifts to the remaining healthy instances automatically.
This is why the most scalable architectures separate compute from state entirely. Servers are fungible, disposable, replaceable. State lives in dedicated systems: databases, distributed caches, object stores, each handling their own scaling concerns independently.
The routing layer in front of your horizontal instances has its own set of algorithm choices. Load balancing algorithms covers round-robin, least connections, and consistent hashing-based routing in detail.
For distributed data stores where nodes are added and removed frequently, consistent hashing minimizes data movement when the cluster size changes. Consistent hashing is worth understanding before any interview where data distribution across nodes comes up.
The Pattern Every Production System Follows
A startup's API server starts on a single instance. Traffic is low, deployment is simple, one machine is easy to reason about. Vertical scaling as the base case.
As the product grows, they add a load balancer and a second API instance. The application was already stateless (sessions in Redis), so this requires no application changes. The compute tier scales horizontally; the database stays on one large instance for much longer than you'd expect.
Read replicas come when analytics queries start competing with production traffic. Sharding comes much later, if ever, because it introduces significant operational overhead that isn't warranted until write throughput is actually the bottleneck.
This isn't engineers being lazy. It's because the CAP theorem tells you that distributing state comes with real consistency and availability tradeoffs, and you should only take on that complexity when the alternative is worse.
Where Candidates Go Wrong
Treating horizontal as always superior. For a system handling moderate traffic with a complex stateful design, one large well-tuned instance is simpler, cheaper, and easier to operate. Horizontal scaling is a legitimate engineering choice, not a sign of naivety.
Ignoring the database. Candidates focus on the stateless application tier and wave their hand at the database. Have a real opinion on when you'd introduce read replicas versus when you'd shard. "The database will also scale horizontally" is not a plan.
Skipping the bottleneck analysis. If you're CPU-bound, adding memory doesn't help. If you're I/O-bound, adding CPU cores doesn't help. Identify the constraint first, then prescribe the scaling approach.
Reaching for microservices to solve a scaling problem. A well-structured monolith with stateless API servers behind a load balancer scales horizontally without the operational overhead of a distributed system. Microservices are not the answer to a throughput problem.

When you memorized "horizontal = stateless, vertical = one big box" but the interviewer wants to know what happens to session state during a rolling deploy.
Practice Before It Counts
Reading about scaling tradeoffs is not the same as explaining them out loud under time pressure while someone is watching how you think. System design interviews are conversational, and the examiner is watching your reasoning process as much as your final answer.
If you want to practice walking through scaling decisions the way you'd do it in a real interview, SpaceComplexity runs voice-based system design mocks with rubric-based feedback. You'll get a clear sense of whether your explanation of horizontal versus vertical scaling actually lands before it matters.
Further Reading
- Scalability, Wikipedia
- Amazon EC2 Instance Types, the actual hardware ceiling for vertical scaling
- Google Cloud Architecture Framework: System Design, Google's take on scalability patterns
- High Scalability, real-world case studies on how major systems actually scaled
- Martin Fowler: Patterns of Enterprise Application Architecture, foundational patterns including stateless service design