Stateless vs Stateful Services: The System Design Decision Guide

June 3, 202610 min read
system-designinterview-prepdistributed-systemsalgorithms
Stateless vs Stateful Services: The System Design Decision Guide
TL;DR
  • Stateless services process each request in isolation; any instance serves any request, making horizontal scaling and fault tolerance trivial.
  • Sticky sessions route clients to a fixed server but break fault tolerance and prevent effective load rebalancing when demand shifts.
  • Externalizing state to Redis or a database makes the service tier stateless while keeping data persistent. The cost is one network hop.
  • JWT tokens are client-side state: the signed token carries all session data so the server stores nothing per-user.
  • Stateful services are unavoidable for databases, WebSocket servers, stream processors, and consensus protocols.
  • Stateful scale-out requires state migration or replication; stateless scale-out is mechanical. That gap drives most distributed architecture decisions.

The question sounds like a vocabulary quiz. Stateless: no memory. Stateful: has memory. Next question.

Except interviewers don't ask it as a vocab quiz. They ask it because the decision controls how easily you can scale, where your failure modes hide, and whether your load balancer has to be polite about where it sends requests. Saying "stateless is better" without explaining why is like saying "use the right tool for the job" without knowing what tools exist.

The right question isn't which type is better. It's where state lives. The rest follows from that.

Stateless: Every Request Is a First Date

A stateless service has no memory of prior requests. You send a request, the service processes it using only what you sent, and returns a response. Two consecutive requests from the same client could land on entirely different servers and produce identical results.

HTTP is the canonical stateless protocol. REST APIs are stateless. Most microservices should be stateless by design.

                        ┌─── Server A
Client ── Load Balancer ┼─── Server B   ← any server can handle this
                        └─── Server C

Because any instance can serve any request, you can add servers freely. The load balancer doesn't care which one gets the request. Autoscaling is trivial. If Server B dies, traffic shifts to A and C without incident, and Server B is not missed.

This is why stateless services are the default choice for API layers in most distributed systems. Check out horizontal vs vertical scaling for how this plays out at the infrastructure level.

Stateful: The Service Remembers You (This Gets Complicated)

A stateful service retains information across requests. A database is stateful. A WebSocket server tracking active connections is stateful. A multiplayer game server maintaining room state is stateful.

The challenge is physical: the state lives somewhere specific, and that somewhere becomes a constraint.

If your service stores session data in memory, that data exists only on the server where the session was created. Send the next request to a different server and the data is gone. The server doesn't know who you are. It's a new relationship every time.

Client A ── Load Balancer ── Server A  (Client A's session lives here)
Client B ── Load Balancer ── Server B  (Client B's session lives here)
                          ✗
                 Server A cannot serve Client B

This isn't insurmountable. But it forces architectural decisions that stateless services never have to make.

The Sticky Session Compromise

The naive fix is sticky sessions (also called session affinity): configure the load balancer to always route a specific client to the same server. See load balancing algorithms for how this fits into the broader routing picture.

Client A ── Load Balancer ── Server A  (pinned)
Client B ── Load Balancer ── Server B  (pinned)
Client C ── Load Balancer ── Server A  (also pinned to A)

This works until it doesn't. Three problems:

Uneven load. If Client A's session is very active, Server A gets hammered regardless of other servers sitting idle. The load balancer can't rebalance what it has already pinned. You built a distributed system and then immediately took away the distribution.

No fault tolerance. If Server A dies, Client A's session data is gone. The user gets logged out mid-transaction, or the session must be rebuilt from scratch. Your "distributed" system now has the fault tolerance of a single server.

Adding servers doesn't help existing users. New Server D absorbs new clients, but Server A's load doesn't decrease. Your scale-out event fixes tomorrow's problem, not today's.

Sticky sessions are a pattern to know about in interviews. They're rarely the right answer. They're a symptom of state that should have been externalized.

The Real Pattern: Externalize the State

Modern distributed systems separate concerns cleanly. Services are stateless. State lives in dedicated, purpose-built stores.

                        ┌─── Service A ─┐
Client ── Load Balancer ┼─── Service B ─┼── Redis   (sessions, cache)
                        └─── Service C ─┘      │
                                            Database  (durable state)

Any service instance can handle any request, because they all read from the same external store. The service layer scales horizontally without restriction. Fault tolerance at the service layer becomes automatic. The complexity shifts to the state layer, which is engineered to handle it.

This pattern shows up everywhere:

  • JWT tokens instead of server-side sessions: all session state lives in the signed token the client carries. The service verifies the signature and reads the claims. No lookup needed, no shared memory, no server that has to remember anything.
  • Redis for shared session state: any service instance can read or write the session. Redis handles the stateful part at scale. Read caching strategies for when Redis is the right choice versus a full database.
  • Databases for durable state: obvious, but worth stating explicitly. The database is the stateful component. Your API layer is stateless.

What Each Choice Actually Costs You

PropertyStatelessStateful
Horizontal scalingTrivial, any algorithm worksComplex, requires state migration or replication
Fault toleranceHigh, any instance is substitutableLower, losing a node means losing its in-memory state
Load balancingNo constraintsRequires affinity routing or external state sync
LatencyExtra network hop to external storeData in-process, no extra hop
Operational complexityLowHigh, especially for failover and replication
ConsistencyExternal store is the single source of truthState can diverge across instances

The latency row matters more than people expect. A stateful service that holds data in memory skips the network hop to Redis or the database on every operation. For low-latency requirements (trading systems, gaming servers, real-time analytics), this can be the deciding factor. Stateful isn't always the wrong answer. It's just the harder one.

When Stateful Is the Right Call

Most application-layer services should be stateless. But some categories require statefulness.

Databases. The entire value proposition is that they remember things. PostgreSQL, Cassandra, DynamoDB are stateful by definition. You can't externalize the state from the thing whose whole job is to hold state.

Real-time connections. WebSocket servers maintain a persistent connection per client. That connection carries state. You can route all of a client's messages through the same server process, or use a message broker (Kafka, Redis Pub/Sub) to relay messages between servers so the service tier stays stateless.

Stream processing. Systems like Kafka Streams or Apache Flink maintain windowed state for aggregations. A running count, a sliding average, a session window are all stateful by nature.

Leader-based coordination. Consensus protocols (Raft, Paxos) maintain persistent state about who holds leadership and what has been committed. Unavoidably stateful.

The pattern that emerges across all of these: application logic is stateless, infrastructure is stateful.

The Common Mistake: Claiming Stateless When You're Not

There's a failure mode worth naming, because it comes up in interviews and in real codebases.

Some candidates say "our services are stateless" but haven't actually made them stateless. They rely on sticky sessions without realizing it. Each instance has its own local database that nothing else reads. There's a cache warming step that's instance-specific. The service claims to be stateless the same way someone claims to "barely use" their phone while refreshing Twitter every four minutes.

A service is stateless only if any instance can serve any request with identical results. If you need to pin a client to a specific instance for correctness, you have a stateful service, whether you call it one or not.

Before claiming statelessness in an interview, ask: can my load balancer freely route this request to any instance? If yes, you're stateless. If not, dig into why.

Scaling Stateful Services: The State Migration Problem

One question that separates good answers from great ones: what happens when you need to scale a stateful service?

For stateless services: add instances, update the load balancer, done. Ten minutes of work.

For stateful services: you have to move or replicate state. This is where consistent hashing shows up. You cannot add a new database node and have it magically hold existing data. You need to migrate partitions, replicate, and handle the transition window where reads might hit either old or new nodes. It's an engineering project, not a config change.

This ceremony is the core reason stateless designs are preferred for application tiers. Stateless scale-out is mechanical. Stateful scale-out is an incident waiting to happen at 2am.

Stateless vs Stateful in a System Design Interview: How to Reason Through It

When you're designing a system, make your state decisions explicit. Interviewers want to see you reason through options, not just announce a conclusion.

A useful frame: ask yourself "if this service instance dies right now, what breaks?"

If the answer is "nothing, the load balancer routes to the next instance," you have a stateless service. If the answer is "all active users connected to that instance lose their session," you have a stateful problem that needs a solution.

Walk through it out loud:

  1. Identify what state this component needs (session data, connection state, in-progress computation).
  2. Decide where that state should live. In-memory on the instance? External store? Client-side token?
  3. State the tradeoff explicitly: "I'm storing sessions in Redis so we can scale the API tier horizontally. The tradeoff is an extra network hop on every authenticated request. At this scale, sub-millisecond Redis reads are acceptable."

Interviewers are scoring the reasoning, not just the conclusion. Arriving at the right answer silently is worth less than walking through the wrong answer and correcting yourself out loud.

The Short Version

  • A stateless service processes each request using only what's in that request. A stateful service maintains context across requests.
  • The standard modern pattern: stateless service layer over stateful infrastructure (databases, Redis, queues).
  • Sticky sessions solve the routing problem but break fault tolerance and prevent effective load balancing.
  • Externalizing state to Redis or a database gives you stateless services with persistent state. The cost is a network hop.
  • JWT tokens are client-side state: the token carries session data, so the server stores nothing per-user.
  • In a system design interview, explain not just what you chose but what breaks if you choose wrong.

Practice narrating decisions like this under pressure at SpaceComplexity, which runs voice-based mock system design interviews with rubric-based feedback. The hardest part of state management questions isn't knowing the answer. It's explaining your reasoning clearly enough for an interviewer to follow while they're also filling out a scorecard.

Further Reading