Session Store System Design Interview: The 45-Minute Walkthrough

- Sticky sessions fail on node loss and JWTs cannot be revoked before expiry; centralized Redis with an HttpOnly cookie is the only correct architecture.
- Session IDs need 128 to 160 bits of CSPRNG entropy;
token_urlsafe(20)is safer than UUID v4 under high-concurrency login bursts. - Two TTL strategies coexist: Redis
EXPIREhandles idle timeout; anabsoluteExpiryhash field checked in middleware enforces hard session limits regardless of activity. - Use hash tags (
{userId}) on both the session hash and the user-sessions Set so Redis Cluster places them on the same slot, enabling atomic multi-key ops. - A single Redis instance hits a ~200K ops/sec CPU ceiling (single-threaded); Sentinel adds HA without sharding; Cluster adds both HA and horizontal write scaling.
- Session fixation prevention requires regenerating the session ID on every login; every session cookie must carry Secure, HttpOnly, SameSite=Strict, and the
__Host-prefix.
Every authenticated request your app handles starts with the same hidden step: look up a session. Not once per login. Once per request. At 50,000 requests per second, that's 50,000 key-value lookups before you've done anything useful. Get the design wrong and you've built a latency floor under every feature you'll ever ship.
That's why the session store system design interview is a staple across Meta, Google, and Amazon loops. It looks like a simple CRUD problem. It isn't. There are exactly two popular approaches that seem fine until they catch fire, and one that actually works. Interviewers want to know if you can tell the difference.
Start With the Clarifying Questions
You have five minutes. Use them.
Functional scope: Do users need to view and revoke sessions from other devices? That adds a secondary index. Is there an admin force-logout flow? Does the session carry application state like cart contents, or just authentication identity?
Non-functional targets: Read latency target (p99 under 10ms is standard). Write throughput ceiling. Concurrent session count. Acceptable data loss window on a primary failure.
For a typical 10M DAU product: 500K concurrent sessions at peak, roughly 1 KB each, 80 million read operations per day, bursts to 10K+ reads per second. Those numbers determine whether you need Redis Cluster or a single Redis primary with replicas. Nail the capacity estimate early so every architecture decision has a quantitative anchor, not a vague feeling.
Three Architectures, Two Dead Ends
Sticky sessions (session affinity via load balancer) pin each user to one app server. The session lives in-process memory. No external hop. Simple. But a node failure wipes every session pinned to it, and you cannot scale your app servers without redirecting users mid-session. This architecture fails in the exact ways that make users angry: random logouts during deploys, support tickets at 2am. Cut it early. Say why.
Client-side tokens (JWT) encode the session in a signed token the client sends on every request. Zero server-side storage. Infinite horizontal scale. Sounds great until you need to revoke one. The fatal problem is that you cannot revoke a JWT before it expires. User logs out? Token stays valid. Credentials compromised? Attacker keeps access until expiry. Solving this requires a server-side blacklist, which reintroduces most of the complexity you tried to avoid. Congratulations, you have invented a worse version of what we're about to build.
The right answer is a centralized session store backed by Redis, with the session ID delivered as an HttpOnly cookie. Stateless app servers, immediate revocation, sub-millisecond reads within a datacenter. Commit to this architecture and draw it:

Request flow: cookie in, Redis lookup, response out. Every component to the left of Redis is stateless and disposable.
The Data Model Has Two Non-Obvious Parts
The primary key in Redis is sess:{sessionId}. Store the session as a Redis hash:
sessionId string (opaque 160-bit random token)
userId string
createdAt unix timestamp
lastActivity unix timestamp
absoluteExpiry unix timestamp
ipAddress string
userAgent string
Session ID generation is a security requirement, not an implementation detail. UUID v4 technically gives 122 bits of entropy, which meets the OWASP minimum, but it exhibits lock contention under high concurrency (Java's UUID.randomUUID() shares a single SecureRandom instance) and wastes 6 bits on version/variant markers. The safer default is a CSPRNG-generated 160-bit token:
import secrets session_id = secrets.token_urlsafe(20) # 27-character URL-safe base64
const sessionId = crypto.randomBytes(20).toString('base64url');
27 characters. Brute-force resistant under any realistic threat model. No lock contention. No wasted bits encoding the version marker of a spec nobody asked for.
Two TTL values coexist, and only one can live purely in Redis.
Redis handles the sliding idle timeout natively. Call EXPIRE sess:{sessionId} 1800 on every successful read and you get automatic eviction on idle sessions. But Redis TTL cannot enforce an absolute session limit regardless of activity. Store absoluteExpiry as a field in the hash and check it in your validation middleware. A user who has been continuously active for 8 hours gets terminated even mid-request. This is intentional. Banks have known this for decades.
If you need device management (list all sessions for a user), add a secondary index: a Redis Set at user_sessions:{userId} containing all active session IDs. Use the same {userId} hash tag on both keys so Redis Cluster places them on the same slot, enabling atomic SADD and SREM without a cross-slot error. Without hash tags, multi-key operations across slots throw CROSSSLOT errors and your device revocation flow breaks. This is the detail that separates candidates who've actually run Redis in production from candidates who've read about it.

The hash tag is the trick. Without it, the SADD and SREM for device revocation hit different cluster nodes and throw CROSSSLOT.
Five Endpoints Cover the Full Lifecycle
POST /sessions Create on login, return sessionId, set cookie
GET /sessions/{id} Validate on every authenticated request
PATCH /sessions/{id} Update lastActivity, device metadata
DELETE /sessions/{id} Logout, immediate invalidation
GET /users/{userId}/sessions List active devices
DELETE /users/{userId}/sessions Force-logout all devices
The validate endpoint is the hot path. It gets called on every authenticated request. It must be a single HGETALL against Redis, 0.2 to 1ms within a datacenter. If validation is slower than that, check your connection pooling before looking at anything else. Nine times out of ten the problem is opening a new connection per request instead of reusing a pool.
The create endpoint has one non-obvious rule: issue a fresh session ID on every successful login, even if the client already has a session. This prevents session fixation attacks (covered in the security section). Never reuse an old ID through a login event.
Redis Session Store: The Throughput Ceiling You Need to Know
A single Redis instance on modern hardware handles roughly 72,000 operations per second without pipelining, per the official redis-benchmark. With pipeline depth 16 you push past 1 million ops per second. The real ceiling is that Redis executes commands on a single thread, so CPU saturation on one node tops out around 200K ops/sec regardless of pipeline depth. Know this number. It is the natural transition from single-primary to cluster. Write it down during the interview.
Redis Cluster shards the keyspace across N primary nodes using 16,384 hash slots (CRC16 mod 16384). The number 16,384 is not arbitrary: it fits in 2 KB, keeping gossip heartbeats small. Adding nodes migrates slot ranges live with no downtime. For a 10M DAU product at 10K+ reads per second, three to six primary nodes is typical, each running well below its CPU ceiling.
Hot partitions are the failure mode most candidates miss. See hot partition patterns in system design interviews. Session access is not uniform. A user with 10 million followers creates a thundering herd against a single slot. Three mitigations:
- Read replicas per shard: Route reads round-robin across replicas. Reads outnumber writes 10 to 1 in a session store.
- Application-level L1 cache: Cache validated sessions in app server memory for 5 seconds. This eliminates 80 to 90% of Redis reads for active users with minimal stale-session risk.
- Avoid low-cardinality hash tags:
{status}routes half your keyspace to one node. Use{sessionId}or{userId}.

Reads go to replicas. Writes go to primaries. The L1 cache on each app server kills 80-90% of Redis reads before they even leave the box.
Background on sharding strategies: distributed cache system design and consistent hashing in system design interviews.
Sentinel and Cluster Do Different Things
This question comes up every time. They sound similar. They do not overlap.
Redis Sentinel provides automatic failover for a single primary plus replicas. Three Sentinel processes form a quorum. When the primary goes down, Sentinels elect a replacement within roughly 5 seconds. No sharding. All writes still go to one node. Sentinel is the right choice when your write throughput fits on one machine and you only need HA.
Redis Cluster provides both horizontal write scaling and HA. Each shard has its own primary and replicas. Failover is per-shard. Minimum six nodes (three primaries, three replicas). Use Cluster when you have hit the single-primary ceiling.
The confusion is understandable. Both handle failover. But Sentinel does not shard, and Cluster does not need a separate quorum process. If you say "I'll use Sentinel for the sharding," the interviewer will mark that down. The words sound close. The concepts are not.
Both use asynchronous replication. If the primary receives a write and crashes before syncing to any replica, that write is lost. For sessions this is usually acceptable. The user gets logged out and logs back in. For sessions carrying financial state or audit requirements, enable hybrid persistence (aof-use-rdb-preamble yes, appendfsync everysec): compact RDB snapshots plus an append-only log, with at most one second of data loss.
Security Has Four Non-Negotiables
Regenerate the session ID on every privilege change. Session fixation attacks work because an attacker can plant a pre-authentication session ID and ride it through login. Issuing the same ID before and after login hands the attacker an authenticated session. Rails: reset_session. Django: cycle_key(). Java: session.invalidate() followed by getSession(true). Miss this and you hand the attacker a valid authenticated session. It is one line of code and one of the most common web vulnerabilities in enterprise apps.
Cookie attributes, all four required:
Secure: HTTPS only. Without this, the session ID travels in plaintext.HttpOnly: JavaScript cannot read it. Blocks XSS-based session theft.SameSite=Strict: Blocks cross-site request forgery.__Host-prefix: Binds the cookie to the exact HTTPS origin, strips the Domain attribute.
OWASP requires 128-bit minimum entropy from a CSPRNG for the session ID. The cookie name should be generic (id, sid) rather than PHPSESSID or JSESSIONID, which advertise the framework to attackers. This is free information for anyone running a scanner. Don't give it to them.
SameSite=Strict covers most modern browsers for CSRF, but the synchronizer token pattern (store a CSRF token in the session, embed it in every form, verify on every state-changing request) is required for older clients and is defense in depth. Belt and suspenders.
How to Fill 45 Minutes
| Phase | Time |
|---|---|
| Requirements and capacity estimate | 5 min |
| High-level design, eliminate sticky sessions and JWT | 10 min |
| Data model plus API (including TTL split and secondary index) | 8 min |
| Scaling (throughput numbers, hot partitions, L1 cache) | 10 min |
| HA and persistence (Sentinel vs Cluster) | 7 min |
| Security requirements | 5 min |
Most candidates run out of time after drawing "centralized Redis" and never reach hot partitions or security. Interviewers score on depth. The hash tag trick for secondary indexes, the distinction between Redis TTL for idle timeout versus a stored field for absolute timeout, and knowing Sentinel does not shard are the details that separate a passing mark from a strong hire.
Explaining a design under time pressure is a different skill from knowing it. SpaceComplexity runs voice-based mock system design interviews with rubric scoring, so you can practice the actual skill the interviewer is testing, not just stare at diagrams until you feel confident.
Recap
- Sticky sessions fail on node loss; JWTs fail on revocation; centralized Redis is the right answer
- Session IDs need 128 to 160 bits of CSPRNG entropy, prefer
token_urlsafe(20)over UUID v4 - Idle timeout lives in Redis TTL; absolute timeout lives as a hash field checked in application code
- Secondary indexes for user-to-sessions need hash tags to land on the same Redis Cluster slot
- Single Redis instance: 72K ops/sec without pipelining, CPU ceiling at 200K ops/sec single-threaded
- Hot partitions: add read replicas and an L1 app-level cache for active session validation
- Sentinel is HA without sharding; Cluster is HA plus horizontal write scaling
- Session fixation: always regenerate the session ID on login
- Cookie flags: Secure + HttpOnly + SameSite=Strict +
__Host-prefix, all four required