Session Store System Design Interview: The 45-Minute Walkthrough

Every authenticated request your app handles starts with the same hidden step: look up a session. Not once per login. Once per request. At 50,000 requests per second, that's 50,000 key-value lookups before you've done anything useful. Get the design wrong and you've built a latency floor under every feature you'll ever ship.

That's why the session store system design interview is a staple across Meta, Google, and Amazon loops. It looks like a simple CRUD problem. It isn't. There are exactly two popular approaches that seem fine until they catch fire, and one that actually works. Interviewers want to know if you can tell the difference.

Start With the Clarifying Questions

You have five minutes. Use them.

Functional scope: Do users need to view and revoke sessions from other devices? That adds a secondary index. Is there an admin force-logout flow? Does the session carry application state like cart contents, or just authentication identity?

Non-functional targets: Read latency target (p99 under 10ms is standard). Write throughput ceiling. Concurrent session count. Acceptable data loss window on a primary failure.

For a typical 10M DAU product: 500K concurrent sessions at peak, roughly 1 KB each, 80 million read operations per day, bursts to 10K+ reads per second. Those numbers determine whether you need Redis Cluster or a single Redis primary with replicas. Nail the capacity estimate early so every architecture decision has a quantitative anchor, not a vague feeling.

Three Architectures, Two Dead Ends

Sticky sessions (session affinity via load balancer) pin each user to one app server. The session lives in-process memory. No external hop. Simple. But a node failure wipes every session pinned to it, and you cannot scale your app servers without redirecting users mid-session. This architecture fails in the exact ways that make users angry: random logouts during deploys, support tickets at 2am. Cut it early. Say why.

Client-side tokens (JWT) encode the session in a signed token the client sends on every request. Zero server-side storage. Infinite horizontal scale. Sounds great until you need to revoke one. The fatal problem is that you cannot revoke a JWT before it expires. User logs out? Token stays valid. Credentials compromised? Attacker keeps access until expiry. Solving this requires a server-side blacklist, which reintroduces most of the complexity you tried to avoid. Congratulations, you have invented a worse version of what we're about to build.

The right answer is a centralized session store backed by Redis, with the session ID delivered as an HttpOnly cookie. Stateless app servers, immediate revocation, sub-millisecond reads within a datacenter. Commit to this architecture and draw it:

Session store architecture: browser sends HttpOnly cookie through load balancer to stateless app servers, which validate against Redis Cluster. Optional audit DB for durable log.

Request flow: cookie in, Redis lookup, response out. Every component to the left of Redis is stateless and disposable.

The Data Model Has Two Non-Obvious Parts

The primary key in Redis is sess:{sessionId}. Store the session as a Redis hash:

sessionId        string   (opaque 160-bit random token)
userId           string
createdAt        unix timestamp
lastActivity     unix timestamp
absoluteExpiry   unix timestamp
ipAddress        string
userAgent        string

Session ID generation is a security requirement, not an implementation detail. UUID v4 technically gives 122 bits of entropy, which meets the OWASP minimum, but it exhibits lock contention under high concurrency (Java's UUID.randomUUID() shares a single SecureRandom instance) and wastes 6 bits on version/variant markers. The safer default is a CSPRNG-generated 160-bit token:

import secrets
session_id = secrets.token_urlsafe(20)  # 27-character URL-safe base64

const sessionId = crypto.randomBytes(20).toString('base64url');

27 characters. Brute-force resistant under any realistic threat model. No lock contention. No wasted bits encoding the version marker of a spec nobody asked for.

Two TTL values coexist, and only one can live purely in Redis.

Redis handles the sliding idle timeout natively. Call EXPIRE sess:{sessionId} 1800 on every successful read and you get automatic eviction on idle sessions. But Redis TTL cannot enforce an absolute session limit regardless of activity. Store absoluteExpiry as a field in the hash and check it in your validation middleware. A user who has been continuously active for 8 hours gets terminated even mid-request. This is intentional. Banks have known this for decades.

If you need device management (list all sessions for a user), add a secondary index: a Redis Set at user_sessions:{userId} containing all active session IDs. Use the same {userId} hash tag on both keys so Redis Cluster places them on the same slot, enabling atomic SADD and SREM without a cross-slot error. Without hash tags, multi-key operations across slots throw CROSSSLOT errors and your device revocation flow breaks. This is the detail that separates candidates who've actually run Redis in production from candidates who've read about it.

$Redis key layout: sess:{sessionId} hash on the left with all fields; user_sessions:{userId} Set on the right with active session IDs. Both share the {userId} hash tag to land on the same cluster slot.$

The hash tag is the trick. Without it, the SADD and SREM for device revocation hit different cluster nodes and throw CROSSSLOT.

Five Endpoints Cover the Full Lifecycle

POST   /sessions                   Create on login, return sessionId, set cookie
GET    /sessions/{id}              Validate on every authenticated request
PATCH  /sessions/{id}              Update lastActivity, device metadata
DELETE /sessions/{id}              Logout, immediate invalidation
GET    /users/{userId}/sessions    List active devices
DELETE /users/{userId}/sessions    Force-logout all devices

The validate endpoint is the hot path. It gets called on every authenticated request. It must be a single HGETALL against Redis, 0.2 to 1ms within a datacenter. If validation is slower than that, check your connection pooling before looking at anything else. Nine times out of ten the problem is opening a new connection per request instead of reusing a pool.

The create endpoint has one non-obvious rule: issue a fresh session ID on every successful login, even if the client already has a session. This prevents session fixation attacks (covered in the security section). Never reuse an old ID through a login event.

Redis Session Store: The Throughput Ceiling You Need to Know

A single Redis instance on modern hardware handles roughly 72,000 operations per second without pipelining, per the official redis-benchmark. With pipeline depth 16 you push past 1 million ops per second. The real ceiling is that Redis executes commands on a single thread, so CPU saturation on one node tops out around 200K ops/sec regardless of pipeline depth. Know this number. It is the natural transition from single-primary to cluster. Write it down during the interview.

Redis Cluster shards the keyspace across N primary nodes using 16,384 hash slots (CRC16 mod 16384). The number 16,384 is not arbitrary: it fits in 2 KB, keeping gossip heartbeats small. Adding nodes migrates slot ranges live with no downtime. For a 10M DAU product at 10K+ reads per second, three to six primary nodes is typical, each running well below its CPU ceiling.

Hot partitions are the failure mode most candidates miss. See hot partition patterns in system design interviews. Session access is not uniform. A user with 10 million followers creates a thundering herd against a single slot. Three mitigations:

Read replicas per shard: Route reads round-robin across replicas. Reads outnumber writes 10 to 1 in a session store.
Application-level L1 cache: Cache validated sessions in app server memory for 5 seconds. This eliminates 80 to 90% of Redis reads for active users with minimal stale-session risk.
Avoid low-cardinality hash tags: {status} routes half your keyspace to one node. Use {sessionId} or {userId}.

Redis Cluster with 3 primaries and 3 replicas. App servers write to primaries, read from replicas round-robin. A hot slot on Primary 2 is mitigated by replica fan-out and app-level L1 cache. Throughput numbers annotated.

Reads go to replicas. Writes go to primaries. The L1 cache on each app server kills 80-90% of Redis reads before they even leave the box.

Background on sharding strategies: distributed cache system design and consistent hashing in system design interviews.

Sentinel and Cluster Do Different Things

This question comes up every time. They sound similar. They do not overlap.

Redis Sentinel provides automatic failover for a single primary plus replicas. Three Sentinel processes form a quorum. When the primary goes down, Sentinels elect a replacement within roughly 5 seconds. No sharding. All writes still go to one node. Sentinel is the right choice when your write throughput fits on one machine and you only need HA.

Redis Cluster provides both horizontal write scaling and HA. Each shard has its own primary and replicas. Failover is per-shard. Minimum six nodes (three primaries, three replicas). Use Cluster when you have hit the single-primary ceiling.

The confusion is understandable. Both handle failover. But Sentinel does not shard, and Cluster does not need a separate quorum process. If you say "I'll use Sentinel for the sharding," the interviewer will mark that down. The words sound close. The concepts are not.

Both use asynchronous replication. If the primary receives a write and crashes before syncing to any replica, that write is lost. For sessions this is usually acceptable. The user gets logged out and logs back in. For sessions carrying financial state or audit requirements, enable hybrid persistence (aof-use-rdb-preamble yes, appendfsync everysec): compact RDB snapshots plus an append-only log, with at most one second of data loss.

Security Has Four Non-Negotiables

Regenerate the session ID on every privilege change. Session fixation attacks work because an attacker can plant a pre-authentication session ID and ride it through login. Issuing the same ID before and after login hands the attacker an authenticated session. Rails: reset_session. Django: cycle_key(). Java: session.invalidate() followed by getSession(true). Miss this and you hand the attacker a valid authenticated session. It is one line of code and one of the most common web vulnerabilities in enterprise apps.

Cookie attributes, all four required:

Secure: HTTPS only. Without this, the session ID travels in plaintext.
HttpOnly: JavaScript cannot read it. Blocks XSS-based session theft.
SameSite=Strict: Blocks cross-site request forgery.
__Host- prefix: Binds the cookie to the exact HTTPS origin, strips the Domain attribute.

OWASP requires 128-bit minimum entropy from a CSPRNG for the session ID. The cookie name should be generic (id, sid) rather than PHPSESSID or JSESSIONID, which advertise the framework to attackers. This is free information for anyone running a scanner. Don't give it to them.

SameSite=Strict covers most modern browsers for CSRF, but the synchronizer token pattern (store a CSRF token in the session, embed it in every form, verify on every state-changing request) is required for older clients and is defense in depth. Belt and suspenders.

How to Fill 45 Minutes

Phase	Time
Requirements and capacity estimate	5 min
High-level design, eliminate sticky sessions and JWT	10 min
Data model plus API (including TTL split and secondary index)	8 min
Scaling (throughput numbers, hot partitions, L1 cache)	10 min
HA and persistence (Sentinel vs Cluster)	7 min
Security requirements	5 min

Most candidates run out of time after drawing "centralized Redis" and never reach hot partitions or security. Interviewers score on depth. The hash tag trick for secondary indexes, the distinction between Redis TTL for idle timeout versus a stored field for absolute timeout, and knowing Sentinel does not shard are the details that separate a passing mark from a strong hire.

Explaining a design under time pressure is a different skill from knowing it. SpaceComplexity runs voice-based mock system design interviews with rubric scoring, so you can practice the actual skill the interviewer is testing, not just stare at diagrams until you feel confident.

Recap

Sticky sessions fail on node loss; JWTs fail on revocation; centralized Redis is the right answer
Session IDs need 128 to 160 bits of CSPRNG entropy, prefer token_urlsafe(20) over UUID v4
Idle timeout lives in Redis TTL; absolute timeout lives as a hash field checked in application code
Secondary indexes for user-to-sessions need hash tags to land on the same Redis Cluster slot
Single Redis instance: 72K ops/sec without pipelining, CPU ceiling at 200K ops/sec single-threaded
Hot partitions: add read replicas and an L1 app-level cache for active session validation
Sentinel is HA without sharding; Cluster is HA plus horizontal write scaling
Session fixation: always regenerate the session ID on login
Cookie flags: Secure + HttpOnly + SameSite=Strict + __Host- prefix, all four required