OTP System Design Interview: The 45-Minute Walkthrough

June 11, 202611 min read
interview-prepcareersystem-designalgorithms
OTP System Design Interview: The 45-Minute Walkthrough
TL;DR
  • OTP system design interviews test generation, secure storage, delivery, and verification in one 45-minute session — most candidates stop at storage and miss the security layer
  • Hash codes with Argon2id, never plaintext; TTL-based Redis expiry handles cleanup automatically without a separate job
  • Atomic verify-and-consume via Redis Lua script prevents two simultaneous requests from both succeeding on the same code
  • Constant-time comparison (hmac.compare_digest, crypto.timingSafeEqual) blocks timing attacks that can reconstruct a 6-digit code digit by digit
  • Rate limit on three dimensions: per-phone sends (5/hr), per-OTP verification attempts (3), and per-IP sends (100/hr)
  • Your SMS provider is the real bottleneck, not your service; queue sends asynchronously, implement multi-provider failover, and circuit-break on p99 latency
  • Opaque session tokens beat JWTs when instant revocation ("log out everywhere") is a real product requirement

In most OTP system design interviews, the naive answer is "generate a 6-digit code, store it in Redis with a 5-minute TTL, and check it on verification." That answer is correct. It also gets you about 20 minutes in before the follow-ups arrive like SMS messages: out of order, delayed, and somehow three at once. What if the same OTP is submitted twice simultaneously? What if an attacker can measure response times? What happens when your SMS provider goes down mid-signup-wave? This walkthrough builds the full one-time password system, including the security properties most candidates forget to name.


Start With Clarifying Questions

Before any diagram, ask these out loud. They change the design substantially, and they buy you time.

Delivery channels. SMS only, or also email and TOTP (authenticator app)? A multi-channel system needs an abstracted delivery layer. Single-channel doesn't.

Use case. Login flow, transaction confirmation, or password reset? Each has different TTL requirements (60 seconds to 30 minutes) and different retry tolerances.

Scale. 10 million DAU with a 20% daily login rate is 2 million OTPs per day, or roughly 25 sends per second at peak. That's a different problem than 1,000 per second.

Compliance. HIPAA requires 6 years of audit log retention. PCI-DSS requires 1 year. Know this before you design the schema. Find out during the interview and you'll spend the rest of the session quietly reconsidering your career.

For this walkthrough: SMS and email delivery, login plus transaction confirmation use cases, 100 OTP sends per second at peak, standard compliance logging.


Five Components, One Critical Path

Draw them left to right as the request flows through.

Client
  └─> API Gateway
        └─> OTP Service ──────────> Delivery Service ──> SMS Provider
              │                                       └─> Email Provider
              ├─> Redis (hot path: OTP records + rate counters)
              └─> PostgreSQL (audit log)

The OTP Service owns the critical path. It generates the code, stores a hash (never plaintext) in Redis, hands off to the Delivery Service asynchronously, and later verifies incoming codes with atomic check-and-consume semantics.

The Delivery Service is intentionally separate. It owns retry logic, provider failover, and delivery status tracking. When Twilio times out at 8 seconds, the OTP Service doesn't stall. The message is queued and retried.

Architecture diagram showing the five-component OTP system: Client, API Gateway, OTP Service, Delivery Service, and storage tiers with async queue boundary OTP request path: the async queue between OTP Service and Delivery Service is the key architectural boundary.


The Algorithm Behind the Six Digits

Both HOTP (RFC 4226) and TOTP (RFC 6238) are built on HMAC-SHA-1:

HOTP(K, C) = Truncate(HMAC-SHA-1(K, C))

K is a shared secret (minimum 128 bits). C is a counter. The truncation function takes the 20-byte HMAC output, extracts a 4-byte window using a dynamic offset from the final byte, masks the MSB, and takes the result modulo 10^6. That produces a zero-padded 6-digit code.

TOTP extends this by replacing the counter with a time step:

T = floor((Unix time - T0) / 30)
TOTP(K) = HOTP(K, T)

Every 30 seconds, T increments and a new code is valid. The server accepts T-1, T, and T+1 to tolerate up to 30 seconds of clock drift in either direction.

For SMS OTP you skip HMAC entirely. Generate randint(0, 999999), zero-pad to 6 digits, hash and store. Not cryptographically interesting, but it works.

6 digits gives 1,000,000 possible codes. With a 3-attempt limit per OTP, an attacker's success probability is 3/1,000,000 = 0.0003%. Combine that with a 5-sends-per-hour limit and they get at most 15 guesses per hour. The math makes brute force irrelevant.

TOTP 30-second time windows showing T-1, T, and T+1 acceptance range for clock drift tolerance The server accepts three consecutive windows to handle up to 30 seconds of clock drift in either direction.


Redis for Speed, PostgreSQL for Evidence

Two storage tiers: Redis for the hot verification path, PostgreSQL for the audit trail.

Redis OTP record:

Key:   otp:user:{user_id}:{channel_hash}
Value: {
  code_hash:   argon2id(code),
  expires_at:  unix_timestamp,
  attempts:    0,
  is_consumed: false
}
TTL:   300 seconds

Never store the plaintext code. If Redis is compromised, hashed codes expire in minutes anyway. Use Argon2id rather than bcrypt here because its memory-hardness defeats GPU cracking even on short inputs like a 6-digit string. (Yes, someone has built a GPU cracker for 6-digit codes. People are creative in the worst ways.)

Rate limiting counters (Redis):

send_rate:phone:{phone_hash}   TTL: 3600s   Limit: 5 sends/hour
verify_rate:otp:{otp_id}        TTL: 300s    Limit: 3 attempts
send_rate:ip:{ip_address}       TTL: 3600s   Limit: 100 sends/hour

PostgreSQL audit log:

otp_events ( id UUID PRIMARY KEY, user_id UUID, channel_hash VARCHAR, -- SHA-256 of phone or email, never plaintext event_type VARCHAR, -- 'send', 'verify_success', 'verify_fail', 'expired' ip_address INET, country CHAR(2), created_at TIMESTAMPTZ )

Write audit events asynchronously so they never block the verification hot path. The PostgreSQL write is fire-and-forget; a compensating job catches up from Redis logs if writes miss.


Two Endpoints, One Idempotency Problem

Two endpoints carry the entire flow.

POST /otp/send

Request: { "phone": "+14155552671", "idempotency_key": "550e8400-e29b..." } Response: { "expires_in": 300, "attempts_remaining": 3 }

The idempotency key matters. Mobile clients retry on network timeout. Without it, a user gets two SMS messages for one action. With it, the second request returns the cached response and no second SMS is sent. Store the key with a 24-hour TTL.

Return 202 immediately and queue the SMS delivery to an async job. This decouples your API latency from the carrier's delivery time.

POST /otp/verify

Request: { "phone": "+14155552671", "code": "481623" } Response: { "session_token": "eyJ...", "expires_in": 86400 }

Error with remaining attempts:

{ "error": "invalid_otp", "attempts_remaining": 2 }

When attempts hit zero, return 429 with "requires_new_otp": true. The current OTP is poisoned after max attempts. Force a new send. This prevents indefinite brute force on a single generated code.


Three Security Properties You Have to Name

Most candidates cover storage and rate limiting. Strong-hire candidates proactively name these three.

Constant-time comparison. String equality that exits early on the first mismatched byte leaks timing information. An attacker can submit 000000, 100000, 200000, measure response times, and reconstruct the correct code digit by digit. This sounds like something from a heist movie. It works on real servers. Use constant-time comparison functions exclusively: hmac.compare_digest() in Python, crypto.timingSafeEqual() in Node.js, subtle.ConstantTimeCompare() in Go.

Atomic verify-and-consume. The operation that checks the code and marks it consumed must be a single transaction. Without atomicity, two simultaneous requests submitting the correct code both succeed. Use a Redis Lua script or MULTI/EXEC block to make the entire check-attempt-consume sequence atomic.

-- Atomic Lua script: check, increment attempt, and consume in one transaction local record = redis.call('HGETALL', KEYS[1]) if #record == 0 then return 'expired' end -- parse fields, check is_consumed, check attempts, compare hash... -- if valid: HSET is_consumed true, return 'valid' -- if invalid: HINCRBY attempts 1, return 'invalid'

Replay prevention. The is_consumed flag (or a verified_at timestamp in PostgreSQL) prevents a verified code from being reused. An attacker who intercepts the SMS gets one shot. The code is marked consumed the moment verification succeeds, before the session token is returned.

Sequence diagram showing two concurrent OTP verify requests hitting the Redis Lua script atomically, where only the first request succeeds Without the Lua script, both concurrent requests would return valid. The atomic execution guarantees only the first wins.


Your SMS Provider Is the Bottleneck

The real bottleneck is your SMS provider, not your service.

Twilio, AWS SNS, and Vonage all guarantee 99.9-99.95% availability, not zero-latency delivery. Domestic SMS takes 2-10 seconds. International routes can exceed 30 seconds. You've watched someone stare at their phone waiting for a code. That's your service working correctly.

Three mitigations:

  • Queue sends to a job worker (SQS, Kafka, or BullMQ). Decouple your API response time from carrier delivery time.
  • Multi-provider failover. Keep two SMS providers live. If provider A's error rate exceeds 5%, shift traffic to provider B automatically.
  • Circuit breaker on the Delivery Service. If provider p99 latency exceeds 15 seconds, open the circuit and surface a clear error to the user rather than leaving them waiting.

For a deeper treatment of the rate limiting layer here, see the Rate Limiter System Design walkthrough.

Redis handles the hot path comfortably. A single instance processes roughly 100K operations per second. At 100 OTP sends per second, you're at 0.1% of capacity. Redis Cluster handles 100x growth without architecture changes.

OTP bombing (flooding a victim's phone with send requests) needs two defenses. Per-phone rate limits (5 sends per hour) make it slow. Progressive friction (CAPTCHA after the second send within 10 minutes) makes it expensive. Both layers matter because the attacker's cost is near zero per SMS request on free tiers. The victim's patience is also near zero.

Sliding window rate limiting beats fixed window here. A fixed window counter can be gamed by submitting 5 requests at 11:59pm and 5 more at 12:00am, landing 10 sends in two minutes. Sliding window tracks a rolling hour and prevents this. See Distributed Cache System Design for the Redis pattern behind sliding counters.

Fixed window vs sliding window rate limiting, showing the boundary exploit in fixed window that sliding window prevents Fixed window lets an attacker cram 10 sends into 2 minutes by straddling the boundary. Sliding window closes the gap.


SMS vs TOTP, TTLs, and Token Types

SMS vs TOTP. SMS is low friction and high risk. Anyone who controls the SIM receives the code. SIM swap attacks let an attacker port a victim's number to their own device before the code arrives. TOTP is phishing-resistant, costs nothing per verification, and requires no external dependency. The catch is friction: users need an authenticator app. Ship SMS first for mass-market products and offer TOTP as a security upgrade option.

TTL length. Five minutes (300 seconds) is the right default for login flows. It gives SMS time to arrive on congested carrier routes without leaving a long brute-force window. Transaction confirmations warrant 10 minutes. Password reset links should expire in 15-30 minutes, long enough for a user to find their email but short enough to keep sleeping attackers out.

JWT vs opaque session tokens. JWTs are stateless: the server verifies them without a database lookup. Opaque tokens require a Redis or database lookup on every request but can be revoked instantly. Prefer opaque tokens when logout and compromised-device scenarios are real requirements. JWT's inability to be invalidated before expiry is an acceptable tradeoff for stateless microservices, but a liability for user-facing auth where "log out everywhere" is an expected feature.


The 45-Minute Clock

PhaseTimeWhat to Cover
Clarification5 minChannels, use cases, scale, compliance
Architecture10 min5-component diagram, async delivery queue
Data model and API10 minRedis schema, 2 endpoints, idempotency
Security8 minHashing, constant-time, atomic consume, replay
Scaling7 minSMS provider bottleneck, Redis, rate limiting
Tradeoffs5 minSMS vs TOTP, TTL choices, token type

Start the security section proactively. Most candidates wait to be asked. Naming timing attacks and atomic consume before the interviewer raises them is one of the clearest strong-hire signals in system design.


What to Take Into the OTP System Design Interview

  • Generate and deliver are two separate concerns. Design for delivery failure from the start.
  • Store Argon2id hash, never plaintext. TTL-based Redis expiry handles cleanup automatically.
  • Atomic verify-and-consume prevents race conditions. Constant-time comparison prevents timing attacks.
  • Rate limit on three dimensions: per-phone sends, per-OTP verification attempts, per-IP sends.
  • The SMS provider is your real scaling bottleneck. Queue sends, implement multi-provider failover, circuit-break on latency.
  • For revocable sessions, opaque tokens beat JWTs every time.

The gap between a correct description of this system and a convincing spoken explanation of it is wider than most engineers expect. SpaceComplexity runs live voice-based system design mocks where you practice narrating these tradeoffs under interview conditions and get rubric-based feedback on where the explanation broke down.


Further Reading