Idempotency in System Design Interviews: What It Is and When to Use It
- Idempotency means applying the same operation once or a thousand times produces the same result — the property that makes retries safe in distributed systems.
- Idempotency keys are client-generated UUIDs sent as request headers; the server stores them and returns a cached response on any retry, preventing duplicate processing.
- Atomicity is non-negotiable: the idempotency key and the business operation must commit in the same transaction — separate commits lead to orphaned keys or duplicate operations.
- Redis (SET with NX flag and TTL) works for notifications and background jobs; a relational database with a unique constraint is required for payments and financial operations.
- TTL must cover your longest retry window plus a safety margin — Stripe's 24-hour default works for web clients; weekly retry schedules need 7+ day TTLs.
- Bring it up unprompted whenever the design involves payments, order creation, Kafka/SQS consumers, or any scenario where the interviewer asks "what happens if the request fails?"
You click "Pay." The request times out. You have no idea if the charge went through. Do you retry?
If the server processed the charge and the response just got lost on the way back, a retry means charging the customer twice. If the server never received the request, not retrying means a failed payment. You can't tell which case you're in, and neither can your code.
This is the problem idempotency solves. In a system design interview, it's one of the signals that separates candidates who understand distributed systems from those treating retry logic as a complete answer. If you can make the payment operation idempotent, you can retry freely, knowing the customer gets charged exactly once no matter how many times the request reaches the server. Every payment system, order system, and notification system has to grapple with this. Bringing it up unprompted signals that you think at the level of distributed systems, not just CRUD.
What "Idempotent" Actually Means (Yes, There's Math)
Mathematically: f(f(x)) = f(x). Apply the function once, apply it a thousand times, same result. You may not have expected a formal definition in the second section of a blog post, but here we are.
HTTP has its own version of this. GET, PUT, and DELETE are idempotent by specification (RFC 9110). POST is not. PUT replaces a resource with the same payload, so calling it 10 times leaves you in the same state as calling it once. POST creates a new resource each time, so 10 calls mean 10 resources. And 10 charges on someone's credit card.
There's also a related concept worth distinguishing: safety. A safe method doesn't modify state (GET, HEAD), while idempotency only guarantees repeated calls produce the same state, not that state goes unchanged. DELETE is idempotent but not safe. The resource is gone after the first call; after the 10th call it's still gone. Same final state, modified state.
Why does this matter for interviews? Because your API is almost certainly built on POST. Creating a payment, creating an order, creating a notification: all POST requests. And POST gives you zero idempotency guarantees out of the box. None. You're on your own.
Networks Fail. A Lot. More Than You Think.
In a local system you know if a function ran. In a distributed system you often don't.
A client sends a request. Then any of the following happens: the server receives it and crashes mid-process. The server processes it and the response packet is dropped. The client times out before the response arrives. The client's connection drops. The load balancer retries before the original request finishes.
The client gets a timeout error in every single one of these cases. Only one of them means the operation didn't run.
At-least-once delivery is a fundamental property of reliable message systems. Kafka, SQS, and RabbitMQ all guarantee messages are never lost, but they make no promise about delivering a message only once. A broker restart, a consumer crash, a rebalance: any of these can cause a message to be redelivered. Your consumers will see duplicates. The question is what happens when they do.
Exactly-once semantics sound like the right answer, but they're provably impossible to guarantee over unreliable networks. The Two Generals Problem proves no protocol can ensure agreement when messages can be lost. What you can achieve in practice is at-least-once delivery combined with idempotent processing, which produces exactly-once outcomes. Semantics-wise you get what you wanted. Mathematically you're just being clever about it.
How Idempotency Keys Work
The standard solution is the idempotency key pattern. The client generates a unique identifier, includes it with the request, and the server uses it to detect and deduplicate retries.
The flow:
1. Client generates UUID v4
2. Client sends request with header: Idempotency-Key: <uuid>
3. Server checks idempotency store: does this key exist?
4. Not found → execute business logic, store key + result
5. Found → return cached result immediately, skip business logic

On a retry, step 3 finds the key and returns the stored result. The operation runs once. The client receives the same response either way. The customer's card does not get charged twice. Everyone sleeps.
Stripe is the clearest real-world example. All POST requests to their API accept an Idempotency-Key header. Send the same key twice with the same parameters and Stripe returns the same response. Keys expire after 24 hours. This is how Stripe guarantees a customer gets charged once even if your server retries a failed charge three times during an outage.
curl https://api.stripe.com/v1/charges \ -H "Idempotency-Key: AGJ6FJMkGQIpHUTX" \ -d amount=2000 \ -d currency=usd \ -d source=tok_visa
One detail that trips people up: the idempotency key and the business operation must be committed in the same database transaction. If you insert the key first, then crash before committing the payment row, the key exists but the payment doesn't. If you insert the payment first, then crash before saving the key, a retry will process the payment again. You need atomicity. No cheating, no "we'll insert in order and hope."
Where to Store the Keys
You have two main options, and the right one depends on what you're protecting.
Redis with TTL is fast and simple. One command: SET <key> <response> EX 86400 NX. The NX flag makes it atomic: it only sets the key if it doesn't already exist. Sub-millisecond lookup, automatic expiration. The downside is that Redis is not durable by default. A crash and restart can lose your key store. For a payment, that's a very bad day.
For non-critical operations (notification deduplication, background jobs) Redis is usually fine.
A relational database with a unique constraint is appropriate for financial operations. The schema needs the key, the request parameters, the response code, the response body, and a timestamp.
CREATE TABLE idempotency_keys ( user_id BIGINT NOT NULL, key VARCHAR(255) NOT NULL, request_hash VARCHAR(64) NOT NULL, response_code INT, response_body TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), PRIMARY KEY (user_id, key) );
Storing request_hash lets you detect when a client sends the same key with different parameters, which usually indicates a bug on the client side. Or a very creative user.
The table approach adds 10 to 50ms of latency per request. For a payment a user is waiting on, this is invisible. For a high-throughput ingestion pipeline processing millions of events per second, it's not.
A practical pattern: use a database for the first write to establish durability, then cache the key and response in Redis for subsequent lookups.
TTL: Match the Retry Window (or Pay for It Later)
The key expiration time matters more than it looks.
If the TTL is shorter than your client's maximum retry window, a client retrying after the key expired will trigger the operation again. You get duplicates. The whole thing was for nothing.
If the TTL is much longer than the retry window, you're storing data you don't need. At high request volume, idle key storage adds up.
The right TTL is the maximum time any client would realistically retry, plus a safety margin. Stripe's 24-hour default works for most web and mobile clients. Background jobs with retry schedules spanning days need longer TTLs. A job that retries on a 7-day backoff needs at least a 7-day TTL. This is not complicated, but it is easy to get wrong by copying a default.
The Tradeoffs, Laid Out
Every design choice here has a cost:
| Concern | Redis | Database |
|---|---|---|
| Lookup latency | Under 1ms | 10 to 50ms |
| Durability | Configurable, off by default | Durable |
| Expiration | Automatic | Requires cleanup job |
| Best for | Notifications, jobs | Payments, orders |
The other tradeoff is storage overhead. At 100,000 requests per day, 24-hour TTL, and 1KB average response, you're storing roughly 100MB of idempotency data. Manageable. At 10 million requests per day, it's 10GB daily churn through your Redis instance. You need to size accordingly, not pretend the math doesn't exist.
There's also the external API timeout problem. Suppose you start a database transaction, set the idempotency key, then call Stripe to process the charge. Stripe times out. You don't know if the charge happened. The key is already committed. If you mark the key as "completed" you might be lying. If you mark it as "failed" a retry will try the charge again, which might double-charge.
The solution, described well in Brandur Leach's implementation guide, is recovery points: named checkpoints stored in the idempotency record that let a retry resume from the last confirmed step rather than restarting from scratch.
When to Bring It Up in an Interview
You don't need to explain idempotency keys in a design for a read-heavy social graph. You do need to bring it up any time the interview involves:
- Payments or financial transactions (say it in the requirements phase)
- Order creation (same)
- Notification delivery (when discussing retry behavior)
- Event-driven consumers with Kafka or SQS (when discussing reliability)
- Any system where the interviewer asks "what happens if a request fails?"
The signal you want to send: you know that networks fail, that clients retry, and that "add retry logic" is only half the answer. The other half is making the operation safe to retry.
Weak answer: "We'll add retry logic with exponential backoff."
Strong answer: "We'll add retry logic with exponential backoff, and the payment endpoint will use idempotency keys so retries don't double-charge the customer."
The strongest candidates name the specific failure mode, explain why retries alone don't fix it, and describe the key storage approach including where it lives and what TTL they'd use. For a payment system, that's: client-generated UUID, database-backed key store (not Redis), same transaction as the payment row, 24-hour TTL.
For practice saying this out loud before an interview, SpaceComplexity runs voice-based mock interviews with rubric feedback on exactly this kind of system design depth, including whether you surfaced idempotency unprompted.
For more on the systems where idempotency matters most, see the payment system design walkthrough, the distributed message queue design, and the notification system design walkthrough.