Design a Payment System: A Stripe-Style Interview Walkthrough

- Scope to merchant-facing API: agree on what you're building in the first 5 minutes; you are not rebuilding Visa
- Name all four parties: merchant, acquirer, card network, and issuer; authorization and settlement are two separate operations
- Idempotency key pattern: INSERT atomically and return 409 on in-flight conflicts; SELECT-then-INSERT has a race condition
- Store money as integers: use BIGINT cents, never floats; IEEE 754 cannot exactly represent 0.10
- Treat PSP timeouts as unknown: a timeout is not a failure; poll asynchronously for the outcome rather than marking the payment failed
- Outbox pattern for webhooks: write events in the same transaction as state changes and deliver at-least-once with exponential backoff
- Reconciliation closes the loop: daily settlement file comparison catches discrepancies the real-time path misses
Most system design problems are forgiving. You drop a message, you retry. You lose a cache entry, you recompute. You get a stale read, you refresh. Nobody files a complaint.
Money doesn't work that way.
Money can't disappear, appear twice, or end up in the wrong account. Every bug has a dollar amount attached to it. Sometimes a large one. That's what makes the payment system design question genuinely hard, and why interviewers love asking it. Get the happy path right and you have a decent answer. Get idempotency wrong and you have a live incident at 2am on a Saturday.
Scope Your Payment System Design in the First Five Minutes
Before you draw a single box, align on what you're building. "Payment system" can mean the full card network, a merchant-facing API, an internal funds-transfer service, or all three.
Ask: "Are we building the merchant-facing payment API, or the full card processing infrastructure?" For most interviews, the right scope is a Stripe-like service: merchants call your API to charge a card, your service communicates with a payment processor, and you return a result. You are not rebuilding Visa.
Functional requirements (agree on these in two minutes):
- Initiate a payment (charge a card)
- Confirm or fail the payment
- Notify the merchant via webhook
- Support refunds (say you'll cover it if there's time)
Non-functional requirements matter more here than in most problems. You need strong consistency (no double charges, no lost money), high availability (99.99% uptime target), and low latency (authorization under 500ms). Scale: Stripe processes roughly 250 million API requests per day, about 3,000 requests per second on average with much higher peaks.
State what you're excluding. Multi-currency, payouts, and fraud scoring are real features, but they're not what makes this problem interesting. Say you'll call them out as extensions.
The Four Parties You're Actually Dealing With
The card payment ecosystem has four parties. Name them, or your architecture won't make sense.
The merchant integrates your API. Your service, the PSP (payment service provider), handles the merchant-facing logic and routes to the card network. The acquirer is the merchant's bank. The card network (Visa, Mastercard) routes authorization requests between banks. The issuer is the cardholder's bank and it decides whether to approve or decline.

Authorization and settlement are different events, separated by one to two business days.
Authorization and settlement are two separate operations. Authorization (200-400ms, synchronous) reserves funds on the cardholder's account. Settlement (T+1 to T+2 business days, batched) is when money actually moves. Capture sits in between: the merchant triggers it when they're ready to collect. An e-commerce site typically authorizes at order placement and captures at fulfillment.
Draw this flow on your whiteboard early. It anchors every subsequent design decision, and it signals to your interviewer that you've actually thought about how card payments work, not just how REST APIs work.
Five Services, One Payment
Your high-level architecture has five components.

Kafka sits between Payment Service and the downstream consumers, which means slow webhook delivery never blocks authorization.
The API Gateway authenticates merchants via API keys, enforces rate limits, and validates idempotency keys at the edge. It rejects malformed requests before they reach your business logic.
The Payment Service owns the state machine for each payment. It calls out to the external PSP, persists state transitions atomically, and publishes events downstream. This is the brain of the system.
The Ledger Service records every financial movement as double-entry accounting entries. It's the source of truth for money, separate from the payment service.
The Notification Service delivers webhooks to merchants when payment state changes. It consumes events from Kafka so slow or failing webhook delivery doesn't block authorization.
The Reconciliation Service runs nightly, comparing your internal ledger against PSP settlement files. Any discrepancy triggers an alert.
Idempotency Is Not Optional
This is the section most candidates hand-wave. Don't. You will be asked to go deeper.
Networks fail. Clients retry. A merchant's server crashes mid-request and restarts. If your system processes the same request twice, the cardholder gets charged twice. That's not a bug. That's a legal incident. Possibly a news article.
Every payment request must carry a client-generated idempotency key, and your system must guarantee that the same key always returns the same result, no matter how many times it's submitted.
The implementation uses a dedicated table:
CREATE TABLE idempotency_keys ( key TEXT PRIMARY KEY, merchant_id UUID NOT NULL, response JSONB, -- stored result from first execution locked_at TIMESTAMPTZ, -- set while request is in-flight created_at TIMESTAMPTZ DEFAULT now() );

The single atomic INSERT prevents the TOCTOU race entirely. There is no SELECT-then-INSERT here.
On each request, try to INSERT the key. If the INSERT succeeds, the key is new: process the payment, then UPDATE the row with the response and clear the lock. If the INSERT fails (key exists), SELECT the row. If response is populated, return it immediately. If response is null (another request is in-flight), return HTTP 409 and tell the client to retry in a few seconds.
This is a SELECT-or-INSERT pattern, not a SELECT-then-INSERT. The single INSERT is atomic. The TOCTOU race doesn't exist. Idempotency keys expire after 24 hours (match your client retry window). Redis handles the key store for fast lookups; Postgres is the durable fallback. If your interviewer pushes on why lookup is O(1), the hash map internals post covers the mechanics.
The Data Model: Three Tables That Do the Work
First, payments. Never store money as a float. IEEE 754 floating point cannot exactly represent 0.10 in binary. Try it yourself: 0.1 + 0.2 in your browser console returns 0.30000000000000004. Cute in a CS class. Catastrophic in a payment system. Store amounts as integers in the smallest currency unit: 1099 for $10.99, not 10.99. This is the kind of thing that quietly destroys reconciliation at scale, and your interviewer knows it.
CREATE TABLE payments ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), merchant_id UUID NOT NULL, amount BIGINT NOT NULL, -- cents, or smallest currency unit currency CHAR(3) NOT NULL, status TEXT NOT NULL, idempotency_key TEXT UNIQUE, psp_reference TEXT, -- PSP's transaction ID created_at TIMESTAMPTZ DEFAULT now(), updated_at TIMESTAMPTZ DEFAULT now() );
Second, the ledger. Use double-entry bookkeeping: every debit has a matching credit, in the same atomic transaction. The invariant is that the sum of all credits equals the sum of all debits, at all times. If it doesn't, you have a bug.

Every dollar that enters the system also leaves it. The invariant is mathematical, not aspirational.
CREATE TABLE ledger_entries ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), payment_id UUID REFERENCES payments(id), account_id UUID NOT NULL, amount BIGINT NOT NULL, entry_type TEXT NOT NULL, -- 'debit' or 'credit' created_at TIMESTAMPTZ DEFAULT now() );
A successful authorization creates two entries: debit on the merchant's receivables account, credit on the clearing account. Settlement creates two more entries moving funds to the merchant's payout account.
Third, an outbox table for reliable webhook delivery (covered next).
The State Machine You Need to Draw
Every payment follows a fixed sequence of states. No state can be skipped, and some transitions are permanently one-way.

Each state transition is a single atomic database write. The payment service updates the row and publishes an event to Kafka in the same transaction using the outbox pattern. No transition can skip states. A failed payment can never move to succeeded. This is not just a design choice. It's what makes your audit log trustworthy.
Webhooks Without Data Loss
Merchants need to know when payments succeed or fail. HTTP delivery is not reliable, so you need a delivery guarantee.
Use the transactional outbox pattern. When the payment service commits a state transition, it writes a row to an outbox table in the same database transaction. A separate worker polls the outbox and makes the HTTP call. If the worker crashes between commit and delivery, the row is still there. The next worker picks it up and retries.
Delivery is at-least-once, not exactly-once. Merchants must handle duplicates on their end. Your webhook payload should include the payment ID so they can deduplicate. Retry with exponential backoff, capped at 1 hour between attempts. After 72 hours of consecutive failures, stop and alert. Stripe's actual retry window is about 3 days in live mode.
Scaling: The Hot Merchant Problem
At low scale, one Postgres instance handles everything fine. At Stripe scale, you have two problems.
The first is raw database throughput. Shard by merchant_id. All payments for a merchant live on one shard. This keeps ledger queries fast (no cross-shard joins) and limits contention to one shard at a time.
The second is the hot merchant problem. A large merchant running a flash sale can generate thousands of writes per second, all hitting one shard. Their account balance row becomes a contention hotspot, serializing every write. This is where most candidates stop. Don't stop here.

The fix is internal account sharding: split the merchant's logical balance across N sub-account rows and aggregate them on read. Writes distribute across sub-accounts randomly. Reads do a SUM. This is the same pattern as Redis hot key sharding (scatter writes with a random suffix, gather on read). Start with N=10 and tune based on observed contention.
Kafka absorbs write pressure before it reaches the database. Partition Kafka by merchant_id so events for a merchant are processed in order. This keeps your Kafka-to-Postgres fan-in from overwhelming any single shard.
When the PSP Doesn't Pick Up
External PSPs fail. You need a strategy for each failure mode.
Put a circuit breaker in front of every PSP call. After five consecutive failures within a 30-second window, open the circuit and fail fast. This protects the PSP from a retry storm during a partial outage. After a timeout period, move to half-open and let a probe request through.
For major outages, route to a backup PSP. Call this out as a significant operational commitment: you need payment method tokens synchronized across both PSPs. Mention it as a tradeoff and move on. The simpler fallback is degraded-but-honest (return a clear error, let the merchant retry later).
The hardest case is a timeout. A timeout is not a failure. When the PSP call times out, you have no idea whether the charge happened. The PSP might have processed it. Or it crashed before processing. You genuinely don't know. Store the payment in processing state, then run an async job that polls the PSP for the outcome. Never assume timeout means failed. In payment systems, ambiguity costs real money.
The timeout-as-unknown insight separates engineers who've thought about distributed systems from engineers who've thought about functions that return values. Say it out loud in the interview. Your interviewer will write it down.
Reconciliation Closes the Loop
Your real-time path handles the happy case. Reconciliation handles everything else.
Every PSP sends a daily settlement file listing all transactions they processed. Your reconciliation service downloads it and compares row by row against your ledger. Three cases:
- In your ledger but not in the settlement file: you recorded a success the PSP doesn't know about. Investigate before paying out.
- In the settlement file but not your ledger: the PSP processed it and you never recorded it. Revenue leak if undetected.
- Amount mismatch: usually a fee calculation error or currency rounding issue.
Each case has a different resolution workflow, and every discrepancy triggers an alert before the next settlement cycle. This is how Stripe keeps books clean at scale. If you only mention the real-time path, you've described a system that drifts over time. Your interviewer will notice.
The 45-Minute Clock
Allocate the 45 minutes like this.
0-5 min: Clarify scope. Functional requirements, non-functional requirements, scale. Say what you're leaving out.
5-10 min: Draw the card network ecosystem. Name all four parties. Show the authorization flow. Mention auth vs capture vs settlement.
10-20 min: High-level architecture. Five services, Kafka between them, database.
20-30 min: Idempotency and the data model. This is where most candidates fail. The INSERT-based idempotency key pattern and the float-vs-integer gotcha are what separate a deep answer from a shallow one.
30-38 min: Scaling, hot merchant sharding, circuit breaker, and the timeout-as-unknown insight.
38-45 min: Reconciliation, open tradeoffs, and the interviewer's questions.
If you run short, finish the idempotency deep dive before anything else. An interviewer who hears the INSERT-then-UPDATE pattern and the 409 response logic will rate your answer far higher than one who sees a polished architecture diagram with no correctness reasoning.
The Tradeoffs Worth Naming
Strong candidates name their tradeoffs without being asked. The consistency vs availability decision shows up in multiple places. The push vs pull tradeoff framing maps to several of the calls below.
For the authorization path, you choose consistency. A double charge is worse than a failed authorization. Every write hits the primary. No stale reads allowed.
For the webhook path, you choose availability. Delayed delivery is better than dropped delivery. The outbox pattern plus at-least-once retry is the right call, not a synchronous HTTP call that blocks the authorization path.
For settlement reads (merchant dashboards, reporting), you can serve from read replicas. The data is historical and tolerance for slight staleness is high.
Synchronous capture vs async capture is another call. Synchronous gives merchants an immediate result. Async lets you batch capture requests to the acquirer, which reduces interchange fees but adds latency and failure modes to reason about.
Recap
- Scope: merchant-facing API, not the card network infrastructure
- Four parties: merchant, PSP, acquirer, card network (issuer decides approve/decline)
- Five services: API gateway, payment service, ledger, notifications, reconciliation
- Idempotency keys: INSERT-based, not SELECT-then-INSERT; 24-hour TTL; return 409 on in-flight conflict
- Store money as integers (cents). Never floats
- Double-entry ledger: every debit has a matching credit, written atomically
- Outbox pattern for reliable webhook delivery; at-least-once means merchants deduplicate
- Shard by merchant, internal account sharding for hot merchants
- Circuit breaker in front of PSP calls; treat timeout as unknown, never as failed
- Reconciliation catches drift the real-time path misses
If you want to practice walking through this design out loud under real time pressure, SpaceComplexity runs voice-based system design interviews with rubric feedback on exactly this kind of problem.
Further Reading
- Designing robust and predictable APIs with idempotency (Stripe Engineering Blog)
- PaymentIntents lifecycle (Stripe Docs)
- Avoiding double payments in a distributed payments system (Airbnb Engineering)
- Payments System Architecture: Designing for Scale, Consistency, and Resilience (CockroachDB Blog)
- Designing a Payment System (The Pragmatic Engineer)