Design a Payment System: A Stripe-Style Interview Walkthrough

Most system design problems are forgiving. You drop a message, you retry. You lose a cache entry, you recompute. You get a stale read, you refresh. Nobody files a complaint.

Money doesn't work that way.

Money can't disappear, appear twice, or end up in the wrong account. Every bug has a dollar amount attached to it. Sometimes a large one. That's what makes the payment system design question genuinely hard, and why interviewers love asking it. Get the happy path right and you have a decent answer. Get idempotency wrong and you have a live incident at 2am on a Saturday.

Scope Your Payment System Design in the First Five Minutes

Before you draw a single box, align on what you're building. "Payment system" can mean the full card network, a merchant-facing API, an internal funds-transfer service, or all three.

Ask: "Are we building the merchant-facing payment API, or the full card processing infrastructure?" For most interviews, the right scope is a Stripe-like service: merchants call your API to charge a card, your service communicates with a payment processor, and you return a result. You are not rebuilding Visa.

Functional requirements (agree on these in two minutes):

Initiate a payment (charge a card)
Confirm or fail the payment
Notify the merchant via webhook
Support refunds (say you'll cover it if there's time)

Non-functional requirements matter more here than in most problems. You need strong consistency (no double charges, no lost money), high availability (99.99% uptime target), and low latency (authorization under 500ms). Scale: Stripe processes roughly 250 million API requests per day, about 3,000 requests per second on average with much higher peaks.

State what you're excluding. Multi-currency, payouts, and fraud scoring are real features, but they're not what makes this problem interesting. Say you'll call them out as extensions.

The Four Parties You're Actually Dealing With

The card payment ecosystem has four parties. Name them, or your architecture won't make sense.

The merchant integrates your API. Your service, the PSP (payment service provider), handles the merchant-facing logic and routes to the card network. The acquirer is the merchant's bank. The card network (Visa, Mastercard) routes authorization requests between banks. The issuer is the cardholder's bank and it decides whether to approve or decline.

The card network flow: Merchant sends charge to Your API, which routes through Acquirer, Card Network, and Issuer. Approve/decline response flows back in 200-400ms.

Authorization and settlement are different events, separated by one to two business days.

Authorization and settlement are two separate operations. Authorization (200-400ms, synchronous) reserves funds on the cardholder's account. Settlement (T+1 to T+2 business days, batched) is when money actually moves. Capture sits in between: the merchant triggers it when they're ready to collect. An e-commerce site typically authorizes at order placement and captures at fulfillment.

Draw this flow on your whiteboard early. It anchors every subsequent design decision, and it signals to your interviewer that you've actually thought about how card payments work, not just how REST APIs work.

Five Services, One Payment

Your high-level architecture has five components.

Architecture overview: Merchant calls API Gateway, which hits Payment Service. Payment Service calls External PSP and publishes to Kafka. Kafka fans out to Ledger Service and Notification Service. Reconciliation runs nightly against Payment Service.

Kafka sits between Payment Service and the downstream consumers, which means slow webhook delivery never blocks authorization.

The API Gateway authenticates merchants via API keys, enforces rate limits, and validates idempotency keys at the edge. It rejects malformed requests before they reach your business logic.

The Payment Service owns the state machine for each payment. It calls out to the external PSP, persists state transitions atomically, and publishes events downstream. This is the brain of the system.

The Ledger Service records every financial movement as double-entry accounting entries. It's the source of truth for money, separate from the payment service.

The Notification Service delivers webhooks to merchants when payment state changes. It consumes events from Kafka so slow or failing webhook delivery doesn't block authorization.

The Reconciliation Service runs nightly, comparing your internal ledger against PSP settlement files. Any discrepancy triggers an alert.

Idempotency Is Not Optional

This is the section most candidates hand-wave. Don't. You will be asked to go deeper.

Networks fail. Clients retry. A merchant's server crashes mid-request and restarts. If your system processes the same request twice, the cardholder gets charged twice. That's not a bug. That's a legal incident. Possibly a news article.

Every payment request must carry a client-generated idempotency key, and your system must guarantee that the same key always returns the same result, no matter how many times it's submitted.

The implementation uses a dedicated table:

CREATE TABLE idempotency_keys (
  key         TEXT PRIMARY KEY,
  merchant_id UUID NOT NULL,
  response    JSONB,         -- stored result from first execution
  locked_at   TIMESTAMPTZ,  -- set while request is in-flight
  created_at  TIMESTAMPTZ DEFAULT now()
);

Idempotency key flow: INSERT key. If it succeeds, process payment and store response. If key exists and response is populated, return cached result. If key exists and response is null (in-flight), return HTTP 409.

The single atomic INSERT prevents the TOCTOU race entirely. There is no SELECT-then-INSERT here.

On each request, try to INSERT the key. If the INSERT succeeds, the key is new: process the payment, then UPDATE the row with the response and clear the lock. If the INSERT fails (key exists), SELECT the row. If response is populated, return it immediately. If response is null (another request is in-flight), return HTTP 409 and tell the client to retry in a few seconds.

This is a SELECT-or-INSERT pattern, not a SELECT-then-INSERT. The single INSERT is atomic. The TOCTOU race doesn't exist. Idempotency keys expire after 24 hours (match your client retry window). Redis handles the key store for fast lookups; Postgres is the durable fallback. If your interviewer pushes on why lookup is O(1), the hash map internals post covers the mechanics.

The Data Model: Three Tables That Do the Work

First, payments. Never store money as a float. IEEE 754 floating point cannot exactly represent 0.10 in binary. Try it yourself: 0.1 + 0.2 in your browser console returns 0.30000000000000004. Cute in a CS class. Catastrophic in a payment system. Store amounts as integers in the smallest currency unit: 1099 for $10.99, not 10.99. This is the kind of thing that quietly destroys reconciliation at scale, and your interviewer knows it.

CREATE TABLE payments (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  merchant_id     UUID NOT NULL,
  amount          BIGINT NOT NULL,  -- cents, or smallest currency unit
  currency        CHAR(3) NOT NULL,
  status          TEXT NOT NULL,
  idempotency_key TEXT UNIQUE,
  psp_reference   TEXT,             -- PSP's transaction ID
  created_at      TIMESTAMPTZ DEFAULT now(),
  updated_at      TIMESTAMPTZ DEFAULT now()
);

Second, the ledger. Use double-entry bookkeeping: every debit has a matching credit, in the same atomic transaction. The invariant is that the sum of all credits equals the sum of all debits, at all times. If it doesn't, you have a bug.

Double-entry ledger: authorization creates a debit on merchant.receivables and credit on clearing.account. Settlement creates a debit on clearing.account and credit on merchant.payout. Sum of debits always equals sum of credits.

Every dollar that enters the system also leaves it. The invariant is mathematical, not aspirational.

CREATE TABLE ledger_entries (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  payment_id  UUID REFERENCES payments(id),
  account_id  UUID NOT NULL,
  amount      BIGINT NOT NULL,
  entry_type  TEXT NOT NULL,  -- 'debit' or 'credit'
  created_at  TIMESTAMPTZ DEFAULT now()
);

A successful authorization creates two entries: debit on the merchant's receivables account, credit on the clearing account. Settlement creates two more entries moving funds to the merchant's payout account.

Third, an outbox table for reliable webhook delivery (covered next).

The State Machine You Need to Draw

Every payment follows a fixed sequence of states. No state can be skipped, and some transitions are permanently one-way.

Payment state machine: requires_payment_method to requires_confirmation to processing. Processing branches to succeeded or failed. Succeeded can transition to refunded. Failed can never become succeeded.

Each state transition is a single atomic database write. The payment service updates the row and publishes an event to Kafka in the same transaction using the outbox pattern. No transition can skip states. A failed payment can never move to succeeded. This is not just a design choice. It's what makes your audit log trustworthy.

Webhooks Without Data Loss

Merchants need to know when payments succeed or fail. HTTP delivery is not reliable, so you need a delivery guarantee.

Use the transactional outbox pattern. When the payment service commits a state transition, it writes a row to an outbox table in the same database transaction. A separate worker polls the outbox and makes the HTTP call. If the worker crashes between commit and delivery, the row is still there. The next worker picks it up and retries.

Delivery is at-least-once, not exactly-once. Merchants must handle duplicates on their end. Your webhook payload should include the payment ID so they can deduplicate. Retry with exponential backoff, capped at 1 hour between attempts. After 72 hours of consecutive failures, stop and alert. Stripe's actual retry window is about 3 days in live mode.

Scaling: The Hot Merchant Problem

At low scale, one Postgres instance handles everything fine. At Stripe scale, you have two problems.

The first is raw database throughput. Shard by merchant_id. All payments for a merchant live on one shard. This keeps ledger queries fast (no cross-shard joins) and limits contention to one shard at a time.

The second is the hot merchant problem. A large merchant running a flash sale can generate thousands of writes per second, all hitting one shard. Their account balance row becomes a contention hotspot, serializing every write. This is where most candidates stop. Don't stop here.

Hot merchant problem: naive single balance row becomes a contention bottleneck under high write volume. Solution: split into N sub-account rows, distribute writes randomly, aggregate with SELECT SUM on read.

The fix is internal account sharding: split the merchant's logical balance across N sub-account rows and aggregate them on read. Writes distribute across sub-accounts randomly. Reads do a SUM. This is the same pattern as Redis hot key sharding (scatter writes with a random suffix, gather on read). Start with N=10 and tune based on observed contention.

Kafka absorbs write pressure before it reaches the database. Partition Kafka by merchant_id so events for a merchant are processed in order. This keeps your Kafka-to-Postgres fan-in from overwhelming any single shard.

When the PSP Doesn't Pick Up

External PSPs fail. You need a strategy for each failure mode.

Put a circuit breaker in front of every PSP call. After five consecutive failures within a 30-second window, open the circuit and fail fast. This protects the PSP from a retry storm during a partial outage. After a timeout period, move to half-open and let a probe request through.

For major outages, route to a backup PSP. Call this out as a significant operational commitment: you need payment method tokens synchronized across both PSPs. Mention it as a tradeoff and move on. The simpler fallback is degraded-but-honest (return a clear error, let the merchant retry later).

The hardest case is a timeout. A timeout is not a failure. When the PSP call times out, you have no idea whether the charge happened. The PSP might have processed it. Or it crashed before processing. You genuinely don't know. Store the payment in processing state, then run an async job that polls the PSP for the outcome. Never assume timeout means failed. In payment systems, ambiguity costs real money.

The timeout-as-unknown insight separates engineers who've thought about distributed systems from engineers who've thought about functions that return values. Say it out loud in the interview. Your interviewer will write it down.

Reconciliation Closes the Loop

Your real-time path handles the happy case. Reconciliation handles everything else.

Every PSP sends a daily settlement file listing all transactions they processed. Your reconciliation service downloads it and compares row by row against your ledger. Three cases:

In your ledger but not in the settlement file: you recorded a success the PSP doesn't know about. Investigate before paying out.
In the settlement file but not your ledger: the PSP processed it and you never recorded it. Revenue leak if undetected.
Amount mismatch: usually a fee calculation error or currency rounding issue.

Each case has a different resolution workflow, and every discrepancy triggers an alert before the next settlement cycle. This is how Stripe keeps books clean at scale. If you only mention the real-time path, you've described a system that drifts over time. Your interviewer will notice.

The 45-Minute Clock

Allocate the 45 minutes like this.

0-5 min: Clarify scope. Functional requirements, non-functional requirements, scale. Say what you're leaving out.

5-10 min: Draw the card network ecosystem. Name all four parties. Show the authorization flow. Mention auth vs capture vs settlement.

10-20 min: High-level architecture. Five services, Kafka between them, database.

20-30 min: Idempotency and the data model. This is where most candidates fail. The INSERT-based idempotency key pattern and the float-vs-integer gotcha are what separate a deep answer from a shallow one.

30-38 min: Scaling, hot merchant sharding, circuit breaker, and the timeout-as-unknown insight.

38-45 min: Reconciliation, open tradeoffs, and the interviewer's questions.

If you run short, finish the idempotency deep dive before anything else. An interviewer who hears the INSERT-then-UPDATE pattern and the 409 response logic will rate your answer far higher than one who sees a polished architecture diagram with no correctness reasoning.

The Tradeoffs Worth Naming

Strong candidates name their tradeoffs without being asked. The consistency vs availability decision shows up in multiple places. The push vs pull tradeoff framing maps to several of the calls below.

For the authorization path, you choose consistency. A double charge is worse than a failed authorization. Every write hits the primary. No stale reads allowed.

For the webhook path, you choose availability. Delayed delivery is better than dropped delivery. The outbox pattern plus at-least-once retry is the right call, not a synchronous HTTP call that blocks the authorization path.

For settlement reads (merchant dashboards, reporting), you can serve from read replicas. The data is historical and tolerance for slight staleness is high.

Synchronous capture vs async capture is another call. Synchronous gives merchants an immediate result. Async lets you batch capture requests to the acquirer, which reduces interchange fees but adds latency and failure modes to reason about.

Recap

Scope: merchant-facing API, not the card network infrastructure
Four parties: merchant, PSP, acquirer, card network (issuer decides approve/decline)
Five services: API gateway, payment service, ledger, notifications, reconciliation
Idempotency keys: INSERT-based, not SELECT-then-INSERT; 24-hour TTL; return 409 on in-flight conflict
Store money as integers (cents). Never floats
Double-entry ledger: every debit has a matching credit, written atomically
Outbox pattern for reliable webhook delivery; at-least-once means merchants deduplicate
Shard by merchant, internal account sharding for hot merchants
Circuit breaker in front of PSP calls; treat timeout as unknown, never as failed
Reconciliation catches drift the real-time path misses

If you want to practice walking through this design out loud under real time pressure, SpaceComplexity runs voice-based system design interviews with rubric feedback on exactly this kind of problem.