Subscription Billing System Design Interview: The Complete Walkthrough

"Charge the customer monthly." That is the whole feature. Three words. Your API calls Stripe, you are done.

Then you ship it. A customer's card declines. You retry. The card goes through. The customer gets charged twice because a webhook fired between your two attempts. Your on-call pager goes off at 2am. Surprise: "charge the customer monthly" is a distributed systems problem wearing a product manager's blazer.

Behind those three words sits a state machine tracking six subscription states, an idempotency layer at every write, a distributed job scheduler with crash recovery, and a dunning engine with ML-powered retry logic. The surface looks simple. The subscription billing system design interview does not.

Clarify the Scope Before You Draw Anything

Before you touch the whiteboard, ask four questions. The answers change the architecture.

Billing models: flat-rate per seat, or usage-based? Usage-based requires a separate metered billing pipeline where you aggregate usage events before the billing period closes.

Scale: 100K subscribers or 10M? This determines whether you need database partitioning and how aggressively you need to design around billing bursts.

Payment methods: cards only, or ACH and SEPA too? Async methods have settlement windows measured in days, not seconds. Subscription activation logic differs significantly.

Tax and multi-currency: adds a tax engine integration (Avalara, TaxJar) and requires FX rate snapshotting at invoice creation time, not charge time.

For this walkthrough: flat-rate SaaS with optional per-seat pricing, 5M active subscribers, cards plus ACH, global currency support.

Six Components, One Direction of Data

┌─────────────────────────────────────────────────────┐
│            API Layer  (idempotency enforced)        │
└─────────────────────┬───────────────────────────────┘
                      │
     ┌────────────────▼──────────────────────┐
     │                                       │
┌────▼──────────┐  ┌────────────────┐  ┌────▼──────────┐
│  Subscription │  │    Billing     │  │    Invoice    │
│   Service     │  │   Scheduler   │  │    Service    │
│ (state mach.) │  │ (clock + CAS) │  │  (gen + send) │
└────┬──────────┘  └───────┬────────┘  └───────────────┘
     │                     │
     └──────────┬──────────┘
                │
     ┌──────────▼──────────┐
     │  Payment Processor  │
     │  (Stripe / Adyen)   │
     └──────────┬──────────┘
                │
     ┌──────────▼──────────┐
     │   Dunning Engine    │
     │  (retry + email)    │
     └─────────────────────┘

Data flows in one direction: the scheduler triggers billing jobs, the invoice service generates charges, the payment processor attempts collection, and the dunning engine handles everything that goes wrong after that. The subscription service owns the state machine that links all of this together.

A Data Model That Won't Haunt You Later

Five tables do most of the work:

subscriptions
  id, customer_id, plan_id,
  status,                        -- trialing|active|past_due|unpaid|canceled
  current_period_start, current_period_end,
  billing_cycle_anchor, cancel_at_period_end

invoices
  id, subscription_id, customer_id,
  status,                        -- draft|open|paid|void|uncollectible
  amount_due_cents, period_start, period_end

invoice_line_items
  id, invoice_id, description,
  amount_cents, proration BOOL,
  period_start, period_end

charges
  id, invoice_id, customer_id, amount_cents,
  status, payment_method_id, failure_code

idempotency_keys
  key VARCHAR PRIMARY KEY,
  request_hash, status,          -- in_progress|completed|failed
  response_body JSONB, expires_at

Two non-obvious decisions here. First, status on subscriptions is a six-state enum, not a boolean. The active <-> past_due transition is reversible on successful payment. Modeling it as is_active breaks the first time a customer's card declines and then recovers. Second, every invoice stores period_start and period_end explicitly. You need these for proration, for tax filing, and for audit trails. Do not compute them on the fly from the subscription anchor.

Subscription state machine showing all six states and which transitions are reversible The six-state machine. Note the only reversible pair: active and past_due. Everything else is a one-way door.

The Billing Clock Is Where Interviews Are Won

Most candidates draw the architecture and move on. The interviewer wants to know how charges actually fire.

A billing job is not like a web request. A web request arrives and you handle it. A billing job must fire at a specific future time, survive process restarts, and execute exactly once (or at-least-once with idempotent handlers). Getting this wrong means charges fire twice, or never.

When you say "a cron job fires the charges," the interviewer writes "has not seen this break in production." Do not say that.

The reference architecture: a Postgres billing_jobs table as the durable store, a Redis sorted set as a time index (score = Unix epoch of next fire time), and a worker pool for execution.

The scheduler polls Redis every second for jobs due now. When it finds one, it claims with a compare-and-swap update:

UPDATE billing_jobs
SET status = 'acquired', version = version + 1
WHERE id = $1
  AND status = 'scheduled'
  AND version = $2;

Zero rows updated means another scheduler instance already claimed the job. No global lock needed. At 5M subscriptions with 20% renewing in the same hour at month-end, you need roughly 15 to 20 scheduler instances running this loop.

After claiming, the job goes onto a Kafka topic. Workers consume it and attempt the charge. If a worker crashes mid-processing, a watchdog resets jobs stuck in acquired with an expired heartbeat back to scheduled. Workers must be idempotent regardless: before charging, query whether a successful charge already exists for this subscription and billing period.

Billing scheduler architecture: Postgres billing_jobs, Redis sorted set time index, CAS claiming, Kafka topic, and worker pool with watchdog crash recovery The full scheduler stack. The CAS claim is the key insight: no distributed locks, no coordination overhead, just an optimistic write that fails gracefully when another instance gets there first.

This problem overlaps directly with designing a distributed job scheduler, which is a standalone interview problem worth understanding in full.

The Double-Charge Problem

Your billing worker fires a charge request to Stripe. The network times out. Did the charge succeed? You do not know. This is Schrödinger's payment: simultaneously charged and not charged until you open the box. Retry with a new request and the customer sees two charges at 2am. Skip the retry and revenue disappears into the void.

The fix: every write request carries a UUID idempotency key. The server stores the key with a 24-hour TTL. On replay, it returns the stored response without re-executing the operation.

Server-side claim in Postgres:

INSERT INTO idempotency_keys (key, status, expires_at)
VALUES ($1, 'in_progress', NOW() + INTERVAL '24 hours')
ON CONFLICT (key) DO NOTHING
RETURNING key;

If this returns no rows, another request already claimed the key. Read and return the existing response. If it returns a row, you own the key. Execute the operation, write the result, and mark it complete in one transaction.

But idempotency keys expire. The second layer of defense is a business-logic check: before charging, query whether a charge already exists for this subscription and this billing period. The key TTL layer protects against duplicate retries within 24 hours. The business-logic check protects against duplicates after key expiry.

A third layer: use the append-only ledger (covered below) so there is no mutable balance to accidentally update twice.

Two-layer idempotency defense showing key TTL claim path and business-logic period check path, with the happy path executing a charge in a single transaction Two layers, two different failure modes covered. Network retry within 24 hours hits layer one. Retry after key expiry hits layer two. Both return the same result without re-executing the charge.

For a full treatment of the implementation patterns, see Idempotency in System Design Interviews.

When the Card Declines

25% of subscription cancellations are involuntary, driven entirely by payment failures. The customer wanted to stay. They liked your product. Their bank just said no on their behalf.

Classify the decline before deciding what to do.

Hard declines (stolen card, closed account, invalid card number) should never be retried. The card is not coming back. Send a dunning email requesting an updated payment method.

Soft declines (insufficient funds, temporary hold, issuer timeout) are worth retrying. The card is valid. A practical schedule without ML: retry at 24 hours, then 72 hours (paycheck timing), then day 7, then day 14. Stripe's Smart Retries trains on 500+ signals (time of day, decline code, billing amount, customer history) and recovers 57% of failed payments. Basic fixed-schedule retry recovers 15 to 25%.

Proration runs on the same timing logic. When a customer upgrades mid-cycle from $29/month to $99/month with 20 days remaining in a 30-day period:

credit  = $29 × (20/30) = $19.33
charge  = $99 × (20/30) = $66.00
net due = $66.00 - $19.33 = $46.67

Store the proration as a separate line item with its own period_start and period_end. Tax applies to the net amount. If the prorated charge fails, do not change the plan until payment succeeds.

Dunning engine flow showing decline classification into hard (never retry, email for new card) and soft (retry at 24h, 72h, 7d, 14d), with the recovery path back to active state Hard declines stop immediately. Soft declines enter the retry schedule. A successful retry walks the subscription back from past_due to active. The customer never noticed.

Billing Bursts Create Hot Partitions

At 5M subscribers, the subscriptions table is not the bottleneck. The payment write path is.

A charge updates the charge record, the invoice status, and the subscription status. All three rows are locked for the duration of the write. Under normal load this is fine. At month-end, when 20% of subscriptions renew in the same hour, you get lock contention on the most active accounts. Your database becomes a nightclub with one bouncer and everyone showing up at midnight.

The architectural fix is an append-only ledger. Instead of updating balance rows in place, you append immutable debit/credit entries. The current state is always SUM(entries). Two concurrent writes to the same account never conflict because they append to different rows.

CREATE TABLE ledger_entries (
  id         BIGSERIAL PRIMARY KEY,
  account_id BIGINT NOT NULL,
  amount_cents BIGINT,       -- positive = credit, negative = debit
  entry_type VARCHAR,
  reference_id VARCHAR,      -- invoice ID, charge ID
  created_at TIMESTAMP
);

This is double-entry bookkeeping applied to distributed systems. Stripe's ledger processes 5 billion events per day with this pattern. For reads, a materialized view of current balances refreshes periodically. Cold reads aggregate the full entry log; hot reads hit the materialized view.

Mutable balance vs append-only ledger showing lock contention on the left and parallel appends on the right, with the balance computed as SUM of entries Left: two concurrent charges fight over the same row lock. Right: they append to different rows. No coordination, no contention. Balance is always derivable from the log.

When the ledger itself outgrows a single Postgres instance, hash-shard on account_id. Functional partitioning (separate schema for payments, settlements, and disputes) before sharding keeps individual schemas manageable. See Database Sharding for System Design Interviews for the mechanics.

Say These Tradeoffs Out Loud

Interviewers score reasoning, not diagrams. Verbalize each one.

CP vs AP: payment authorization must be consistent. A payment that "maybe succeeded" is catastrophically worse than a temporary unavailability. Use synchronous replication for charge records and idempotency tables. Use eventual consistency for analytics dashboards and MRR reporting.

Saga vs 2PC: when a successful charge must also provision access and send a confirmation email, those span three services. Two-Phase Commit distributes locks across all three and blocks if any participant is unavailable. A Saga decomposes this into three local transactions with compensating rollbacks. The charge is the pivot transaction. If it succeeds, provisioning and email retry to completion independently. If it fails, nothing irreversible has happened yet.

Invoice draft window: holding invoices in draft for ~1 hour before finalizing lets you adjust line items (add proration credits, apply coupons) before the customer is charged. The tradeoff is finalization logic complexity versus the ability to correct invoices before they go out.

How to Use 45 Minutes

0-5: clarifying questions (billing model, scale, payment methods, tax)
5-10: six components named, data flows described
10-20: data model, subscription state machine, invoice lifecycle
20-30: billing clock (scheduler + CAS claim), idempotency two-layer defense
30-38: dunning strategy (hard vs soft decline classification), proration formula
38-45: append-only ledger for scale, CP vs AP framing, open questions from the interviewer

Do not rush past the billing clock. That is where the conversation lives at senior and staff levels. Saying "a cron job fires the charges" signals you have never seen this break in production. Explaining distributed CAS claims and idempotent workers signals the opposite.

Eight Invariants to Walk In With

Subscription status is a six-state enum. active <-> past_due is reversible.
Store period_start and period_end on every invoice and line item.
The billing scheduler uses distributed CAS claims, not global locks.
Idempotency has two layers: the key TTL layer and the business-logic period check.
Classify declines before retrying. Hard declines never retry.
Proration is price_diff × (days_remaining / days_in_period).
The append-only ledger eliminates lock contention at scale.
Payment authorization is CP. Analytics can be AP. Say this out loud.

Practicing this kind of walkthrough out loud matters more than memorizing the components. SpaceComplexity runs voice-based system design mock interviews with rubric scoring on the tradeoffs and communication clarity described above.