Fraud Detection System Design Interview: The 45-Minute Walkthrough

May 27, 202614 min read
interview-prepcareersystem-designalgorithms
Fraud Detection System Design Interview: The 45-Minute Walkthrough
TL;DR
  • Two-lane architecture: split real-time scoring (under 150ms) from offline analysis (minutes to hours) before drawing anything else.
  • Feature store: precompute velocity windows and risk signals in Redis; never touch a relational database in the hot path.
  • Three-tier decision: hard block, friction/review, and allow tiers prevent analyst queues from drowning on gray-zone volume.
  • Label delay problem: chargebacks arrive 30 to 90 days after fraud, making naive supervised training impossible; use proxy labels and drift detection instead.
  • Rules plus ML: rules handle hard constraints and cold-start cases; gradient boosting generalizes patterns across the long tail.
  • Graph analysis: GNNs run offline to detect fraud rings that single-transaction scoring misses entirely.

Most candidates hit the fraud detection system design interview and immediately start drawing a machine learning pipeline. Ten minutes on model selection, another five on feature engineering, and then the interviewer asks about latency. That's when you realize you forgot the entire real-time path. The one that has to respond in 150 milliseconds or the payment network rejects you.

Fraud detection is two systems, not one. There is a real-time scoring path that must decide in under 100 milliseconds, and an offline analysis path that runs over hours or days. Most of the interesting architectural decisions follow from that split.

Get that distinction on the whiteboard early. Everything else fills in around it.


Clarify Before You Draw Anything

Good interviewers reward scope control. Spend the first five minutes here. It is free points and it prevents you from designing the wrong thing for 40 minutes.

What kind of fraud? Payment fraud (card transactions), account takeover (credential stuffing), promotion abuse (coupon farming), or spam (fake content)? The latency requirement and signal set differ for each. This walkthrough focuses on payment fraud. It is the hardest version because Visa and Mastercard have opinions about your p99 latency.

Who calls us? An internal service (your own checkout flow) or an external API (third-party merchants)? If external, you need tenant isolation and per-merchant model customization.

What action do we take? Block the transaction, add friction (challenge the user), flag for human review, or just log the score for analytics? This shapes the response contract.

What are the latency and scale requirements? Payment processors have a hard window: Visa and Mastercard require authorization response within 100 to 200 milliseconds end to end. At 1 million transactions per day, that is roughly 12 per second average but you must size for peak (10 to 20x average for flash sales or holiday spikes).

Typical requirements for this walkthrough:

  • Latency: score and decide within 150ms p99
  • Scale: 10,000 transactions per second at peak
  • Fraud rate: roughly 0.3% of transactions (highly imbalanced)
  • Chargeback target: keep chargeback rate below 0.1% (Visa fines merchants above this threshold)

Two Lanes, Not One

Draw two swim lanes. Everything else fills in around them. If you draw one lane that goes transaction → ML model → decision, you have described a data science project, not a production system.

Two-lane fraud detection architecture: real-time path on the left feeding into Kafka, offline path on the right processing from Kafka back into the feature store Real-time path (left) must return in under 150ms. Offline path (right) feeds it better features over time. The feedback loop is what makes it work.

The real-time path never touches disk. Every microsecond counts. The offline path is where you afford expensive computation: graph traversals, ensemble models over long historical windows, analyst review queues.


Four Tables That Cover the Interview

events holds the raw transaction record plus the risk outcome.

CREATE TABLE events ( id UUID PRIMARY KEY, user_id BIGINT NOT NULL, session_id UUID, device_id VARCHAR(64), ip_address INET, amount_cents INT NOT NULL, currency CHAR(3), merchant_id BIGINT, event_type VARCHAR(32), -- 'payment', 'login', 'signup' risk_score FLOAT, decision VARCHAR(16), -- 'allow', 'block', 'review' created_at TIMESTAMPTZ DEFAULT now() );

user_risk_profiles stores precomputed velocity windows and lifetime signals. This is what gets served from Redis into the real-time path.

CREATE TABLE user_risk_profiles ( user_id BIGINT PRIMARY KEY, account_age_days INT, txn_count_1h INT, txn_amount_1h_cents BIGINT, txn_count_24h INT, failed_auth_count_1h INT, distinct_ips_7d INT, distinct_devices_7d INT, fraud_count_lifetime INT, last_seen_ip INET, last_seen_device VARCHAR(64), updated_at TIMESTAMPTZ );

device_profiles links device fingerprints to historical fraud signals.

CREATE TABLE device_profiles ( device_id VARCHAR(64) PRIMARY KEY, fingerprint_hash VARCHAR(64), first_seen TIMESTAMPTZ, linked_user_count INT, fraud_event_count INT );

review_cases manages the human review queue and feeds labels back into training. Without this table, your model never learns from analyst decisions.

CREATE TABLE review_cases ( id UUID PRIMARY KEY, event_id UUID REFERENCES events(id), status VARCHAR(16), -- 'pending', 'resolved' label BOOLEAN, -- true = fraud confirmed reviewer_id BIGINT, labeled_at TIMESTAMPTZ );

Three Endpoints, One Contract

Three endpoints cover the interview.

POST /v1/score

Synchronous. Called by checkout. Returns a decision within the latency budget.

// Request { "user_id": 42, "device_id": "abc123", "ip": "203.0.113.45", "amount_cents": 8999, "currency": "USD", "merchant_id": 7 } // Response { "decision": "allow", // "allow" | "block" | "review" "risk_score": 0.12, "reason_codes": ["low_velocity", "trusted_device"] }
POST /v1/cases/{id}/label

Used by fraud analysts to submit ground-truth labels. Triggers a label event on Kafka that feeds the retraining pipeline.

GET /v1/explain/{event_id}

Returns feature values and SHAP importance scores for a given decision. Analysts need this to work the queue. Regulators need this to not fine you.


What Happens in 150 Milliseconds

You have 150 milliseconds. That includes the two network hops. Here is where the time goes.

150ms p99 latency budget breakdown: each component shown against the total 150ms budget with ~100ms slack remaining Most of the budget is slack. You just can't spend it on a synchronous database read.

Budget breakdown for 150ms p99: network from client to service (5ms), feature lookup from Redis (5ms), rules engine evaluation (10ms), ML model inference (20ms), write decision to Kafka (5ms), network back to client (5ms). That leaves about 100ms of slack for variance. Do not spend any of it on synchronous database reads.

Feature Store

Redis is the only store that fits in this budget. Precomputed features are written here by the offline path. When a transaction arrives, one Redis pipeline call fetches the user profile, device profile, and IP signals together. Single-digit millisecond retrieval is achievable at scale.

Redis pipeline: three HGETALL commands in one round-trip produce the feature vector that feeds the ML model A Redis pipeline batches all three lookups into a single round-trip. No serial network waits.

The features that matter most:

SignalIntuition
txn_count_1hVelocity: fraudsters burn accounts fast
distinct_devices_7dCompromised accounts often switch devices
account_age_daysNew accounts are higher risk
ip_fraud_scoreEnriched from third-party IP reputation feed
amount_z_scoreHow unusual is this amount relative to user history?
merchant_categorySome merchant categories attract more fraud
time_of_dayFraudsters prefer off-hours
card_country_mismatchCard issuer country vs IP country

Rules Engine

Rules run before the ML model and can short-circuit the decision. This is by design. If a device is on a known-bad list, you don't need a 20ms model inference to block it. Uber's Mastermind engine maintains thousands of such rules and evaluates them in under 10 milliseconds.

Three categories of rules:

Hard block rules (zero tolerance). Velocity exceeds threshold (more than 5 transactions in 5 minutes from the same device). IP on blocklist. Known compromised device. These trigger an immediate block regardless of ML score.

Hard allow rules (trusted entities). Verified recurring billing. Merchant whitelist. These skip ML scoring and return allow immediately. This saves latency and compute on your highest-volume safe traffic. Your Netflix subscription doesn't need a fraud model every month.

Gray zone rules (hand off to ML). Everything else. The rules engine marks the transaction for ML scoring and passes the feature vector forward.

ML Model Scoring

The model receives a feature vector and returns a probability. Gradient boosting (XGBoost, LightGBM) is the right default for tabular fraud data. It handles missing features gracefully, trains fast, and produces calibrated probabilities. Deep learning adds complexity without proportional gain unless you have billions of labeled examples (Stripe does; most teams don't).

The model runs as a gRPC service. You want strict timeouts here. If the model service is slow or unavailable, fall back to rules-only scoring. A degraded decision is better than no decision.

Three-Tier Decision

The three tiers are not symmetric. Blocking costs you a legitimate transaction. Missing fraud costs you a chargeback plus a fee.

Three-tier decision layer: allow zone (score under 0.4), friction/review zone (0.4 to 0.8), and hard block zone (above 0.8) The friction tier is doing most of the work. Don't route everything gray-zone to a human queue.

risk_score >= 0.8    →  block
0.4 <= score < 0.8   →  review (add friction or queue for analyst)
score < 0.4          →  allow

The review tier is where most systems fail. If you put every gray-zone transaction in a human queue, analysts drown. Use friction instead: require 2FA, send a verification SMS, or ask the user to confirm unusual details. Legitimate users complete friction. Fraudsters often abandon.


The Offline Path and the Label Delay Problem

Here is the part that separates candidates who have run a real system from candidates who have read about one.

Chargebacks arrive 30 to 90 days after the fraudulent transaction. That is not a footnote. Your model scores a transaction as fine, and two months later the cardholder disputes it. You now have a label. Two months later.

This has two consequences. First, your training data is always stale. You cannot train a model on yesterday's transactions because you don't have ground truth for them yet. Second, your model drifts as fraud patterns evolve, but you don't see the signal for weeks.

A developer wading into false positives "Developer dives into false positives" (r/ProgrammerHumor). Accurate.

Mitigation strategies:

Proxy labels: analyst review decisions (from the human review queue) are available within hours. They are noisier than chargeback labels but let you train more frequently.

Semi-supervised learning: use the unlabeled recent transactions alongside labeled historical data. The model learns the distribution of normal behavior even without fraud labels.

Concept drift monitoring: track model score distribution over time. If the distribution shifts significantly from baseline (run a population stability index check), retrain before your metrics degrade. You don't need the labels to detect drift.

The offline path runs on Flink or Spark Streaming. It reads from the Kafka event log, computes rolling window features, updates the feature store, and periodically triggers model retraining from the training pipeline.


Graph Analysis: Catching Rings, Not Just Individuals

Single-transaction scoring misses coordinated fraud. A fraud ring might use 50 different accounts, each doing one transaction that individually looks completely normal. Individually: a $30 purchase from a two-week-old account. Nothing suspicious. Together: 50 accounts sharing one device fingerprint and one proxy IP.

Graph analysis finds the connections.

Graph analysis: normal user cluster on the left, fraud ring on the right sharing a single device and IP proxy node One confirmed fraud account in a cluster flags the whole neighborhood for review. The shared device is the tell.

Build a graph where nodes are users, devices, and IPs. Edges connect them when they share a device, share an IP, or transact with the same merchant in the same window. Apply graph neural networks or simpler community detection (Louvain algorithm) to find clusters. If a device is shared by 20 accounts and one of them got labeled fraud, the others are suspect.

Visa's research found GNN-based detection identified 26% more previously unknown mule accounts compared to their existing ML pipeline. This runs offline (15-minute to hourly cadence), writes risk signals back to the feature store, and influences the next round of real-time scoring.


Where It Breaks at Scale

Kafka partitioning: partition by user_id (not by transaction ID) so that all events for a user arrive in order to the same Flink operator. Order matters for velocity windows. Partition by transaction ID if you want your velocity windows to be wrong.

Redis hot keys: power users with thousands of transactions per day create hot keys. Shard by appending a random suffix (user:42:0 through user:42:15) and aggregate reads across shards. For write-heavy keys, use Redis Cluster's hash slot mechanism.

Model serving: each ML inference is ~20ms with a single model. At 10,000 TPS, that is 200 seconds of compute per second and you need ~200 inference threads or replicas. Horizontal scaling with a load balancer in front of the model service is straightforward. Batch requests from the same millisecond together to amortize GPU/CPU context switch overhead.

Feature freshness vs cost: recomputing velocity windows for every request is expensive. Pre-aggregate them with Flink using tumbling and sliding windows, write to Redis every 30 seconds. A 30-second staleness is acceptable for most signals. For the highest-risk signals (failed auth count in 5 minutes), push updates synchronously on each event.


The Tradeoffs Worth Knowing

DecisionOption AOption BGuidance
False positive vs false negativeBlock more (higher recall)Block less (higher precision)Cost of a blocked legitimate transaction (customer lost) vs cost of fraud (chargeback + fee). Calculate your actual dollar values.
Rules vs MLRules onlyML only87% of best-in-class systems use both. Rules for hard constraints and cold-start, ML for pattern generalization.
Online vs offline featuresPrecompute everythingCompute at request timePrecompute all but the most time-sensitive signals. Computing at request time breaks the latency budget.
Synchronous vs asynchronous decisionBlock before completionLog and review post-hocSynchronous for payment authorization (too late after the fact), asynchronous for spam/content moderation where reversibility is higher.
Explainability vs accuracySimple decision treeGNN ensembleRegulators in finance often require explainability. SHAP values on gradient boosting are a reasonable middle ground.

How to Run a Fraud Detection System Design Interview in 45 Minutes

0:00  Clarify type of fraud, latency SLA, scale, who acts on decisions
5:00  Draw the two swim lanes (real-time vs offline)
10:00 Five key components in the real-time path (feature store, rules, ML, decision, write-back)
18:00 Data model: events, user_risk_profiles, device_profiles, review_cases
23:00 API: POST /v1/score request/response shape
27:00 Deep dive: latency budget breakdown, feature store design, rules categories
35:00 Label delay problem: why naive training doesn't work, proxy labels, drift detection
38:00 Bottlenecks: Kafka partitioning, Redis hot keys, model serving scale
42:00 Tradeoffs table, graph analysis mention
45:00 Stop

Flag early when you want to go deep. "I can go deeper on the feature store, the ML pipeline, or the human review queue. Which matters most to you?" That question earns points for communication regardless of which direction they pick.


The Insight That Wins the Interview

Most candidates design a fraud system as a single ML pipeline and spend 30 minutes on model selection. The split that separates strong candidates is the real-time vs offline architecture, the label delay problem, and the three-tier decision layer (hard block, friction, allow). A candidate who explains why chargebacks arrive 60 days later and how that breaks naive supervised learning has clearly operated on a real system. That is what the interviewer is listening for.

If you want to practice explaining this architecture out loud under interview conditions, SpaceComplexity runs you through timed system design sessions with rubric-based feedback on your communication, not just whether you covered the components.


Recap

  • Two paths: real-time scoring (< 150ms) and offline analysis (minutes to hours). Design them separately.
  • Feature store: Redis for sub-10ms feature retrieval. Never read from a relational database in the hot path.
  • Three-tier decision: hard block, friction/review, allow. The review tier uses friction, not a human queue.
  • Label delay: chargebacks arrive 30 to 90 days late. Use proxy labels and drift detection to compensate.
  • Rules + ML: rules for hard constraints and cold start, gradient boosting for pattern generalization.
  • Graph analysis: GNNs find fraud rings that single-transaction scoring misses. Runs offline, writes back to feature store.
  • Latency budget: 150ms p99, spend it on Redis (5ms) + rules (10ms) + ML inference (20ms).
  • Chargeback threshold: Visa and Mastercard penalize merchants above 0.1%. This is your recall floor.

Further Reading