A/B Testing System Design Interview: The 45-Minute Walkthrough

- Deterministic hashing eliminates the database for user assignment: the same user and experiment always produce the same bucket with zero storage and zero network calls.
- Layer-based traffic partitioning (Google's KDD 2010 design) lets teams run thousands of concurrent experiments without statistical interference between them.
- Separate the hot path from the cold path: assignment runs in-process at sub-millisecond latency; analysis runs as scheduled warehouse jobs hours later.
- Kafka-buffered event ingestion decouples exposure logging from the critical request path; synchronous database writes do not scale at a million QPS.
- The peeking problem inflates false positives to 30-40% unless you use sequential testing (mSPRT), which gives always-valid inference regardless of when you stop.
- Sample ratio mismatch (SRM) invalidates any experiment whose observed traffic split deviates from the configured split; run a chi-squared test before reading any result.
- Holdout groups and cluster randomization are the advanced tradeoffs that separate strong hires from candidates who are reciting the design.
Everyone walks into the A/B testing system design interview thinking the hard part is statistics. It is not. The hard part is assigning a billion users to experiments without touching a database, and running a thousand simultaneous experiments without them contaminating each other. Get those two right and the rest is plumbing. The statistics part, interviewers mostly just want to see that you've heard of a t-test.
Scope the Interview Before You Draw Anything
Five minutes up front saves fifteen later. A complete A/B testing platform needs four things:
- Experiment management: create experiments, define variants, configure traffic splits, target populations
- User assignment: given a user and an experiment, return a variant deterministically every time
- Event tracking: log exposures and metric events (purchases, clicks, conversions)
- Statistical analysis: compute significance, confidence intervals, guardrail metrics
The assignment path needs sub-millisecond latency at 1 million QPS. The analysis pipeline can lag by hours. That gap is the biggest architectural constraint in the system. Everything else flows from it.
Every Platform Has Five Components. Draw All Five.

Assignment and analytics never share infrastructure. That separation is the whole design.
Every A/B testing platform has five pieces:
- Experiment config service: stores active experiment definitions. Low write frequency, aggressively cached at the edge.
- Assignment service: the hot path. Given a user ID and an experiment, returns a variant. Must be fast enough to call on every request.
- Event ingestion pipeline: Kafka queue that buffers exposure and metric events, drains to a data warehouse asynchronously.
- Analytics engine: scheduled jobs that run statistical computations against the warehouse and write results to a read-optimized store.
- Results API and dashboard: read-heavy, tolerates data that is a few minutes stale.
The key architectural decision is separating the hot path from the cold path. Assignment is synchronous, on the critical path of every user request. Analytics is asynchronous and runs on a schedule. They should not share infrastructure.
Why Your Database Can't Handle Assignment
The naive approach: store each user's variant in a database. On every request, look it up.
This is the wrong answer. A billion users times a thousand experiments at 40 bytes per row is 40 petabytes. At 1 million QPS, your database becomes a bottleneck under every user-facing request in your system. Congratulations, you've turned a feature flag lookup into a single point of failure.
The actual approach is deterministic hashing. No storage, no network, no state.
import hashlib def assign_variant(user_id: str, experiment_id: str, num_buckets: int = 10000) -> int: key = f"{user_id}/{experiment_id}" digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % num_buckets
Map buckets to variants: buckets 0-4999 go to control, 5000-9999 go to treatment (for a 50/50 split). The same user always lands in the same bucket for the same experiment because the hash inputs are deterministic. Stickiness is free.

Same inputs, same bucket, every time. No writes. No reads. No network.
The experiment_id is critical. Include it in the hash input so the same user lands in different buckets across different experiments. Without it, every experiment assigns the same 50% of your users to treatment. Those users are correlated across every test you run. Your results are garbage and you don't know it yet.
Modern SDKs pull the full experiment config once and evaluate assignments entirely in-process. No RPC. No network hop. Uber moved from an RPC-based assignment service to local in-process evaluation and dropped p99 latency from 10ms to 100 microseconds. A 100x improvement from eliminating one network call. One network call. That's all it was.
See the consistent hashing guide for the full mechanics.
Running a Thousand Experiments Without Contamination
With 50 experiments each touching conversion rate, they can contaminate each other's results if users overlap across tests. The naive fix is to give each experiment its own exclusive traffic slice, but that runs out fast. A thousand experiments at 1% each uses 10x your total traffic. Physics, unfortunately.
Google's solution, published at KDD 2010: layers.
A layer is an independent traffic partition with its own hash salt. Experiments in different layers are orthogonal: a user can participate in one experiment per layer simultaneously, but the assignments are statistically independent because each layer uses a different salt. Experiments in the same layer are mutually exclusive: a user is in at most one experiment per layer.

One user, three experiments, zero contamination. Independent hash salts do the work.
You organize layers by product surface. The search ranking team gets one layer. The UI team gets another. Notifications gets a third. Teams run experiments in parallel within their layer without affecting any other team's results.
LinkedIn runs over 40,000 concurrent tests against 700 million members using this pattern. (Yes, 40,000. Everything on that site is an experiment.)
Don't Block the Request
When a user is assigned to a variant, you want to record an exposure event. When they convert, you want to record a metric event. The obvious implementation writes both synchronously to a database. This adds 5-10ms to every user request. Every. Single. One.
Do not write synchronously. Write to Kafka, return immediately, and let a background consumer drain events to the warehouse.

The critical path returns before the event even hits Kafka. That's the whole point.
The warehouse schema is intentionally simple:
-- Exposure events (append-only, partitioned by date) user_id VARCHAR experiment_id VARCHAR variant_key VARCHAR exposed_at TIMESTAMPTZ platform VARCHAR country VARCHAR -- Metric events (append-only, partitioned by date) user_id VARCHAR event_name VARCHAR event_value DECIMAL occurred_at TIMESTAMPTZ
Analysis joins on user_id: find all users exposed to experiment btn_color_v2, then check whether those users triggered purchase_completed afterward. Columnar warehouse storage and partition pruning handle this join at billion-row scale.
Add an LRU cache in the assignment service to deduplicate exposure logs. A user assigned to an experiment should generate one exposure log per session, not one per page request. Uber's in-process cache eliminates 80% of redundant writes.
The ad click aggregator design covers the same Kafka-to-warehouse pipeline under comparable write volumes.
Two Statistics Gotchas That Will Cost You the Offer
Interviewers do not expect you to derive the t-test. They do want to see you understand the two failure modes.
The peeking problem. If you check your p-value repeatedly as data accumulates and stop the experiment when p drops below 0.05, your actual false positive rate inflates to 30-40%. The 5% threshold only holds if you read the result exactly once at a pre-specified sample size. Experimenters always peek. You will too. Everyone does. The fix is sequential testing (mSPRT), which gives always-valid inference regardless of when you stop. Optimizely, Uber, and Netflix all switched to sequential testing for this reason.
Sample ratio mismatch (SRM). Your experiment is configured for a 50/50 split. The actual observed split is 52/48. Something broke: a bug in the bucketing code, bot traffic inflating one variant, a redirect losing users in transit. The result is invalid and you should not read it. Check SRM with a chi-squared goodness-of-fit test against the expected proportions, using a threshold of p < 0.0005. Microsoft finds SRM in roughly 6% of their tests. Six percent. At Microsoft. These are people who know what they're doing. It should be the first thing your dashboard checks.
The Data Model Is Simpler Than You Think
The core tables for the experiment management and results planes:
experiments ( experiment_id UUID primary key, name VARCHAR, layer_id UUID, status ENUM(draft, running, paused, concluded), traffic_pct DECIMAL(5,2), -- percent of layer traffic this experiment uses start_time TIMESTAMPTZ, end_time TIMESTAMPTZ, primary_metric VARCHAR ) variants ( variant_id UUID primary key, experiment_id UUID references experiments, key VARCHAR, -- "control", "treatment_a" is_control BOOLEAN, weight DECIMAL(5,2), -- relative traffic weight within experiment config JSONB -- arbitrary key-value overrides to serve )
The assignment service reads from a Redis-cached copy of this table (TTL 60 seconds), invalidated via pub/sub on every config write. In production, Eppo, LaunchDarkly, and Statsig go further: they push the full ruleset to a CDN so SDKs poll the nearest edge node and never hit the origin for config reads.
How to Spend 45 Minutes

Spend the most time on deep dives. That's where interviewers separate strong hire from hire.
Five minutes on requirements. Eight minutes on data model and the key API endpoints (assignment, event ingestion, experiment CRUD, results). Twelve minutes on the high-level architecture. Fifteen minutes on one or two deep dives. Five minutes on tradeoffs.
Two moments separate the candidates who understand this problem from those who are reciting it: why hashing replaces a database for assignment, and what the layer abstraction actually buys you. Nail those. Everything else is execution. The system design interview tips guide covers the general framework.
The Tradeoffs to Raise Unprompted
Client-side vs server-side assignment. A JavaScript SDK evaluates assignments in the browser, which causes a visible flicker: the page renders with the default variant before the SDK loads and swaps it. You can feel it. Users notice. Server-side assignment happens before the response is rendered, no flicker. The cost is that adding a server-side experiment requires a backend deployment. All serious platforms have moved to server-side evaluation.
Holdout groups. A permanently excluded 1-2% of traffic that never receives any new features, maintained across all experiments. The goal is measuring cumulative lift: if you shipped 50 features this quarter and conversion rate grew 10%, holdouts tell you how much of that growth the features actually caused. Without holdouts, the attribution is noise. (The answer is usually "less than you think.") The cost is real: holdout users never see your product improvements.
Network effect violations. Standard A/B testing assumes one user's treatment does not affect another user's outcome. This breaks in marketplaces and social networks. If treatment users see lower prices, they draw down supply, making control users' prices higher. Treatment looks better than it is. Your experiment has accidentally run a market. The fixes are cluster randomization (assign entire geographic regions to treatment or control) or switchback experiments (time-based switching where the entire network moves between variants simultaneously). Both sacrifice statistical power for valid causal inference.
Recap
- Assignment is deterministic hashing, not a database lookup. No storage, no state, no network call.
- Layers are orthogonal traffic partitions that let you run thousands of experiments without statistical interference.
- Separate the hot path (assignment, sub-millisecond) from the cold path (analytics, minutes-to-hours). They have nothing in common.
- Kafka buffers event writes. The warehouse runs the analysis. Direct synchronous writes to a database on the critical path do not scale.
- Check for sample ratio mismatch before reading any result. Use sequential testing to allow early stopping without inflating false positives.
Neither the hashing insight nor the layer abstraction is obvious on first read. Together, they determine whether you walk out with a strong hire.
If you want to practice this design under realistic interview conditions, SpaceComplexity runs voice-based system design mock interviews with rubric-based feedback on architecture decisions, communication, and tradeoff reasoning.
Further Reading
- Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. The original Google KDD 2010 paper on layers and orthogonal traffic partitions.
- Making Uber's Experiment Evaluation Engine 100x Faster. Local in-process evaluation vs RPC, with real latency numbers.
- Choosing a Sequential Testing Framework. Spotify's comparison of mSPRT vs group sequential tests.
- Diagnosing Sample Ratio Mismatch in A/B Testing. Microsoft Research on SRM detection and root causes.
- How Bucketing Works in Optimizely Feature Experimentation. The 10,000-bucket MurmurHash implementation.
- Reimagining Experimentation Analysis at Netflix. Netflix's warehouse-native approach to analysis at scale.