A/B Testing System Design Interview: The 45-Minute Walkthrough

Everyone walks into the A/B testing system design interview thinking the hard part is statistics. It is not. The hard part is assigning a billion users to experiments without touching a database, and running a thousand simultaneous experiments without them contaminating each other. Get those two right and the rest is plumbing. The statistics part, interviewers mostly just want to see that you've heard of a t-test.

Scope the Interview Before You Draw Anything

Five minutes up front saves fifteen later. A complete A/B testing platform needs four things:

Experiment management: create experiments, define variants, configure traffic splits, target populations
User assignment: given a user and an experiment, return a variant deterministically every time
Event tracking: log exposures and metric events (purchases, clicks, conversions)
Statistical analysis: compute significance, confidence intervals, guardrail metrics

The assignment path needs sub-millisecond latency at 1 million QPS. The analysis pipeline can lag by hours. That gap is the biggest architectural constraint in the system. Everything else flows from it.

Every Platform Has Five Components. Draw All Five.

Five-box A/B testing system architecture with hot path (assignment service) and cold path (analytics pipeline) clearly separated

Assignment and analytics never share infrastructure. That separation is the whole design.

Every A/B testing platform has five pieces:

Experiment config service: stores active experiment definitions. Low write frequency, aggressively cached at the edge.
Assignment service: the hot path. Given a user ID and an experiment, returns a variant. Must be fast enough to call on every request.
Event ingestion pipeline: Kafka queue that buffers exposure and metric events, drains to a data warehouse asynchronously.
Analytics engine: scheduled jobs that run statistical computations against the warehouse and write results to a read-optimized store.
Results API and dashboard: read-heavy, tolerates data that is a few minutes stale.

The key architectural decision is separating the hot path from the cold path. Assignment is synchronous, on the critical path of every user request. Analytics is asynchronous and runs on a schedule. They should not share infrastructure.

Why Your Database Can't Handle Assignment

The naive approach: store each user's variant in a database. On every request, look it up.

This is the wrong answer. A billion users times a thousand experiments at 40 bytes per row is 40 petabytes. At 1 million QPS, your database becomes a bottleneck under every user-facing request in your system. Congratulations, you've turned a feature flag lookup into a single point of failure.

The actual approach is deterministic hashing. No storage, no network, no state.

import hashlib

def assign_variant(user_id: str, experiment_id: str, num_buckets: int = 10000) -> int:
    key = f"{user_id}/{experiment_id}"
    digest = hashlib.md5(key.encode()).hexdigest()
    return int(digest, 16) % num_buckets

Map buckets to variants: buckets 0-4999 go to control, 5000-9999 go to treatment (for a 50/50 split). The same user always lands in the same bucket for the same experiment because the hash inputs are deterministic. Stickiness is free.

Deterministic hashing pipeline: user_id and experiment_id feed into MD5 hash, output is modded by 10000 to produce bucket 6231, which maps to treatment in the variant range table

Same inputs, same bucket, every time. No writes. No reads. No network.

The experiment_id is critical. Include it in the hash input so the same user lands in different buckets across different experiments. Without it, every experiment assigns the same 50% of your users to treatment. Those users are correlated across every test you run. Your results are garbage and you don't know it yet.

Modern SDKs pull the full experiment config once and evaluate assignments entirely in-process. No RPC. No network hop. Uber moved from an RPC-based assignment service to local in-process evaluation and dropped p99 latency from 10ms to 100 microseconds. A 100x improvement from eliminating one network call. One network call. That's all it was.

See the consistent hashing guide for the full mechanics.

Running a Thousand Experiments Without Contamination

With 50 experiments each touching conversion rate, they can contaminate each other's results if users overlap across tests. The naive fix is to give each experiment its own exclusive traffic slice, but that runs out fast. A thousand experiments at 1% each uses 10x your total traffic. Physics, unfortunately.

Google's solution, published at KDD 2010: layers.

A layer is an independent traffic partition with its own hash salt. Experiments in different layers are orthogonal: a user can participate in one experiment per layer simultaneously, but the assignments are statistically independent because each layer uses a different salt. Experiments in the same layer are mutually exclusive: a user is in at most one experiment per layer.

Three horizontal layers (Search Ranking, UI, Notifications) each spanning 100% of traffic with independent partitioning. A user simultaneously participates in one experiment from each layer with statistically independent assignments.

One user, three experiments, zero contamination. Independent hash salts do the work.

You organize layers by product surface. The search ranking team gets one layer. The UI team gets another. Notifications gets a third. Teams run experiments in parallel within their layer without affecting any other team's results.

LinkedIn runs over 40,000 concurrent tests against 700 million members using this pattern. (Yes, 40,000. Everything on that site is an experiment.)

Don't Block the Request

When a user is assigned to a variant, you want to record an exposure event. When they convert, you want to record a metric event. The obvious implementation writes both synchronously to a database. This adds 5-10ms to every user request. Every. Single. One.

Do not write synchronously. Write to Kafka, return immediately, and let a background consumer drain events to the warehouse.

Event ingestion pipeline: SDK assigns variant and fires exposure event to Kafka asynchronously in under 1ms. User conversion fires to a separate metric-events topic. Kafka consumer bulk-loads both into the data warehouse in 1-minute micro-batches.

The critical path returns before the event even hits Kafka. That's the whole point.

The warehouse schema is intentionally simple:

-- Exposure events (append-only, partitioned by date)
user_id         VARCHAR
experiment_id   VARCHAR
variant_key     VARCHAR
exposed_at      TIMESTAMPTZ
platform        VARCHAR
country         VARCHAR

-- Metric events (append-only, partitioned by date)
user_id         VARCHAR
event_name      VARCHAR
event_value     DECIMAL
occurred_at     TIMESTAMPTZ

Analysis joins on user_id: find all users exposed to experiment btn_color_v2, then check whether those users triggered purchase_completed afterward. Columnar warehouse storage and partition pruning handle this join at billion-row scale.

Add an LRU cache in the assignment service to deduplicate exposure logs. A user assigned to an experiment should generate one exposure log per session, not one per page request. Uber's in-process cache eliminates 80% of redundant writes.

The ad click aggregator design covers the same Kafka-to-warehouse pipeline under comparable write volumes.

Two Statistics Gotchas That Will Cost You the Offer

Interviewers do not expect you to derive the t-test. They do want to see you understand the two failure modes.

The peeking problem. If you check your p-value repeatedly as data accumulates and stop the experiment when p drops below 0.05, your actual false positive rate inflates to 30-40%. The 5% threshold only holds if you read the result exactly once at a pre-specified sample size. Experimenters always peek. You will too. Everyone does. The fix is sequential testing (mSPRT), which gives always-valid inference regardless of when you stop. Optimizely, Uber, and Netflix all switched to sequential testing for this reason.

Sample ratio mismatch (SRM). Your experiment is configured for a 50/50 split. The actual observed split is 52/48. Something broke: a bug in the bucketing code, bot traffic inflating one variant, a redirect losing users in transit. The result is invalid and you should not read it. Check SRM with a chi-squared goodness-of-fit test against the expected proportions, using a threshold of p < 0.0005. Microsoft finds SRM in roughly 6% of their tests. Six percent. At Microsoft. These are people who know what they're doing. It should be the first thing your dashboard checks.

The Data Model Is Simpler Than You Think

The core tables for the experiment management and results planes:

experiments (
  experiment_id  UUID primary key,
  name           VARCHAR,
  layer_id       UUID,
  status         ENUM(draft, running, paused, concluded),
  traffic_pct    DECIMAL(5,2),   -- percent of layer traffic this experiment uses
  start_time     TIMESTAMPTZ,
  end_time       TIMESTAMPTZ,
  primary_metric VARCHAR
)

variants (
  variant_id     UUID primary key,
  experiment_id  UUID references experiments,
  key            VARCHAR,        -- "control", "treatment_a"
  is_control     BOOLEAN,
  weight         DECIMAL(5,2),   -- relative traffic weight within experiment
  config         JSONB           -- arbitrary key-value overrides to serve
)

The assignment service reads from a Redis-cached copy of this table (TTL 60 seconds), invalidated via pub/sub on every config write. In production, Eppo, LaunchDarkly, and Statsig go further: they push the full ruleset to a CDN so SDKs poll the nearest edge node and never hit the origin for config reads.

How to Spend 45 Minutes

45-minute interview timeline divided into five segments: Requirements (0-5), Data model and API (5-13), High-level architecture (13-25), Deep dives (25-40), Tradeoffs (40-45)

Spend the most time on deep dives. That's where interviewers separate strong hire from hire.

Five minutes on requirements. Eight minutes on data model and the key API endpoints (assignment, event ingestion, experiment CRUD, results). Twelve minutes on the high-level architecture. Fifteen minutes on one or two deep dives. Five minutes on tradeoffs.

Two moments separate the candidates who understand this problem from those who are reciting it: why hashing replaces a database for assignment, and what the layer abstraction actually buys you. Nail those. Everything else is execution. The system design interview tips guide covers the general framework.

The Tradeoffs to Raise Unprompted

Client-side vs server-side assignment. A JavaScript SDK evaluates assignments in the browser, which causes a visible flicker: the page renders with the default variant before the SDK loads and swaps it. You can feel it. Users notice. Server-side assignment happens before the response is rendered, no flicker. The cost is that adding a server-side experiment requires a backend deployment. All serious platforms have moved to server-side evaluation.

Holdout groups. A permanently excluded 1-2% of traffic that never receives any new features, maintained across all experiments. The goal is measuring cumulative lift: if you shipped 50 features this quarter and conversion rate grew 10%, holdouts tell you how much of that growth the features actually caused. Without holdouts, the attribution is noise. (The answer is usually "less than you think.") The cost is real: holdout users never see your product improvements.

Network effect violations. Standard A/B testing assumes one user's treatment does not affect another user's outcome. This breaks in marketplaces and social networks. If treatment users see lower prices, they draw down supply, making control users' prices higher. Treatment looks better than it is. Your experiment has accidentally run a market. The fixes are cluster randomization (assign entire geographic regions to treatment or control) or switchback experiments (time-based switching where the entire network moves between variants simultaneously). Both sacrifice statistical power for valid causal inference.

Recap

Assignment is deterministic hashing, not a database lookup. No storage, no state, no network call.
Layers are orthogonal traffic partitions that let you run thousands of experiments without statistical interference.
Separate the hot path (assignment, sub-millisecond) from the cold path (analytics, minutes-to-hours). They have nothing in common.
Kafka buffers event writes. The warehouse runs the analysis. Direct synchronous writes to a database on the critical path do not scale.
Check for sample ratio mismatch before reading any result. Use sequential testing to allow early stopping without inflating false positives.

Neither the hashing insight nor the layer abstraction is obvious on first read. Together, they determine whether you walk out with a strong hire.

If you want to practice this design under realistic interview conditions, SpaceComplexity runs voice-based system design mock interviews with rubric-based feedback on architecture decisions, communication, and tradeoff reasoning.