Recommendation System Design Interview: The Complete Walkthrough

May 27, 202612 min read
interview-prepcareersystem-designalgorithms
Recommendation System Design Interview: The Complete Walkthrough
TL;DR
  • Recommendation system design interviews test infrastructure reasoning, not ML: no single model can score 100M items in 200ms.
  • The three-stage funnel (candidate generation, L1 ranking, L2 ranking, re-ranking) is the architecture every major platform uses.
  • Candidate generation runs three parallel sources: two-tower ANN search (HNSW), item-to-item collaborative filtering, and a trending fallback.
  • A feature store prevents training-serving skew, the production failure mode most candidates never mention.
  • Cold start splits into two problems: demographics-based fallback for new users, content-embedding warm start for new items.
  • Re-ranking is where diversity, freshness, policy, and exploration (epsilon-greedy) live; the scoring model should own none of these.
  • Cache pre-generated lists in Redis per user; most requests return in under 5ms and never touch the ranking pipeline.

80% of content watched on Netflix arrives through algorithmic recommendations. Amazon attributes 35% of its revenue to them. The recommendation system design interview shows up at every major tech company and trips up candidates who treat it as an ML question. It is an infrastructure question that happens to involve ML.

Most candidates sketch a single box labeled "model", draw an arrow from "users" to "recommendations", and look at the interviewer expectantly. The interviewer has seen this 40 times this week. What they actually want to see is the three-stage funnel, the feedback loop, and an honest account of where the system breaks.

Start With Numbers, Not Architecture

Before touching the whiteboard, pin down four things:

  • Catalog size. 100 million items (videos, songs, products).
  • User base. 100 million monthly active users.
  • Signal type. Implicit feedback only: watch time, skips, likes. No star ratings.
  • Latency budget. p95 recommendation response under 200ms.

The central problem is filtering 100 million candidates down to 10 personalized results in under 200ms. Write that on the board. Everything else is a consequence of that constraint.

The Math Says You Can't Use One Model

A single model scoring every item for every request at 100M items per user would take seconds, not milliseconds, even with GPU acceleration. Your users would age visibly.

You need to eliminate candidates in stages, each stage cheaper than the last. This is not a clever trick. It is physics.

This is the three-stage funnel. Draw it before explaining any component.

Three-stage recommendation funnel: 100M items filtered down to 10 personalized results in under 200ms

How YouTube does recommendations according to the internet: user gets rickrolled once, algorithm recommends 80s pop for two weeks straight

The reality is a bit more principled than this. Only a bit. (r/ProgrammerHumor)

Every major recommendation system uses this shape. YouTube described it in their 2016 RecSys paper; Pinterest uses it; TikTok uses it. The interview doesn't require you to have read those papers. It requires you to derive this shape from the latency constraint.

Candidate Generation: The Quality Ceiling

If a relevant item never makes it into the candidate set, no downstream model can surface it. Candidate generation is the ceiling on recommendation quality.

Run multiple generators in parallel and merge their outputs. The three you need to name:

Two-tower neural network. Train a model with a user tower and an item tower. Both encode their respective inputs into the same 128-dimensional embedding space. Training pulls user and item embeddings close together when the user interacted with the item. At serving time, precompute and index all item embeddings offline. The user tower runs online, generating a query vector in ~5ms. Then retrieve the 1,000 nearest item embeddings using approximate nearest neighbor (ANN) search.

Item-to-item collaborative filtering. For each item a user recently watched, find the 100 most similar items (precomputed offline using cosine similarity on co-watch patterns). Union those sets. This is the 2003 Linden et al. algorithm Amazon made famous. Fast, interpretable, no model serving required.

Trending/popularity fallback. Items ranked by engagement velocity over the last 24 hours, filtered by user demographics (country, language, device). This is the only generator that works for completely new users.

Two-tower neural network architecture: user tower encodes user features into 128-dim embedding, item tower encodes item metadata offline, ANN search finds nearest items at query time

Both towers output into the same 128-dimensional space. Similarity becomes a dot product. The offline precomputation is what makes it fast.

For ANN search, HNSW (Hierarchical Navigable Small World) is the practical choice. It organizes embeddings into a multilayer proximity graph and navigates from coarse upper layers to dense lower layers at query time. Malkov and Yashunin introduced it in 2016; it is now the default inside FAISS, Pinecone, and Weaviate. At 100M items with 128-dimensional float32 embeddings, the raw index is about 50GB, which fits in RAM across a few machines sharded by item ID range.

# Serving pseudocode: candidate generation user_vec = user_tower.predict(user_features) # ~5ms, runs online candidates = hnsw_index.search(user_vec, k=500) # ~10ms, ANN lookup item_cf_candidates = item_cf_store.get(recent_watches, k=500) trending = trending_cache.get(user.country, k=200) all_candidates = deduplicate(candidates + item_cf_candidates + trending)

Four Tables Drive Everything

users (user_id PK, country, language, device_type, created_at)

items (item_id PK, title, category, duration_seconds, tags[], uploaded_at, content_embedding vector(128))

interactions (user_id, item_id, event_type, watch_fraction float, occurred_at, session_id) Partition key: user_id. Clustering key: occurred_at DESC.

recommendations (user_id, item_id, score float, generated_at, was_clicked bool)

The interactions table is the engine of the system. At 100M users generating 50 events per day, that is about 60,000 writes per second. Use Cassandra and partition by user_id so a user's recent history is a cheap time-range scan. See Key-Value Store System Design for how to think through the storage tradeoffs at this write rate.

The recommendations table is a precomputed cache refreshed every one to four hours by a batch job. Serve it from Redis for sub-millisecond access. The was_clicked field closes the feedback loop into training data.

The Feature Store: Your Model Trains on Lies Without It

The ranking models need features from multiple sources at different freshness levels: static user profile (batch, updated daily), recent interactions (online, updated per session), and item metadata (medium freshness, updated on upload). A feature store gives both the training pipeline and the serving path a single API for these features.

Offline training:   feature_store.get_historical(user_id, cutoff_time)
Online serving:     feature_store.get_online(user_id)   # Redis, <5ms

Without a feature store, you get training-serving skew: the model trains on one computation of a feature and serves on a different one. This is one of the hardest production bugs to diagnose in a recommendation system. You will spend three days staring at metrics that don't move before you figure out what happened. The Distributed Cache System Design article covers the Redis patterns that underpin online feature serving.

Feature store architecture showing offline pipeline and online serving both reading from the same unified API to eliminate training-serving skew

The skew bug is career-defining. In a bad way. One feature store saves months of confused on-call escalations.

Ranking in Two Passes

With ~1,000 candidates from generation, two ranking stages trim the list.

L1 ranking. A gradient-boosted tree or small neural network. Uses 100-200 features: embedding dot product (user-item affinity), item popularity, recency, watch-time statistics. Scores all 1,000 candidates in under 20ms. Cheap enough to run on CPU.

L2 ranking. A deep neural network with cross-features: combinations of user and item attributes that the embedding distance cannot capture. "This user watches long documentaries in Spanish" is a cross-feature. L2 runs only on the top 200 from L1. The YouTube 2016 paper used exactly this two-stage setup, with the L2 model trained to predict expected watch time.

Training labels matter more than model architecture. Watch time predicts satisfaction better than clicks. Clicks are gameable: a misleading thumbnail maximizes clicks and destroys watch time. A weighted combination like 0.7 × watch_fraction + 0.3 × like_binary is a safe starting point.

Re-ranking: Where Business Logic Lives

The L2 score optimizes for a single user's engagement. Re-ranking adds the constraints no model should own directly.

Diversity. Penalize consecutive items from the same creator or category. A ranked list of ten nature documentaries is not a good recommendation even if the user loves nature documentaries.

Freshness. Apply a recency boost to items uploaded in the last 48 hours to prevent a rich-get-richer feedback loop where popular older content crowds out new uploads.

Policy and safety. Remove flagged content. Apply parental controls. Apply geographic licensing restrictions.

Exploration. With probability 0.05, replace one slot with a random candidate from the cold pool to collect training signal on new items. This is epsilon-greedy from reinforcement learning. Without it, the system amplifies its own biases indefinitely.

Re-ranking is deterministic rule execution. It runs in memory in under 5ms.

What the API Looks Like

GET /v1/recommendations
  ?user_id=u123
  &surface=home_feed
  &limit=20
  &cursor=eyJwYWdlIjoxfQ

200 OK
{
  "items": [
    {"item_id": "v456", "score": 0.94, "reason": "because_you_watched"},
    ...
  ],
  "next_cursor": "eyJwYWdlIjoyfQ",
  "generated_at": "2026-05-27T10:00:00Z",
  "cache_ttl_seconds": 1800
}

The surface field matters. Home feed, end-of-video autoplay, and search result injection all have different candidate pools, ranking models, and diversity rules. Model the surface as a routing key, not a filter.

Cache pre-generated recommendation lists in Redis per user with a 30-minute TTL. Invalidate on the user's next watch event. This is what keeps p95 under 200ms: most requests hit Redis and return in under 5ms, never touching the candidate generation or ranking pipeline.

Where the System Actually Breaks

ANN index memory. 100M items at 128 dims, float32 is 51GB. Shard by item ID range across 4-8 machines. Fan out the query to all shards and merge top-k from each. With 8 shards, each machine holds ~6GB and handles the merge locally.

Feature serving. Each ranking request needs features for 1,000 candidates. Issue one Redis MGET with 1,000 keys, not 1,000 individual GETs. The difference is 15ms versus 15,000ms because each network round trip is ~1ms and pipelining collapses them.

Feedback loop lag. The training pipeline consumes interaction events from Kafka, runs feature engineering in Spark, and retrains the model daily. New interaction data takes 24 hours to affect recommendations. For rapidly trending items, add a real-time signal layer: a short-term popularity counter (Redis INCR, key = item_id + hour) that the L1 ranker reads directly, bypassing the batch pipeline.

The Cold Start Problem

New users have no interaction history. New items have no co-watch data. Both break collaborative filtering. Congratulations on your user base of strangers.

New users. Fall back to popularity-based recommendations filtered by country, language, and device. If the product can support it, ask for three explicit preferences during onboarding. Three data points collapse the cold start into a warm start faster than any ML technique.

New items. Run the item through the item tower of the two-tower model using content metadata to generate an embedding. This puts the new item into the ANN index immediately. As real interactions accumulate, the collaborative signal gradually dominates the content-based embedding. This warm-start pattern means new items appear in recommendations from the first hour, not after the next training cycle.

The Honest Tradeoffs

Watch time vs. clicks as training labels. Watch time requires users to finish or partially finish an item before you get a signal. Clicks are available in milliseconds. For a new system, click labels train faster but create a sensationalism bias. Watch time labels train slower but align with user satisfaction. Most production systems use both: train initially on clicks, then shift weight to watch time as data accumulates.

Batch vs. real-time candidate generation. Precomputing all user embeddings daily is cheap and predictable. Generating them on-the-fly at request time is expensive but reflects the user's most recent 30 seconds of behavior. A common middle path: precompute daily, but recompute for users who have watched something in the last 5 minutes. See The Tradeoff Maze for how to frame push-vs-pull latency decisions in interviews.

Diversity vs. engagement. A purely engagement-maximizing ranker will converge on filter bubbles: it learns that you like a thing and then shows you only that thing. Diversity penalties in re-ranking reduce short-term engagement metrics but improve 30-day retention. Frame this in the interview as a long-term versus short-term optimization tradeoff, not a technical limitation of the model.

The 45-Minute Clock

  • 0-5: Clarify. Nail catalog size, user count, signal type, latency budget. Write the constraint on the board.
  • 5-15: Draw the three-stage funnel. Name the three candidate generators. Explain why approximate nearest neighbor is necessary.
  • 15-22: Data model. Four tables. Cassandra for interactions, Redis for the recommendation cache.
  • 22-32: Ranking. Feature store, L1 and L2, training labels, the feedback loop via Kafka. Mention shadow deployment.
  • 32-40: Bottlenecks. ANN sharding, batched feature reads, the real-time trending counter. Cover cold start in two minutes.
  • 40-45: Tradeoffs. Pick one (watch time vs. clicks) and argue it from both sides.

Resist the urge to describe model architectures in detail unless asked. The question is testing infrastructure reasoning, not ML knowledge. If you find yourself explaining attention mechanisms, you have taken a wrong turn.

Before You Leave the Room

  • The three-stage funnel (candidate generation, L1 ranking, L2 ranking) exists because no single model can score 100M items in 200ms.
  • Two-tower networks plus ANN search (HNSW) are the current standard for candidate generation.
  • The feature store prevents training-serving skew, which is the failure mode nobody talks about until they hit it.
  • Cold start requires two separate strategies: demographics-based fallback for new users, content-embedding warm start for new items.
  • Re-ranking is where business logic lives. The scoring model should not be responsible for diversity, freshness, or policy.
  • The feedback loop (interactions to Kafka to Spark to model training) is what turns a static system into one that improves over time.

If you want to practice walking through this end-to-end under time pressure, with an interviewer who asks follow-up questions and gives rubric-based feedback, SpaceComplexity runs voice-based mock system design interviews on exactly this kind of problem.

Further Reading