Design a Search Engine: The System Design Interview Walkthrough

Candidates who nail the search engine system design interview do something counterintuitive: they draw two diagrams before describing a single component. Because a search engine isn't one system. It's two systems that happen to share an index, and most candidates spend 40 minutes on the crawling half and never reach the part that actually makes or breaks the interview: how a query returns results in under 200 milliseconds against 100 billion documents.

The two systems are the offline pipeline (crawl and index) and the online pipeline (query and serve). The only connection between them is the index. Get that separation on the whiteboard in the first ten minutes and the rest of the design writes itself.

Clarify the Scope Before You Draw Anything

The question has at least three very different versions. Figure out which one you're answering before drawing anything.

Domain: General web search (Google-scale, ~100B pages) versus product or enterprise search (Elasticsearch-scale, ~100M documents)?
Freshness: How stale is acceptable? News queries want seconds. Static documentation can tolerate days.
Personalization: Does the same query return different results for different users?
Query model: Keyword search only, or natural language and semantic similarity?

For this walkthrough: general web search, no personalization, freshness in minutes for news and hours for static content, keyword search with basic query understanding.

The Numbers You Need to Anchor the Design

Working numbers for the interview: 100 billion documents indexed, 100K QPS peak, 200ms p99 target, 50 billion pages crawled per day, ~20KB average compressed page. Google's index alone runs to roughly 100 petabytes.

Put these on the whiteboard early. Reaching for back-of-envelope math before components signals that you think in systems.

Two Pipelines, One Index

Draw two horizontal swimlanes before describing any component.

Offline pipeline: crawl the web, parse HTML, build the inverted index, compute PageRank scores. Runs continuously. Feeds into the index.

Online pipeline: receive a query, look up posting lists, score documents, return the top 10 results. Reads from the index on every request.

The decoupling is the key design decision. The offline pipeline can update the index thousands of times per day without pausing the serving tier. The serving tier reads a snapshot; the indexer writes to new segments in parallel.

Two pipelines, one connection: the index. Get this on the whiteboard in the first ten minutes.

Offline: Crawling and Indexing

URL Frontier: Priority queue of URLs awaiting crawl. Priority is a function of estimated PageRank, observed update frequency, and time since last crawl. News sites recrawled every few minutes. Static tutorials: weekly.

Fetcher: Distributed HTTP workers. Respect robots.txt, per-domain crawl delays, cache DNS lookups locally. DNS takes 10-100ms. At this scale, caching it is non-negotiable.

Web crawlers, explained. (via r/ProgrammerHumor)

Deduplication: Before parsing, the fetcher computes an MD5 hash of the page body and checks it against a Bloom filter backed by a distributed key-value store. Near-duplicate pages get dropped.

Parser and Indexer: Extracts plain text, tokenizes, lowercases, removes stop words. Emits (term, doc_id, position) tuples into a distributed batch job that builds inverted index segments. Segments merge periodically into the main index.

The Inverted Index Is the Whole Design

Every other component in a search engine is logistics. The inverted index is the actual search engine.

An inverted index is a hash map from term to posting list. Each posting list is a sorted array of (doc_id, term_frequency, positions) tuples:

"python"    → [(doc_14, tf=3, pos=[12,45,89]), (doc_201, tf=1, pos=[3]), ...]
"interview" → [(doc_14, tf=2, pos=[50,60]),    (doc_419, tf=5, pos=[1,2,3,4,5]), ...]

Posting lists are sorted by doc_id, which lets the query engine intersect two lists in O(n+m) by walking both in lockstep. For a query "python interview," you find doc IDs that appear in both posting lists. That's every result candidate.

Skip pointers every √n entries let you jump ahead when one list outpaces the other. Without them, intersecting a rare term (10K entries) with a common one (10M entries) scans past 9,990,000 entries that can never match.

Posting lists are delta-encoded: store the gap between consecutive doc_ids rather than the raw values. The posting list for "the" has hundreds of millions of entries. Delta-compressed, it fits in a fraction of the raw space.

Skip pointers let you jump √n entries ahead when one list outpaces the other. The "the" posting list, uncompressed, would be enormous.

PageRank runs as a separate batch job on the link graph (50-100 iterations until convergence). The resulting score per document gets stored alongside its posting list entries so the online pipeline can use it without an extra lookup.

The Query Path: From Typed Text to Results

A user types "python interview tips." Here is what happens in the next 200 milliseconds:

1. Query processing: tokenize, lowercase, remove stop words. Apply spell correction ("pyhton" → did you mean "python"?). Optionally expand with synonyms to broaden recall.

2. Fan-out: The coordinator sends the parsed query to all N index shards in parallel. Each shard independently looks up its posting lists, intersects them, and scores its local top-100 candidates.

3. BM25 scoring on each shard: rare terms score higher (IDF), term frequency saturates so the tenth occurrence of "python" adds less than the first, and document length is normalized so long pages don't win by volume.

score(D, Q) = Σ  IDF(qi) * [ tf(qi,D) * (k1+1) / (tf(qi,D) + k1*(1-b+b*|D|/avgdl)) ]

k1 = 1.2 (saturation), b = 0.75 (length normalization).

4. Merge and re-rank: The coordinator collects top-100 from each of, say, 200 shards (20,000 candidates total). The top 200 survivors go to a lightweight ML ranker that applies PageRank, freshness, and click signals. Top 10 go back to the user.

Every query hits every shard. BM25 runs locally. The ML ranker touches only the top 200 survivors, never the full corpus.

Index Sharding: The Fan-out Problem

The full index can't fit on one machine. At 100 billion documents, the index alone runs to tens of petabytes.

Use document partitioning, not term partitioning. With document partitioning, shard 1 holds docs 0-500M, shard 2 holds 500M-1B, and so on. Every query fans out to every shard. With term partitioning, one machine owns all posting lists for "a" through "m." Multi-term queries still hit every shard, and a common term creates a single-machine bottleneck.

Document partitioning has one serious problem: tail latency amplification. With 200 shards each at p99 = 50ms, the coordinator sees the max of 200 independent distributions. The math: 0.99^200 is 13%, so 87% of requests see at least one slow shard.

The fix is hedged requests: after 30ms, fire the same sub-query to a replica and use whichever replies first. This adds ~5% extra load and cuts p99 by 40 to 60%.

With hedging, the slow shard becomes someone else's problem.

Each shard is replicated 3x. The coordinator uses consistent hashing to pick the primary and falls back to secondaries on timeout.

Freshness: Crawl Priority Is a Scheduling Problem

New content on a news site has a freshness requirement measured in minutes. A static Wikipedia article about ancient Rome can wait a week.

The URL Frontier is a multi-tier priority queue. Tier 1 holds news sites, recrawled every few minutes. Tier 2 holds regular blogs and product pages, recrawled hourly. Tier 3 holds static content, recrawled weekly.

The crawler tracks each URL's historical change rate and promotes or demotes it between tiers automatically. Changed 7 of last 10 crawls? Move up. Unchanged for 20 consecutive crawls? Move down. For near-real-time freshness (news, financial data), a fast path bypasses batch and feeds a small in-memory overlay index. The serving tier queries both.

Pages earn crawl budget by changing. Stay stale, get demoted.

Bottlenecks Worth Talking About

Hot terms: The posting list for "the" has hundreds of millions of entries. Routing queries containing it to one machine creates a hotspot. Replicate high-frequency posting lists across multiple machines and load-balance across those replicas. Most implementations just remove stop words, but if your query model keeps them, this is the fix.

Result cache: A Redis cluster caches (query → result list) for popular queries. Cache hit rates above 50% are realistic globally, cutting serving load roughly in half. Invalidate on significant index updates or after a TTL.

Index segment merges: New content gets indexed into small segments. Serving from thousands of tiny segments is expensive. Schedule merges during low-traffic windows and route queries to a hot standby during the merge.

Trade-offs Worth Defending

Decision	Option A	Option B	Winner
Index partitioning	Term partitioning	Document partitioning	Document: simpler updates, even fan-out
Crawl scheduling	Fixed interval	Adaptive queue	Adaptive: focus budget on live content
Freshness	Batch only	Batch + real-time overlay	Hybrid
Ranking	Pure BM25	BM25 + ML re-ranker	Hybrid: BM25 on all, ML on top 200
Tail latency	Accept p99	Hedged requests	Hedged: +5% load, 40-60% p99 reduction

The 45-Minute Clock

0-5: Clarify scope. Web vs. product search, freshness, personalization, query model.
5-15: Draw both pipelines. Call out that the index is the only connection.
15-25: Inverted index structure, posting list compression, skip pointers, PageRank batch job.
25-35: Query path end to end: tokenize, fan-out, BM25 per shard, merge, ML re-rank.
35-42: Sharding, tail latency, hedged requests, freshness tiers, result cache.
42-45: Trade-off table. Open questions: semantic search, personalization, spam detection.

When the interviewer pulls you deeper, connect your answers back to the top-level architecture. Show that you hold the full picture while zooming in.

The Mental Model That Sticks

Two decoupled systems connected only through the index. Say this first.
Inverted index: term maps to compressed, sorted posting list of (doc_id, tf, positions). Intersection is O(n+m). Delta encoding shrinks it dramatically.
Document partitioning: every query fans out to all shards, each shard returns local top-K, coordinator merges, ML ranker finalizes.
p99 with many shards is dangerous. Hedge requests. The math forces it.
Crawl freshness is a scheduling problem. Adaptive priority queue beats fixed interval every time.
BM25 filters all candidates fast. ML re-ranker touches only the top 200 survivors.

Knowing this architecture is step one. Explaining it under time pressure, while an interviewer asks what happens when two shards return conflicting scores, is a different skill. SpaceComplexity runs voice-based mock system design interviews with rubric-based feedback so you can practice that part out loud, on the clock.

Deeper reading: web crawler system design, distributed cache design, and the Bloom filter for the dedup check.