Design a Search Engine: The System Design Interview Walkthrough

May 27, 202610 min read
interview-prepcareerdsaalgorithms
Design a Search Engine: The System Design Interview Walkthrough
TL;DR
  • Two-pipeline architecture: a search engine is an offline crawl-and-index pipeline and an online query-and-serve pipeline, connected only through the inverted index.
  • Inverted index maps each term to a sorted, delta-encoded posting list of (doc_id, tf, positions); intersection of two lists runs in O(n+m) using skip pointers.
  • Document partitioning over term partitioning: every query fans out to all shards, each shard returns its local top-K, and the coordinator merges and re-ranks.
  • BM25 scoring filters candidates fast across all shards; an ML re-ranker applies only to the top 200 survivors before returning results to the user.
  • Hedged requests add roughly 5% extra load and cut p99 tail latency by 40-60% when any one of hundreds of shards runs slow.
  • Adaptive URL Frontier tracks each page's historical change rate and promotes or demotes it between crawl-frequency tiers automatically.

Candidates who nail the search engine system design interview do something counterintuitive: they draw two diagrams before describing a single component. Because a search engine isn't one system. It's two systems that happen to share an index, and most candidates spend 40 minutes on the crawling half and never reach the part that actually makes or breaks the interview: how a query returns results in under 200 milliseconds against 100 billion documents.

The two systems are the offline pipeline (crawl and index) and the online pipeline (query and serve). The only connection between them is the index. Get that separation on the whiteboard in the first ten minutes and the rest of the design writes itself.

Clarify the Scope Before You Draw Anything

The question has at least three very different versions. Figure out which one you're answering before drawing anything.

  • Domain: General web search (Google-scale, ~100B pages) versus product or enterprise search (Elasticsearch-scale, ~100M documents)?
  • Freshness: How stale is acceptable? News queries want seconds. Static documentation can tolerate days.
  • Personalization: Does the same query return different results for different users?
  • Query model: Keyword search only, or natural language and semantic similarity?

For this walkthrough: general web search, no personalization, freshness in minutes for news and hours for static content, keyword search with basic query understanding.

The Numbers You Need to Anchor the Design

Working numbers for the interview: 100 billion documents indexed, 100K QPS peak, 200ms p99 target, 50 billion pages crawled per day, ~20KB average compressed page. Google's index alone runs to roughly 100 petabytes.

Put these on the whiteboard early. Reaching for back-of-envelope math before components signals that you think in systems.

Two Pipelines, One Index

Draw two horizontal swimlanes before describing any component.

Offline pipeline: crawl the web, parse HTML, build the inverted index, compute PageRank scores. Runs continuously. Feeds into the index.

Online pipeline: receive a query, look up posting lists, score documents, return the top 10 results. Reads from the index on every request.

The decoupling is the key design decision. The offline pipeline can update the index thousands of times per day without pausing the serving tier. The serving tier reads a snapshot; the indexer writes to new segments in parallel.

Search engine two-pipeline architecture: offline crawl and index, online query and serve, connected only through the inverted index Two pipelines, one connection: the index. Get this on the whiteboard in the first ten minutes.

Offline: Crawling and Indexing

URL Frontier: Priority queue of URLs awaiting crawl. Priority is a function of estimated PageRank, observed update frequency, and time since last crawl. News sites recrawled every few minutes. Static tutorials: weekly.

Fetcher: Distributed HTTP workers. Respect robots.txt, per-domain crawl delays, cache DNS lookups locally. DNS takes 10-100ms. At this scale, caching it is non-negotiable.

A viral tweet where a guy claims he was laid off from Google after 24 years on the Crawler team because his job was clicking all the links on websites and adding pages to Google, and a reporter asks "was this a serious tweet" Web crawlers, explained. (via r/ProgrammerHumor)

Deduplication: Before parsing, the fetcher computes an MD5 hash of the page body and checks it against a Bloom filter backed by a distributed key-value store. Near-duplicate pages get dropped.

Parser and Indexer: Extracts plain text, tokenizes, lowercases, removes stop words. Emits (term, doc_id, position) tuples into a distributed batch job that builds inverted index segments. Segments merge periodically into the main index.

The Inverted Index Is the Whole Design

Every other component in a search engine is logistics. The inverted index is the actual search engine.

An inverted index is a hash map from term to posting list. Each posting list is a sorted array of (doc_id, term_frequency, positions) tuples:

"python"    → [(doc_14, tf=3, pos=[12,45,89]), (doc_201, tf=1, pos=[3]), ...]
"interview" → [(doc_14, tf=2, pos=[50,60]),    (doc_419, tf=5, pos=[1,2,3,4,5]), ...]

Posting lists are sorted by doc_id, which lets the query engine intersect two lists in O(n+m) by walking both in lockstep. For a query "python interview," you find doc IDs that appear in both posting lists. That's every result candidate.

Skip pointers every √n entries let you jump ahead when one list outpaces the other. Without them, intersecting a rare term (10K entries) with a common one (10M entries) scans past 9,990,000 entries that can never match.

Posting lists are delta-encoded: store the gap between consecutive doc_ids rather than the raw values. The posting list for "the" has hundreds of millions of entries. Delta-compressed, it fits in a fraction of the raw space.

Inverted index structure: term dictionary on the left pointing to sorted delta-encoded posting lists on the right, with skip pointer arcs at √n intervals Skip pointers let you jump √n entries ahead when one list outpaces the other. The "the" posting list, uncompressed, would be enormous.

PageRank runs as a separate batch job on the link graph (50-100 iterations until convergence). The resulting score per document gets stored alongside its posting list entries so the online pipeline can use it without an extra lookup.

The Query Path: From Typed Text to Results

A user types "python interview tips." Here is what happens in the next 200 milliseconds:

1. Query processing: tokenize, lowercase, remove stop words. Apply spell correction ("pyhton" → did you mean "python"?). Optionally expand with synonyms to broaden recall.

2. Fan-out: The coordinator sends the parsed query to all N index shards in parallel. Each shard independently looks up its posting lists, intersects them, and scores its local top-100 candidates.

3. BM25 scoring on each shard: rare terms score higher (IDF), term frequency saturates so the tenth occurrence of "python" adds less than the first, and document length is normalized so long pages don't win by volume.

score(D, Q) = Σ  IDF(qi) * [ tf(qi,D) * (k1+1) / (tf(qi,D) + k1*(1-b+b*|D|/avgdl)) ]

k1 = 1.2 (saturation), b = 0.75 (length normalization).

4. Merge and re-rank: The coordinator collects top-100 from each of, say, 200 shards (20,000 candidates total). The top 200 survivors go to a lightweight ML ranker that applies PageRank, freshness, and click signals. Top 10 go back to the user.

Query fan-out to N index shards running BM25 locally, coordinator merging top-100 from each shard, ML ranker trimming to top 10 results Every query hits every shard. BM25 runs locally. The ML ranker touches only the top 200 survivors, never the full corpus.

Index Sharding: The Fan-out Problem

The full index can't fit on one machine. At 100 billion documents, the index alone runs to tens of petabytes.

Use document partitioning, not term partitioning. With document partitioning, shard 1 holds docs 0-500M, shard 2 holds 500M-1B, and so on. Every query fans out to every shard. With term partitioning, one machine owns all posting lists for "a" through "m." Multi-term queries still hit every shard, and a common term creates a single-machine bottleneck.

Document partitioning has one serious problem: tail latency amplification. With 200 shards each at p99 = 50ms, the coordinator sees the max of 200 independent distributions. The math: 0.99^200 is 13%, so 87% of requests see at least one slow shard.

The fix is hedged requests: after 30ms, fire the same sub-query to a replica and use whichever replies first. This adds ~5% extra load and cuts p99 by 40 to 60%.

Hedged request timeline comparing with and without: primary sent at T+0ms, hedge to replica at T+30ms, replica responds T+45ms, slow primary discarded With hedging, the slow shard becomes someone else's problem.

Each shard is replicated 3x. The coordinator uses consistent hashing to pick the primary and falls back to secondaries on timeout.

Freshness: Crawl Priority Is a Scheduling Problem

New content on a news site has a freshness requirement measured in minutes. A static Wikipedia article about ancient Rome can wait a week.

The URL Frontier is a multi-tier priority queue. Tier 1 holds news sites, recrawled every few minutes. Tier 2 holds regular blogs and product pages, recrawled hourly. Tier 3 holds static content, recrawled weekly.

The crawler tracks each URL's historical change rate and promotes or demotes it between tiers automatically. Changed 7 of last 10 crawls? Move up. Unchanged for 20 consecutive crawls? Move down. For near-real-time freshness (news, financial data), a fast path bypasses batch and feeds a small in-memory overlay index. The serving tier queries both.

URL Frontier adaptive 3-tier priority queue: tier 1 for news at 1-5 minute intervals, tier 2 for blogs at 1-24 hours, tier 3 for static content weekly, with change-rate-driven promotion and demotion Pages earn crawl budget by changing. Stay stale, get demoted.

Bottlenecks Worth Talking About

Hot terms: The posting list for "the" has hundreds of millions of entries. Routing queries containing it to one machine creates a hotspot. Replicate high-frequency posting lists across multiple machines and load-balance across those replicas. Most implementations just remove stop words, but if your query model keeps them, this is the fix.

Result cache: A Redis cluster caches (query → result list) for popular queries. Cache hit rates above 50% are realistic globally, cutting serving load roughly in half. Invalidate on significant index updates or after a TTL.

Index segment merges: New content gets indexed into small segments. Serving from thousands of tiny segments is expensive. Schedule merges during low-traffic windows and route queries to a hot standby during the merge.

Trade-offs Worth Defending

DecisionOption AOption BWinner
Index partitioningTerm partitioningDocument partitioningDocument: simpler updates, even fan-out
Crawl schedulingFixed intervalAdaptive queueAdaptive: focus budget on live content
FreshnessBatch onlyBatch + real-time overlayHybrid
RankingPure BM25BM25 + ML re-rankerHybrid: BM25 on all, ML on top 200
Tail latencyAccept p99Hedged requestsHedged: +5% load, 40-60% p99 reduction

The 45-Minute Clock

  • 0-5: Clarify scope. Web vs. product search, freshness, personalization, query model.
  • 5-15: Draw both pipelines. Call out that the index is the only connection.
  • 15-25: Inverted index structure, posting list compression, skip pointers, PageRank batch job.
  • 25-35: Query path end to end: tokenize, fan-out, BM25 per shard, merge, ML re-rank.
  • 35-42: Sharding, tail latency, hedged requests, freshness tiers, result cache.
  • 42-45: Trade-off table. Open questions: semantic search, personalization, spam detection.

When the interviewer pulls you deeper, connect your answers back to the top-level architecture. Show that you hold the full picture while zooming in.

The Mental Model That Sticks

  • Two decoupled systems connected only through the index. Say this first.
  • Inverted index: term maps to compressed, sorted posting list of (doc_id, tf, positions). Intersection is O(n+m). Delta encoding shrinks it dramatically.
  • Document partitioning: every query fans out to all shards, each shard returns local top-K, coordinator merges, ML ranker finalizes.
  • p99 with many shards is dangerous. Hedge requests. The math forces it.
  • Crawl freshness is a scheduling problem. Adaptive priority queue beats fixed interval every time.
  • BM25 filters all candidates fast. ML re-ranker touches only the top 200 survivors.

Knowing this architecture is step one. Explaining it under time pressure, while an interviewer asks what happens when two shards return conflicting scores, is a different skill. SpaceComplexity runs voice-based mock system design interviews with rubric-based feedback so you can practice that part out loud, on the clock.

Deeper reading: web crawler system design, distributed cache design, and the Bloom filter for the dedup check.

Further Reading