Typeahead Search at Scale: The System Design Interview Walkthrough

- Typeahead generates 5x the QPS of search because autocomplete fires on every keystroke, not just on submission — design for that number.
- Solve hard problems offline, serve trivially online: precompute top-k suggestions per prefix so the online path is a single hash lookup.
- Sorted prefix arrays beat tries for production serving: contiguous memory, cache-friendly, no locking under concurrent index updates.
- CDN-cache the top 1,000 prefixes to reach 80-90% hit rates and sub-20ms latency for global users without touching origin servers.
- Rank with frequency, CTR, and exponential recency decay in the offline pipeline; the online path only sorts by a precomputed score.
- Hybrid updates catch trending terms: a full daily scoring run plus lightweight hourly incremental jobs push diffs instead of rebuilding the index.
- Consistent hash by full prefix string prevents hot-shard problems that crush single-character shards like "a" and "t".
Google suggests completions as you type. This sounds like a tiny feature. It is not a tiny feature.
Google handles about 40,000 searches per second at peak. Autocomplete fires on every keystroke. An average query is five characters long. That means 200,000 typeahead requests per second at peak, five times the search traffic itself. Add a latency budget of 100ms end-to-end and you have a hard real-time serving problem dressed up as a text box.
Here is what a complete typeahead search system design interview answer looks like, from requirements to tradeoffs, in 45 minutes.
Start With Scope. Seriously, Don't Skip This.
Before drawing anything, ask two questions.
What kind of typeahead? Web search suggestions, e-commerce product names, and map addresses have different ranking logic. An interviewer who hears you treat "Apple Music" and "apple pie" identically hasn't been impressed yet.
How fresh do suggestions need to be? For a news site, trending terms need to surface within minutes. For a recipe app, daily batch is fine. The freshness requirement changes your architecture more than any other single decision.
Reasonable assumptions for a Google-scale design: 1 billion daily searches, up to 20 suggestions per request, P99 latency under 300ms end-to-end, multi-region deployment, suggestions refreshed every few hours.
Candidates who skip this step spend the next 30 minutes heroically solving the wrong problem. Don't be that person.
Two Pipelines, Not One
Draw this early. Most candidates never draw this boundary and spend the next 20 minutes explaining an architecture that collapses the moment you do the load math.
The offline pipeline aggregates query logs and builds the suggestion index. MapReduce or Spark jobs read billions of log entries, count frequencies, apply scoring, and write the result to a serving store. This runs every few hours.
The online pipeline handles live user requests. It reads from the prebuilt index and returns results. It does no computation.
The key insight: you solve the hard problem offline so the online path is trivial. By the time a user types "app," the top 20 suggestions for "app" are already computed, scored, and waiting. The serving tier just fetches them.
These two pipelines share nothing at request time. The only artifact they exchange is the index.
The offline pipeline does the heavy lifting every few hours. The online path does exactly one thing: look up the answer that's already been computed.
OFFLINE PIPELINE
Query Logs → Aggregation (Spark) → Scoring → Top-K Selection → Index Write
ONLINE PIPELINE
User Request → CDN Cache? → Load Balancer → Typeahead Server → Index Lookup → Response
Why You Should Pause Before Drawing a Trie
Every candidate reaches for a trie. Every single one. It is the Pavlovian response to the word "prefix." You should pause.
A trie node stores 26 child pointers for lowercase English, a value field, a flag, and padding. That is 40 to 100 bytes per node. A corpus of one million common queries requires tens of millions of nodes, easily several gigabytes of memory for a naive implementation. Tries also need careful locking for concurrent writes during index updates. At 200K QPS that is painful.
Sorted prefix arrays are better. Store suggestions lexicographically: ["apple", "apple music", "apple store", "apricot", ...]. Binary search finds the start of any prefix range in O(log N). Arrays are contiguous in memory, cache-friendly, and trivially parallelizable.
But sorted arrays still require traversal to find the top-k by score. The actual answer goes one step further.
A trie node costs 40-100 bytes of scattered heap memory. The precomputed map uses contiguous blocks the CPU prefetcher actually likes.
For a deeper look at trie mechanics, see Trie Data Structure: How Prefix Trees Make Autocomplete Fast.
Precompute Every Prefix
Store a prebuilt list of top suggestions for every possible prefix.
For "a," store the top 20 globally popular suggestions starting with "a." For "ap," the top 20 for "ap." For "app," the top 20 for "app." All the way down to full terms.
A request arrives. Hash the prefix, look it up in a hash map, return the list. Latency is O(1).
The memory cost is manageable. One million common terms with an average length of 8 characters generates up to 8 million prefix entries. Storing 20 suggestions per prefix at 12 bytes each (8 bytes for a term ID plus 4 for a score) is about 2GB. That fits comfortably in memory on a single machine and in Redis.
This is why the offline pipeline exists. It builds this precomputed map. The online pipeline never touches the raw corpus. You trade storage and offline compute for something priceless: the serving path does zero work.
Every prefix that could ever be typed has its answer ready. The online server is basically a librarian who already organized everything the night before.
Prefix → Precomputed Top-K (sorted by score)
"a" → [about, amazon, apple, android, ...] (score: 0.99, 0.97, 0.96, ...)
"ap" → [apple, app, application, apply, ...]
"app" → [apple, application, apple music, ...]
"appl" → [apple, application, apple music, ...]
The API Is Simple on Purpose
GET /suggest?q=app&limit=10&locale=en-US&context=web
Response:
{ "suggestions": [ {"text": "apple", "type": "entity", "score": 0.98}, {"text": "apple music", "type": "query", "score": 0.92}, {"text": "application", "type": "query", "score": 0.87} ] }
Two client-side optimizations matter, and both are worth saying out loud in an interview.
Client debouncing: fire the request only after 300ms of inactivity. This cuts request volume by 60 to 80 percent without noticeable UX degradation. Your servers thank you every time someone pauses to think.
Request cancellation: if a user types "app" and then "appl" before the first response arrives, cancel the first request. The browser's AbortController handles this on the client. Server-side cancellation tokens let the backend stop processing early and free the thread. This one always gets nods from interviewers because most candidates forget about it entirely.
Ranking Takes More Than Just Frequency
Frequency alone is wrong, and this is a great moment to show some product sense.
"How to tie a tie" is searched millions of times but the answer hasn't changed in decades. "Elon Musk" spikes after news events. A pure frequency model would surface 2012's trending terms forever. That's not a search engine, that's a time capsule.
Production ranking combines multiple signals with a weighted formula:
- Query frequency, log-scaled (raw counts are misleading because of power-law distributions)
- Click-through rate from historical impressions
- Recency weight using exponential decay:
score × e^(-λt), where t is days since last query. A λ of 0.1 gives roughly a 7-day half-life. - Trending multiplier: day-over-day frequency change
All of this runs in the offline pipeline. By the time a prefix hits the serving tier, every suggestion already has a final composite score. The online path just sorts by that score and returns the top k.
Personalization layers on top at request time if required. The serving tier fetches global top-k suggestions and re-ranks by user history. This keeps the index general-purpose while still allowing per-user customization.
Three Scaling Problems You Must Address
The hot prefix problem. Single-letter prefixes like "a" get 10 to 20 times more traffic than the average prefix. The letter "a" accounts for a disproportionate share of humanity's curiosity. Shard by first character and your "a" shard is overwhelmed while "x" and "z" sit idle, philosophically contemplating their underutilization.
Solution: consistent hash on the full prefix string, then distribute across shards. Overprovision replicas for the top 100 prefixes. See Consistent Hashing: The Complete System Design Interview Guide for how the ring distributes load.
CDN caching. The top 1,000 prefixes account for over 70 percent of all traffic. Push their suggestion lists to CDN edge nodes worldwide. A user in Tokyo asking for "th" completions never hits your origin servers. CDN hit rate for hot prefixes runs 80 to 90 percent. This is the single biggest latency win in the design, and it's the thing most candidates forget to mention.
Users hitting popular prefixes get a response from the nearest CDN edge in under 20ms. Cache misses travel to origin and cost 50-80ms. Your hot prefixes should almost never reach origin.
[User in Tokyo] → [CDN Edge, Asia] (cache hit: <20ms) → response
[User in NYC] → [CDN Edge, US-East] (cache hit: <20ms) → response
[Cache miss] → [Origin servers] → [In-memory index lookup] → response (~50-80ms)
Update frequency vs freshness. A pure daily batch misses trending terms. Real-time streaming catches trends but adds infrastructure complexity and a whole new class of things that can go wrong at 3am.
The industry answer is hybrid: run the full scoring pipeline daily, then run a lighter incremental job every hour to catch rapidly trending terms. "Trending" means any term whose query frequency has grown more than 3x since the last full run. Hourly jobs push diffs to the serving tier rather than replacing the full index, which keeps the update cost manageable.
Make Your Typeahead System Design Interview Count
Most candidates run out of time because they go deep on the wrong thing first. This structure works.
Minutes 0 to 5: Clarify requirements. Write down scale numbers (daily searches, latency SLA, freshness requirement). Do not skip this even under time pressure. Especially under time pressure.
Minutes 5 to 10: Draw the two pipelines. Label everything. Say explicitly: "offline builds the index, online serves it, the serving path does no computation." Watch your interviewer nod.
Minutes 10 to 20: Walk through the data model. Precomputed top-k per prefix. Show the memory math. Explain why sorted arrays beat naive tries.
Minutes 20 to 30: Scaling. Hot prefix sharding, CDN caching layer, replication strategy.
Minutes 30 to 40: Ranking and freshness. Walk through the scoring formula, explain the hybrid update pipeline.
Minutes 40 to 45: State your key tradeoff clearly. "We trade storage and offline compute for sub-millisecond serving latency. The precomputed index is the bet." Then stop talking and let the interviewer steer.
If your interviewer signals interest in one section, follow them. That deeper conversation is where you show expertise, and it's where generic candidates get separated from the people who actually built something.
Every Choice Has a Cost. Name It.
| Decision | Choice | Cost |
|---|---|---|
| Data structure | Precomputed hash map per prefix | Large storage, offline compute |
| Update strategy | Hybrid: daily full + hourly incremental | Two pipelines to operate |
| Caching | CDN edge + in-memory origin | Cache invalidation on index updates |
| Sharding | Consistent hash by full prefix string | Rebalancing cost on topology changes |
| Ranking | Offline scoring, not real-time | Minutes to hours of staleness |
None of these choices are obviously wrong. They are tradeoffs. Naming the cost explicitly is what moves an answer from "hire" to "strong hire." An interviewer who hears "we trade X for Y and accept Z as the consequence" knows they are talking to someone who has shipped something before.
Seven Things to Remember
- Typeahead generates five times the QPS of search. Design for that number from the start.
- Solve offline, serve online. The serving path does O(1) lookup, nothing more.
- Precompute top-k suggestions per prefix and store them in memory.
- CDN-cache the top 1,000 prefixes. Eighty percent hit rate.
- Rank with frequency, CTR, and exponential recency decay in the offline pipeline.
- Use hybrid updates: full daily run plus hourly incremental for trending terms.
- Consistent hash by full prefix string to avoid single-letter hot-spot shards.
Typeahead is the system design problem where one concept unlocks everything: precompute offline, serve online. Once you internalize that split, the rest of the design follows. If you want to practice talking through this out loud, with an interviewer who asks follow-up questions and actually scores your tradeoff discussion, SpaceComplexity runs those sessions on demand.
For more on how system design answers are evaluated and what strong answers look like at each level, see System Design Interview: What to Expect, How It's Scored, and How to Stand Out.