Search Autocomplete System Design: The Two Flows Nobody Draws

You type "reacti" into Google and get ten suggestions in under 100ms. Most candidates, when asked to design search autocomplete, draw a trie, slap a cache in front of it, and call it done. The interviewer nods politely. Then: "How do you update the trie when query frequencies change?" Silence. The kind where you can hear the cursor moving toward the notes field.

The trie is the right data structure. But autocomplete is two systems stitched together: a read path that returns suggestions in milliseconds, and a write path that builds the trie offline from real search data. Conflating them is why interviews stall. This walkthrough separates them.

Real autocomplete, ranking by actual query frequency. The system is working exactly as designed.

Scope It Before You Draw It

Spend the first two minutes nailing requirements. Autocomplete (sometimes called typeahead search) is deceptively broad. Two minutes of scoping now saves twenty minutes of backtracking later.

Functional: Return the top N (usually 5-10) completions for any prefix. Rank by popularity. Latency target under 100ms end-to-end. Support basic text queries, no real-time personalization to start.

Non-functional: Read-to-write ratio is roughly 1000:1. Every character typed is a read; a query completion happens once per session. High availability matters, but autocomplete is not on the critical path. Search still works if suggestions are unavailable. Eventual consistency is explicitly acceptable. Suggestions can lag hours behind actual query trends.

Suggested framing for the interview: "I'll scope this for 10M DAU, top-10 suggestions per prefix, p99 latency under 100ms, no personalization. I'll note where personalization would require architectural changes."

That single scoping decision (no personalization) unlocks aggressive CDN caching and dramatically simplifies sharding.

Draw Both Flows First

Most diagrams stop here. They show a box called "autocomplete service" and skip straight to the trie. Wrong level of detail.

Draw both flows before zooming into either. The interviewer is waiting to see if you know they exist.

The read flow is what happens when a user types. Their client debounces the request, it hits a CDN cache, then lands on a sharded trie service, and comes back with ranked suggestions.

The write flow runs in the background. Every search is logged to Kafka. An aggregation pipeline counts query frequencies. A batch job rebuilds the trie periodically, serializes it, and ships it to trie servers for a zero-downtime swap.

The trie is read-only at query time. It is never mutated synchronously when someone searches. Writing to a shared in-memory structure under thousands of concurrent reads would serialize operations and kill latency. Instead, the write path builds a new trie offline and replaces the old one atomically.

The Autocomplete Trie: Why Naive Implementations Break Down

A plain trie stores strings character by character. To get suggestions for prefix "ca", you walk to the "ca" node, then traverse the entire subtree to find the most popular completions. At scale, that subtree can have millions of nodes. A single query becomes an O(n) tree walk. That is not a 100ms system.

The fix: store the top-K suggestions pre-computed at every trie node.

When a user types "ca", the service walks to the "ca" node in O(k) time (k = prefix length) and reads the cached list directly. No traversal. The top-K list at each node is computed at trie-build time, propagated from leaves upward.

Query hits "ca" and reads the pre-computed list in O(prefix length). The subtree never gets traversed.

Each trie node stores:

The character
Children map (26 slots for lowercase ASCII, more for unicode)
is_terminal: bool
top_suggestions: List[(string, score)]

The score is a weighted combination of search frequency (primary signal), recency decay (so last year's viral query fades), and optionally click-through rate.

Memory reality check. A trie for 50 million unique queries at an average of 10 characters each fits in 1-2GB of RAM. That easily lands on a single machine, though you will shard for throughput.

An alternative worth mentioning: a flat prefix hash table. Every prefix is a key, every value is the top-K list. Simpler to build and shard (just hash the prefix string), but uses 2-3x more memory because you cannot share prefix storage across terms. Both are valid; the trie is the canonical answer, and you should know why tries are structured the way they are before walking into this interview.

API Design: One Endpoint

GET /v1/autocomplete?q=reacti&limit=10

Response:

{
  "suggestions": [
    { "text": "reactivity", "score": 9842 },
    { "text": "react interview questions", "score": 7310 },
    { "text": "react native", "score": 6201 }
  ]
}

Keep the surface minimal. limit defaults to 10, q is the raw prefix string. You can add locale for region-specific suggestions and session_id for lightweight personalization later without breaking the contract.

Client-side: debounce at 150-200ms. Fire a request only after the user pauses typing. This alone cuts your backend QPS by roughly 60%. Also maintain a session-level prefix cache: typing "rea", then "reac", then deleting back to "rea" should not trigger a second network call.

Scaling the Read Path

The trie lookup is fast. The bottleneck is network latency and raw QPS.

CDN caching. Prefix queries follow a power law. The top 10,000 prefixes account for roughly 80% of all autocomplete traffic. Cache those responses at CDN edge nodes, close to users. A CDN hit adds 1-5ms and eliminates the backend entirely for the common case. TTL of 30-60 minutes is reasonable given eventual consistency is acceptable.

In-memory cache on the service. LRU cache the last 50-100K prefixes on each trie server. Warm it from a precomputed dump at startup.

Sharding. One trie server cannot handle 500K QPS. Shard by prefix range: one shard handles prefixes starting with "a" through "d", another "e" through "r", and so on.

The wrong way to set boundaries is naive alphabetical splits. "S" alone carries more traffic than "u" through "z" combined. Every "spotify" trend is someone else's pager alert if you got the split wrong.

Traffic bars show the actual distribution. Shard 4 is doing most of the work. Plan for this.

Analyze your actual prefix frequency distribution and set split points to balance request load, not alphabet coverage.

Because the trie is replaced wholesale rather than mutated, replication is straightforward. Push the new serialized trie snapshot to all replicas. No complicated change replication.

Scaling the Write Path

The write path is where candidates run out of time. Most spend 25 minutes on the trie and describe the write path as "a batch job, maybe Kafka, probably some caching." That does not get you through. Describe it in order:

Five numbered steps. Each one has a reason. Name them.

Log every search to Kafka. Do not touch the trie directly. A Kafka topic buffers the write load and decouples logging from trie serving.
Stream aggregation with Spark or Flink counts query frequencies in 15-minute windows. This gives you a near-real-time view of what is trending.
Daily batch job runs over the full query log. This is the source of truth. MapReduce (or Spark) counts total frequency over the last 30 days, applying exponential decay so recent queries rank higher. Queries below a minimum frequency threshold (say, 100 searches) are dropped. This filters out typos, personally identifying queries, and one-off searches.
Trie builder ingests the frequency table, builds a fresh trie with top-K pre-computed at every node, serializes it to a compact binary format, and distributes it to trie servers.
Atomic pointer swap. Each trie server holds two trie instances in memory. When the new snapshot arrives, the server loads it into the second slot and atomically updates the serving pointer. Zero downtime, zero locked reads.

The optional trending overlay: a small Redis store of the top-100 trending terms, updated every 15 minutes, merged on top of trie results at query time. Breaking news gets into suggestions in minutes without touching the rebuild cadence.

The Tradeoffs You Need to Name Out Loud

Name every tradeoff out loud. Interviewers want reasoning, not narration.

Freshness vs. simplicity. Daily batch keeps infrastructure simple and the trie stable. Hourly updates are possible but double operational complexity. Sub-hourly requires a real-time write path that competes with reads. Unless you are building Twitter's search bar, daily is fine and you should say so.

Personalization vs. CDN caching. CDN caching works because everyone typing "spo" sees the same suggestions. Personalization breaks that. A practical middle ground: serve generic CDN-cached results by default, and apply a lightweight reranking pass in the backend using user context if session_id is provided. The first pass is always fast.

Trie vs. prefix hash. The trie is memory-efficient and elegant. The prefix hash table is simpler to reason about and shard. Hash map internals matter here: the prefix hash needs O(1) average lookup but uses more memory and cannot share prefix storage. Name both, pick the trie, explain why.

Content filtering. You cannot return every query starting with "ki". Apply a blocklist at trie-build time, not at query time (latency cost) and not retroactively (incident response lag). A missing blocklist is a design flaw you will get called on. Build it in from day one.

How to Use Your 45 Minutes

The fatal mistake is spending 30 minutes on trie internals and never reaching the write path. The interviewer does not need a lecture on 26-child pointer arrays. They need to know you understand the full system.

0-5: Requirements. Scale, latency, personalization scope. Say: "I'll scope out personalization to keep CDN caching viable."

5-15: High-level diagram. Both flows. Label every box. Name the data store behind the trie.

15-25: Trie with top-K. Show the lookup. Mention prefix hash and the memory tradeoff.

25-35: Write pipeline. Kafka, aggregation, batch job, trie builder, atomic swap. Name the freshness tradeoff.

35-45: Scaling deep-dives. Sharding strategy, CDN TTL, debouncing, trending overlay if time permits.

Follow the interviewer if they redirect. This order ensures you hit every load-bearing decision before time runs out.

Recap

Autocomplete has two separate flows. Read path: fast prefix lookup from an in-memory trie. Write path: offline aggregation and periodic trie rebuild. Never mix them.
Store top-K suggestions at every trie node. Without pre-computation, every query becomes an O(n) subtree traversal.
Queries follow a power law. CDN-cache the top prefixes. A 30-60 minute TTL is fine given eventual consistency.
Shard by prefix range with empirically tuned split points, not naive alphabetical splits.
Write pipeline: Kafka for logging, Spark for aggregation, daily batch for the frequency table, atomic pointer swap for zero-downtime updates.
Name the tradeoffs: freshness vs. simplicity, personalization vs. caching, trie vs. prefix hash, blocklist for content safety.
Debounce at 150-200ms on the client. It is free QPS reduction.

System design is a spoken skill. Writing the diagram is step one. Explaining your reasoning under pressure and defending your choices when the interviewer pivots is what actually gets scored. SpaceComplexity runs voice-based system design mocks with rubric-driven feedback so you practice the full loop, not just the whiteboard.