Web Crawler System Design: Why It's Harder Than a While Loop

You have 45 minutes to design a web crawler. Your brain supplies an answer immediately: fetch a URL, extract links, push them onto a queue, repeat. Ship it. The interviewer smiles and asks, "how do you avoid hammering the same server 500 times a second?" and the while-loop unravels in real time.

The real challenge in web crawler design is politeness. You are a guest on every server you visit. The queue, the fetcher, the deduplication layer, every piece of this design exists to serve two masters: crawl as many pages as possible, and don't abuse the servers doing you a favor by responding.

Ask Before You Draw Anything

The most common mistake is grabbing the marker before you understand the requirements. A general search index, a price-comparison engine, and a news freshness monitor all have radically different recrawl requirements. They are not the same system.

Ask these four questions before touching the whiteboard:

What is this crawling for? Index freshness requirements change everything downstream.
What scale? Millions of pages (a focused crawler) vs. billions (a general index) changes every capacity decision.
How fresh does the data need to be? Hourly recrawls for news vs. weekly for static documentation.
HTML only, or JavaScript-rendered pages? A JavaScript rendering farm is a separate system. Don't conflate it with the crawler, or you'll be explaining Puppeteer for 20 minutes while the architecture sits blank.

For this walkthrough, assume the standard interview framing: 1 billion HTML pages, crawled within 7 days, refreshed periodically. That exposes every interesting design problem.

The Numbers Come First

Do the math before touching the architecture. Interviewers notice candidates who skip this step, and not in a good way.

Throughput needed: 1B pages / 7 days / 86,400 seconds ≈ 1,650 pages/second
Average page size: ~200KB (modern web, HTML only)
Incoming bandwidth: 1,650 × 200KB ≈ 330 MB/s
Raw HTML storage: 1B × 200KB = 200 TB
URL frontier size: 100 bytes × 1B URLs = 100 GB

That last number is the one that changes your architecture. 100GB for the frontier alone is too large to keep in RAM. You need disk-backed storage before you've drawn a single box. This is not a detail you discover later. Doing the math first means you never design your way into a corner.

Five Components, One Loop

Web crawler five-component loop: Seed URLs flow into URL Frontier, then Fetcher Workers, then Parser, then Content Store. New links from Parser feed back through Dedup Filter into the Frontier.

The feedback loop from Parser back to Frontier is where the complexity concentrates.

Five components:

URL Frontier: the structured queue of URLs to crawl next, not a simple FIFO
DNS Cache: a custom async resolver (not your OS's, for reasons we'll get to)
Fetcher Workers: download pages, enforce politeness
Parser: extracts links and content from raw HTML
Content Store + Deduplication: persists pages, filters URLs already seen

The feedback loop from Parser back to Frontier is where the complexity concentrates. Every link the Parser extracts is a candidate for the Frontier. Without deduplication and rate-limit enforcement sitting in that path, you are building a polite DoS attack.

The URL Frontier Is Not a Queue

This is where most candidates lose points. If your answer is "I'd use a queue," the design has a problem.

The frontier has two responsibilities that conflict: prioritize high-value pages and respect per-host rate limits. A single queue can satisfy one or the other, not both.

The classic solution, first described in the Mercator crawler (Heydon and Najork, 1999) and formalized in Manning's Introduction to Information Retrieval, uses two queue tiers.

Two-tier URL Frontier diagram: front queues handle priority scoring, back queues enforce per-host politeness, a min-heap tracks earliest allowed contact time per host, and crawler threads pull work from the back queue whose timer has expired.

Front queues handle priority. Back queues handle politeness. These are independent concerns and mixing them destroys throughput.

Front queues handle priority. The prioritizer scores each incoming URL (by PageRank signal, inbound link count, or observed change frequency) and assigns it to queue F1 through Fk. The selector draws from front queues with probability proportional to tier, so high-priority URLs get crawled more often without completely starving lower tiers.

Back queues handle politeness. Each back queue holds URLs from exactly one host. A min-heap tracks the earliest time each host may be contacted again. A crawler thread picks the back queue whose heap entry has the smallest timestamp, fetches one URL from that queue, and then updates that host's next-allowed time. Mercator recommends keeping roughly three back queues per crawler thread, so threads stay busy while enforcing wait intervals.

The insight: prioritization and politeness are independent concerns. Mixing them in one queue forces you to scan past thousands of waiting-for-politeness URLs to find the next fetchable one, destroying throughput.

Politeness Has Three Layers

Meta/Facebook crawler traffic dashboard showing 11 million requests in 30 days versus 24 million real users, bot traffic nearly matching all real human traffic on the site.

This is what an impolite crawler looks like from the server's side. Meta/Facebook: 11 million requests in 30 days. Real users: 24 million. The site owner paid Vercel for every single one.

Your crawler's job is to not be Meta on that chart. Before a fetcher touches any URL on a domain, it checks three things.

robots.txt. Fetch /robots.txt once per domain and cache it for several hours. Respect both Disallow directives (paths the crawler must skip) and Crawl-delay directives (minimum seconds between requests). Ignoring robots.txt is how crawlers get IP-banned, and a banned crawler is a crawler at 0 pages per second.

Per-host rate limiting. Even without a crawl-delay directive, wait for an interval proportional to the last fetch time. The Mercator heuristic: the wait should be an order of magnitude larger than the previous fetch duration. A page that took 2 seconds to load means wait 20 seconds before the next request to that host.

User-Agent transparency. Send an honest User-Agent string with a contact URL. Impersonating a browser has no upside and causes real damage when something goes wrong.

DNS Is the Hidden Bottleneck Everyone Forgets

You have 1,000 concurrent workers. Each fires a request. Each blocks waiting for DNS resolution. Your concurrent crawler is now serial. This is not hypothetical. Standard-library DNS resolution is synchronous, and if your fetcher fires 1,000 concurrent requests while each blocks on a DNS lookup, you have serialized your entire crawler through a single resolver.

The Mercator paper measured DNS consuming up to 70% of elapsed time per crawler thread before the team built a custom async resolver. That number tends to get people's attention.

The fix is a custom DNS layer that:

Issues lookups asynchronously, unblocking other workers while waiting
Maintains an LRU in-process cache keyed on hostname, with TTL-based expiry
Optionally pre-resolves IPs as URLs enter the frontier, so workers receive addresses instead of hostnames

Mention the async resolver in your interview. It signals production experience in a way that architecture buzzwords do not.

Deduplicate Before You Fetch, and Again After

Two-stage deduplication pipeline: Stage 1 normalizes URLs and checks a Bloom filter before fetching. Stage 2 computes SHA-256 hash and SimHash of response body after fetching, skipping duplicate and near-duplicate content.

Bloom filter at 1% FPR costs 1.2 GB for 1 billion URLs. An exact set would cost 100 GB. The occasional false positive means re-fetching a known URL, which is far cheaper.

You need deduplication at two points in the pipeline.

URL dedup. The same page lives under multiple URLs: HTTP vs. HTTPS, with and without trailing slash, with sorted vs. unsorted query parameters. Normalize first (lowercase scheme and host, strip default ports, canonicalize query strings). Then check the normalized URL against a Bloom filter. Bloom filters give O(1) lookup, accept a small false-positive rate, and never produce false negatives, which is exactly the right tradeoff. At 1% FPR, a Bloom filter costs 9.6 bits per URL, so 1 billion URLs occupy about 1.2 GB.

Content dedup. Different URLs often return identical HTML. Compute a SHA-256 hash of the response body and store it in a hash set. Skip storage if you've seen that hash. For near-duplicates (pages that differ only in navigation, timestamps, or ads), SimHash generates 64-bit fingerprints where similar documents have low Hamming distances. Two pages with Hamming distance 3 or fewer are likely duplicates. At 64 bits per fingerprint, 1 billion pages fit in 8 GB.

Match the Data to the Right Store

Raw HTML goes in an object store (S3 or equivalent). Write-once, read-rarely, 200TB scale. Don't put HTML in a relational database. Nothing good happens when you put HTML in a relational database.

Metadata goes in a relational store or document DB: URL, crawl timestamp, HTTP status code, content hash, next-crawl time, discovered link count. Small rows, high query volume, perfect for PostgreSQL or DynamoDB.

The search index (inverted index, rankings) is a downstream system. The crawler's job is to fetch and store. Draw a clean boundary and don't design the indexer unless asked. Scope creep in a system design interview looks like enthusiasm but reads as an inability to prioritize.

Where Distributed Crawlers Break

A single crawler node tops out at a few hundred pages per second. Reaching 1,650 requires horizontal scaling, and horizontal scaling introduces coordination problems the single-node design never sees.

Double-crawling the same domain. Without coordination, two crawler nodes will independently hit the same server, doubling the load and violating politeness. The fix is consistent hashing on domain name: node = hash(domain) % N. All URLs from example.com always go to node 3. Politeness enforcement works because each domain has exactly one responsible node.

Consistent hash ring showing four crawler nodes. Domains hash to specific nodes: example.com to Node 2, wikipedia.org to Node 1, nytimes.com to Node 4, reddit.com to Node 3. Each node owns all rate-limit state for its assigned domains.

Consistent hashing is deceptively simple to explain and just tricky enough to design correctly. Each domain maps to exactly one node, which owns all politeness state for that domain.

Memory frontier overflow. Popular domains (Wikipedia, Reddit) can accumulate tens of millions of discoverable URLs. Uncapped, they eat a node's entire RAM. Cap per-domain URL counts and use a disk-backed frontier (Redis with persistence, or RocksDB) for overflow.

Parser CPU, not network bandwidth. In 2025, modern pages average around 240KB of HTML and almost all connections are TLS. SSL handshakes and HTML parsing consume more CPU than the network downloading does. A single crawler in a recent 1-billion-page experiment found SSL/TLS consuming 25% of total CPU. Run parsers in separate worker processes from fetchers and scale them independently.

robots.txt at scale. A million domains means a million upfront /robots.txt fetches. Cache them for hours, not minutes.

The Tradeoffs to Have Ready

An interviewer will ask you to compare choices. Don't wait for the prompt.

BFS vs. DFS traversal: BFS gives broader coverage and is more polite by default (you spend time across many domains instead of digging deep into one). DFS is almost never the right choice for a general crawler. A DFS crawler will exhaust one domain completely before moving on, which violates politeness and produces a lopsided index.

Bloom filter vs. exact URL set: Bloom filters are the right call at billion-URL scale. The false-positive rate means occasionally re-fetching a known URL, which is far cheaper than the memory of an exact set.

Fixed recrawl interval vs. change-rate based: Fixed intervals are simpler to operate. Change-rate scheduling (recrawl pages proportional to how often they've changed historically) gives better freshness for volatile content like news. Start with fixed intervals and layer in change-rate logic once you have historical data.

Pacing the 45-Minute Interview

45 minutes is not as long as it sounds. It is especially not as long as it sounds at minute 38.

Minutes 0-5: Clarify scope. Ask the four questions above. State assumptions out loud. Don't skip this; it anchors every decision that follows.

Minutes 5-8: Back-of-envelope math. Write the numbers on the board. This justifies every architecture decision that follows. The 100GB frontier number alone demonstrates you know what you're designing.

Minutes 8-18: Draw the five-component high-level architecture. Name all components. Show the feedback loop. Don't dive deep yet.

Minutes 18-35: Deep-dive the URL frontier. This is where interviewers probe hardest and most candidates are weakest. Cover front queues, back queues, and the heap. Then cover DNS caching, deduplication, storage, and consistent hashing for distribution.

Minutes 35-45: Invite trade-off discussion. End with: "The main design tension here is between crawl velocity and politeness. I've optimized for politeness. If the use case were real-time news indexing, I'd tune the recrawl scheduler to prioritize change-rate signals over uniform freshness."

That closing move shows engineering maturity. Your design was a deliberate choice, not a default.

If you want to practice explaining this out loud under real interview pressure, with a voice-based interviewer asking follow-up questions about your tradeoffs in real time, SpaceComplexity runs system design mocks in exactly that format.

Quick Recap

Clarify use case, scale, and freshness before designing anything
The URL frontier has two tiers: front queues for priority, back queues for politeness per host
A min-heap tracks the earliest-allowed contact time per host
DNS is a hidden bottleneck at scale; build a custom async resolver with an LRU cache
Normalize URLs and use a Bloom filter for URL dedup; SHA-256 plus SimHash for content dedup
Raw HTML belongs in object storage; metadata in a relational DB; the search index is downstream
Use consistent hashing on domain name to distribute crawlers without coordination overhead
In the interview: numbers first, high-level second, URL frontier deep-dive third, tradeoffs last