System Design Interview

Design a Web Crawler

Straightforward to describe, surprisingly complex to scale, a billion-page crawl touches nearly every distributed systems concept.

L3 / L4, Working system L5, Tradeoff reasoning L7 / L8, Architecture ownership
Hero Image for Web Crawler System Design
01

What the interviewer is testing

A web crawler is deceptively approachable, "fetch pages, extract links, repeat" is a three-word description that fits in a tweet. The interview question is a trap for exactly that reason: the system looks trivial until you start asking the hard questions. How do you avoid crawling the same URL twice across a fleet of 500 workers? How do you avoid taking down small websites? How do you prioritise the 1% of URLs that account for 50% of the value?

The question tests three specific things that don't come up in most other system design problems:

Graph traversal at scale. The web is a directed graph with ~50 billion nodes and trillions of edges. Choosing BFS vs DFS isn't academic, it determines whether your crawl discovers important pages in hours or months.

Distributed coordination without a central bottleneck. A single seen-URL set that every worker writes to will become the choke point. Candidates who don't think about this usually hit it around the "scalability" section.

Being a good citizen. A crawler that ignores robots.txt and hammers domains will get IP-blocked and potentially create legal exposure. Politeness isn't a nice-to-have, it's architecturally required.

Level What good looks like
L3/L4 Describe the fetch-parse-enqueue loop, basic deduplication, robots.txt compliance
L5 Defend BFS frontier choice, Bloom filter tradeoffs, politeness queue design, content hashing
L6 Partition the frontier by domain hash, explain DNS caching, handle re-crawl scheduling with TTL
L7/L8 Own the full crawl budget problem: importance scoring, freshness vs coverage tradeoffs, anti-spam, legal exposure
02

Requirements clarification

Functional requirements

Requirement Notes
Discover and fetch web pages starting from a seed URL list Seed set provided externally
Extract hyperlinks and enqueue discovered URLs HTML anchor tags; normalise before enqueue
Store raw page content and metadata HTML, fetch timestamp, HTTP status, content hash
Respect robots.txt and per-domain crawl rate limits Per-domain politeness queue
Avoid fetching the same URL twice within a crawl cycle URL fingerprint deduplication
Re-crawl pages on a refresh schedule TTL-based or change-detection-based

Non-functional requirements

NFR Target Reasoning
Scale ~1B pages/day Google-scale; reasonable FAANG interview scope
Crawl freshness Top pages re-crawled within 24h Ensures index currency for search ranking
Politeness ≤1 req/s per domain by default Avoids overloading small sites; industry norm
Deduplication accuracy <0.01% duplicate fetches in steady state Bloom filter false-positive rate target
Availability 99.9% (crawl continues despite worker failures) Worker crashes must not lose URL queue entries
Storage Petabyte-scale raw content ~100KB avg page × 1B pages/day × 30 days

NFR reasoning

Politeness constraint (≤1 req/s per domain)

One request per second is the de facto industry standard for respectful crawling. Many small sites run on shared hosting with 100–200 requests/s total capacity. Even modest domains can be overwhelmed if multiple crawl workers independently select their pages. The per-domain rate limit is the most consequential single constraint in this design, it directly forces the partitioned frontier architecture in §4.

What this drivesThe frontier must be partitioned by domain so each domain's queue is owned by exactly one worker pool at a time. A single global FIFO queue makes politeness enforcement nearly impossible.
Deduplication accuracy (<0.01% duplicates)

At 1B pages/day, even 0.1% duplicate fetches is 1M wasted requests consuming bandwidth and crawl budget. A Bloom filter with a 0.01% false positive rate is achievable with ~14 bits per element, for 50B seen URLs that's roughly 87GB, which fits in memory across a Redis cluster. The false positive direction is safe: occasionally skipping a new URL is acceptable. False negatives (treating a seen URL as new) would require an exact-match store as the second gate.

What this drivesTwo-tier deduplication: distributed Bloom filter for fast reject, then exact Redis HSET check for the small fraction that passes the filter. Drives the URL Frontier Service architecture in §4.
Crawl freshness (top pages re-crawled within 24h)

The web's high-value tail, news sites, social platforms, large retailers, changes hourly. A 24-hour freshness SLA for prioritised URLs means the crawl scheduler must distinguish between tier-1 pages (high change frequency, high importance) and tier-2/3 pages (low traffic, rarely updated). A uniform re-crawl schedule wastes crawl budget on static content while letting high-frequency pages go stale. This drives the priority queue and importance scoring discussed in §5.

What this drivesPer-URL TTL and importance score stored in url_records (§7). The re-crawl Scheduler reads these to compute next_crawl_at and push due URLs to the Frontier (§7, §9).
03

Capacity estimation

Before reaching for numbers, consider the shape of each dimension. Write throughput is pure fetch rate, easy to estimate but dominates bandwidth and storage planning. Storage compounds quickly: raw HTML at scale means petabytes, not terabytes. URL deduplication state is the unusual one here, the seen-URL set must fit in memory or we lose the ability to do fast deduplication. That constraint deserves its own estimate.

Interactive crawl estimator

1,000M
100 KB
30
30 days
Fetch QPS
11,574
requests/sec
pages/day ÷ 86,400 s
Peak QPS (3×)
34,722
requests/sec
fetch QPS × 3 (burst headroom)
Inbound bandwidth
11.6
GB/s raw HTML
fetch QPS × page size (KB) ÷ 1,024
Storage / day
100
TB raw content
pages/day × page size (KB) ÷ 109
Total storage
3.0
PB over retention
storage/day (TB) × retention (days) ÷ 1,000
URL seen-set size
53
GB (14 bits/URL)
pages/day × retention × 14 bits ÷ 8 ÷ 109
⚠️

The seen-URL Bloom filter memory estimate is the architectural forcing function. At 1B pages/day over 30 days, the crawl window accumulates 30B unique URLs. At 14 bits per URL that's ~53GB, which must live in RAM across an in-memory cluster (Redis). If it spills to disk, dedup latency blows up from microseconds to milliseconds, and at 11,574 QPS, that becomes the bottleneck. Every architectural decision around the URL frontier traces back to keeping this set in RAM. (For web-scale estimates using the full 50B-page web graph, the same formula gives ~87GB, you'll see both numbers in the literature.)

04

High-level architecture

Seed URLs bootstrap list URL Frontier Priority queue Per-domain buckets Politeness scheduler Fetcher Pool HTTP/S fetch DNS cache robots.txt cache Parser Extract links Content hash SimHash dedupe Object Store Raw HTML (S3) Metadata DB URL state, hashes Dedup Store Bloom filter + Redis new URLs → frontier LEGEND sync path async feedback data write
High-level architecture: seed URLs feed the frontier, workers fetch and parse, extracted links loop back asynchronously

Component breakdown

URL Frontier Service, almost every scalability decision in this design traces back here. The frontier queues URLs waiting to be crawled, organises them into per-domain priority buckets, and enforces politeness delays so no single domain gets hammered. Unlike a simple FIFO, it must know which domains are rate-limited before dispatching the next URL.

Fetcher Pool, a fleet of HTTP worker processes that pull URLs from the frontier, issue the actual HTTP/S request, and stream raw HTML back for processing. Fetchers maintain their own DNS cache (DNS round-trips at 11k req/s would become the bottleneck) and a per-domain robots.txt cache with a 24h TTL.

Parser / Link Extractor, receives raw HTML, extracts all anchor href values, normalises them (resolve relative paths, strip fragments, canonicalise scheme and host), and computes a SimHash fingerprint of the body for near-duplicate detection. New URLs discovered here loop back to the frontier; the content hash goes to the metadata store.

Dedup Store, guards the frontier against re-enqueuing URLs already crawled. A distributed Bloom filter (backed by an in-memory store like Redis) handles the fast path; an exact hash set backstops the small fraction the filter passes through. Any URL must clear both gates before entering the frontier.

Object Store, holds raw HTML blobs, referenced by URL hash. A distributed blob store (Amazon S3 or Google Cloud Storage) handles the write throughput and durability that a relational database's BLOB columns cannot, raw HTML at 11,000 writes/second and 100KB per page demands it.

Metadata DB, records per-URL state: crawl status, last fetch timestamp, content hash, link count, HTTP response code, ETag, and computed importance score. The re-crawl Scheduler reads it to compute the next fetch time for each URL, making it the control plane for the entire crawl lifecycle.

Architectural rationale

URL Frontier: distributed priority queue, not a single FIFO Core design choice

A single FIFO queue makes politeness enforcement intractable, you'd need to scan ahead to find the next URL from a domain that isn't rate-limited. A per-domain bucket design lets each domain's queue drain at its own rate. A durable message queue (Apache Kafka) with domain-keyed partitions is a natural fit: each partition maps to a domain range, and consumer groups naturally respect the rate limit by pausing consumption on a partition until the cooldown expires.

TradeoffDomain-partitioned queues can create hotspots, large domains (Wikipedia, Reddit) will have far more URLs than small ones, causing uneven partition load. Consistent hashing on domain name distributes across workers, but very large domains may still need sub-partitioning by URL path prefix at L7/L8.
AlternativesRabbitMQRedis Sorted SetSQS + delay queues
Object store for raw content, not a relational DB Storage architecture

Raw HTML blobs average 100KB and arrive at 11,000/second. A relational database storing BLOB columns at this write rate would require extreme vertical scaling and still hit I/O limits quickly. An object store designed for high-throughput sequential writes with 11 nines durability (like S3) stores the blobs cheaply and lets the metadata database remain lean — storing only the object key, not the content.

TradeoffObject stores are not queryable by content. Downstream systems (indexers, ML pipelines) must read by key and parse locally. At L5+ you'd discuss how a processing pipeline (Apache Spark or Flink) reads from the object store in batch.
Fetcher DNS caching: local TTL override Performance

At 11,574 req/s, every fetch needs a DNS lookup for the target hostname. Public resolvers have rate limits and introduce 10–100ms latency per lookup. Fetcher workers maintain a local DNS cache with a 5-minute TTL, overriding the authoritative TTL. This reduces external DNS queries by ~99% for popular domains. The tradeoff, DNS changes take 5 minutes to propagate , is acceptable for a crawler where a brief 404 burst is harmless compared to DNS becoming a bottleneck.

Real-world comparison

Decision This design Common Crawl Google
Frontier queue Kafka domain partitions Custom priority queues Proprietary distributed queue
Raw content store Object store (S3-compatible) WARC files on S3 Colossus (GFS)
Deduplication Bloom filter + Redis exact hash URL canonicalisation + fingerprint Simhash near-duplicate at index time
Priority scoring PageRank-like importance score Domain whitelist tiers Multi-signal quality score
Politeness Per-domain token bucket Crawl-delay from robots.txt Adaptive rate per site
ℹ️

Common Crawl publishes its architecture and stores WARC-format crawl archives openly on S3, a useful reference for interview discussions about open data pipelines. Google's architecture is proprietary, but its published papers on Caffeine (their indexing pipeline) reveal the general shape.

05

Crawl strategy & URL deduplication

Two questions dominate this section: in what order should URLs be fetched? and how do you know you've already seen a URL? They're related, the frontier algorithm determines how often URLs arrive at the Dedup Store, and the Dedup Store must keep up. At 11k QPS, that means sub-millisecond lookups; anything that falls through to a 10ms database read becomes the system's throughput ceiling.

① BFS (Breadth-First Search) Process all level-N pages before level N+1 seed A B C ✔ Finds authoritative pages first ② DFS (Depth-First Search) Dive deep into one branch before returning seed A B C ✗ Exhausts one domain before exploring others
BFS discovers broad, high-authority pages early. DFS risks over-crawling one domain while missing others entirely. Our choice: BFS with priority queue.

Our choice for this system: BFS with priority scoring

Standard BFS discovers pages linked by many other pages (high-authority pages) early in the crawl. Adding a priority score, based on estimated page importance, change frequency, and link depth, turns the FIFO BFS queue into a best-first crawl that approximates what search engines actually do. A newly discovered URL gets a base priority; crawl history and inbound link count update it over time.

💡

URL normalisation, a classic L5 follow-up. Before any URL reaches the Dedup Store it must be normalised, or semantically identical pages create duplicate entries. The six standard steps: (1) lowercase the scheme and host (HTTP://Example.COMhttp://example.com); (2) strip default ports (:80, :443); (3) remove fragments (#section, fragments are client-side only; the server returns the same page regardless); (4) resolve relative paths (/../foo/foo); (5) sort query parameters alphabetically so ?b=2&a=1 and ?a=1&b=2 map to the same fingerprint; (6) strip known tracking parameters (utm_source, fbclid). Normalisation happens in the parser before the URL hash is computed, the hash stored in the Dedup Store and url_records is always of the normalised form.

URL deduplication: two-tier approach

Tier Mechanism Latency False positive rate
Tier 1 (fast reject) Distributed Bloom filter (in-memory, Redis cluster) ~0.5ms ~0.01% (configurable)
Tier 2 (exact check) Redis HSET with URL SHA-256 fingerprint ~2ms 0% (exact match)

Tier 1 efficiently handles the vast majority of checks. Every URL the frontier wants to enqueue passes through Tier 1 first, if the Bloom filter says "seen", the URL is dropped immediately without hitting Redis. If the filter says "not seen", the URL proceeds to Tier 2 for an exact confirmation. The false positive case, where the Bloom filter incorrectly reports a genuinely new URL as already-seen, causes that URL to be silently skipped. At a 0.01% false positive rate this is acceptable crawl budget loss (roughly 1M missed URLs per day at 1B pages/day scale). Tier 2 is exact: a Redis HSET lookup on the normalised URL fingerprint, with 0% false positive rate. A Bloom filter never produces false negatives, so any URL that passes both tiers is definitively new.

Content-level deduplication with SimHash Advanced, L5+

URL deduplication catches exact URL repeats, but the web is full of near-duplicate content: mirror sites, printer-friendly versions, paginated archives. SimHash converts a page's token set into a 64-bit fingerprint where similar documents produce fingerprints that differ in only a few bits (Hamming distance ≤ 3). Storing SimHash values in the metadata DB and checking Hamming distance on fetch lets the parser flag near-duplicates without re-indexing their content. Google's Detecting Near-Duplicates for Web Crawling (2007) is the primary reference.

TradeoffHamming distance search over 50B stored fingerprints is non-trivial. In practice, SimHash bits are partitioned into bands and each band is indexed, only fingerprints sharing at least one band need exact comparison. This reduces the search to a hash lookup rather than a scan.
🔍

One open question: what's the false-positive rate for a Bloom filter over 50B URLs at 14 bits/element? Answer: ~0.01%. The formula is (1 − e−kn/m)k where k = 10 hash functions, n = 50B URLs, m = 700B bits. Remember the direction: a false positive means the filter says "seen" for a URL that hasn't actually been crawled, causing a valid new URL to be silently skipped. That's the acceptable loss. A false negative (saying "not seen" for a URL that has been crawled) cannot happen with a Bloom filter by design. Know both the formula and the direction — interviewers probe both.

05b

API & protocol design

A web crawler has two external-facing protocol surfaces: the HTTP protocol it speaks to target websites, and the internal API that external systems (search indexer, ML pipelines) use to query or drive the crawl. Both matter in an interview.

External: HTTP fetching protocol

Fetchers issue standard HTTP/1.1 or HTTP/2 GET requests. Key headers the crawler must send and respect:

// Outbound request headers (what we send)
GET /page HTTP/1.1
Host: example.com
User-Agent: MyCrawler/1.0 (+https://mysite.com/crawler)
Accept: text/html,application/xhtml+xml
Accept-Encoding: gzip, deflate, br
If-None-Match: "etag-from-last-crawl"       // conditional GET, avoid re-downloading if unchanged
If-Modified-Since: Tue, 01 Apr 2025 12:00:00 GMT

// Response handling
200 OK       → parse and store
301/302      → follow redirect (max 5 hops to prevent loops)
304 Not Modified → skip storage, update last-checked timestamp only
429 Too Many Requests → back off exponentially, respect Retry-After header
5xx          → retry with jitter, increment failure counter
🔒

HTTPS / TLS handling. Most of the web now serves HTTPS. Fetchers must decide how to handle certificate errors: an invalid, self-signed, or expired certificate on an otherwise-legitimate site. Most production crawlers accept TLS errors with a flag (--tls-skip-verify equivalent) rather than hard-failing, since many small sites have certificate misconfigurations that aren't a sign of malicious intent. The crawl metadata should record tls_error: bool so downstream systems can downgrade trust scores for these pages. Hard-failing on TLS errors would exclude a significant fraction of the crawlable web.

Internal: Crawl Management API

The system exposes a REST API for operators and downstream consumers. Core endpoints:

POST /crawl/seeds, Submit seed URLs to bootstrap a crawl.

// Request
{
  "urls": ["https://example.com", "https://other.com"],
  "priority": "high",          // "high" | "normal" | "low"
  "crawl_depth": 3,            // max BFS depth from seed, null = unlimited
  "recrawl_ttl_hours": 24
}

// Response: 202 Accepted
{
  "job_id": "job_abc123",
  "queued_urls": 2,
  "estimated_start": "2026-04-01T12:00:05Z"
}

GET /pages/{url_hash}, Retrieve crawl metadata for a specific URL.

// Response: 200 OK
{
  "url": "https://example.com/page",
  "url_hash": "sha256:abc123...",
  "last_fetched_at": "2026-04-01T10:00:00Z",
  "http_status": 200,
  "content_hash": "md5:def456...",
  "simhash": "0xA3F2...",
  "content_length_bytes": 102400,
  "outbound_link_count": 47,
  "storage_key": "s3://crawl-bucket/pages/abc123.html.gz",
  "crawl_status": "indexed"      // "queued" | "fetching" | "indexed" | "failed"
}
Endpoint Level Purpose
POST /crawl/seeds L3/L4 Bootstrap crawl with seed URLs
GET /pages/{url_hash} L3/L4 Fetch metadata for a crawled URL
DELETE /pages/{url_hash} L5 Remove URL from crawl (DMCA, robots.txt change)
GET /crawl/stats L5 Crawl throughput, queue depth, error rates
POST /crawl/recrawl L6 Trigger immediate re-crawl of a URL or domain
GET /pages/{url_hash}/links L7/L8 Retrieve extracted link graph for downstream ranking
06

Core crawl flow

The dominant path is the single-URL fetch cycle. Understanding each decision point, especially what happens on a cache hit, a redirect, or a robots.txt denial, is what separates an L3 answer from an L5 one. The key decision the NFR forces is conditional GET: the crawl freshness requirement (§2) demands we re-visit pages, but the storage NFR says we shouldn't re-store unchanged content. Conditional GET via ETags or If-Modified-Since resolves this.

Frontier dequeues URL Check robots.txt (cached 24h per domain) disallowed Mark skipped allowed Politeness scheduler wait if < 1s since last domain fetch HTTP GET (conditional) If-None-Match / If-Modified-Since 304, unchanged Update last-checked only no storage write 200 OK Parse HTML extract links, compute SimHash Dedup each extracted link Bloom filter → Redis exact check Enqueue new URLs + write content frontier + object store + metadata DB Done, ready for next URL
Core crawl loop. The 304 conditional-GET branch is the key insight: it satisfies the freshness NFR (§2) while avoiding unnecessary storage writes.
💡

Without conditional GET, every re-crawl stores a full copy of every page regardless of whether anything changed. At 1B pages/day re-crawled every 24 hours with 100KB average page size, that's 100TB of redundant writes per day, the same as the original crawl volume, doubled. ETag-based conditional GET turns most of those into 304 responses with no body: a timestamp update in Cassandra and nothing written to the object store. The candidate gotcha: many servers — especially older CMSes and static file servers, don't implement ETags or Last-Modified headers, so the crawler must detect the absence of these headers and fall back to comparing the new response body's MD5 against the stored content hash to decide whether to overwrite.

07

Data model

There are three distinct entities in the crawl system. The differences in how each is used end up shaping the schema significantly, particularly the choice of key space and the need for certain index patterns.

Entities and access patterns

Operation Entity Frequency Query shape
Check if URL seen URL record 11,000/s Point lookup by URL hash
Write crawl result URL record + content 11,000/s Insert / update by URL hash
Read next URLs to crawl Scheduler table (Redis sorted set) 11,000/s Range scan by next_crawl_at score, served from dedicated scheduler store, not url_records
Retrieve raw HTML Content blob ~1,000/s (indexer) Point read by storage key
Find outbound links from page Link edge ~100/s (ranking) Range scan by source URL hash

Two things stand out. First, dedup lookups and writes run at the same rate as fetches, 11,000/s, so the Dedup Store must handle that write rate without becoming the bottleneck. Second, the "next URLs to crawl" access pattern is the tricky one: it requires a range scan ordered by time, which Cassandra cannot serve efficiently from url_records, a WHERE next_crawl_at < now() query across all partitions is a full-cluster scatter-gather, not a range query. This forces a dedicated scheduling data structure, described below.

Schema

url_records, the metadata control plane (wide-column store like Apache Cassandra or AWS DynamoDB, sharded by url_hash):

-- url_records (Cassandra-style)
url_hash        TEXT  PRIMARY KEY    -- SHA-256 of normalised URL
url             TEXT                 -- full normalised URL
domain          TEXT                 -- extracted hostname, used for partition routing
status          TEXT                 -- queued | fetching | indexed | failed | skipped
http_status     INT                  -- last HTTP response code
last_fetched_at TIMESTAMP
next_crawl_at   TIMESTAMP            -- computed from TTL + importance
content_hash    TEXT                 -- MD5 of body bytes
simhash         BIGINT               -- 64-bit near-duplicate fingerprint
storage_key     TEXT                 -- S3 object key for raw HTML
importance      FLOAT                -- crawl priority score [0.0–1.0]
etag            TEXT                 -- from last response, used in conditional GET
fail_count      INT                  -- consecutive failures; triggers backoff

link_edges, the link graph (wide-column store, sharded by source_hash):

-- link_edges (correct Cassandra CQL, composite primary key)
CREATE TABLE link_edges (
  source_hash   TEXT,
  target_url    TEXT,           -- clustering column: range scan of all links from a page
  anchor_text   TEXT,
  discovered_at TIMESTAMP,
  PRIMARY KEY (source_hash, target_url)
);

scheduler_queue, the re-crawl scheduling index (a Redis sorted set, one per domain tier):

-- Redis sorted set: score = next_crawl_at Unix timestamp, member = url_hash
-- Allows efficient: ZRANGEBYSCORE scheduler:tier1 0 {now} LIMIT 0 500
-- The Scheduler reads a batch of due URLs, pushes them to the Frontier (Kafka), marks status=fetching in url_records
-- Separate sets per tier (tier1 = 1h TTL, tier2 = 24h, tier3 = 7–30 days)
ZADD scheduler:tier1 {next_crawl_at_unix} {url_hash}
ZRANGEBYSCORE scheduler:tier1 0 {now_unix} LIMIT 0 500   -- fetch next batch due
ZREM scheduler:tier1 {url_hash}                          -- remove after enqueue to Kafka

content blobs, raw HTML stored in an object store, keyed by url_hash:

-- Object store key pattern
s3://crawl-raw/{yyyy}/{mm}/{dd}/{url_hash}.html.gz
Why Cassandra for url_records, not PostgreSQL? DB choice rationale

PostgreSQL at 11,000 writes/second per table is achievable with connection pooling, but adding read throughput of the same magnitude pushes a single instance to its limits around 20–30k operations/second on commodity hardware. Horizontal scaling via read replicas helps for reads, but writes still go through one primary. Cassandra's (or DynamoDB's) multi-primary, leaderless replication lets you add nodes linearly as write throughput grows, matching the crawl worker scaling model. The tradeoff is that Cassandra doesn't support arbitrary range queries across partitions; the access pattern table above was specifically designed to avoid the ones we can't support.

TradeoffCassandra provides eventual consistency. A URL could be enqueued by two workers in a small window if the write hasn't propagated. The Dedup Store (Bloom filter + Redis exact-hash) is the primary guard against this, Cassandra is the durable record of truth, not the real-time dedup layer.
08

Caching strategy

Caching in a web crawler works differently from most systems, the crawler is more consumer than server. Most systems design caches to serve many readers from few writes; this system writes constantly and caches to reduce the cost of its own outbound reads. The latency target is sustained crawl throughput of 11,000 req/s, not user-facing page load time. Every cache miss that falls through to a slower tier directly reduces how many pages per day the system can process.

Fetcher Worker DNS Cache in-process, 5min TTL ~99% hit rate robots.txt Cache shared Redis, 24h TTL per-domain entry Bloom Filter Redis cluster, ~53GB 0.5ms lookup URL Exact Set Redis HSET, ~2ms seen URL fingerprints Cassandra url_records durable state ~10ms read truth of record in-process / shared cache distributed memory (Redis) fallback to durable DB
Cache hierarchy anchored to the §4 architecture. DNS and robots.txt caches prevent repeated fetches for the same domain; Bloom filter and URL exact set are the dedup layer.
Cache Backing store TTL Why it exists
DNS lookup cache In-process (per worker) 5 min Public resolvers rate-limit and add 10–100ms latency at 11k req/s
robots.txt cache Shared in-memory store (Redis) 24h robots.txt rarely changes; re-fetching per-URL wastes bandwidth
URL Bloom filter Redis cluster (~53GB at 30-day window) None, entries never expire Fast first-pass dedup; keeps ~99.99% of seen-URL checks under 1ms. Write-only: bits are set but never cleared.
URL exact hash set Redis HSET None, entries never expire Exact dedup confirmation for URLs the Bloom filter passes through. Zero false-positive rate.

Invalidation

DNS and robots.txt caches expire on TTL, no active invalidation. When a site updates its robots.txt, the cached version is stale for up to 24 hours; this is a known tradeoff (the alternative is fetching robots.txt on every request). The Bloom filter and URL hash set are write-only, there's no invalidation, by design. If a URL needs to be forcibly re-crawled (e.g. DMCA removal followed by re-addition), operators mark it eligible in Cassandra and the re-crawl Scheduler inserts a fresh entry into the Frontier, bypassing the Dedup Store check.

09

Deep-dive: scalability

The single-machine design breaks at three specific points as load increases: the URL frontier becomes a coordination bottleneck when a single queue must enforce per-domain politeness across hundreds of workers; the seen-URL set outgrows available RAM and falls back to disk; and the re-crawl scheduler needs a data structure that can serve time-ordered range queries at 11k QPS, something a wide-column store's partitioning model can't provide. The production architecture addresses each with a targeted change.

Seed input Kafka Frontier P0: a–f domains P1: g–m domains P2: n–s domains P3: t–z domains + priority score Fetcher-0 a–f domains Fetcher-1 g–m domains Fetcher-2/3 n–z domains Redis Bloom 3-node cluster ~53GB RAM Cassandra url_records 6-node ring Object Store S3 / GCS Batch Pipeline (Apache Spark) SimHash, ranking Parser Pool link extraction sync path async write/pipeline
Production-scale architecture: Kafka frontier partitioned by domain, per-partition fetcher pools, Redis Bloom cluster, Cassandra ring, and async distributed batch processing pipeline (Apache Spark)
Frontier partitioning: domain-hash sharding across Kafka partitions L5+

Each Kafka partition handles a contiguous range of domain name hashes. A fetcher consumer group is pinned to a partition range, so all URLs for a given domain are always consumed by the same worker pool. This guarantees the politeness invariant without coordination: only one pool is ever actively fetching from a domain. Rebalancing when workers are added or removed is handled by Kafka consumer group rebalancing, domains remapped to new workers pause for a politeness interval before resuming.

TradeoffLarge domains (e.g. Wikipedia with millions of article URLs) can overwhelm a single partition. A secondary sub-partitioning by URL path prefix handles this, but requires a custom partitioner and adds complexity, worth raising at L7/L8 only.
Distributed Bloom filter over Redis cluster L5+

The seen-URL Bloom filter at ~53GB (for a 30-day crawl window of 30B URLs at 14 bits/URL — see §3) must live in RAM to deliver sub-millisecond lookups. A Redis cluster with 2–3 primary nodes provides headroom with room to grow as the retention window widens. The Redis Bloom filter module (RedisBloom / Redis Stack) implements the filter natively with BF.ADD / BF.EXISTS commands. A consistent hash of the URL hash selects which Redis node holds the filter bit range, distributing load evenly. When the filter fills (false-positive rate degrades past threshold), a new filter generation is created and both generations are checked during a transition window.

Re-crawl scheduling: TTL + change frequency model L6+

Re-crawl frequency is a crawl budget problem. Crawling everything at equal intervals wastes crawl budget on static pages while missing dynamic ones. The Scheduler uses a dedicated Redis sorted set (not a Cassandra range query, see §7) where the score is the Unix timestamp of next_crawl_at and the member is the URL hash. A ZRANGEBYSCORE scheduler:tier1 0 {now} call efficiently fetches the next batch of due URLs in O(log N + K) time. Once a URL is dispatched to the Frontier, it is removed from the sorted set with ZREM and its status in url_records is set to fetching.

The Scheduler maintains three separate sorted sets by tier: tier1 (1h TTL, high-importance/high-frequency pages), tier2 (24h TTL), and tier3 (7–30 day TTL, static content). Change frequency is derived from historical content hash diffs; importance score from estimated PageRank. When a crawl completes, the Scheduler re-inserts the URL with a new score: now + TTL(importance, change_freq).

Why not Cassandra for this?A WHERE next_crawl_at < now() query across all Cassandra partitions is a full-cluster scatter-gather, not a range query, Cassandra secondary indexes on high-cardinality timestamp columns fan out to every node. A Redis sorted set delivers O(log N) range queries and handles 11k writes/s on a single node, with the Scheduler tier split keeping each set under ~1B members at any time.
Geographic distribution for latency reduction L7/L8

A crawler in US-East fetching a Japanese news site incurs 200–300ms RTT per page. Deploying fetcher pools in multiple geographic regions, US, EU, APAC, cuts latency to 20–40ms for regional domains. Domain-to-region assignment is done by TLD and geo-IP of the target server's DNS response. The frontier is replicated across regions with cross-region URL dedup via a globally replicated key-value store (Google Spanner or DynamoDB Global Tables). Content is written to the nearest regional object store with cross-region replication.

TradeoffCross-region dedup synchronisation adds 100–200ms latency to the dedup check path for URLs first discovered in one region and validated in another. In practice, the Bloom filter handles the fast path and the exact hash store is the slow path, acceptable given the rare cross-region dedup scenario.
10

Failure modes & edge cases

Scenario Problem Solution Level
Fetcher worker crashes mid-fetch URL is dequeued from Kafka but content never stored Kafka consumer commits offset only after successful write to object store. Crash before commit triggers redelivery from last committed offset. L3/L4
Target site returns infinite redirect loop Fetcher follows redirects indefinitely, exhausting threads Hard cap of 5 redirect hops per fetch; mark URL as failed and blacklist the redirect target if loop detected. L3/L4
Spider trap, dynamically generated infinite URLs Sites like calendars generate infinite unique URLs (e.g. /cal?date=2099-01-01), exhausting the frontier Per-domain URL count cap. Detect repeated URL structure patterns (query parameter explosion). Apply path-depth limit per domain. L5
Redis Bloom filter node failure Bloom filter lookup throws; duplicate fetch rate spikes Fail open (treat URL as unseen) and fall through to Cassandra exact-hash check. Redis cluster replication recovers within seconds in steady state. L5
Kafka partition rebalance during politeness window Domain moves to new fetcher consumer group; new consumer immediately fetches, violating the ≤1 req/s rule On consumer group join, each worker reads the last-fetch timestamp from Cassandra for all domains in its new partition and inserts a politeness delay before resuming. L5
Content farm, millions of near-duplicate pages Low-quality content consumes crawl budget; SimHash cluster is overwhelmed Per-domain SimHash collision rate tracked in metadata. Domains exceeding the threshold have crawl budget capped at 10% of normal allocation and are flagged for manual quality review. L7/L8
Legal / DMCA takedown of a crawled URL Crawled content must be removed from storage and future crawls blocked URL marked with status=blocked in Cassandra; object store blob deleted; URL added to a permanent denylist checked before any frontier enqueue. This is an operator action via the DELETE API (§5b). L7/L8
11

How to answer by level

L3 / L4 SDE I / SDE II, Can you build a working system?
What good looks like
  • Describe the fetch-parse-enqueue loop clearly
  • Know what robots.txt is and that you must check it
  • Identify a simple dedup mechanism (hash set)
  • Pick a queue for the frontier and explain why
  • Mention object store for raw HTML
What separates L3 from L5
  • L3 uses a global hash set without thinking about size
  • L3 ignores politeness rate limiting entirely
  • L3 treats the frontier as a simple FIFO queue
  • L3 doesn't consider re-crawl freshness
L5 Senior, Do you understand the tradeoffs?
What good looks like
  • Defend BFS with priority queue over DFS
  • Size the Bloom filter (14 bits/URL → ~53GB for a 30-day crawl window) and explain why Redis
  • Design per-domain politeness buckets, not just "add a delay"
  • Explain conditional GET (ETags, 304) for re-crawls
  • Know why Cassandra over PostgreSQL at this write rate
What separates L5 from L6
  • L5 knows Kafka but doesn't partition by domain
  • L5 mentions DNS caching but not that it needs override TTL
  • L5 doesn't handle frontier rebalance + politeness edge case
L6 Staff, Can you own this end-to-end?
What good looks like
  • Own the contract between crawler and downstream indexer: schema versioning, delivery guarantees, object store path conventions
  • Define the crawl freshness SLA end-to-end: not just re-crawl TTL, but how latency in the crawl-to-index pipeline affects search result currency
  • Identify that the scheduler (Redis sorted set) is a separate component from url_records, and explain why Cassandra range queries can't serve this access pattern
  • Address the Kafka rebalance + politeness invariant edge case (§10) without prompting
  • Design operational runbooks: what breaks first under load, and what's the graceful degradation path?
What separates L6 from L7/L8
  • L6 treats crawl budget as a technical problem; L7 frames it as a business ROI problem, which sites are worth more crawl budget and why
  • L6 handles DMCA as a feature; L7 designs the compliance pipeline proactively (denylist, audit trail, legal hold)
  • L6 ignores geographic distribution and cross-region dedup consistency tradeoffs
  • L6 doesn't question whether to build vs. license Common Crawl data
L7 / L8 Principal / Distinguished, Should we build this, and how?
What good looks like
  • Frame crawl budget as a resource allocation problem with ROI tradeoffs
  • Design geographic distribution + cross-region dedup synchronisation
  • Address anti-spam, content farm detection, and quality scoring at system level
  • Know that robots.txt has no legal force in most jurisdictions, it is a convention, not a contract. Violating it has nonetheless led to litigation (hiQ v. LinkedIn, Sandvig v. Barr). The correct framing: respect it by policy, document that policy, and maintain an audit trail
  • Address GDPR: raw HTML stored in the object store may contain EU-resident PII. Data retention policy for crawled content, right-to-erasure pipeline, and regional storage boundaries are real L7/L8 design concerns
  • Question the requirements: does the company need a full crawler or can it license Common Crawl data for far less engineering cost?
Common gaps even at L7
  • Treating all pages as equally valuable (no importance scoring)
  • Not connecting crawl freshness to downstream search ranking quality
  • Ignoring that a broken crawler is a legal liability, not just a performance issue

Classic interview probes

Probe question L3/L4 L5/L6 L7/L8
"How do you avoid crawling the same URL twice?" In-memory hash set of seen URLs Bloom filter front-tier + Redis exact hash back-tier; size the filter at 14 bits/URL Two-tier + SimHash for near-duplicate content; discuss Bloom filter generation rotation
"How do you respect a website's crawl limits?" Add a sleep() between requests Per-domain token bucket; read robots.txt Crawl-delay directive; cache robots.txt 24h Adaptive rate per domain based on server response latency; per-domain failure tracking; back off on 429
"How do you keep popular pages fresh?" Re-crawl everything every N hours Per-URL TTL derived from change-frequency history; conditional GET via ETags for 304 optimisation Multi-signal freshness model: change frequency + importance + change impact on ranking; separate high-frequency pipeline for tier-1 domains
"What happens when your crawler crashes?" Restart and re-crawl everything Kafka offset commit after write; uncommitted URLs redelivered; Cassandra as durable state Idempotent write design (content hash dedup prevents duplicate storage); partial crawl checkpointing; graceful drain on shutdown
How the pieces connect
Tracing each NFR to the architectural decision it forced
1
Politeness NFR (§2) per-domain rate limits required frontier must be partitioned by domain (§4) Kafka with domain-hash partitions (§9)
2
Dedup accuracy NFR (§2) seen-URL set must stay in RAM for <1ms lookups ~53GB Bloom filter on Redis cluster (§3, §8) Redis exact-hash gate confirms URLs the filter passes through (§5)
3
Crawl freshness NFR (§2) re-crawling without wasted storage writes required conditional GET via ETags (§6) ETag field in url_records schema (§7)
4
Storage scale (§3 estimate: 3 PB / 30 days) relational DB BLOB columns ruled out object store (S3) for raw HTML (§4) metadata DB only stores storage_key reference (§7)
5
Write throughput (§3: 11k req/s) PostgreSQL single-primary write ceiling approached wide-column store (Cassandra) with linear write scaling (§7) partition key = url_hash, consistent hashing across 6-node ring (§9)
6
Availability NFR (§2: 99.9%) worker crashes must not lose queued URLs Kafka offset committed only after write success (§10) idempotent content writes via content hash dedup prevent duplicate storage on replay
Also in this series

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →