Design a Web Crawler
Straightforward to describe, surprisingly complex to scale, a billion-page crawl touches nearly every distributed systems concept.
What the interviewer is testing
A web crawler is deceptively approachable, "fetch pages, extract links, repeat" is a three-word description that fits in a tweet. The interview question is a trap for exactly that reason: the system looks trivial until you start asking the hard questions. How do you avoid crawling the same URL twice across a fleet of 500 workers? How do you avoid taking down small websites? How do you prioritise the 1% of URLs that account for 50% of the value?
The question tests three specific things that don't come up in most other system design problems:
Graph traversal at scale. The web is a directed graph with ~50 billion nodes and trillions of edges. Choosing BFS vs DFS isn't academic, it determines whether your crawl discovers important pages in hours or months.
Distributed coordination without a central bottleneck. A single seen-URL set that every worker writes to will become the choke point. Candidates who don't think about this usually hit it around the "scalability" section.
Being a good citizen. A crawler that ignores robots.txt and hammers domains
will get IP-blocked and potentially create legal exposure. Politeness isn't a nice-to-have, it's
architecturally required.
| Level | What good looks like |
|---|---|
| L3/L4 | Describe the fetch-parse-enqueue loop, basic deduplication, robots.txt compliance |
| L5 | Defend BFS frontier choice, Bloom filter tradeoffs, politeness queue design, content hashing |
| L6 | Partition the frontier by domain hash, explain DNS caching, handle re-crawl scheduling with TTL |
| L7/L8 | Own the full crawl budget problem: importance scoring, freshness vs coverage tradeoffs, anti-spam, legal exposure |
Requirements clarification
Functional requirements
| Requirement | Notes |
|---|---|
| Discover and fetch web pages starting from a seed URL list | Seed set provided externally |
| Extract hyperlinks and enqueue discovered URLs | HTML anchor tags; normalise before enqueue |
| Store raw page content and metadata | HTML, fetch timestamp, HTTP status, content hash |
| Respect robots.txt and per-domain crawl rate limits | Per-domain politeness queue |
| Avoid fetching the same URL twice within a crawl cycle | URL fingerprint deduplication |
| Re-crawl pages on a refresh schedule | TTL-based or change-detection-based |
Non-functional requirements
| NFR | Target | Reasoning |
|---|---|---|
| Scale | ~1B pages/day | Google-scale; reasonable FAANG interview scope |
| Crawl freshness | Top pages re-crawled within 24h | Ensures index currency for search ranking |
| Politeness | ≤1 req/s per domain by default | Avoids overloading small sites; industry norm |
| Deduplication accuracy | <0.01% duplicate fetches in steady state | Bloom filter false-positive rate target |
| Availability | 99.9% (crawl continues despite worker failures) | Worker crashes must not lose URL queue entries |
| Storage | Petabyte-scale raw content | ~100KB avg page × 1B pages/day × 30 days |
NFR reasoning
Politeness constraint (≤1 req/s per domain) ›
One request per second is the de facto industry standard for respectful crawling. Many small sites run on shared hosting with 100–200 requests/s total capacity. Even modest domains can be overwhelmed if multiple crawl workers independently select their pages. The per-domain rate limit is the most consequential single constraint in this design, it directly forces the partitioned frontier architecture in §4.
Deduplication accuracy (<0.01% duplicates) ›
At 1B pages/day, even 0.1% duplicate fetches is 1M wasted requests consuming bandwidth and crawl budget. A Bloom filter with a 0.01% false positive rate is achievable with ~14 bits per element, for 50B seen URLs that's roughly 87GB, which fits in memory across a Redis cluster. The false positive direction is safe: occasionally skipping a new URL is acceptable. False negatives (treating a seen URL as new) would require an exact-match store as the second gate.
Crawl freshness (top pages re-crawled within 24h) ›
The web's high-value tail, news sites, social platforms, large retailers, changes hourly. A 24-hour freshness SLA for prioritised URLs means the crawl scheduler must distinguish between tier-1 pages (high change frequency, high importance) and tier-2/3 pages (low traffic, rarely updated). A uniform re-crawl schedule wastes crawl budget on static content while letting high-frequency pages go stale. This drives the priority queue and importance scoring discussed in §5.
next_crawl_at and push due URLs to the Frontier (§7, §9).Capacity estimation
Before reaching for numbers, consider the shape of each dimension. Write throughput is pure fetch rate, easy to estimate but dominates bandwidth and storage planning. Storage compounds quickly: raw HTML at scale means petabytes, not terabytes. URL deduplication state is the unusual one here, the seen-URL set must fit in memory or we lose the ability to do fast deduplication. That constraint deserves its own estimate.
Interactive crawl estimator
The seen-URL Bloom filter memory estimate is the architectural forcing function. At 1B pages/day over 30 days, the crawl window accumulates 30B unique URLs. At 14 bits per URL that's ~53GB, which must live in RAM across an in-memory cluster (Redis). If it spills to disk, dedup latency blows up from microseconds to milliseconds, and at 11,574 QPS, that becomes the bottleneck. Every architectural decision around the URL frontier traces back to keeping this set in RAM. (For web-scale estimates using the full 50B-page web graph, the same formula gives ~87GB, you'll see both numbers in the literature.)
High-level architecture
Component breakdown
URL Frontier Service, almost every scalability decision in this design traces back here. The frontier queues URLs waiting to be crawled, organises them into per-domain priority buckets, and enforces politeness delays so no single domain gets hammered. Unlike a simple FIFO, it must know which domains are rate-limited before dispatching the next URL.
Fetcher Pool, a fleet of HTTP worker processes that pull URLs from the frontier, issue the actual HTTP/S request, and stream raw HTML back for processing. Fetchers maintain their own DNS cache (DNS round-trips at 11k req/s would become the bottleneck) and a per-domain robots.txt cache with a 24h TTL.
Parser / Link Extractor, receives raw HTML, extracts all anchor href values, normalises them (resolve relative paths, strip fragments, canonicalise scheme and host), and computes a SimHash fingerprint of the body for near-duplicate detection. New URLs discovered here loop back to the frontier; the content hash goes to the metadata store.
Dedup Store, guards the frontier against re-enqueuing URLs already crawled. A distributed Bloom filter (backed by an in-memory store like Redis) handles the fast path; an exact hash set backstops the small fraction the filter passes through. Any URL must clear both gates before entering the frontier.
Object Store, holds raw HTML blobs, referenced by URL hash. A distributed blob store (Amazon S3 or Google Cloud Storage) handles the write throughput and durability that a relational database's BLOB columns cannot, raw HTML at 11,000 writes/second and 100KB per page demands it.
Metadata DB, records per-URL state: crawl status, last fetch timestamp, content hash, link count, HTTP response code, ETag, and computed importance score. The re-crawl Scheduler reads it to compute the next fetch time for each URL, making it the control plane for the entire crawl lifecycle.
Architectural rationale
URL Frontier: distributed priority queue, not a single FIFO Core design choice ›
A single FIFO queue makes politeness enforcement intractable, you'd need to scan ahead to find the next URL from a domain that isn't rate-limited. A per-domain bucket design lets each domain's queue drain at its own rate. A durable message queue (Apache Kafka) with domain-keyed partitions is a natural fit: each partition maps to a domain range, and consumer groups naturally respect the rate limit by pausing consumption on a partition until the cooldown expires.
Object store for raw content, not a relational DB Storage architecture ›
Raw HTML blobs average 100KB and arrive at 11,000/second. A relational database storing BLOB columns at this write rate would require extreme vertical scaling and still hit I/O limits quickly. An object store designed for high-throughput sequential writes with 11 nines durability (like S3) stores the blobs cheaply and lets the metadata database remain lean — storing only the object key, not the content.
Fetcher DNS caching: local TTL override Performance ›
At 11,574 req/s, every fetch needs a DNS lookup for the target hostname. Public resolvers have rate limits and introduce 10–100ms latency per lookup. Fetcher workers maintain a local DNS cache with a 5-minute TTL, overriding the authoritative TTL. This reduces external DNS queries by ~99% for popular domains. The tradeoff, DNS changes take 5 minutes to propagate , is acceptable for a crawler where a brief 404 burst is harmless compared to DNS becoming a bottleneck.
Real-world comparison
| Decision | This design | Common Crawl | |
|---|---|---|---|
| Frontier queue | Kafka domain partitions | Custom priority queues | Proprietary distributed queue |
| Raw content store | Object store (S3-compatible) | WARC files on S3 | Colossus (GFS) |
| Deduplication | Bloom filter + Redis exact hash | URL canonicalisation + fingerprint | Simhash near-duplicate at index time |
| Priority scoring | PageRank-like importance score | Domain whitelist tiers | Multi-signal quality score |
| Politeness | Per-domain token bucket | Crawl-delay from robots.txt | Adaptive rate per site |
Common Crawl publishes its architecture and stores WARC-format crawl archives openly on S3, a useful reference for interview discussions about open data pipelines. Google's architecture is proprietary, but its published papers on Caffeine (their indexing pipeline) reveal the general shape.
Crawl strategy & URL deduplication
Two questions dominate this section: in what order should URLs be fetched? and how do you know you've already seen a URL? They're related, the frontier algorithm determines how often URLs arrive at the Dedup Store, and the Dedup Store must keep up. At 11k QPS, that means sub-millisecond lookups; anything that falls through to a 10ms database read becomes the system's throughput ceiling.
Our choice for this system: BFS with priority scoring
Standard BFS discovers pages linked by many other pages (high-authority pages) early in the crawl. Adding a priority score, based on estimated page importance, change frequency, and link depth, turns the FIFO BFS queue into a best-first crawl that approximates what search engines actually do. A newly discovered URL gets a base priority; crawl history and inbound link count update it over time.
URL normalisation, a classic L5 follow-up. Before any URL reaches the Dedup Store
it must be normalised, or semantically identical pages create duplicate entries. The six standard
steps: (1) lowercase the scheme and host (HTTP://Example.COM →
http://example.com); (2) strip default ports (:80, :443); (3)
remove fragments (#section, fragments are client-side only; the server returns the
same page regardless); (4) resolve relative paths (/../foo → /foo); (5)
sort query parameters alphabetically so ?b=2&a=1 and ?a=1&b=2 map to the
same fingerprint; (6) strip known tracking parameters (utm_source,
fbclid). Normalisation happens in the parser before the URL hash is computed, the hash
stored in the Dedup Store and url_records is always of the normalised form.
URL deduplication: two-tier approach
| Tier | Mechanism | Latency | False positive rate |
|---|---|---|---|
| Tier 1 (fast reject) | Distributed Bloom filter (in-memory, Redis cluster) | ~0.5ms | ~0.01% (configurable) |
| Tier 2 (exact check) | Redis HSET with URL SHA-256 fingerprint | ~2ms | 0% (exact match) |
Tier 1 efficiently handles the vast majority of checks. Every URL the frontier wants to enqueue passes through Tier 1 first, if the Bloom filter says "seen", the URL is dropped immediately without hitting Redis. If the filter says "not seen", the URL proceeds to Tier 2 for an exact confirmation. The false positive case, where the Bloom filter incorrectly reports a genuinely new URL as already-seen, causes that URL to be silently skipped. At a 0.01% false positive rate this is acceptable crawl budget loss (roughly 1M missed URLs per day at 1B pages/day scale). Tier 2 is exact: a Redis HSET lookup on the normalised URL fingerprint, with 0% false positive rate. A Bloom filter never produces false negatives, so any URL that passes both tiers is definitively new.
Content-level deduplication with SimHash Advanced, L5+ ›
URL deduplication catches exact URL repeats, but the web is full of near-duplicate content: mirror sites, printer-friendly versions, paginated archives. SimHash converts a page's token set into a 64-bit fingerprint where similar documents produce fingerprints that differ in only a few bits (Hamming distance ≤ 3). Storing SimHash values in the metadata DB and checking Hamming distance on fetch lets the parser flag near-duplicates without re-indexing their content. Google's Detecting Near-Duplicates for Web Crawling (2007) is the primary reference.
One open question: what's the false-positive rate for a Bloom filter over 50B URLs at 14 bits/element? Answer: ~0.01%. The formula is (1 − e−kn/m)k where k = 10 hash functions, n = 50B URLs, m = 700B bits. Remember the direction: a false positive means the filter says "seen" for a URL that hasn't actually been crawled, causing a valid new URL to be silently skipped. That's the acceptable loss. A false negative (saying "not seen" for a URL that has been crawled) cannot happen with a Bloom filter by design. Know both the formula and the direction — interviewers probe both.
API & protocol design
A web crawler has two external-facing protocol surfaces: the HTTP protocol it speaks to target websites, and the internal API that external systems (search indexer, ML pipelines) use to query or drive the crawl. Both matter in an interview.
External: HTTP fetching protocol
Fetchers issue standard HTTP/1.1 or HTTP/2 GET requests. Key headers the crawler must send and respect:
// Outbound request headers (what we send)
GET /page HTTP/1.1
Host: example.com
User-Agent: MyCrawler/1.0 (+https://mysite.com/crawler)
Accept: text/html,application/xhtml+xml
Accept-Encoding: gzip, deflate, br
If-None-Match: "etag-from-last-crawl" // conditional GET, avoid re-downloading if unchanged
If-Modified-Since: Tue, 01 Apr 2025 12:00:00 GMT
// Response handling
200 OK → parse and store
301/302 → follow redirect (max 5 hops to prevent loops)
304 Not Modified → skip storage, update last-checked timestamp only
429 Too Many Requests → back off exponentially, respect Retry-After header
5xx → retry with jitter, increment failure counter
HTTPS / TLS handling. Most of the web now serves HTTPS. Fetchers must decide how to
handle certificate errors: an invalid, self-signed, or expired certificate on an
otherwise-legitimate site. Most production crawlers accept TLS errors with a flag
(--tls-skip-verify equivalent) rather than hard-failing, since many small sites have
certificate misconfigurations that aren't a sign of malicious intent. The crawl metadata should
record tls_error: bool so downstream systems can downgrade trust scores for these
pages. Hard-failing on TLS errors would exclude a significant fraction of the crawlable web.
Internal: Crawl Management API
The system exposes a REST API for operators and downstream consumers. Core endpoints:
POST /crawl/seeds, Submit seed URLs to bootstrap a crawl.
// Request
{
"urls": ["https://example.com", "https://other.com"],
"priority": "high", // "high" | "normal" | "low"
"crawl_depth": 3, // max BFS depth from seed, null = unlimited
"recrawl_ttl_hours": 24
}
// Response: 202 Accepted
{
"job_id": "job_abc123",
"queued_urls": 2,
"estimated_start": "2026-04-01T12:00:05Z"
}
GET /pages/{url_hash}, Retrieve crawl metadata for a specific URL.
// Response: 200 OK
{
"url": "https://example.com/page",
"url_hash": "sha256:abc123...",
"last_fetched_at": "2026-04-01T10:00:00Z",
"http_status": 200,
"content_hash": "md5:def456...",
"simhash": "0xA3F2...",
"content_length_bytes": 102400,
"outbound_link_count": 47,
"storage_key": "s3://crawl-bucket/pages/abc123.html.gz",
"crawl_status": "indexed" // "queued" | "fetching" | "indexed" | "failed"
}
| Endpoint | Level | Purpose |
|---|---|---|
| POST /crawl/seeds | L3/L4 | Bootstrap crawl with seed URLs |
| GET /pages/{url_hash} | L3/L4 | Fetch metadata for a crawled URL |
| DELETE /pages/{url_hash} | L5 | Remove URL from crawl (DMCA, robots.txt change) |
| GET /crawl/stats | L5 | Crawl throughput, queue depth, error rates |
| POST /crawl/recrawl | L6 | Trigger immediate re-crawl of a URL or domain |
| GET /pages/{url_hash}/links | L7/L8 | Retrieve extracted link graph for downstream ranking |
Core crawl flow
The dominant path is the single-URL fetch cycle. Understanding each decision point, especially what happens on a cache hit, a redirect, or a robots.txt denial, is what separates an L3 answer from an L5 one. The key decision the NFR forces is conditional GET: the crawl freshness requirement (§2) demands we re-visit pages, but the storage NFR says we shouldn't re-store unchanged content. Conditional GET via ETags or If-Modified-Since resolves this.
Without conditional GET, every re-crawl stores a full copy of every page regardless of whether anything changed. At 1B pages/day re-crawled every 24 hours with 100KB average page size, that's 100TB of redundant writes per day, the same as the original crawl volume, doubled. ETag-based conditional GET turns most of those into 304 responses with no body: a timestamp update in Cassandra and nothing written to the object store. The candidate gotcha: many servers — especially older CMSes and static file servers, don't implement ETags or Last-Modified headers, so the crawler must detect the absence of these headers and fall back to comparing the new response body's MD5 against the stored content hash to decide whether to overwrite.
Data model
There are three distinct entities in the crawl system. The differences in how each is used end up shaping the schema significantly, particularly the choice of key space and the need for certain index patterns.
Entities and access patterns
| Operation | Entity | Frequency | Query shape |
|---|---|---|---|
| Check if URL seen | URL record | 11,000/s | Point lookup by URL hash |
| Write crawl result | URL record + content | 11,000/s | Insert / update by URL hash |
| Read next URLs to crawl | Scheduler table (Redis sorted set) | 11,000/s | Range scan by next_crawl_at score, served from dedicated scheduler store, not url_records |
| Retrieve raw HTML | Content blob | ~1,000/s (indexer) | Point read by storage key |
| Find outbound links from page | Link edge | ~100/s (ranking) | Range scan by source URL hash |
Two things stand out. First, dedup lookups and writes run at the same rate as fetches, 11,000/s, so the
Dedup Store must handle that write rate without becoming the bottleneck. Second, the "next URLs to
crawl" access pattern is the tricky one: it requires a range scan ordered by time, which Cassandra
cannot serve efficiently from url_records, a WHERE next_crawl_at < now() query
across all partitions is a full-cluster scatter-gather, not a range query. This forces a dedicated
scheduling data structure, described below.
Schema
url_records, the metadata control plane (wide-column store like Apache Cassandra or AWS
DynamoDB, sharded by url_hash):
-- url_records (Cassandra-style)
url_hash TEXT PRIMARY KEY -- SHA-256 of normalised URL
url TEXT -- full normalised URL
domain TEXT -- extracted hostname, used for partition routing
status TEXT -- queued | fetching | indexed | failed | skipped
http_status INT -- last HTTP response code
last_fetched_at TIMESTAMP
next_crawl_at TIMESTAMP -- computed from TTL + importance
content_hash TEXT -- MD5 of body bytes
simhash BIGINT -- 64-bit near-duplicate fingerprint
storage_key TEXT -- S3 object key for raw HTML
importance FLOAT -- crawl priority score [0.0–1.0]
etag TEXT -- from last response, used in conditional GET
fail_count INT -- consecutive failures; triggers backoff
link_edges, the link graph (wide-column store, sharded by source_hash):
-- link_edges (correct Cassandra CQL, composite primary key)
CREATE TABLE link_edges (
source_hash TEXT,
target_url TEXT, -- clustering column: range scan of all links from a page
anchor_text TEXT,
discovered_at TIMESTAMP,
PRIMARY KEY (source_hash, target_url)
);
scheduler_queue, the re-crawl scheduling index (a Redis sorted set, one per domain tier):
-- Redis sorted set: score = next_crawl_at Unix timestamp, member = url_hash
-- Allows efficient: ZRANGEBYSCORE scheduler:tier1 0 {now} LIMIT 0 500
-- The Scheduler reads a batch of due URLs, pushes them to the Frontier (Kafka), marks status=fetching in url_records
-- Separate sets per tier (tier1 = 1h TTL, tier2 = 24h, tier3 = 7–30 days)
ZADD scheduler:tier1 {next_crawl_at_unix} {url_hash}
ZRANGEBYSCORE scheduler:tier1 0 {now_unix} LIMIT 0 500 -- fetch next batch due
ZREM scheduler:tier1 {url_hash} -- remove after enqueue to Kafka
content blobs, raw HTML stored in an object store, keyed by url_hash:
-- Object store key pattern
s3://crawl-raw/{yyyy}/{mm}/{dd}/{url_hash}.html.gz
Why Cassandra for url_records, not PostgreSQL? DB choice rationale ›
PostgreSQL at 11,000 writes/second per table is achievable with connection pooling, but adding read throughput of the same magnitude pushes a single instance to its limits around 20–30k operations/second on commodity hardware. Horizontal scaling via read replicas helps for reads, but writes still go through one primary. Cassandra's (or DynamoDB's) multi-primary, leaderless replication lets you add nodes linearly as write throughput grows, matching the crawl worker scaling model. The tradeoff is that Cassandra doesn't support arbitrary range queries across partitions; the access pattern table above was specifically designed to avoid the ones we can't support.
Caching strategy
Caching in a web crawler works differently from most systems, the crawler is more consumer than server. Most systems design caches to serve many readers from few writes; this system writes constantly and caches to reduce the cost of its own outbound reads. The latency target is sustained crawl throughput of 11,000 req/s, not user-facing page load time. Every cache miss that falls through to a slower tier directly reduces how many pages per day the system can process.
| Cache | Backing store | TTL | Why it exists |
|---|---|---|---|
| DNS lookup cache | In-process (per worker) | 5 min | Public resolvers rate-limit and add 10–100ms latency at 11k req/s |
| robots.txt cache | Shared in-memory store (Redis) | 24h | robots.txt rarely changes; re-fetching per-URL wastes bandwidth |
| URL Bloom filter | Redis cluster (~53GB at 30-day window) | None, entries never expire | Fast first-pass dedup; keeps ~99.99% of seen-URL checks under 1ms. Write-only: bits are set but never cleared. |
| URL exact hash set | Redis HSET | None, entries never expire | Exact dedup confirmation for URLs the Bloom filter passes through. Zero false-positive rate. |
Invalidation
DNS and robots.txt caches expire on TTL, no active invalidation. When a site updates its robots.txt, the cached version is stale for up to 24 hours; this is a known tradeoff (the alternative is fetching robots.txt on every request). The Bloom filter and URL hash set are write-only, there's no invalidation, by design. If a URL needs to be forcibly re-crawled (e.g. DMCA removal followed by re-addition), operators mark it eligible in Cassandra and the re-crawl Scheduler inserts a fresh entry into the Frontier, bypassing the Dedup Store check.
Deep-dive: scalability
The single-machine design breaks at three specific points as load increases: the URL frontier becomes a coordination bottleneck when a single queue must enforce per-domain politeness across hundreds of workers; the seen-URL set outgrows available RAM and falls back to disk; and the re-crawl scheduler needs a data structure that can serve time-ordered range queries at 11k QPS, something a wide-column store's partitioning model can't provide. The production architecture addresses each with a targeted change.
Frontier partitioning: domain-hash sharding across Kafka partitions L5+ ›
Each Kafka partition handles a contiguous range of domain name hashes. A fetcher consumer group is pinned to a partition range, so all URLs for a given domain are always consumed by the same worker pool. This guarantees the politeness invariant without coordination: only one pool is ever actively fetching from a domain. Rebalancing when workers are added or removed is handled by Kafka consumer group rebalancing, domains remapped to new workers pause for a politeness interval before resuming.
Distributed Bloom filter over Redis cluster L5+ ›
The seen-URL Bloom filter at ~53GB (for a 30-day crawl window of 30B URLs at 14 bits/URL —
see §3) must live in RAM to deliver sub-millisecond lookups. A Redis cluster with 2–3
primary nodes provides headroom with room to grow as the retention window widens. The Redis
Bloom filter module (RedisBloom / Redis Stack) implements the filter natively with
BF.ADD / BF.EXISTS commands. A consistent hash of the URL hash
selects which Redis node holds the filter bit range, distributing load evenly. When the
filter fills (false-positive rate degrades past threshold), a new filter generation is
created and both generations are checked during a transition window.
Re-crawl scheduling: TTL + change frequency model L6+ ›
Re-crawl frequency is a crawl budget problem. Crawling everything at equal intervals wastes
crawl budget on static pages while missing dynamic ones. The Scheduler uses a dedicated
Redis sorted set (not a Cassandra range query, see §7) where the score is
the Unix timestamp of next_crawl_at and the member is the URL hash. A
ZRANGEBYSCORE scheduler:tier1 0 {now} call efficiently fetches the next batch
of due URLs in O(log N + K) time. Once a URL is dispatched to the Frontier, it is removed
from the sorted set with ZREM and its status in url_records is set
to fetching.
The Scheduler maintains three separate sorted sets by tier: tier1 (1h TTL,
high-importance/high-frequency pages), tier2 (24h TTL), and tier3
(7–30 day TTL, static content). Change frequency is derived from historical content hash
diffs; importance score from estimated PageRank. When a crawl completes, the Scheduler
re-inserts the URL with a new score: now + TTL(importance, change_freq).
WHERE next_crawl_at < now() query across all Cassandra partitions is a
full-cluster scatter-gather, not a range query, Cassandra secondary indexes on
high-cardinality timestamp columns fan out to every node. A Redis sorted set delivers O(log
N) range queries and handles 11k writes/s on a single node, with the Scheduler tier split
keeping each set under ~1B members at any time.Geographic distribution for latency reduction L7/L8 ›
A crawler in US-East fetching a Japanese news site incurs 200–300ms RTT per page. Deploying fetcher pools in multiple geographic regions, US, EU, APAC, cuts latency to 20–40ms for regional domains. Domain-to-region assignment is done by TLD and geo-IP of the target server's DNS response. The frontier is replicated across regions with cross-region URL dedup via a globally replicated key-value store (Google Spanner or DynamoDB Global Tables). Content is written to the nearest regional object store with cross-region replication.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Fetcher worker crashes mid-fetch | URL is dequeued from Kafka but content never stored | Kafka consumer commits offset only after successful write to object store. Crash before commit triggers redelivery from last committed offset. | L3/L4 |
| Target site returns infinite redirect loop | Fetcher follows redirects indefinitely, exhausting threads | Hard cap of 5 redirect hops per fetch; mark URL as failed and blacklist the redirect target if loop detected. | L3/L4 |
| Spider trap, dynamically generated infinite URLs | Sites like calendars generate infinite unique URLs (e.g. /cal?date=2099-01-01),
exhausting the frontier |
Per-domain URL count cap. Detect repeated URL structure patterns (query parameter explosion). Apply path-depth limit per domain. | L5 |
| Redis Bloom filter node failure | Bloom filter lookup throws; duplicate fetch rate spikes | Fail open (treat URL as unseen) and fall through to Cassandra exact-hash check. Redis cluster replication recovers within seconds in steady state. | L5 |
| Kafka partition rebalance during politeness window | Domain moves to new fetcher consumer group; new consumer immediately fetches, violating the ≤1 req/s rule | On consumer group join, each worker reads the last-fetch timestamp from Cassandra for all domains in its new partition and inserts a politeness delay before resuming. | L5 |
| Content farm, millions of near-duplicate pages | Low-quality content consumes crawl budget; SimHash cluster is overwhelmed | Per-domain SimHash collision rate tracked in metadata. Domains exceeding the threshold have crawl budget capped at 10% of normal allocation and are flagged for manual quality review. | L7/L8 |
| Legal / DMCA takedown of a crawled URL | Crawled content must be removed from storage and future crawls blocked | URL marked with status=blocked in Cassandra; object store blob deleted; URL
added to a permanent denylist checked before any frontier enqueue. This is an operator
action via the DELETE API (§5b). |
L7/L8 |
How to answer by level
L3 / L4 SDE I / SDE II, Can you build a working system? ›
- Describe the fetch-parse-enqueue loop clearly
- Know what robots.txt is and that you must check it
- Identify a simple dedup mechanism (hash set)
- Pick a queue for the frontier and explain why
- Mention object store for raw HTML
- L3 uses a global hash set without thinking about size
- L3 ignores politeness rate limiting entirely
- L3 treats the frontier as a simple FIFO queue
- L3 doesn't consider re-crawl freshness
L5 Senior, Do you understand the tradeoffs? ›
- Defend BFS with priority queue over DFS
- Size the Bloom filter (14 bits/URL → ~53GB for a 30-day crawl window) and explain why Redis
- Design per-domain politeness buckets, not just "add a delay"
- Explain conditional GET (ETags, 304) for re-crawls
- Know why Cassandra over PostgreSQL at this write rate
- L5 knows Kafka but doesn't partition by domain
- L5 mentions DNS caching but not that it needs override TTL
- L5 doesn't handle frontier rebalance + politeness edge case
L6 Staff, Can you own this end-to-end? ›
- Own the contract between crawler and downstream indexer: schema versioning, delivery guarantees, object store path conventions
- Define the crawl freshness SLA end-to-end: not just re-crawl TTL, but how latency in the crawl-to-index pipeline affects search result currency
- Identify that the scheduler (Redis sorted set) is a separate component from url_records, and explain why Cassandra range queries can't serve this access pattern
- Address the Kafka rebalance + politeness invariant edge case (§10) without prompting
- Design operational runbooks: what breaks first under load, and what's the graceful degradation path?
- L6 treats crawl budget as a technical problem; L7 frames it as a business ROI problem, which sites are worth more crawl budget and why
- L6 handles DMCA as a feature; L7 designs the compliance pipeline proactively (denylist, audit trail, legal hold)
- L6 ignores geographic distribution and cross-region dedup consistency tradeoffs
- L6 doesn't question whether to build vs. license Common Crawl data
L7 / L8 Principal / Distinguished, Should we build this, and how? ›
- Frame crawl budget as a resource allocation problem with ROI tradeoffs
- Design geographic distribution + cross-region dedup synchronisation
- Address anti-spam, content farm detection, and quality scoring at system level
- Know that
robots.txthas no legal force in most jurisdictions, it is a convention, not a contract. Violating it has nonetheless led to litigation (hiQ v. LinkedIn, Sandvig v. Barr). The correct framing: respect it by policy, document that policy, and maintain an audit trail - Address GDPR: raw HTML stored in the object store may contain EU-resident PII. Data retention policy for crawled content, right-to-erasure pipeline, and regional storage boundaries are real L7/L8 design concerns
- Question the requirements: does the company need a full crawler or can it license Common Crawl data for far less engineering cost?
- Treating all pages as equally valuable (no importance scoring)
- Not connecting crawl freshness to downstream search ranking quality
- Ignoring that a broken crawler is a legal liability, not just a performance issue
Classic interview probes
| Probe question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "How do you avoid crawling the same URL twice?" | In-memory hash set of seen URLs | Bloom filter front-tier + Redis exact hash back-tier; size the filter at 14 bits/URL | Two-tier + SimHash for near-duplicate content; discuss Bloom filter generation rotation |
| "How do you respect a website's crawl limits?" | Add a sleep() between requests | Per-domain token bucket; read robots.txt Crawl-delay directive; cache robots.txt 24h | Adaptive rate per domain based on server response latency; per-domain failure tracking; back off on 429 |
| "How do you keep popular pages fresh?" | Re-crawl everything every N hours | Per-URL TTL derived from change-frequency history; conditional GET via ETags for 304 optimisation | Multi-signal freshness model: change frequency + importance + change impact on ranking; separate high-frequency pipeline for tier-1 domains |
| "What happens when your crawler crashes?" | Restart and re-crawl everything | Kafka offset commit after write; uncommitted URLs redelivered; Cassandra as durable state | Idempotent write design (content hash dedup prevents duplicate storage); partial crawl checkpointing; graceful drain on shutdown |
- Rate Limiter System Design — atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design — hash-based ID generation, Redis caching strategy, and async analytics pipeline
- Twitter Feed System Design — fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design — multi-channel delivery, idempotency keys, and priority queues at scale
- Key-Value Store System Design — consistent hashing, eviction policies, replication, and failure modes
- Search Autocomplete
- Chat System (WhatsApp) System Design — WebSocket architecture, message delivery guarantees, and fan-out