Design a Proximity Search System
like Yelp or Google Places
Deceptively simple to describe, surprisingly hard to scale: finding the nearest relevant business in under 200 ms across 200 million listings worldwide.
~30 min read · 11 sections · interactive estimator
What the interviewer is testing
"Design Yelp" looks like a CRUD app with a map. It isn't. Beneath the surface are three distinct hard problems, and you need a credible answer to all three — not a deep dive into one.
| Problem | Why it's hard | What changes at scale |
|---|---|---|
| Spatial indexing | SQL WHERE distance < X is a full table scan at 200M rows; you need a purpose-built geo-index | Naive radius queries degrade from O(1) to O(n); the geo-index must shard gracefully without hot spots |
| Review ranking | Raw average ratings are easily gamed; a 1-review 5-star business beats a 500-review 4.8-star one | Aggregate recomputation on every review write is O(n) at scale; you need incremental precomputation |
| Read-scale imbalance | Searches vastly outnumber writes — users browse far more than they review | The read:write ratio at Yelp-scale is ~1000:1; the system must be read-optimised without sacrificing write correctness |
Level signal: L3/L4 candidates name geohash but can't explain prefix-length tradeoffs. L5 candidates pick geohash vs quadtree correctly and justify it. L6 candidates reason about hot-cell mitigation and ranking freshness SLAs. L7/L8 candidates frame the geo-index sharding strategy and the ranking pipeline as separate subsystems with different consistency requirements.
Requirements clarification
Start with requirements before touching architecture. Every major design decision in this article traces back to a specific NFR target established here.
Functional requirements
| Capability | In scope | Out of scope (for this interview) |
|---|---|---|
| Proximity search | Find businesses within a radius, filtered by category and min-rating; ranked by composite score | ML personalisation, social graph ("friends also liked") |
| Business listing | Create, update, and display business profiles (name, address, hours, photos, category) | Business owner verification, paid listing tiers |
| Reviews | Authenticated users submit a rating (1–5) and text; review is public immediately | Review moderation, image uploads, business responses |
| Aggregated ratings | Per-business average rating and review count, kept current within 60 s of a new review | Sentiment analysis, review helpfulness voting |
Non-functional requirements
| NFR | Target | Why this level? |
|---|---|---|
| Search latency (p99) | < 200 ms | Users abandon map searches after ~300 ms; 200 ms leaves budget for network |
| Rating freshness | Aggregate reflects new reviews within 60 s | Stale ratings mislead booking decisions; real-time is unnecessary |
| Search availability | 99.99% (52 min/year downtime) | Search is the core revenue path; downtime directly loses bookings |
| Write availability | 99.9% (8.7 hr/year) | Review submission can tolerate brief degradation; users will retry |
| Scalability | 200M businesses, 100M DAU, 1000:1 read:write | Yelp/Google Places 2024 reported scale; drives read-optimised architecture |
| Geo accuracy | Search results within stated radius ± 20% | Geohash cell boundaries introduce edge-of-cell error; ± 20% is imperceptible to users |
NFR reasoning
Search latency < 200 ms Drives §4, §5, §8 ›
A full table scan over 200M rows — even with a B-tree index on latitude — takes seconds. Meeting 200 ms p99 requires a spatial index that can narrow candidates to a small cell in O(1), then apply distance filtering only on that subset. This is what drives geohash or quadtree in §5 and the Elasticsearch geo_distance query in §4.
Rating freshness within 60 s Drives §4, §8, §9 ›
Recomputing the Bayesian average on every search query (100K QPS) by joining reviews would collapse the DB. The 60-second SLA means we can precompute and cache the aggregate, updating it asynchronously via a message queue (Kafka) after each review write. Staleness beyond 60 s would be user-visible on trending businesses.
1000:1 read:write ratio Drives §4, §9 ›
At 100M DAU with ~10 searches per active user per day and ~0.1 reviews per user per week, the ratio is roughly 1000 searches per review write. This makes the system almost entirely read-bound — it should be designed around read throughput, not write throughput. Write path correctness matters, but write path latency is not the bottleneck.
Capacity estimation
The load profile is extremely read-heavy. Searches dominate at ~1000× the review write rate. The implication: the architecture must optimise for read throughput and caching, not write throughput.
Interactive capacity estimator
Key insight: At 100M DAU × 10 searches/day, search QPS averages ~11,600 — peaking near 35,000 during lunch and evening hours. This is the number that makes a relational geo-query impossible without a dedicated spatial index. The review write rate (~16 writes/sec) is trivially manageable by a single relational primary; the challenge is entirely on the read side.
| Dimension | Estimate | Key insight |
|---|---|---|
| Elasticsearch index | ~300 GB | 200M docs × 1.5 KB avg → 6 shards of 50 GB each |
| Review DB storage | ~730 GB | 100M DAU × 0.1 reviews/week × 52 weeks × 5 yr × 2 KB |
| Geo-cell hot spot | Top 1% cells get 30–40% of queries | Manhattan/Tokyo/central-London cells need dedicated cache warming |
| Peak multiplier | 3× | Lunch + dinner rush for restaurant searches; system must handle 35K search QPS |
High-level architecture
The architecture separates the read plane (geo-search, business detail, review listing — latency-critical, eventual consistency acceptable) from the write plane (review submission, business updates — correctness-critical, async propagation acceptable). A message queue connects them without coupling write latency to index update latency.
Component breakdown
CDN sits in front of all traffic and serves business detail pages, photo assets, and search result pages that have been explicitly cached. Static assets never hit origin servers.
API Gateway / Load Balancer terminates TLS, validates auth tokens, and routes requests to the correct downstream service fleet. Routes: /search → Search API, /businesses → Business API, /reviews → Review API.
Search API translates a geo+category+rating query into an Elasticsearch geo_distance query, checks the Redis result cache first, and applies the composite ranking score to the returned candidates before responding. It never touches the relational database on the hot path.
Business API handles CRUD for business profiles. Reads are served from read replicas behind a Redis cache for individual business detail pages. Writes go to the primary PostgreSQL node and trigger an Elasticsearch document upsert.
Review API accepts review submissions from authenticated users, writes to the Review DB (PostgreSQL), and publishes a review.created event to a durable message queue (Kafka). It does not recompute ratings synchronously — that happens downstream.
Elasticsearch (geo_point index) holds one document per business: geo-point coordinates, category, current Bayesian rating, review count, and hours. The Search API queries it exclusively for proximity searches. It is not the source of truth — it's a read-optimised projection of the Business DB, kept in sync by the Business API on write.
Redis Cache serves two distinct caches: (1) search result pages keyed by hash(lat, lon, radius, category, min_rating) with a 30-second TTL, and (2) per-business rating aggregates with a 60-second TTL.
Kafka (review events) decouples the Review API's write latency from the aggregate update latency. The Agg Worker consumes review.created events, recomputes the Bayesian rating for the affected business, writes to the review_aggregates table, and invalidates the Redis rating cache.
Agg Worker is a stateless Kafka consumer. It uses the outbox pattern: the Review API writes the event to an outbox table in the same PostgreSQL transaction as the review record, and Debezium publishes to Kafka from the WAL. This guarantees the event is never lost even if Kafka is temporarily unavailable during the write.
Architectural rationale
Why Elasticsearch instead of PostGIS for geo-search? Core tradeoff ›
PostGIS geo-queries against 200M rows require a full index scan over the spatial index, and adding category + rating filters means multi-column index intersections — difficult to optimise past ~50 ms. Elasticsearch's geo_distance filter uses a pre-built BKD tree (Block KD-tree) that narrows candidates in O(log n) before applying secondary filters, reliably returning results in <20 ms on a 200M document index. It also scales horizontally via sharding with no schema migrations.
Why async rating updates via Kafka? Decoupling ›
Synchronous rating recomputation inside the review write transaction would require a SELECT AVG(rating) over all reviews for the business — O(n) per write. At 16 review writes/sec with businesses accumulating thousands of reviews, this becomes progressively more expensive. Kafka decouples the write from the aggregate update: the review commits in <10 ms, and the aggregate updates within ~60 s in steady state.
review_aggregates and increment them atomically — this gives O(1) aggregate update without scanning all reviews, at the cost of losing the ability to recalculate from scratch without a full review scan.Why the outbox pattern + CDC instead of direct Kafka producer? Reliability ›
The Review API could publish directly to Kafka after writing to the database. But if Kafka is temporarily unavailable, the review is committed to PostgreSQL but the event is lost: the Agg Worker never recomputes the rating. The outbox pattern solves this: the Review API writes both the review row and an outbox event row in the same database transaction. Debezium (a CDC connector) reads the PostgreSQL WAL and publishes the outbox event to Kafka. Since the WAL is durable, the event is guaranteed to be published eventually — even if Kafka was down during the original write.
Real-world comparison
| Decision | This design | Yelp (reported) | Google Places |
|---|---|---|---|
| Geo index | Elasticsearch geo_point (BKD tree) | Elasticsearch (migrated from Solr) | Proprietary S2 geometry cells + BigTable |
| Review storage | PostgreSQL, sharded by business_id | MySQL → PostgreSQL migration | Spanner for global consistency |
| Rating aggregation | Async via Kafka consumer | Periodic batch + real-time increment | Near-real-time with Bigtable increments |
| Search caching | Redis result cache (30 s TTL) | Memcached + Redis hybrid | CDN + edge caching aggressive |
Core algorithm — spatial indexing
Before designing search, you must answer one foundational question: how do you represent a location in a way that makes "find all businesses within X km of me" fast? The answer isn't obvious — latitude and longitude are continuous, not naturally indexed. Three approaches dominate in production systems.
Our choice: Geohash at precision 6: each cell covers approximately 1.2 km × 0.6 km. For a 500 m search radius, we query the target cell plus its 8 neighbours (9 cells total) to handle edge-of-cell cases, then filter by exact Haversine distance in the application layer. This gives O(1) cell lookup via an Elasticsearch term query, with the distance filter applied to at most a few thousand candidates.
Implementation: geohash radius query ›
// Elasticsearch geo_distance query (used by Search API)
{
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "500m",
"location": { "lat": 37.7749, "lon": -122.4194 }
}
},
{ "term": { "category": "restaurant" } },
{ "range": { "bayesian_rating": { "gte": 4.0 } } }
]
}
},
"sort": [
{ "_score": { "order": "desc" } },
{ "bayesian_rating": { "order": "desc" } }
],
"size": 20
}
// Bayesian average formula (computed by Agg Worker, stored in ES)
// bayesian_rating = (C × global_mean + sum_ratings) / (C + review_count)
// C = 25 (prior count), global_mean = 3.8 (platform average)
// A new business with 0 reviews starts at 3.8 (the global mean)
The Elasticsearch geo_distance filter uses an internal BKD tree, not a geohash prefix scan. We pass the coordinates directly — Elasticsearch handles the spatial math. The geohash discussion in §5 applies to Redis-based caching and sharding strategies, not to Elasticsearch's internal index structure.
One open question: What happens when a user on a cell boundary gets results from only one cell? The 9-cell neighbour search handles this in most cases. For very small radii (<100 m) in dense areas, precision 7 cells (152 m × 152 m) may be needed to keep the candidate set from being too large. The precision tradeoff is: larger cells = more false candidates to filter; smaller cells = more neighbour lookups.
API design
Two endpoints dominate the hot path: the proximity search and the review submission. Both have non-obvious design choices.
GET /search/businesses
Use GET for search (not POST) because search results are cacheable by URL — CDN and intermediary proxies can serve cached results for identical queries. The query parameters form the cache key.
// Request
GET /search/businesses?lat=37.7749&lon=-122.4194&radius_m=500&category=restaurant&min_rating=4.0&page_token=eyJ...
// Response 200 OK
{
"businesses": [
{
"business_id": "biz_8821",
"name": "Mission Pie",
"category": "restaurant",
"distance_m": 312,
"bayesian_rating": 4.6,
"review_count": 843,
"address": "2901 Mission St, San Francisco",
"is_open_now": true,
"thumbnail_url": "https://cdn.example.com/..."
}
],
"next_page_token": "eyJ...",
"result_count": 47
}
POST /reviews
Use POST (not PUT) — review creation is not idempotent. Include an idempotency key header to prevent duplicate reviews on mobile retry.
// Request — Idempotency-Key header required
POST /reviews
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{
"business_id": "biz_8821",
"rating": 5,
"text": "Excellent pie. Highly recommended."
}
// Response 202 Accepted (async — rating will update within 60s)
{
"review_id": "rev_449201",
"status": "published",
"rating_update_eta_s": 60
}
Optional endpoints by level
| Endpoint | Purpose | Level |
|---|---|---|
| GET /businesses/:id | Full business detail (hours, photos, menu link) | L3/L4 |
| GET /businesses/:id/reviews | Paginated review list, sorted by recency or helpfulness | L3/L4 |
| POST /businesses | Create new business listing (host-side) | L3/L4 |
| GET /search/businesses/count | Map view: cluster count of businesses per geo-cell | L5 |
| GET /businesses/trending | Rising businesses in a region (review velocity spike) | L5 |
| POST /search/businesses/bulk-nearby | Multi-origin batch search (logistics, delivery zone coverage) | L7/L8 |
Core flow — search path
The search path is the hot path. It must complete in <200 ms p99, which means every component adds latency budget. The design enforces a strict cache-first strategy: Redis result cache → Elasticsearch → response. The database is never touched on the search hot path.
Latency budget: Redis HIT: ~2 ms cache read + ~3 ms network = 5 ms. Redis MISS: ~5 ms Redis check + ~20 ms Elasticsearch + ~10 ms ranking + ~3 ms cache write = ~38 ms. With ~150 ms of network/CDN overhead, total p99 stays well under 200 ms. If Elasticsearch degrades to >100 ms, the SLA is at risk — circuit-break to a degraded mode that returns stale cache results with a freshness timestamp.
Data model
Data model decisions follow from access patterns — not from what's convenient to normalise. Here the dominant patterns are: proximity search (never touches relational DB), business detail fetch (single-row read, heavily cached), and review listing (paginated scan by business, sorted by time).
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Proximity search | Very high (11,600 QPS avg) | Elasticsearch geo_distance + filters → never touches PostgreSQL |
| Business detail page | High (every search result tap) | Single-row read by business_id — served from Redis or read replica |
| Review list for business | Moderate | Range scan by business_id, ordered by created_at DESC — paginated cursor |
| Submit review | Low (~16 writes/sec) | INSERT into reviews + outbox — single relational shard |
| Rating aggregate read | Very high (embedded in search results) | Single key Redis lookup (bayesian_rating, review_count) per business |
Two things stand out: the search path never touches the relational database, so the PostgreSQL schema only needs to serve detail pages and write operations. And the rating aggregate must be precomputed — it's read far too frequently to compute on demand.
Why store rating_sum and review_count separately? Storing the sum and count allows O(1) incremental updates: when a new review arrives, rating_sum += new_rating and review_count += 1 — no scan of all reviews needed. The Bayesian average is then computed from these two fields. If you only stored the average, every update would require knowing the previous count to reverse-engineer the sum.
Caching strategy
The 1000:1 read:write ratio means caching is not optional — it's the primary scaling mechanism. Three distinct cache layers handle different parts of the read path, each with different TTL and invalidation strategies.
| Layer | What it caches | TTL | Invalidation | Hit rate target |
|---|---|---|---|---|
| CDN | Business detail HTML pages, photo assets | 5 min (Cache-Control) | Purge on business profile update | >80% |
| Redis (search results) | Search result pages, keyed by full param hash | 30 s | TTL expiry only (eventual consistency acceptable) | 40–60% (diverse queries) |
| Redis (ratings) | Per-business Bayesian rating + review_count | 60 s | Explicit invalidation by Agg Worker on update | >95% (high reuse) |
Hot-cell warming: Popular geo-cells (downtown Manhattan, Shibuya, central London) get query volumes 10–50× higher than the global average. At startup or after cache eviction, these cells cold-start simultaneously — a thundering-herd problem. Mitigate with probabilistic early expiration (cache stampede protection): when a key has <5 s left on its TTL, 5% of requests will fetch fresh data and extend the TTL, while the other 95% still serve the cached value.
Cache invalidation
Each cache layer uses a different invalidation strategy, chosen based on the consistency requirement and the write frequency of the underlying data:
Search result cache: TTL-only (no active invalidation) Simplicity over freshness ›
Search results are keyed by hash(lat, lon, radius, category, min_rating). The parameter space is enormous — millions of unique keys. Active invalidation would require tracking which business_ids appear in which cached result pages, creating an inverted index of cache entries per business. This is expensive and fragile. Instead, the 30-second TTL ensures that any stale result is replaced within 30 seconds. For a 1000:1 read:write ratio, a 30-second staleness window is imperceptible to users.
Rating cache: event-driven invalidation + TTL fallback Hybrid approach ›
The Agg Worker explicitly invalidates the rating cache after recomputing the Bayesian average: DEL rating:{business_id} followed by SET rating:{business_id} {json} EX 60. This is write-through: the worker writes the new value to both PostgreSQL and Redis in the same operation. The 60-second TTL acts as a safety net — if the Agg Worker crashes between the DB write and the Redis write, the stale cached value expires within 60 seconds and the next read triggers a cache miss that fetches the correct value from PostgreSQL.
SET ... NX + version check: include a monotonically increasing version (review_count) in the cache value and only overwrite if the new version is higher.CDN cache: purge-on-write for business profile changes Correctness-critical ›
When a business owner updates their profile (hours, address, phone), the Business API triggers an explicit CDN purge for the business detail page URL. The CDN returns stale content for 5–15 seconds during purge propagation across PoPs. For address or coordinate changes, the Business API also triggers a synchronous Elasticsearch document upsert: the search index must reflect the new location immediately, not after a 5-minute CDN TTL expiry.
Cache sizing: Rating cache: 200M businesses × ~200 bytes per entry (business_id, bayesian_avg, review_count, rating_sum) = ~40 GB. In practice, only businesses with recent activity are cached: the 60-second TTL evicts inactive entries. Working set is ~20M active businesses ≈ 4 GB. Search result cache: at 35K peak QPS with a 30-second TTL and ~40% cache hit rate, the working set is ~700K unique cached pages × ~2 KB per page ≈ 1.4 GB. Total Redis memory: ~6 GB: easily fits in a single Redis node with room for growth. Deploy a 3-node Redis Cluster for availability, not for capacity.
Deep-dive scalability
Three scalability problems surface at Yelp/Google Places scale that don't appear at smaller deployments: Elasticsearch shard hot spots, rating consistency under write bursts, and geographic read skew.
Elasticsearch sharding strategy: avoiding geo hot spots L5 / L6 ›
Naive sharding by document ID distributes businesses uniformly but loses data locality — a geo-distance query must fan out to all shards. Sharding by geohash prefix at precision 2 (covering ~2,500 km × 1,250 km cells) concentrates geographically nearby businesses on the same shard, so most radius searches hit one or two shards. At 200M businesses with 6 shards, each shard holds ~33M documents in roughly 50 GB — manageable and under Elasticsearch's recommended shard ceiling.
Hot spot risk: a single heavily queried city (New York, Tokyo) lands on one shard. Mitigate with a replica set per shard (2–3 replicas) so search traffic is distributed across replicas. The primary shard still handles all writes, but reads are load-balanced.
Rating write bursts: viral business problem L6 ›
A restaurant featured on a popular food blog receives 10,000 reviews in an hour. The Kafka consumer processes them sequentially: the Agg Worker recomputes the Bayesian average for the same business 10,000 times. Each compute is cheap (O(1) with sum/count), but the Elasticsearch document update on every recompute causes index write amplification. Solution: coalesce updates in the Agg Worker with a sliding window of 5 seconds — batch all review events for the same business_id arriving within the window into a single Elasticsearch upsert. This reduces index writes from 10,000 to ~720 over an hour.
Geographic read skew: multi-region deployment L7 / L8 ›
A single-region deployment from US-East adds 120–200 ms of round-trip latency for users in Tokyo or London — blowing the 200 ms p99 budget before the application has done any work. Multi-region deployment routes users to the nearest region via latency-based DNS (Route 53 or equivalent). Each region hosts its own Elasticsearch cluster seeded with the businesses in that geographic area. Business and review writes go to the closest region's primary DB and are replicated asynchronously to other regions within ~5 seconds.
The consistency implication: a review written in Asia may not appear in the EU region for up to 5 seconds. This is acceptable — cross-region review consistency is not a user-visible correctness requirement.
Distributed ID generation for business_id and review_id L5 / L6 ›
At 200M businesses and ~16 review writes/sec, auto-increment IDs from a single PostgreSQL primary become a bottleneck when the system scales to multiple write primaries (multi-region) or when sharding the Review DB. Three options:
- UUID v4: Globally unique, no coordination. But 128 bits wastes index space, and random UUIDs destroy B-tree locality — INSERT performance degrades as the index fragments. Not recommended for primary keys at this scale.
- UUID v7 (time-ordered): 128-bit UUID with a time-based prefix. Preserves B-tree insertion order. Good default for most systems. Slight overhead from 128-bit key size.
- Snowflake-style IDs: 64-bit ID = 41 bits timestamp + 10 bits machine_id + 13 bits sequence. Compact, time-ordered, globally unique. Requires a machine_id registry. This is our choice — compact keys keep index sizes manageable at 200M+ rows.
pg_sequence with a machine prefix). Each API server instance gets a unique machine_id at startup from a ZooKeeper or etcd lease. At 16 writes/sec, sequence bits never overflow within a millisecond. For multi-region: each region gets a distinct machine_id range, ensuring IDs never collide across regions without cross-region coordination.PostgreSQL sharding strategy for the Review DB L6 ›
At ~730 GB of review data (5-year retention), a single PostgreSQL primary can handle the storage. But the access pattern — paginated reviews sorted by created_at for a given business_id — benefits from sharding by business_id to keep all reviews for a business on the same shard. This ensures the pagination query (WHERE business_id = X ORDER BY created_at DESC LIMIT 20 OFFSET N) hits a single shard with no cross-shard scatter-gather.
Shard count: start with 8 shards using hash-based partitioning (business_id % 8). Each shard holds ~91 GB — well within PostgreSQL's comfortable range. The shard key choice means that "list all reviews by user X" requires a fan-out query across all 8 shards. This is acceptable: the user-reviews page is low-frequency (users rarely browse their own review history), and the fan-out returns quickly because each shard has an index on (user_id, created_at).
user_id instead. This makes user-reviews pages single-shard, but makes business-reviews pages (the high-frequency query) require fan-out. Wrong tradeoff — always shard by the dominant access pattern.Operational monitoring: Five metrics to alert on: (1) Elasticsearch p99 query latency >100 ms, (2) Kafka consumer group lag >30 s, (3) Redis cache hit rate dropping below 30% (search results) or 80% (ratings), (4) PostgreSQL replication lag >10 s between primary and read replicas, (5) Review write rate for any single business exceeding 50/hour (fraud signal). Dashboard these in Grafana with a 1-minute scrape interval. The Elasticsearch latency metric is the single best predictor of user-facing search SLA degradation.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Elasticsearch shard unavailable | Search returns partial results or times out; users see empty results | Circuit-break to degraded mode: serve stale Redis cache with a "results may be outdated" label; Elasticsearch replicas absorb traffic during shard recovery | L3/L4 |
| Redis cache stampede on startup | Cold cache causes all search QPS to hit Elasticsearch simultaneously, overwhelming it | Probabilistic early expiration (5% of requests refresh before TTL expires); pre-warm popular geo-cells with a background job at startup | L5 |
| Review bombing (fake negative reviews) | A competitor submits 1,000 1-star reviews in 10 minutes, tanking a business's rating | Rate-limit reviews per user per business (max 1 review / business / lifetime); velocity check in Review API — flag burst submissions for async spam review before publishing | L5 |
| Kafka consumer lag spike | Agg Worker falls behind; ratings stale beyond 60 s SLA during viral events | Scale consumer group horizontally (one partition per Agg Worker instance); alert on consumer lag >30 s; coalesce updates per business_id to reduce write amplification | L5 |
| Business geo-coordinates updated | Elasticsearch document still points to old location; search returns wrong results for moved businesses | Business API triggers synchronous Elasticsearch upsert on address/coordinate change (not async); accept brief inconsistency window (<2 s) as acceptable for business moves | L5 |
| Geo-cell boundary edge case | Business 100 m from cell edge not returned in radius search because neighbour cell not queried | Always query target cell + 8 neighbours; then filter by exact Haversine distance in Search API application layer; accept ±20% radius boundary tolerance per NFR | L5 |
| Duplicate review on mobile retry | Network timeout causes client to retry; user ends up with two reviews for the same business | Require idempotency key (UUID) in POST /reviews header; store key in reviews.idempotency_key with UNIQUE constraint; return original response on duplicate key conflict | L3/L4 |
| Elasticsearch index corruption / rebuild | Index must be rebuilt from PostgreSQL source of truth; 200M document re-index takes hours | Use index aliasing: build new index in background → swap alias atomically. Keep old index live during rebuild. Elasticsearch index aliases allow zero-downtime re-indexing. | L7/L8 |
Security, rate limiting & compliance
Three security concerns map directly to components already in the architecture: abuse prevention on the review write path, fraudulent business listing manipulation, and regulatory erasure requirements that span every storage tier. L6+ interviewers probe all three explicitly.
Review API rate limiting
Without rate limiting, a single malicious actor can submit thousands of fake reviews to manipulate ratings. The API Gateway enforces a per-user token-bucket rate limit on POST /reviews: maximum 1 review per business per user lifetime (enforced via the UNIQUE(user_id, business_id) constraint in the Review DB), plus a global velocity limit of 10 reviews per hour per user, enforced via a Redis counter (INCR ratelimit:review:{user_id}:{hour_bucket} with a 3600-second TTL). Requests exceeding the limit receive a 429 Too Many Requests response with a Retry-After header.
A separate, higher limit (100 requests/min) applies at the IP level to catch unauthenticated probing before authentication succeeds. This layered approach — IP-level → user-level → business-level — prevents both automated bots and coordinated human review campaigns.
L5 probe: "How would you prevent a competitor from paying 500 people to leave 1-star reviews on a rival business?" — Per-user-per-business uniqueness constraint prevents multiple reviews. Velocity detection in the Agg Worker flags businesses receiving >50 reviews/hour (10× the 99th percentile) and holds the aggregate update pending async fraud review. The rating cache continues serving the pre-spike value until fraud review completes — preventing the attack from being user-visible during investigation.
Business listing fraud prevention
Fraudulent business listings — fake restaurants created to capture search traffic, or duplicate listings for SEO manipulation — degrade search quality. The Business API enforces three controls:
- Geo-coordinate validation: Reject listings with coordinates in oceans, uninhabited areas, or outside the coverage zone. A simple land-boundary polygon check (pre-computed, ~50 KB dataset) catches the most obvious fakes.
- Duplicate detection: Before creating a business, check for existing listings within 50 m with similar names (Levenshtein distance ≤ 3). Flag duplicates for manual review rather than auto-rejecting — legitimate businesses can share addresses (e.g., food courts).
- Owner verification throttle: Limit unverified accounts to 3 business listings. Verified accounts (via phone, postal mail, or government ID) can create more. This prevents bulk listing spam without blocking legitimate multi-location businesses.
GDPR right-to-erasure pipeline
A user deletion request triggers a multi-stage erasure job implemented as a durable workflow (e.g., AWS Step Functions) to survive partial failures. The pipeline must reach every storage tier:
| Tier | Erasure action | SLA |
|---|---|---|
| PostgreSQL (reviews, users) | Hard-delete user row; cascade to reviews table. Update review_aggregates for affected businesses (decrement count, subtract rating). | ≤24 hours |
| Elasticsearch | Reindex affected business documents with updated bayesian_avg after review removal | ≤24 hours |
| Redis (caches) | DEL rating:{biz_id} for each business the user reviewed; invalidate search result cache entries containing those businesses | ≤1 hour |
| Kafka (event log) | Kafka retention is time-windowed (7 days default); review.created events age out automatically | ≤7 days |
| CDN (cached pages) | Purge cached business detail pages that displayed the user's review | ≤30 days (CDN propagation ~15 min after purge) |
Rating recalculation on erasure: Deleting a user's reviews changes the Bayesian average for every business they reviewed. The erasure workflow must trigger an Agg Worker recomputation for each affected business — not just delete the review row. Without this step, the cached rating_sum and review_count in review_aggregates become inconsistent with the actual reviews in the database. For users who reviewed hundreds of businesses, this creates a burst of aggregate updates; batch them with a 10-second coalescing window (same pattern as §9's viral business solution).
L7/L8 probe: "A user deletes their account. They had reviewed 200 businesses. How do you guarantee ratings are correct within 24 hours across all regions?" — The erasure workflow enqueues 200 review.deleted events to Kafka, one per business. The Agg Worker consumes them with coalescing, recomputes aggregates, and updates both PostgreSQL and Elasticsearch. Cross-region replication propagates within ~5 s per region. The 24-hour SLA is easily met. Edge case: if the user's review was the only review for a business, the Bayesian average reverts to the global mean (3.8) — not zero — because of the prior count C=25 in the formula.
How to answer by level
The interviewer calibrates depth to your level. Here's what good looks like at each, and what differentiates it from the level below.
L3 / L4: Can you build a working system? SDE I / SDE II ›
- Identifies that SQL distance queries don't scale; proposes geohash or Elasticsearch
- Draws the basic read/write split: Search API → Elasticsearch; Review API → DB
- Handles the idempotency key for review submission
- Mentions caching search results in Redis with a TTL
- L4 says "use geohash" without explaining precision tradeoff or neighbour cell problem
- L4 proposes synchronous rating recomputation in the write path
- L4 doesn't reason about read:write ratio or its architectural implication
L5: Do you understand the tradeoffs? Senior SDE ›
- Explains geohash precision tradeoff and 9-cell neighbour query
- Designs async rating pipeline via Kafka; uses sum+count not average storage
- Reasons about cache invalidation: TTL-based vs event-driven per cache layer
- Handles review bombing with rate limiting and idempotency
- L5 doesn't address Kafka consumer lag spikes or coalescing updates
- L5 doesn't address hot geo-cell cache stampede on startup
- L5 doesn't discuss Elasticsearch shard strategy by geo prefix
L6: Can you own this end-to-end? Staff SDE ›
- Designs Elasticsearch sharding by geohash prefix to reduce fan-out
- Designs probabilistic cache expiration to prevent stampedes
- Proposes review event coalescing window to handle viral businesses
- Addresses Elasticsearch zero-downtime re-index via alias swapping
- L6 doesn't address multi-region topology and cross-region consistency model
- L6 doesn't frame consistency choices as explicit product decisions
- L6 stays in one region and doesn't consider global latency budget
L7 / L8: Should we build this, and how? Principal / Distinguished ›
- Frames multi-region as AP consistency model with explicit product tradeoff
- Discusses geo-sharding across regions, not just within one Elasticsearch cluster
- Addresses S2 cells vs geohash for global accuracy (no polar distortion)
- Proposes ranking pipeline as a separate subsystem with its own SLA
- L7 proposes multi-region but doesn't frame the consistency model explicitly
- L7 doesn't question whether the Elasticsearch index rebuild SLA is acceptable to the business
- L8 drives the conversation to org-level questions: build vs buy for geo-search infra
Classic probes
| Question | L3/L4 answer | L5/L6 answer | L7/L8 answer |
|---|---|---|---|
| "How would you handle a very popular business with 10k reviews/hour?" | Scale the Review API horizontally | Coalesce Agg Worker updates per business_id in a 5s sliding window to reduce ES write amplification | Separate the viral business path from the normal path; dedicated consumer group + write-coalescing pipeline; explicit SLA relaxation during viral events |
| "A user is right on the boundary of two geohash cells. What happens?" | Use a smaller geohash cell size | Always query target cell + 8 neighbours, then Haversine filter in app layer; precision 6 gives ~1.2 km cells so 9-cell cover handles any 500m radius | For very small radii (<100m) switch to precision 7 dynamically; discuss the precision vs candidate set size tradeoff as a configurable parameter tuned to query patterns |
| "How fresh are search results after a business updates its hours?" | Near-real-time — we update the DB and cache immediately | Business API triggers synchronous ES upsert on profile change; CDN page cache purged; search results reflect change within ~2s | Distinguish between correctness-critical updates (coordinate change, permanent closure) that require synchronous ES update vs non-critical updates (hours, photos) that can propagate via CDC within 60s — different consistency SLAs for different field types |
| "How would you extend this to support real-time business availability (e.g., restaurant current wait time)?" | Add a wait_time field to the business record | Treat wait time as a high-velocity ephemeral signal — store in Redis with a 60s TTL, not in PostgreSQL; read-through from the business's POS system or crowd-sourced check-ins via a separate write path | Model it as a separate "live signals" service with sub-minute write frequency and TTL-based expiry; never persist it to the main data store; discuss the accuracy vs freshness tradeoff for crowd-sourced signals |
How the pieces connect
Every architectural decision traces back to a requirement. Here are the most important chains.
-
1Search latency < 200 ms (§2) → SQL geo-queries scan 200M rows → Elasticsearch geo_distance BKD tree (§4, §5) returns candidates in ~20 ms → Redis search result cache (§8) serves repeat popular queries in ~5 ms
-
2Rating freshness within 60 s (§2) → synchronous recomputation is O(n) per write → Async Kafka consumer (§4, §6) decouples write latency from aggregate update → sum + count storage in review_aggregates (§7) enables O(1) incremental update
-
31000:1 read:write ratio (§2, §3) → database cannot serve 11,600 QPS → Elasticsearch as primary query engine (§4) — not a relational DB read replica — with three-layer caching (§8) absorbing the majority of traffic
-
4Geohash edge-cell problem (§5) → single-cell query misses nearby businesses → 9-cell neighbour query + Haversine post-filter (§5, §6) → search result accuracy within ±20% of stated radius (§2 NFR)
-
5Popular geo-cell hot spots (§3, §8) → cache cold-start triggers thundering herd → probabilistic early expiration (§8) prevents stampede → sustained >95% cache hit rate for rating aggregates without coordination overhead
-
6Review bombing risk (§10) → raw write path has no fraud signal → idempotency key on POST /reviews (§5b) prevents duplicate submissions + rate limit 1 review/business/user (§10) limits velocity attacks without blocking legitimate reviews
- Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
- Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
- Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
- Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
- Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
- Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
- Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
- Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
- Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
- Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
- Online Judge System Design — secure sandboxing, execution queues, and worker scaling