System Design Interview Guide

Design a Proximity Search System
like Yelp or Google Places

Deceptively simple to describe, surprisingly hard to scale: finding the nearest relevant business in under 200 ms across 200 million listings worldwide.

L3 / L4 — Build it L5 / L6: Own the tradeoffs L7 / L8 — Drive the architecture

~30 min read · 11 sections · interactive estimator

Hero image for Proximity Search System Design

What the interviewer is testing

"Design Yelp" looks like a CRUD app with a map. It isn't. Beneath the surface are three distinct hard problems, and you need a credible answer to all three — not a deep dive into one.

Problem	Why it's hard	What changes at scale
Spatial indexing	SQL `WHERE distance < X` is a full table scan at 200M rows; you need a purpose-built geo-index	Naive radius queries degrade from O(1) to O(n); the geo-index must shard gracefully without hot spots
Review ranking	Raw average ratings are easily gamed; a 1-review 5-star business beats a 500-review 4.8-star one	Aggregate recomputation on every review write is O(n) at scale; you need incremental precomputation
Read-scale imbalance	Searches vastly outnumber writes — users browse far more than they review	The read:write ratio at Yelp-scale is ~1000:1; the system must be read-optimised without sacrificing write correctness

🎯

Level signal: L3/L4 candidates name geohash but can't explain prefix-length tradeoffs. L5 candidates pick geohash vs quadtree correctly and justify it. L6 candidates reason about hot-cell mitigation and ranking freshness SLAs. L7/L8 candidates frame the geo-index sharding strategy and the ranking pipeline as separate subsystems with different consistency requirements.

Requirements clarification

Start with requirements before touching architecture. Every major design decision in this article traces back to a specific NFR target established here.

Functional requirements

Capability	In scope	Out of scope (for this interview)
Proximity search	Find businesses within a radius, filtered by category and min-rating; ranked by composite score	ML personalisation, social graph ("friends also liked")
Business listing	Create, update, and display business profiles (name, address, hours, photos, category)	Business owner verification, paid listing tiers
Reviews	Authenticated users submit a rating (1–5) and text; review is public immediately	Review moderation, image uploads, business responses
Aggregated ratings	Per-business average rating and review count, kept current within 60 s of a new review	Sentiment analysis, review helpfulness voting

Non-functional requirements

NFR	Target	Why this level?
Search latency (p99)	< 200 ms	Users abandon map searches after ~300 ms; 200 ms leaves budget for network
Rating freshness	Aggregate reflects new reviews within 60 s	Stale ratings mislead booking decisions; real-time is unnecessary
Search availability	99.99% (52 min/year downtime)	Search is the core revenue path; downtime directly loses bookings
Write availability	99.9% (8.7 hr/year)	Review submission can tolerate brief degradation; users will retry
Scalability	200M businesses, 100M DAU, 1000:1 read:write	Yelp/Google Places 2024 reported scale; drives read-optimised architecture
Geo accuracy	Search results within stated radius ± 20%	Geohash cell boundaries introduce edge-of-cell error; ± 20% is imperceptible to users

NFR reasoning

Search latency < 200 ms Drives §4, §5, §8 ›

A full table scan over 200M rows — even with a B-tree index on latitude — takes seconds. Meeting 200 ms p99 requires a spatial index that can narrow candidates to a small cell in O(1), then apply distance filtering only on that subset. This is what drives geohash or quadtree in §5 and the Elasticsearch geo_distance query in §4.

TradeoffTighter latency targets (e.g., < 50 ms) would require in-memory spatial indexes everywhere and eliminate any database fallback path. 200 ms allows one Redis cache miss and one Elasticsearch query in the critical path.

Rating freshness within 60 s Drives §4, §8, §9 ›

Recomputing the Bayesian average on every search query (100K QPS) by joining reviews would collapse the DB. The 60-second SLA means we can precompute and cache the aggregate, updating it asynchronously via a message queue (Kafka) after each review write. Staleness beyond 60 s would be user-visible on trending businesses.

What this drivesA review_aggregates table updated by an async consumer (§4, §7). The aggregate is cached in Redis with a 60-second TTL (§8), not recomputed at query time.

1000:1 read:write ratio Drives §4, §9 ›

At 100M DAU with ~10 searches per active user per day and ~0.1 reviews per user per week, the ratio is roughly 1000 searches per review write. This makes the system almost entirely read-bound — it should be designed around read throughput, not write throughput. Write path correctness matters, but write path latency is not the bottleneck.

What this drivesMultiple read replicas for PostgreSQL, Elasticsearch as the primary query engine (not a relational DB), and aggressive result caching at §8. The write path (review submission) can afford to be async.

Capacity estimation

The load profile is extremely read-heavy. Searches dominate at ~1000× the review write rate. The implication: the architecture must optimise for read throughput and caching, not write throughput.

Interactive capacity estimator

Daily active users 100 M

Searches / user / day 10

Reviews / user / week 0.1

Total businesses (M) 200 M

Search QPS

—

queries / second (avg)

Peak search QPS

—

3× avg (lunch / evening)

Review writes

—

writes / second

Business index size

—

Elasticsearch (≈1.5 KB/doc)

Review storage

—

5 yr retention, ≈2 KB/review

Read:Write ratio

—

searches vs review writes

💡

Key insight: At 100M DAU × 10 searches/day, search QPS averages ~11,600 — peaking near 35,000 during lunch and evening hours. This is the number that makes a relational geo-query impossible without a dedicated spatial index. The review write rate (~16 writes/sec) is trivially manageable by a single relational primary; the challenge is entirely on the read side.

Dimension	Estimate	Key insight
Elasticsearch index	~300 GB	200M docs × 1.5 KB avg → 6 shards of 50 GB each
Review DB storage	~730 GB	100M DAU × 0.1 reviews/week × 52 weeks × 5 yr × 2 KB
Geo-cell hot spot	Top 1% cells get 30–40% of queries	Manhattan/Tokyo/central-London cells need dedicated cache warming
Peak multiplier	3×	Lunch + dinner rush for restaurant searches; system must handle 35K search QPS

High-level architecture

The architecture separates the read plane (geo-search, business detail, review listing — latency-critical, eventual consistency acceptable) from the write plane (review submission, business updates — correctness-critical, async propagation acceptable). A message queue connects them without coupling write latency to index update latency.

Figure 1 — High-level architecture. Search and Business APIs are read-optimised. Review API writes to the Review DB and publishes to Kafka; an Agg Worker updates rating caches asynchronously. Dashed arrows = async.

Component breakdown

CDN sits in front of all traffic and serves business detail pages, photo assets, and search result pages that have been explicitly cached. Static assets never hit origin servers.

API Gateway / Load Balancer terminates TLS, validates auth tokens, and routes requests to the correct downstream service fleet. Routes: /search → Search API, /businesses → Business API, /reviews → Review API.

Search API translates a geo+category+rating query into an Elasticsearch geo_distance query, checks the Redis result cache first, and applies the composite ranking score to the returned candidates before responding. It never touches the relational database on the hot path.

Business API handles CRUD for business profiles. Reads are served from read replicas behind a Redis cache for individual business detail pages. Writes go to the primary PostgreSQL node and trigger an Elasticsearch document upsert.

Review API accepts review submissions from authenticated users, writes to the Review DB (PostgreSQL), and publishes a review.created event to a durable message queue (Kafka). It does not recompute ratings synchronously — that happens downstream.

Elasticsearch (geo_point index) holds one document per business: geo-point coordinates, category, current Bayesian rating, review count, and hours. The Search API queries it exclusively for proximity searches. It is not the source of truth — it's a read-optimised projection of the Business DB, kept in sync by the Business API on write.

Redis Cache serves two distinct caches: (1) search result pages keyed by hash(lat, lon, radius, category, min_rating) with a 30-second TTL, and (2) per-business rating aggregates with a 60-second TTL.

Kafka (review events) decouples the Review API's write latency from the aggregate update latency. The Agg Worker consumes review.created events, recomputes the Bayesian rating for the affected business, writes to the review_aggregates table, and invalidates the Redis rating cache.

Agg Worker is a stateless Kafka consumer. It uses the outbox pattern: the Review API writes the event to an outbox table in the same PostgreSQL transaction as the review record, and Debezium publishes to Kafka from the WAL. This guarantees the event is never lost even if Kafka is temporarily unavailable during the write.

Architectural rationale

Why Elasticsearch instead of PostGIS for geo-search? Core tradeoff ›

PostGIS geo-queries against 200M rows require a full index scan over the spatial index, and adding category + rating filters means multi-column index intersections — difficult to optimise past ~50 ms. Elasticsearch's geo_distance filter uses a pre-built BKD tree (Block KD-tree) that narrows candidates in O(log n) before applying secondary filters, reliably returning results in <20 ms on a 200M document index. It also scales horizontally via sharding with no schema migrations.

TradeoffElasticsearch introduces eventual consistency between the Business DB and the search index. A business update takes up to a few seconds to appear in search results. This is an explicit product decision — users don't need sub-second freshness on business profile edits.

AlternativesPostGIS (PostgreSQL)Redis GEOSEARCHCustom quadtree

Why async rating updates via Kafka? Decoupling ›

Synchronous rating recomputation inside the review write transaction would require a SELECT AVG(rating) over all reviews for the business — O(n) per write. At 16 review writes/sec with businesses accumulating thousands of reviews, this becomes progressively more expensive. Kafka decouples the write from the aggregate update: the review commits in <10 ms, and the aggregate updates within ~60 s in steady state.

TradeoffIf Kafka consumer lag spikes (e.g., after a restart), ratings may be stale for longer than 60 s. Consumer lag monitoring and alerting is required. An alternative is to store count and sum in review_aggregates and increment them atomically — this gives O(1) aggregate update without scanning all reviews, at the cost of losing the ability to recalculate from scratch without a full review scan.

Why the outbox pattern + CDC instead of direct Kafka producer? Reliability ›

The Review API could publish directly to Kafka after writing to the database. But if Kafka is temporarily unavailable, the review is committed to PostgreSQL but the event is lost: the Agg Worker never recomputes the rating. The outbox pattern solves this: the Review API writes both the review row and an outbox event row in the same database transaction. Debezium (a CDC connector) reads the PostgreSQL WAL and publishes the outbox event to Kafka. Since the WAL is durable, the event is guaranteed to be published eventually — even if Kafka was down during the original write.

TradeoffThe outbox pattern adds operational complexity: Debezium must be deployed, monitored, and configured to read the WAL. It also adds ~1–5 seconds of latency between the review commit and the Kafka event. Both are acceptable: Debezium is battle-tested at scale (used by LinkedIn, Uber, Airbnb), and the rating freshness SLA is 60 seconds: the extra latency is negligible.

AlternativesDirect Kafka producer (simpler, less reliable)Polling publisher (query outbox table periodically)Dual-write with saga (complex, error-prone)

Real-world comparison

Decision	This design	Yelp (reported)	Google Places
Geo index	Elasticsearch geo_point (BKD tree)	Elasticsearch (migrated from Solr)	Proprietary S2 geometry cells + BigTable
Review storage	PostgreSQL, sharded by business_id	MySQL → PostgreSQL migration	Spanner for global consistency
Rating aggregation	Async via Kafka consumer	Periodic batch + real-time increment	Near-real-time with Bigtable increments
Search caching	Redis result cache (30 s TTL)	Memcached + Redis hybrid	CDN + edge caching aggressive

Core algorithm — spatial indexing

Before designing search, you must answer one foundational question: how do you represent a location in a way that makes "find all businesses within X km of me" fast? The answer isn't obvious — latitude and longitude are continuous, not naturally indexed. Three approaches dominate in production systems.

Figure 2 — Three spatial indexing approaches. ① Geohash is our choice: native Redis/Elasticsearch support, simple prefix lookups, manageable edge-cell overhead. ② Quadtree preferred for extreme density variation. ③ S2 cells used by Google internally for global accuracy.

Figure 2b — Geohash 9-cell neighbour query. The user is near the right edge of cell dp3wjz. A 500 m search radius crosses into adjacent cells. Querying only the target cell would miss nearby businesses in dp3wjq. The 9-cell approach queries the target plus all 8 neighbours, then filters by exact Haversine distance.

Our choice: Geohash at precision 6: each cell covers approximately 1.2 km × 0.6 km. For a 500 m search radius, we query the target cell plus its 8 neighbours (9 cells total) to handle edge-of-cell cases, then filter by exact Haversine distance in the application layer. This gives O(1) cell lookup via an Elasticsearch term query, with the distance filter applied to at most a few thousand candidates.

Implementation: geohash radius query ›

// Elasticsearch geo_distance query (used by Search API)
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_distance": {
            "distance": "500m",
            "location": { "lat": 37.7749, "lon": -122.4194 }
          }
        },
        { "term": { "category": "restaurant" } },
        { "range": { "bayesian_rating": { "gte": 4.0 } } }
      ]
    }
  },
  "sort": [
    { "_score": { "order": "desc" } },
    { "bayesian_rating": { "order": "desc" } }
  ],
  "size": 20
}

// Bayesian average formula (computed by Agg Worker, stored in ES)
// bayesian_rating = (C × global_mean + sum_ratings) / (C + review_count)
// C = 25 (prior count), global_mean = 3.8 (platform average)
// A new business with 0 reviews starts at 3.8 (the global mean)

The Elasticsearch geo_distance filter uses an internal BKD tree, not a geohash prefix scan. We pass the coordinates directly — Elasticsearch handles the spatial math. The geohash discussion in §5 applies to Redis-based caching and sharding strategies, not to Elasticsearch's internal index structure.

💬

One open question: What happens when a user on a cell boundary gets results from only one cell? The 9-cell neighbour search handles this in most cases. For very small radii (<100 m) in dense areas, precision 7 cells (152 m × 152 m) may be needed to keep the candidate set from being too large. The precision tradeoff is: larger cells = more false candidates to filter; smaller cells = more neighbour lookups.

05b

API design

Two endpoints dominate the hot path: the proximity search and the review submission. Both have non-obvious design choices.

GET /search/businesses

Use GET for search (not POST) because search results are cacheable by URL — CDN and intermediary proxies can serve cached results for identical queries. The query parameters form the cache key.

// Request
GET /search/businesses?lat=37.7749&lon=-122.4194&radius_m=500&category=restaurant&min_rating=4.0&page_token=eyJ...

// Response 200 OK
{
  "businesses": [
    {
      "business_id": "biz_8821",
      "name": "Mission Pie",
      "category": "restaurant",
      "distance_m": 312,
      "bayesian_rating": 4.6,
      "review_count": 843,
      "address": "2901 Mission St, San Francisco",
      "is_open_now": true,
      "thumbnail_url": "https://cdn.example.com/..."
    }
  ],
  "next_page_token": "eyJ...",
  "result_count": 47
}

POST /reviews

Use POST (not PUT) — review creation is not idempotent. Include an idempotency key header to prevent duplicate reviews on mobile retry.

// Request — Idempotency-Key header required
POST /reviews
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{
  "business_id": "biz_8821",
  "rating": 5,
  "text": "Excellent pie. Highly recommended."
}

// Response 202 Accepted (async — rating will update within 60s)
{
  "review_id": "rev_449201",
  "status": "published",
  "rating_update_eta_s": 60
}

Optional endpoints by level

Endpoint	Purpose	Level
GET /businesses/:id	Full business detail (hours, photos, menu link)	L3/L4
GET /businesses/:id/reviews	Paginated review list, sorted by recency or helpfulness	L3/L4
POST /businesses	Create new business listing (host-side)	L3/L4
GET /search/businesses/count	Map view: cluster count of businesses per geo-cell	L5
GET /businesses/trending	Rising businesses in a region (review velocity spike)	L5
POST /search/businesses/bulk-nearby	Multi-origin batch search (logistics, delivery zone coverage)	L7/L8

Core flow — search path

The search path is the hot path. It must complete in <200 ms p99, which means every component adds latency budget. The design enforces a strict cache-first strategy: Redis result cache → Elasticsearch → response. The database is never touched on the search hot path.

Figure 3 — Left: search path. Redis HIT returns in ~5 ms. MISS hits Elasticsearch (~20 ms), applies ranking, writes cache, returns. Right: review write path — synchronous DB write, async rating propagation via Kafka. Dashed arrows = async.

⚠️

Latency budget: Redis HIT: ~2 ms cache read + ~3 ms network = 5 ms. Redis MISS: ~5 ms Redis check + ~20 ms Elasticsearch + ~10 ms ranking + ~3 ms cache write = ~38 ms. With ~150 ms of network/CDN overhead, total p99 stays well under 200 ms. If Elasticsearch degrades to >100 ms, the SLA is at risk — circuit-break to a degraded mode that returns stale cache results with a freshness timestamp.

Data model

Data model decisions follow from access patterns — not from what's convenient to normalise. Here the dominant patterns are: proximity search (never touches relational DB), business detail fetch (single-row read, heavily cached), and review listing (paginated scan by business, sorted by time).

Access patterns

Operation	Frequency	Query shape
Proximity search	Very high (11,600 QPS avg)	Elasticsearch geo_distance + filters → never touches PostgreSQL
Business detail page	High (every search result tap)	Single-row read by business_id — served from Redis or read replica
Review list for business	Moderate	Range scan by business_id, ordered by created_at DESC — paginated cursor
Submit review	Low (~16 writes/sec)	INSERT into reviews + outbox — single relational shard
Rating aggregate read	Very high (embedded in search results)	Single key Redis lookup (bayesian_rating, review_count) per business

Two things stand out: the search path never touches the relational database, so the PostgreSQL schema only needs to serve detail pages and write operations. And the rating aggregate must be precomputed — it's read far too frequently to compute on demand.

Figure 4 — Data model. The review_aggregates table is updated asynchronously by the Agg Worker; it is the authoritative source for bayesian_avg, which is also synced to the Elasticsearch document for search-time ranking.

💡

Why store rating_sum and review_count separately? Storing the sum and count allows O(1) incremental updates: when a new review arrives, rating_sum += new_rating and review_count += 1 — no scan of all reviews needed. The Bayesian average is then computed from these two fields. If you only stored the average, every update would require knowing the previous count to reverse-engineer the sum.

Caching strategy

The 1000:1 read:write ratio means caching is not optional — it's the primary scaling mechanism. Three distinct cache layers handle different parts of the read path, each with different TTL and invalidation strategies.

Figure 5 — Three-layer cache hierarchy anchored to the §4 architecture. CDN serves business detail pages. Redis Layer 1 caches search result pages. Redis Layer 2 caches per-business rating aggregates.

Layer	What it caches	TTL	Invalidation	Hit rate target
CDN	Business detail HTML pages, photo assets	5 min (Cache-Control)	Purge on business profile update	>80%
Redis (search results)	Search result pages, keyed by full param hash	30 s	TTL expiry only (eventual consistency acceptable)	40–60% (diverse queries)
Redis (ratings)	Per-business Bayesian rating + review_count	60 s	Explicit invalidation by Agg Worker on update	>95% (high reuse)

💡

Hot-cell warming: Popular geo-cells (downtown Manhattan, Shibuya, central London) get query volumes 10–50× higher than the global average. At startup or after cache eviction, these cells cold-start simultaneously — a thundering-herd problem. Mitigate with probabilistic early expiration (cache stampede protection): when a key has <5 s left on its TTL, 5% of requests will fetch fresh data and extend the TTL, while the other 95% still serve the cached value.

Cache invalidation

Each cache layer uses a different invalidation strategy, chosen based on the consistency requirement and the write frequency of the underlying data:

Search result cache: TTL-only (no active invalidation) Simplicity over freshness ›

Search results are keyed by hash(lat, lon, radius, category, min_rating). The parameter space is enormous — millions of unique keys. Active invalidation would require tracking which business_ids appear in which cached result pages, creating an inverted index of cache entries per business. This is expensive and fragile. Instead, the 30-second TTL ensures that any stale result is replaced within 30 seconds. For a 1000:1 read:write ratio, a 30-second staleness window is imperceptible to users.

TradeoffA newly opened business won't appear in cached search results for up to 30 seconds. A business that closes permanently will continue appearing for up to 30 seconds. Both are acceptable product tradeoffs — users don't expect sub-second freshness from map search results.

Rating cache: event-driven invalidation + TTL fallback Hybrid approach ›

The Agg Worker explicitly invalidates the rating cache after recomputing the Bayesian average: DEL rating:{business_id} followed by SET rating:{business_id} {json} EX 60. This is write-through: the worker writes the new value to both PostgreSQL and Redis in the same operation. The 60-second TTL acts as a safety net — if the Agg Worker crashes between the DB write and the Redis write, the stale cached value expires within 60 seconds and the next read triggers a cache miss that fetches the correct value from PostgreSQL.

Race conditionIf two Agg Worker instances process reviews for the same business concurrently (e.g., during a write burst), the Redis value may be overwritten by the slower worker with a stale aggregate. Prevent this with a Redis SET ... NX + version check: include a monotonically increasing version (review_count) in the cache value and only overwrite if the new version is higher.

CDN cache: purge-on-write for business profile changes Correctness-critical ›

When a business owner updates their profile (hours, address, phone), the Business API triggers an explicit CDN purge for the business detail page URL. The CDN returns stale content for 5–15 seconds during purge propagation across PoPs. For address or coordinate changes, the Business API also triggers a synchronous Elasticsearch document upsert: the search index must reflect the new location immediately, not after a 5-minute CDN TTL expiry.

Consistency windowDuring the 5-minute CDN TTL, a user may see the new address in search results (updated via Elasticsearch) but the old address on the business detail page (still cached by CDN). This is a brief inconsistency window: the CDN purge resolves it within seconds. Acceptable because users rarely compare search result snippets against detail pages side-by-side.

💡

Cache sizing: Rating cache: 200M businesses × ~200 bytes per entry (business_id, bayesian_avg, review_count, rating_sum) = ~40 GB. In practice, only businesses with recent activity are cached: the 60-second TTL evicts inactive entries. Working set is ~20M active businesses ≈ 4 GB. Search result cache: at 35K peak QPS with a 30-second TTL and ~40% cache hit rate, the working set is ~700K unique cached pages × ~2 KB per page ≈ 1.4 GB. Total Redis memory: ~6 GB: easily fits in a single Redis node with room for growth. Deploy a 3-node Redis Cluster for availability, not for capacity.

Deep-dive scalability

Three scalability problems surface at Yelp/Google Places scale that don't appear at smaller deployments: Elasticsearch shard hot spots, rating consistency under write bursts, and geographic read skew.

Elasticsearch sharding strategy: avoiding geo hot spots L5 / L6 ›

Naive sharding by document ID distributes businesses uniformly but loses data locality — a geo-distance query must fan out to all shards. Sharding by geohash prefix at precision 2 (covering ~2,500 km × 1,250 km cells) concentrates geographically nearby businesses on the same shard, so most radius searches hit one or two shards. At 200M businesses with 6 shards, each shard holds ~33M documents in roughly 50 GB — manageable and under Elasticsearch's recommended shard ceiling.

Hot spot risk: a single heavily queried city (New York, Tokyo) lands on one shard. Mitigate with a replica set per shard (2–3 replicas) so search traffic is distributed across replicas. The primary shard still handles all writes, but reads are load-balanced.

AlternativeLet Elasticsearch handle distribution via its own round-robin allocation. Simpler operationally but requires cross-shard fan-out on every query. Works fine at <50M documents; degrades at 200M.

Rating write bursts: viral business problem L6 ›

A restaurant featured on a popular food blog receives 10,000 reviews in an hour. The Kafka consumer processes them sequentially: the Agg Worker recomputes the Bayesian average for the same business 10,000 times. Each compute is cheap (O(1) with sum/count), but the Elasticsearch document update on every recompute causes index write amplification. Solution: coalesce updates in the Agg Worker with a sliding window of 5 seconds — batch all review events for the same business_id arriving within the window into a single Elasticsearch upsert. This reduces index writes from 10,000 to ~720 over an hour.

TradeoffCoalescing introduces up to 5-second additional lag on top of Kafka consumer lag. Total rating freshness SLA becomes 60 s steady-state, up to 65 s during write bursts — still within the 60-second NFR for steady-state. The NFR should be stated as "p99 within 60 s in steady-state, not guaranteed during viral events."

Geographic read skew: multi-region deployment L7 / L8 ›

A single-region deployment from US-East adds 120–200 ms of round-trip latency for users in Tokyo or London — blowing the 200 ms p99 budget before the application has done any work. Multi-region deployment routes users to the nearest region via latency-based DNS (Route 53 or equivalent). Each region hosts its own Elasticsearch cluster seeded with the businesses in that geographic area. Business and review writes go to the closest region's primary DB and are replicated asynchronously to other regions within ~5 seconds.

The consistency implication: a review written in Asia may not appear in the EU region for up to 5 seconds. This is acceptable — cross-region review consistency is not a user-visible correctness requirement.

What to say at L7/L8Frame the multi-region topology as a deliberate consistency model choice: AP (Available + Partition-tolerant) for reads, with best-effort cross-region replication for writes. The alternative — synchronous global replication — would make write latency unacceptably high for mobile users in latency-sensitive regions.

Distributed ID generation for business_id and review_id L5 / L6 ›

At 200M businesses and ~16 review writes/sec, auto-increment IDs from a single PostgreSQL primary become a bottleneck when the system scales to multiple write primaries (multi-region) or when sharding the Review DB. Three options:

UUID v4: Globally unique, no coordination. But 128 bits wastes index space, and random UUIDs destroy B-tree locality — INSERT performance degrades as the index fragments. Not recommended for primary keys at this scale.
UUID v7 (time-ordered): 128-bit UUID with a time-based prefix. Preserves B-tree insertion order. Good default for most systems. Slight overhead from 128-bit key size.
Snowflake-style IDs: 64-bit ID = 41 bits timestamp + 10 bits machine_id + 13 bits sequence. Compact, time-ordered, globally unique. Requires a machine_id registry. This is our choice — compact keys keep index sizes manageable at 200M+ rows.

ImplementationDeploy a lightweight Snowflake ID service (or use PostgreSQL's pg_sequence with a machine prefix). Each API server instance gets a unique machine_id at startup from a ZooKeeper or etcd lease. At 16 writes/sec, sequence bits never overflow within a millisecond. For multi-region: each region gets a distinct machine_id range, ensuring IDs never collide across regions without cross-region coordination.

PostgreSQL sharding strategy for the Review DB L6 ›

At ~730 GB of review data (5-year retention), a single PostgreSQL primary can handle the storage. But the access pattern — paginated reviews sorted by created_at for a given business_id — benefits from sharding by business_id to keep all reviews for a business on the same shard. This ensures the pagination query (WHERE business_id = X ORDER BY created_at DESC LIMIT 20 OFFSET N) hits a single shard with no cross-shard scatter-gather.

Shard count: start with 8 shards using hash-based partitioning (business_id % 8). Each shard holds ~91 GB — well within PostgreSQL's comfortable range. The shard key choice means that "list all reviews by user X" requires a fan-out query across all 8 shards. This is acceptable: the user-reviews page is low-frequency (users rarely browse their own review history), and the fan-out returns quickly because each shard has an index on (user_id, created_at).

AlternativeShard by user_id instead. This makes user-reviews pages single-shard, but makes business-reviews pages (the high-frequency query) require fan-out. Wrong tradeoff — always shard by the dominant access pattern.

💡

Operational monitoring: Five metrics to alert on: (1) Elasticsearch p99 query latency >100 ms, (2) Kafka consumer group lag >30 s, (3) Redis cache hit rate dropping below 30% (search results) or 80% (ratings), (4) PostgreSQL replication lag >10 s between primary and read replicas, (5) Review write rate for any single business exceeding 50/hour (fraud signal). Dashboard these in Grafana with a 1-minute scrape interval. The Elasticsearch latency metric is the single best predictor of user-facing search SLA degradation.

Failure modes & edge cases

Scenario	Problem	Solution	Level
Elasticsearch shard unavailable	Search returns partial results or times out; users see empty results	Circuit-break to degraded mode: serve stale Redis cache with a "results may be outdated" label; Elasticsearch replicas absorb traffic during shard recovery	L3/L4
Redis cache stampede on startup	Cold cache causes all search QPS to hit Elasticsearch simultaneously, overwhelming it	Probabilistic early expiration (5% of requests refresh before TTL expires); pre-warm popular geo-cells with a background job at startup	L5
Review bombing (fake negative reviews)	A competitor submits 1,000 1-star reviews in 10 minutes, tanking a business's rating	Rate-limit reviews per user per business (max 1 review / business / lifetime); velocity check in Review API — flag burst submissions for async spam review before publishing	L5
Kafka consumer lag spike	Agg Worker falls behind; ratings stale beyond 60 s SLA during viral events	Scale consumer group horizontally (one partition per Agg Worker instance); alert on consumer lag >30 s; coalesce updates per business_id to reduce write amplification	L5
Business geo-coordinates updated	Elasticsearch document still points to old location; search returns wrong results for moved businesses	Business API triggers synchronous Elasticsearch upsert on address/coordinate change (not async); accept brief inconsistency window (<2 s) as acceptable for business moves	L5
Geo-cell boundary edge case	Business 100 m from cell edge not returned in radius search because neighbour cell not queried	Always query target cell + 8 neighbours; then filter by exact Haversine distance in Search API application layer; accept ±20% radius boundary tolerance per NFR	L5
Duplicate review on mobile retry	Network timeout causes client to retry; user ends up with two reviews for the same business	Require idempotency key (UUID) in POST /reviews header; store key in reviews.idempotency_key with UNIQUE constraint; return original response on duplicate key conflict	L3/L4
Elasticsearch index corruption / rebuild	Index must be rebuilt from PostgreSQL source of truth; 200M document re-index takes hours	Use index aliasing: build new index in background → swap alias atomically. Keep old index live during rebuild. Elasticsearch index aliases allow zero-downtime re-indexing.	L7/L8

10b

Security, rate limiting & compliance

Three security concerns map directly to components already in the architecture: abuse prevention on the review write path, fraudulent business listing manipulation, and regulatory erasure requirements that span every storage tier. L6+ interviewers probe all three explicitly.

Review API rate limiting

Without rate limiting, a single malicious actor can submit thousands of fake reviews to manipulate ratings. The API Gateway enforces a per-user token-bucket rate limit on POST /reviews: maximum 1 review per business per user lifetime (enforced via the UNIQUE(user_id, business_id) constraint in the Review DB), plus a global velocity limit of 10 reviews per hour per user, enforced via a Redis counter (INCR ratelimit:review:{user_id}:{hour_bucket} with a 3600-second TTL). Requests exceeding the limit receive a 429 Too Many Requests response with a Retry-After header.

A separate, higher limit (100 requests/min) applies at the IP level to catch unauthenticated probing before authentication succeeds. This layered approach — IP-level → user-level → business-level — prevents both automated bots and coordinated human review campaigns.

💬

L5 probe: "How would you prevent a competitor from paying 500 people to leave 1-star reviews on a rival business?" — Per-user-per-business uniqueness constraint prevents multiple reviews. Velocity detection in the Agg Worker flags businesses receiving >50 reviews/hour (10× the 99th percentile) and holds the aggregate update pending async fraud review. The rating cache continues serving the pre-spike value until fraud review completes — preventing the attack from being user-visible during investigation.

Business listing fraud prevention

Fraudulent business listings — fake restaurants created to capture search traffic, or duplicate listings for SEO manipulation — degrade search quality. The Business API enforces three controls:

Geo-coordinate validation: Reject listings with coordinates in oceans, uninhabited areas, or outside the coverage zone. A simple land-boundary polygon check (pre-computed, ~50 KB dataset) catches the most obvious fakes.
Duplicate detection: Before creating a business, check for existing listings within 50 m with similar names (Levenshtein distance ≤ 3). Flag duplicates for manual review rather than auto-rejecting — legitimate businesses can share addresses (e.g., food courts).
Owner verification throttle: Limit unverified accounts to 3 business listings. Verified accounts (via phone, postal mail, or government ID) can create more. This prevents bulk listing spam without blocking legitimate multi-location businesses.

GDPR right-to-erasure pipeline

A user deletion request triggers a multi-stage erasure job implemented as a durable workflow (e.g., AWS Step Functions) to survive partial failures. The pipeline must reach every storage tier:

Tier	Erasure action	SLA
PostgreSQL (reviews, users)	Hard-delete user row; cascade to reviews table. Update review_aggregates for affected businesses (decrement count, subtract rating).	≤24 hours
Elasticsearch	Reindex affected business documents with updated bayesian_avg after review removal	≤24 hours
Redis (caches)	`DEL rating:{biz_id}` for each business the user reviewed; invalidate search result cache entries containing those businesses	≤1 hour
Kafka (event log)	Kafka retention is time-windowed (7 days default); `review.created` events age out automatically	≤7 days
CDN (cached pages)	Purge cached business detail pages that displayed the user's review	≤30 days (CDN propagation ~15 min after purge)

⚠️

Rating recalculation on erasure: Deleting a user's reviews changes the Bayesian average for every business they reviewed. The erasure workflow must trigger an Agg Worker recomputation for each affected business — not just delete the review row. Without this step, the cached rating_sum and review_count in review_aggregates become inconsistent with the actual reviews in the database. For users who reviewed hundreds of businesses, this creates a burst of aggregate updates; batch them with a 10-second coalescing window (same pattern as §9's viral business solution).

💬

L7/L8 probe: "A user deletes their account. They had reviewed 200 businesses. How do you guarantee ratings are correct within 24 hours across all regions?" — The erasure workflow enqueues 200 review.deleted events to Kafka, one per business. The Agg Worker consumes them with coalescing, recomputes aggregates, and updates both PostgreSQL and Elasticsearch. Cross-region replication propagates within ~5 s per region. The 24-hour SLA is easily met. Edge case: if the user's review was the only review for a business, the Bayesian average reverts to the global mean (3.8) — not zero — because of the prior count C=25 in the formula.

How to answer by level

The interviewer calibrates depth to your level. Here's what good looks like at each, and what differentiates it from the level below.

L3 / L4: Can you build a working system? SDE I / SDE II ›

What good looks like

Identifies that SQL distance queries don't scale; proposes geohash or Elasticsearch
Draws the basic read/write split: Search API → Elasticsearch; Review API → DB
Handles the idempotency key for review submission
Mentions caching search results in Redis with a TTL

What separates L5 from L4

L4 says "use geohash" without explaining precision tradeoff or neighbour cell problem
L4 proposes synchronous rating recomputation in the write path
L4 doesn't reason about read:write ratio or its architectural implication

L5: Do you understand the tradeoffs? Senior SDE ›

What good looks like

Explains geohash precision tradeoff and 9-cell neighbour query
Designs async rating pipeline via Kafka; uses sum+count not average storage
Reasons about cache invalidation: TTL-based vs event-driven per cache layer
Handles review bombing with rate limiting and idempotency

What separates L6 from L5

L5 doesn't address Kafka consumer lag spikes or coalescing updates
L5 doesn't address hot geo-cell cache stampede on startup
L5 doesn't discuss Elasticsearch shard strategy by geo prefix

L6: Can you own this end-to-end? Staff SDE ›

What good looks like

Designs Elasticsearch sharding by geohash prefix to reduce fan-out
Designs probabilistic cache expiration to prevent stampedes
Proposes review event coalescing window to handle viral businesses
Addresses Elasticsearch zero-downtime re-index via alias swapping

What separates L7 from L6

L6 doesn't address multi-region topology and cross-region consistency model
L6 doesn't frame consistency choices as explicit product decisions
L6 stays in one region and doesn't consider global latency budget

L7 / L8: Should we build this, and how? Principal / Distinguished ›

What good looks like

Frames multi-region as AP consistency model with explicit product tradeoff
Discusses geo-sharding across regions, not just within one Elasticsearch cluster
Addresses S2 cells vs geohash for global accuracy (no polar distortion)
Proposes ranking pipeline as a separate subsystem with its own SLA

What distinguishes L8

L7 proposes multi-region but doesn't frame the consistency model explicitly
L7 doesn't question whether the Elasticsearch index rebuild SLA is acceptable to the business
L8 drives the conversation to org-level questions: build vs buy for geo-search infra

Classic probes

Question	L3/L4 answer	L5/L6 answer	L7/L8 answer
"How would you handle a very popular business with 10k reviews/hour?"	Scale the Review API horizontally	Coalesce Agg Worker updates per business_id in a 5s sliding window to reduce ES write amplification	Separate the viral business path from the normal path; dedicated consumer group + write-coalescing pipeline; explicit SLA relaxation during viral events
"A user is right on the boundary of two geohash cells. What happens?"	Use a smaller geohash cell size	Always query target cell + 8 neighbours, then Haversine filter in app layer; precision 6 gives ~1.2 km cells so 9-cell cover handles any 500m radius	For very small radii (<100m) switch to precision 7 dynamically; discuss the precision vs candidate set size tradeoff as a configurable parameter tuned to query patterns
"How fresh are search results after a business updates its hours?"	Near-real-time — we update the DB and cache immediately	Business API triggers synchronous ES upsert on profile change; CDN page cache purged; search results reflect change within ~2s	Distinguish between correctness-critical updates (coordinate change, permanent closure) that require synchronous ES update vs non-critical updates (hours, photos) that can propagate via CDC within 60s — different consistency SLAs for different field types
"How would you extend this to support real-time business availability (e.g., restaurant current wait time)?"	Add a wait_time field to the business record	Treat wait time as a high-velocity ephemeral signal — store in Redis with a 60s TTL, not in PostgreSQL; read-through from the business's POS system or crowd-sourced check-ins via a separate write path	Model it as a separate "live signals" service with sub-minute write frequency and TTL-based expiry; never persist it to the main data store; discuss the accuracy vs freshness tradeoff for crowd-sourced signals

How the pieces connect

Every architectural decision traces back to a requirement. Here are the most important chains.

1

Search latency < 200 ms (§2) → SQL geo-queries scan 200M rows → Elasticsearch geo_distance BKD tree (§4, §5) returns candidates in ~20 ms → Redis search result cache (§8) serves repeat popular queries in ~5 ms
2

Rating freshness within 60 s (§2) → synchronous recomputation is O(n) per write → Async Kafka consumer (§4, §6) decouples write latency from aggregate update → sum + count storage in review_aggregates (§7) enables O(1) incremental update
3

1000:1 read:write ratio (§2, §3) → database cannot serve 11,600 QPS → Elasticsearch as primary query engine (§4) — not a relational DB read replica — with three-layer caching (§8) absorbing the majority of traffic
4

Geohash edge-cell problem (§5) → single-cell query misses nearby businesses → 9-cell neighbour query + Haversine post-filter (§5, §6) → search result accuracy within ±20% of stated radius (§2 NFR)
5

Popular geo-cell hot spots (§3, §8) → cache cold-start triggers thundering herd → probabilistic early expiration (§8) prevents stampede → sustained >95% cache hit rate for rating aggregates without coordination overhead
6

Review bombing risk (§10) → raw write path has no fraud signal → idempotency key on POST /reviews (§5b) prevents duplicate submissions + rate limit 1 review/business/user (§10) limits velocity attacks without blocking legitimate reviews

Also in this series

Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
Online Judge System Design — secure sandboxing, execution queues, and worker scaling

Design a Proximity Search System
like Yelp or Google Places

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

NFR reasoning

Capacity estimation

Interactive capacity estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Core algorithm — spatial indexing

API design

GET /search/businesses

POST /reviews

Optional endpoints by level

Core flow — search path

Data model

Access patterns

Caching strategy

Cache invalidation

Deep-dive scalability

Failure modes & edge cases

Security, rate limiting & compliance

Review API rate limiting

Business listing fraud prevention

GDPR right-to-erasure pipeline

How to answer by level

Classic probes

How the pieces connect

System Design Mock Interviews

Practice Coding Interviews Now