System Design Interview Guide

Design a Proximity Search System
like Yelp or Google Places

Deceptively simple to describe, surprisingly hard to scale: finding the nearest relevant business in under 200 ms across 200 million listings worldwide.

L3 / L4 — Build it L5 / L6: Own the tradeoffs L7 / L8 — Drive the architecture

~30 min read · 11 sections · interactive estimator

Hero image for Proximity Search System Design
01

What the interviewer is testing

"Design Yelp" looks like a CRUD app with a map. It isn't. Beneath the surface are three distinct hard problems, and you need a credible answer to all three — not a deep dive into one.

ProblemWhy it's hardWhat changes at scale
Spatial indexingSQL WHERE distance < X is a full table scan at 200M rows; you need a purpose-built geo-indexNaive radius queries degrade from O(1) to O(n); the geo-index must shard gracefully without hot spots
Review rankingRaw average ratings are easily gamed; a 1-review 5-star business beats a 500-review 4.8-star oneAggregate recomputation on every review write is O(n) at scale; you need incremental precomputation
Read-scale imbalanceSearches vastly outnumber writes — users browse far more than they reviewThe read:write ratio at Yelp-scale is ~1000:1; the system must be read-optimised without sacrificing write correctness
🎯

Level signal: L3/L4 candidates name geohash but can't explain prefix-length tradeoffs. L5 candidates pick geohash vs quadtree correctly and justify it. L6 candidates reason about hot-cell mitigation and ranking freshness SLAs. L7/L8 candidates frame the geo-index sharding strategy and the ranking pipeline as separate subsystems with different consistency requirements.

02

Requirements clarification

Start with requirements before touching architecture. Every major design decision in this article traces back to a specific NFR target established here.

Functional requirements

CapabilityIn scopeOut of scope (for this interview)
Proximity searchFind businesses within a radius, filtered by category and min-rating; ranked by composite scoreML personalisation, social graph ("friends also liked")
Business listingCreate, update, and display business profiles (name, address, hours, photos, category)Business owner verification, paid listing tiers
ReviewsAuthenticated users submit a rating (1–5) and text; review is public immediatelyReview moderation, image uploads, business responses
Aggregated ratingsPer-business average rating and review count, kept current within 60 s of a new reviewSentiment analysis, review helpfulness voting

Non-functional requirements

NFRTargetWhy this level?
Search latency (p99)< 200 msUsers abandon map searches after ~300 ms; 200 ms leaves budget for network
Rating freshnessAggregate reflects new reviews within 60 sStale ratings mislead booking decisions; real-time is unnecessary
Search availability99.99% (52 min/year downtime)Search is the core revenue path; downtime directly loses bookings
Write availability99.9% (8.7 hr/year)Review submission can tolerate brief degradation; users will retry
Scalability200M businesses, 100M DAU, 1000:1 read:writeYelp/Google Places 2024 reported scale; drives read-optimised architecture
Geo accuracySearch results within stated radius ± 20%Geohash cell boundaries introduce edge-of-cell error; ± 20% is imperceptible to users

NFR reasoning

Search latency < 200 ms Drives §4, §5, §8

A full table scan over 200M rows — even with a B-tree index on latitude — takes seconds. Meeting 200 ms p99 requires a spatial index that can narrow candidates to a small cell in O(1), then apply distance filtering only on that subset. This is what drives geohash or quadtree in §5 and the Elasticsearch geo_distance query in §4.

TradeoffTighter latency targets (e.g., < 50 ms) would require in-memory spatial indexes everywhere and eliminate any database fallback path. 200 ms allows one Redis cache miss and one Elasticsearch query in the critical path.
Rating freshness within 60 s Drives §4, §8, §9

Recomputing the Bayesian average on every search query (100K QPS) by joining reviews would collapse the DB. The 60-second SLA means we can precompute and cache the aggregate, updating it asynchronously via a message queue (Kafka) after each review write. Staleness beyond 60 s would be user-visible on trending businesses.

What this drivesA review_aggregates table updated by an async consumer (§4, §7). The aggregate is cached in Redis with a 60-second TTL (§8), not recomputed at query time.
1000:1 read:write ratio Drives §4, §9

At 100M DAU with ~10 searches per active user per day and ~0.1 reviews per user per week, the ratio is roughly 1000 searches per review write. This makes the system almost entirely read-bound — it should be designed around read throughput, not write throughput. Write path correctness matters, but write path latency is not the bottleneck.

What this drivesMultiple read replicas for PostgreSQL, Elasticsearch as the primary query engine (not a relational DB), and aggressive result caching at §8. The write path (review submission) can afford to be async.
03

Capacity estimation

The load profile is extremely read-heavy. Searches dominate at ~1000× the review write rate. The implication: the architecture must optimise for read throughput and caching, not write throughput.

Interactive capacity estimator

100 M
10
0.1
200 M
Search QPS
queries / second (avg)
Peak search QPS
3× avg (lunch / evening)
Review writes
writes / second
Business index size
Elasticsearch (≈1.5 KB/doc)
Review storage
5 yr retention, ≈2 KB/review
Read:Write ratio
searches vs review writes
💡

Key insight: At 100M DAU × 10 searches/day, search QPS averages ~11,600 — peaking near 35,000 during lunch and evening hours. This is the number that makes a relational geo-query impossible without a dedicated spatial index. The review write rate (~16 writes/sec) is trivially manageable by a single relational primary; the challenge is entirely on the read side.

DimensionEstimateKey insight
Elasticsearch index~300 GB200M docs × 1.5 KB avg → 6 shards of 50 GB each
Review DB storage~730 GB100M DAU × 0.1 reviews/week × 52 weeks × 5 yr × 2 KB
Geo-cell hot spotTop 1% cells get 30–40% of queriesManhattan/Tokyo/central-London cells need dedicated cache warming
Peak multiplierLunch + dinner rush for restaurant searches; system must handle 35K search QPS
04

High-level architecture

The architecture separates the read plane (geo-search, business detail, review listing — latency-critical, eventual consistency acceptable) from the write plane (review submission, business updates — correctness-critical, async propagation acceptable). A message queue connects them without coupling write latency to index update latency.

Client Web / App CDN static + cached API Gateway LB + auth Search API geo + rank Business API CRUD Review API submit + list Elasticsearch geo_point index Redis Cache results + ratings Business DB PostgreSQL Review DB PostgreSQL Kafka review events Agg Worker rating recompute LEGEND Sync Async
Figure 1 — High-level architecture. Search and Business APIs are read-optimised. Review API writes to the Review DB and publishes to Kafka; an Agg Worker updates rating caches asynchronously. Dashed arrows = async.

Component breakdown

CDN sits in front of all traffic and serves business detail pages, photo assets, and search result pages that have been explicitly cached. Static assets never hit origin servers.

API Gateway / Load Balancer terminates TLS, validates auth tokens, and routes requests to the correct downstream service fleet. Routes: /search → Search API, /businesses → Business API, /reviews → Review API.

Search API translates a geo+category+rating query into an Elasticsearch geo_distance query, checks the Redis result cache first, and applies the composite ranking score to the returned candidates before responding. It never touches the relational database on the hot path.

Business API handles CRUD for business profiles. Reads are served from read replicas behind a Redis cache for individual business detail pages. Writes go to the primary PostgreSQL node and trigger an Elasticsearch document upsert.

Review API accepts review submissions from authenticated users, writes to the Review DB (PostgreSQL), and publishes a review.created event to a durable message queue (Kafka). It does not recompute ratings synchronously — that happens downstream.

Elasticsearch (geo_point index) holds one document per business: geo-point coordinates, category, current Bayesian rating, review count, and hours. The Search API queries it exclusively for proximity searches. It is not the source of truth — it's a read-optimised projection of the Business DB, kept in sync by the Business API on write.

Redis Cache serves two distinct caches: (1) search result pages keyed by hash(lat, lon, radius, category, min_rating) with a 30-second TTL, and (2) per-business rating aggregates with a 60-second TTL.

Kafka (review events) decouples the Review API's write latency from the aggregate update latency. The Agg Worker consumes review.created events, recomputes the Bayesian rating for the affected business, writes to the review_aggregates table, and invalidates the Redis rating cache.

Agg Worker is a stateless Kafka consumer. It uses the outbox pattern: the Review API writes the event to an outbox table in the same PostgreSQL transaction as the review record, and Debezium publishes to Kafka from the WAL. This guarantees the event is never lost even if Kafka is temporarily unavailable during the write.

Architectural rationale

Why Elasticsearch instead of PostGIS for geo-search? Core tradeoff

PostGIS geo-queries against 200M rows require a full index scan over the spatial index, and adding category + rating filters means multi-column index intersections — difficult to optimise past ~50 ms. Elasticsearch's geo_distance filter uses a pre-built BKD tree (Block KD-tree) that narrows candidates in O(log n) before applying secondary filters, reliably returning results in <20 ms on a 200M document index. It also scales horizontally via sharding with no schema migrations.

TradeoffElasticsearch introduces eventual consistency between the Business DB and the search index. A business update takes up to a few seconds to appear in search results. This is an explicit product decision — users don't need sub-second freshness on business profile edits.
AlternativesPostGIS (PostgreSQL)Redis GEOSEARCHCustom quadtree
Why async rating updates via Kafka? Decoupling

Synchronous rating recomputation inside the review write transaction would require a SELECT AVG(rating) over all reviews for the business — O(n) per write. At 16 review writes/sec with businesses accumulating thousands of reviews, this becomes progressively more expensive. Kafka decouples the write from the aggregate update: the review commits in <10 ms, and the aggregate updates within ~60 s in steady state.

TradeoffIf Kafka consumer lag spikes (e.g., after a restart), ratings may be stale for longer than 60 s. Consumer lag monitoring and alerting is required. An alternative is to store count and sum in review_aggregates and increment them atomically — this gives O(1) aggregate update without scanning all reviews, at the cost of losing the ability to recalculate from scratch without a full review scan.
Why the outbox pattern + CDC instead of direct Kafka producer? Reliability

The Review API could publish directly to Kafka after writing to the database. But if Kafka is temporarily unavailable, the review is committed to PostgreSQL but the event is lost: the Agg Worker never recomputes the rating. The outbox pattern solves this: the Review API writes both the review row and an outbox event row in the same database transaction. Debezium (a CDC connector) reads the PostgreSQL WAL and publishes the outbox event to Kafka. Since the WAL is durable, the event is guaranteed to be published eventually — even if Kafka was down during the original write.

TradeoffThe outbox pattern adds operational complexity: Debezium must be deployed, monitored, and configured to read the WAL. It also adds ~1–5 seconds of latency between the review commit and the Kafka event. Both are acceptable: Debezium is battle-tested at scale (used by LinkedIn, Uber, Airbnb), and the rating freshness SLA is 60 seconds: the extra latency is negligible.
AlternativesDirect Kafka producer (simpler, less reliable)Polling publisher (query outbox table periodically)Dual-write with saga (complex, error-prone)

Real-world comparison

DecisionThis designYelp (reported)Google Places
Geo indexElasticsearch geo_point (BKD tree)Elasticsearch (migrated from Solr)Proprietary S2 geometry cells + BigTable
Review storagePostgreSQL, sharded by business_idMySQL → PostgreSQL migrationSpanner for global consistency
Rating aggregationAsync via Kafka consumerPeriodic batch + real-time incrementNear-real-time with Bigtable increments
Search cachingRedis result cache (30 s TTL)Memcached + Redis hybridCDN + edge caching aggressive
05

Core algorithm — spatial indexing

Before designing search, you must answer one foundational question: how do you represent a location in a way that makes "find all businesses within X km of me" fast? The answer isn't obvious — latitude and longitude are continuous, not naturally indexed. Three approaches dominate in production systems.

① GEOHASH base32 string prefix dp3wjz → 1.2km cell dp3wj → 4.9km cell dp3w → 39km cell Lookup: O(1) prefix match Works in Redis + ES natively Edge cells need neighbours ← Our choice ② QUADTREE adaptive tree subdivision root → 4 quadrants leaf splits at threshold dense areas → smaller cells Adaptive density handling No fixed cell boundary error Custom data structure Harder to shard in DB ③ S2 GEOMETRY CELLS sphere-projected hierarchy 64-bit cell IDs 30 levels (0.3 cm → planet) Uniform area across globe No polar distortion Library dependency (s2geometry) Less tooling support Used by Google Maps internally
Figure 2 — Three spatial indexing approaches. ① Geohash is our choice: native Redis/Elasticsearch support, simple prefix lookups, manageable edge-cell overhead. ② Quadtree preferred for extreme density variation. ③ S2 cells used by Google internally for global accuracy.
9-CELL NEIGHBOUR QUERY dp3wjx dp3wjy dp3wjw dp3wjr dp3wjz target cell User 500m radius dp3wjq dp3wjp dp3wjn dp3wjm Query all 9 cells then Haversine filter Radius crosses cell boundary → need neighbours
Figure 2b — Geohash 9-cell neighbour query. The user is near the right edge of cell dp3wjz. A 500 m search radius crosses into adjacent cells. Querying only the target cell would miss nearby businesses in dp3wjq. The 9-cell approach queries the target plus all 8 neighbours, then filters by exact Haversine distance.

Our choice: Geohash at precision 6: each cell covers approximately 1.2 km × 0.6 km. For a 500 m search radius, we query the target cell plus its 8 neighbours (9 cells total) to handle edge-of-cell cases, then filter by exact Haversine distance in the application layer. This gives O(1) cell lookup via an Elasticsearch term query, with the distance filter applied to at most a few thousand candidates.

Implementation: geohash radius query
// Elasticsearch geo_distance query (used by Search API)
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_distance": {
            "distance": "500m",
            "location": { "lat": 37.7749, "lon": -122.4194 }
          }
        },
        { "term": { "category": "restaurant" } },
        { "range": { "bayesian_rating": { "gte": 4.0 } } }
      ]
    }
  },
  "sort": [
    { "_score": { "order": "desc" } },
    { "bayesian_rating": { "order": "desc" } }
  ],
  "size": 20
}

// Bayesian average formula (computed by Agg Worker, stored in ES)
// bayesian_rating = (C × global_mean + sum_ratings) / (C + review_count)
// C = 25 (prior count), global_mean = 3.8 (platform average)
// A new business with 0 reviews starts at 3.8 (the global mean)

The Elasticsearch geo_distance filter uses an internal BKD tree, not a geohash prefix scan. We pass the coordinates directly — Elasticsearch handles the spatial math. The geohash discussion in §5 applies to Redis-based caching and sharding strategies, not to Elasticsearch's internal index structure.

💬

One open question: What happens when a user on a cell boundary gets results from only one cell? The 9-cell neighbour search handles this in most cases. For very small radii (<100 m) in dense areas, precision 7 cells (152 m × 152 m) may be needed to keep the candidate set from being too large. The precision tradeoff is: larger cells = more false candidates to filter; smaller cells = more neighbour lookups.

05b

API design

Two endpoints dominate the hot path: the proximity search and the review submission. Both have non-obvious design choices.

GET /search/businesses

Use GET for search (not POST) because search results are cacheable by URL — CDN and intermediary proxies can serve cached results for identical queries. The query parameters form the cache key.

// Request
GET /search/businesses?lat=37.7749&lon=-122.4194&radius_m=500&category=restaurant&min_rating=4.0&page_token=eyJ...

// Response 200 OK
{
  "businesses": [
    {
      "business_id": "biz_8821",
      "name": "Mission Pie",
      "category": "restaurant",
      "distance_m": 312,
      "bayesian_rating": 4.6,
      "review_count": 843,
      "address": "2901 Mission St, San Francisco",
      "is_open_now": true,
      "thumbnail_url": "https://cdn.example.com/..."
    }
  ],
  "next_page_token": "eyJ...",
  "result_count": 47
}

POST /reviews

Use POST (not PUT) — review creation is not idempotent. Include an idempotency key header to prevent duplicate reviews on mobile retry.

// Request — Idempotency-Key header required
POST /reviews
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{
  "business_id": "biz_8821",
  "rating": 5,
  "text": "Excellent pie. Highly recommended."
}

// Response 202 Accepted (async — rating will update within 60s)
{
  "review_id": "rev_449201",
  "status": "published",
  "rating_update_eta_s": 60
}

Optional endpoints by level

EndpointPurposeLevel
GET /businesses/:idFull business detail (hours, photos, menu link)L3/L4
GET /businesses/:id/reviewsPaginated review list, sorted by recency or helpfulnessL3/L4
POST /businessesCreate new business listing (host-side)L3/L4
GET /search/businesses/countMap view: cluster count of businesses per geo-cellL5
GET /businesses/trendingRising businesses in a region (review velocity spike)L5
POST /search/businesses/bulk-nearbyMulti-origin batch search (logistics, delivery zone coverage)L7/L8
06

Core flow — search path

The search path is the hot path. It must complete in <200 ms p99, which means every component adds latency budget. The design enforces a strict cache-first strategy: Redis result cache → Elasticsearch → response. The database is never touched on the search hot path.

SEARCH PATH User search: lat, lon, radius, category Redis cache HIT? key=hash(lat,lon,radius,category,rating), 30s TTL HIT ≈5 ms MISS Elasticsearch geo_distance query filter: geo_distance + category + min_rating Apply composite ranking score 0.4×dist + 0.4×bayesian_rating + 0.2×recency Write result to Redis cache (TTL 30s) Return ranked results to user REVIEW WRITE PATH User submits review (rating + text) Write review to Review DB INSERT + outbox event (same transaction) Publish review.created to Kafka Agg Worker: recompute Bayesian avg (C × global_mean + sum) / (C + count) Invalidate Redis cache + update ES doc Rating visible in search within ~60s
Figure 3 — Left: search path. Redis HIT returns in ~5 ms. MISS hits Elasticsearch (~20 ms), applies ranking, writes cache, returns. Right: review write path — synchronous DB write, async rating propagation via Kafka. Dashed arrows = async.
⚠️

Latency budget: Redis HIT: ~2 ms cache read + ~3 ms network = 5 ms. Redis MISS: ~5 ms Redis check + ~20 ms Elasticsearch + ~10 ms ranking + ~3 ms cache write = ~38 ms. With ~150 ms of network/CDN overhead, total p99 stays well under 200 ms. If Elasticsearch degrades to >100 ms, the SLA is at risk — circuit-break to a degraded mode that returns stale cache results with a freshness timestamp.

07

Data model

Data model decisions follow from access patterns — not from what's convenient to normalise. Here the dominant patterns are: proximity search (never touches relational DB), business detail fetch (single-row read, heavily cached), and review listing (paginated scan by business, sorted by time).

Access patterns

OperationFrequencyQuery shape
Proximity searchVery high (11,600 QPS avg)Elasticsearch geo_distance + filters → never touches PostgreSQL
Business detail pageHigh (every search result tap)Single-row read by business_id — served from Redis or read replica
Review list for businessModerateRange scan by business_id, ordered by created_at DESC — paginated cursor
Submit reviewLow (~16 writes/sec)INSERT into reviews + outbox — single relational shard
Rating aggregate readVery high (embedded in search results)Single key Redis lookup (bayesian_rating, review_count) per business

Two things stand out: the search path never touches the relational database, so the PostgreSQL schema only needs to serve detail pages and write operations. And the rating aggregate must be precomputed — it's read far too frequently to compute on demand.

businesses 🔑 business_id BIGINT name VARCHAR(255) address TEXT lat / lon DOUBLE category VARCHAR(64) hours JSONB owner_id BIGINT (FK) created_at TSTZ, ... reviews 🔑 review_id BIGINT 🔗 business_id FK → businesses 🔗 user_id FK → users rating SMALLINT (1–5) review_text TEXT created_at TSTZ idempotency_key UUID UNIQUE(idempotency_key) review_aggregates 🔑 business_id BIGINT review_count INT rating_sum BIGINT bayesian_avg NUMERIC(3,2) updated by Agg Worker (async) Key indexes reviews: IDX (business_id, created_at) reviews: IDX (user_id, created_at) reviews: UNIQUE (idempotency_key) businesses: IDX (category, owner_id) ES: geo_point on (lat, lon)
Figure 4 — Data model. The review_aggregates table is updated asynchronously by the Agg Worker; it is the authoritative source for bayesian_avg, which is also synced to the Elasticsearch document for search-time ranking.
💡

Why store rating_sum and review_count separately? Storing the sum and count allows O(1) incremental updates: when a new review arrives, rating_sum += new_rating and review_count += 1 — no scan of all reviews needed. The Bayesian average is then computed from these two fields. If you only stored the average, every update would require knowing the previous count to reverse-engineer the sum.

08

Caching strategy

The 1000:1 read:write ratio means caching is not optional — it's the primary scaling mechanism. Three distinct cache layers handle different parts of the read path, each with different TTL and invalidation strategies.

Client browser cache CDN business pages Redis Layer 1 results (30s TTL) Redis Layer 2 ratings (60s TTL) Elasticsearch / PostgreSQL source of truth business detail pages Cache-Control: max-age=300 key: hash(params) invalidate on biz change key: rating:{biz_id} invalidated by Agg Worker
Figure 5 — Three-layer cache hierarchy anchored to the §4 architecture. CDN serves business detail pages. Redis Layer 1 caches search result pages. Redis Layer 2 caches per-business rating aggregates.
LayerWhat it cachesTTLInvalidationHit rate target
CDNBusiness detail HTML pages, photo assets5 min (Cache-Control)Purge on business profile update>80%
Redis (search results)Search result pages, keyed by full param hash30 sTTL expiry only (eventual consistency acceptable)40–60% (diverse queries)
Redis (ratings)Per-business Bayesian rating + review_count60 sExplicit invalidation by Agg Worker on update>95% (high reuse)
💡

Hot-cell warming: Popular geo-cells (downtown Manhattan, Shibuya, central London) get query volumes 10–50× higher than the global average. At startup or after cache eviction, these cells cold-start simultaneously — a thundering-herd problem. Mitigate with probabilistic early expiration (cache stampede protection): when a key has <5 s left on its TTL, 5% of requests will fetch fresh data and extend the TTL, while the other 95% still serve the cached value.

Cache invalidation

Each cache layer uses a different invalidation strategy, chosen based on the consistency requirement and the write frequency of the underlying data:

Search result cache: TTL-only (no active invalidation) Simplicity over freshness

Search results are keyed by hash(lat, lon, radius, category, min_rating). The parameter space is enormous — millions of unique keys. Active invalidation would require tracking which business_ids appear in which cached result pages, creating an inverted index of cache entries per business. This is expensive and fragile. Instead, the 30-second TTL ensures that any stale result is replaced within 30 seconds. For a 1000:1 read:write ratio, a 30-second staleness window is imperceptible to users.

TradeoffA newly opened business won't appear in cached search results for up to 30 seconds. A business that closes permanently will continue appearing for up to 30 seconds. Both are acceptable product tradeoffs — users don't expect sub-second freshness from map search results.
Rating cache: event-driven invalidation + TTL fallback Hybrid approach

The Agg Worker explicitly invalidates the rating cache after recomputing the Bayesian average: DEL rating:{business_id} followed by SET rating:{business_id} {json} EX 60. This is write-through: the worker writes the new value to both PostgreSQL and Redis in the same operation. The 60-second TTL acts as a safety net — if the Agg Worker crashes between the DB write and the Redis write, the stale cached value expires within 60 seconds and the next read triggers a cache miss that fetches the correct value from PostgreSQL.

Race conditionIf two Agg Worker instances process reviews for the same business concurrently (e.g., during a write burst), the Redis value may be overwritten by the slower worker with a stale aggregate. Prevent this with a Redis SET ... NX + version check: include a monotonically increasing version (review_count) in the cache value and only overwrite if the new version is higher.
CDN cache: purge-on-write for business profile changes Correctness-critical

When a business owner updates their profile (hours, address, phone), the Business API triggers an explicit CDN purge for the business detail page URL. The CDN returns stale content for 5–15 seconds during purge propagation across PoPs. For address or coordinate changes, the Business API also triggers a synchronous Elasticsearch document upsert: the search index must reflect the new location immediately, not after a 5-minute CDN TTL expiry.

Consistency windowDuring the 5-minute CDN TTL, a user may see the new address in search results (updated via Elasticsearch) but the old address on the business detail page (still cached by CDN). This is a brief inconsistency window: the CDN purge resolves it within seconds. Acceptable because users rarely compare search result snippets against detail pages side-by-side.
💡

Cache sizing: Rating cache: 200M businesses × ~200 bytes per entry (business_id, bayesian_avg, review_count, rating_sum) = ~40 GB. In practice, only businesses with recent activity are cached: the 60-second TTL evicts inactive entries. Working set is ~20M active businesses ≈ 4 GB. Search result cache: at 35K peak QPS with a 30-second TTL and ~40% cache hit rate, the working set is ~700K unique cached pages × ~2 KB per page ≈ 1.4 GB. Total Redis memory: ~6 GB: easily fits in a single Redis node with room for growth. Deploy a 3-node Redis Cluster for availability, not for capacity.

09

Deep-dive scalability

Three scalability problems surface at Yelp/Google Places scale that don't appear at smaller deployments: Elasticsearch shard hot spots, rating consistency under write bursts, and geographic read skew.

Elasticsearch sharding strategy: avoiding geo hot spots L5 / L6

Naive sharding by document ID distributes businesses uniformly but loses data locality — a geo-distance query must fan out to all shards. Sharding by geohash prefix at precision 2 (covering ~2,500 km × 1,250 km cells) concentrates geographically nearby businesses on the same shard, so most radius searches hit one or two shards. At 200M businesses with 6 shards, each shard holds ~33M documents in roughly 50 GB — manageable and under Elasticsearch's recommended shard ceiling.

Hot spot risk: a single heavily queried city (New York, Tokyo) lands on one shard. Mitigate with a replica set per shard (2–3 replicas) so search traffic is distributed across replicas. The primary shard still handles all writes, but reads are load-balanced.

AlternativeLet Elasticsearch handle distribution via its own round-robin allocation. Simpler operationally but requires cross-shard fan-out on every query. Works fine at <50M documents; degrades at 200M.
Rating write bursts: viral business problem L6

A restaurant featured on a popular food blog receives 10,000 reviews in an hour. The Kafka consumer processes them sequentially: the Agg Worker recomputes the Bayesian average for the same business 10,000 times. Each compute is cheap (O(1) with sum/count), but the Elasticsearch document update on every recompute causes index write amplification. Solution: coalesce updates in the Agg Worker with a sliding window of 5 seconds — batch all review events for the same business_id arriving within the window into a single Elasticsearch upsert. This reduces index writes from 10,000 to ~720 over an hour.

TradeoffCoalescing introduces up to 5-second additional lag on top of Kafka consumer lag. Total rating freshness SLA becomes 60 s steady-state, up to 65 s during write bursts — still within the 60-second NFR for steady-state. The NFR should be stated as "p99 within 60 s in steady-state, not guaranteed during viral events."
Geographic read skew: multi-region deployment L7 / L8

A single-region deployment from US-East adds 120–200 ms of round-trip latency for users in Tokyo or London — blowing the 200 ms p99 budget before the application has done any work. Multi-region deployment routes users to the nearest region via latency-based DNS (Route 53 or equivalent). Each region hosts its own Elasticsearch cluster seeded with the businesses in that geographic area. Business and review writes go to the closest region's primary DB and are replicated asynchronously to other regions within ~5 seconds.

The consistency implication: a review written in Asia may not appear in the EU region for up to 5 seconds. This is acceptable — cross-region review consistency is not a user-visible correctness requirement.

What to say at L7/L8Frame the multi-region topology as a deliberate consistency model choice: AP (Available + Partition-tolerant) for reads, with best-effort cross-region replication for writes. The alternative — synchronous global replication — would make write latency unacceptably high for mobile users in latency-sensitive regions.
Distributed ID generation for business_id and review_id L5 / L6

At 200M businesses and ~16 review writes/sec, auto-increment IDs from a single PostgreSQL primary become a bottleneck when the system scales to multiple write primaries (multi-region) or when sharding the Review DB. Three options:

  1. UUID v4: Globally unique, no coordination. But 128 bits wastes index space, and random UUIDs destroy B-tree locality — INSERT performance degrades as the index fragments. Not recommended for primary keys at this scale.
  2. UUID v7 (time-ordered): 128-bit UUID with a time-based prefix. Preserves B-tree insertion order. Good default for most systems. Slight overhead from 128-bit key size.
  3. Snowflake-style IDs: 64-bit ID = 41 bits timestamp + 10 bits machine_id + 13 bits sequence. Compact, time-ordered, globally unique. Requires a machine_id registry. This is our choice — compact keys keep index sizes manageable at 200M+ rows.
ImplementationDeploy a lightweight Snowflake ID service (or use PostgreSQL's pg_sequence with a machine prefix). Each API server instance gets a unique machine_id at startup from a ZooKeeper or etcd lease. At 16 writes/sec, sequence bits never overflow within a millisecond. For multi-region: each region gets a distinct machine_id range, ensuring IDs never collide across regions without cross-region coordination.
PostgreSQL sharding strategy for the Review DB L6

At ~730 GB of review data (5-year retention), a single PostgreSQL primary can handle the storage. But the access pattern — paginated reviews sorted by created_at for a given business_id — benefits from sharding by business_id to keep all reviews for a business on the same shard. This ensures the pagination query (WHERE business_id = X ORDER BY created_at DESC LIMIT 20 OFFSET N) hits a single shard with no cross-shard scatter-gather.

Shard count: start with 8 shards using hash-based partitioning (business_id % 8). Each shard holds ~91 GB — well within PostgreSQL's comfortable range. The shard key choice means that "list all reviews by user X" requires a fan-out query across all 8 shards. This is acceptable: the user-reviews page is low-frequency (users rarely browse their own review history), and the fan-out returns quickly because each shard has an index on (user_id, created_at).

AlternativeShard by user_id instead. This makes user-reviews pages single-shard, but makes business-reviews pages (the high-frequency query) require fan-out. Wrong tradeoff — always shard by the dominant access pattern.
💡

Operational monitoring: Five metrics to alert on: (1) Elasticsearch p99 query latency >100 ms, (2) Kafka consumer group lag >30 s, (3) Redis cache hit rate dropping below 30% (search results) or 80% (ratings), (4) PostgreSQL replication lag >10 s between primary and read replicas, (5) Review write rate for any single business exceeding 50/hour (fraud signal). Dashboard these in Grafana with a 1-minute scrape interval. The Elasticsearch latency metric is the single best predictor of user-facing search SLA degradation.

10

Failure modes & edge cases

ScenarioProblemSolutionLevel
Elasticsearch shard unavailable Search returns partial results or times out; users see empty results Circuit-break to degraded mode: serve stale Redis cache with a "results may be outdated" label; Elasticsearch replicas absorb traffic during shard recovery L3/L4
Redis cache stampede on startup Cold cache causes all search QPS to hit Elasticsearch simultaneously, overwhelming it Probabilistic early expiration (5% of requests refresh before TTL expires); pre-warm popular geo-cells with a background job at startup L5
Review bombing (fake negative reviews) A competitor submits 1,000 1-star reviews in 10 minutes, tanking a business's rating Rate-limit reviews per user per business (max 1 review / business / lifetime); velocity check in Review API — flag burst submissions for async spam review before publishing L5
Kafka consumer lag spike Agg Worker falls behind; ratings stale beyond 60 s SLA during viral events Scale consumer group horizontally (one partition per Agg Worker instance); alert on consumer lag >30 s; coalesce updates per business_id to reduce write amplification L5
Business geo-coordinates updated Elasticsearch document still points to old location; search returns wrong results for moved businesses Business API triggers synchronous Elasticsearch upsert on address/coordinate change (not async); accept brief inconsistency window (<2 s) as acceptable for business moves L5
Geo-cell boundary edge case Business 100 m from cell edge not returned in radius search because neighbour cell not queried Always query target cell + 8 neighbours; then filter by exact Haversine distance in Search API application layer; accept ±20% radius boundary tolerance per NFR L5
Duplicate review on mobile retry Network timeout causes client to retry; user ends up with two reviews for the same business Require idempotency key (UUID) in POST /reviews header; store key in reviews.idempotency_key with UNIQUE constraint; return original response on duplicate key conflict L3/L4
Elasticsearch index corruption / rebuild Index must be rebuilt from PostgreSQL source of truth; 200M document re-index takes hours Use index aliasing: build new index in background → swap alias atomically. Keep old index live during rebuild. Elasticsearch index aliases allow zero-downtime re-indexing. L7/L8
10b

Security, rate limiting & compliance

Three security concerns map directly to components already in the architecture: abuse prevention on the review write path, fraudulent business listing manipulation, and regulatory erasure requirements that span every storage tier. L6+ interviewers probe all three explicitly.

Review API rate limiting

Without rate limiting, a single malicious actor can submit thousands of fake reviews to manipulate ratings. The API Gateway enforces a per-user token-bucket rate limit on POST /reviews: maximum 1 review per business per user lifetime (enforced via the UNIQUE(user_id, business_id) constraint in the Review DB), plus a global velocity limit of 10 reviews per hour per user, enforced via a Redis counter (INCR ratelimit:review:{user_id}:{hour_bucket} with a 3600-second TTL). Requests exceeding the limit receive a 429 Too Many Requests response with a Retry-After header.

A separate, higher limit (100 requests/min) applies at the IP level to catch unauthenticated probing before authentication succeeds. This layered approach — IP-level → user-level → business-level — prevents both automated bots and coordinated human review campaigns.

💬

L5 probe: "How would you prevent a competitor from paying 500 people to leave 1-star reviews on a rival business?" — Per-user-per-business uniqueness constraint prevents multiple reviews. Velocity detection in the Agg Worker flags businesses receiving >50 reviews/hour (10× the 99th percentile) and holds the aggregate update pending async fraud review. The rating cache continues serving the pre-spike value until fraud review completes — preventing the attack from being user-visible during investigation.

Business listing fraud prevention

Fraudulent business listings — fake restaurants created to capture search traffic, or duplicate listings for SEO manipulation — degrade search quality. The Business API enforces three controls:

  1. Geo-coordinate validation: Reject listings with coordinates in oceans, uninhabited areas, or outside the coverage zone. A simple land-boundary polygon check (pre-computed, ~50 KB dataset) catches the most obvious fakes.
  2. Duplicate detection: Before creating a business, check for existing listings within 50 m with similar names (Levenshtein distance ≤ 3). Flag duplicates for manual review rather than auto-rejecting — legitimate businesses can share addresses (e.g., food courts).
  3. Owner verification throttle: Limit unverified accounts to 3 business listings. Verified accounts (via phone, postal mail, or government ID) can create more. This prevents bulk listing spam without blocking legitimate multi-location businesses.

GDPR right-to-erasure pipeline

A user deletion request triggers a multi-stage erasure job implemented as a durable workflow (e.g., AWS Step Functions) to survive partial failures. The pipeline must reach every storage tier:

TierErasure actionSLA
PostgreSQL (reviews, users)Hard-delete user row; cascade to reviews table. Update review_aggregates for affected businesses (decrement count, subtract rating).≤24 hours
ElasticsearchReindex affected business documents with updated bayesian_avg after review removal≤24 hours
Redis (caches)DEL rating:{biz_id} for each business the user reviewed; invalidate search result cache entries containing those businesses≤1 hour
Kafka (event log)Kafka retention is time-windowed (7 days default); review.created events age out automatically≤7 days
CDN (cached pages)Purge cached business detail pages that displayed the user's review≤30 days (CDN propagation ~15 min after purge)
⚠️

Rating recalculation on erasure: Deleting a user's reviews changes the Bayesian average for every business they reviewed. The erasure workflow must trigger an Agg Worker recomputation for each affected business — not just delete the review row. Without this step, the cached rating_sum and review_count in review_aggregates become inconsistent with the actual reviews in the database. For users who reviewed hundreds of businesses, this creates a burst of aggregate updates; batch them with a 10-second coalescing window (same pattern as §9's viral business solution).

💬

L7/L8 probe: "A user deletes their account. They had reviewed 200 businesses. How do you guarantee ratings are correct within 24 hours across all regions?" — The erasure workflow enqueues 200 review.deleted events to Kafka, one per business. The Agg Worker consumes them with coalescing, recomputes aggregates, and updates both PostgreSQL and Elasticsearch. Cross-region replication propagates within ~5 s per region. The 24-hour SLA is easily met. Edge case: if the user's review was the only review for a business, the Bayesian average reverts to the global mean (3.8) — not zero — because of the prior count C=25 in the formula.

11

How to answer by level

The interviewer calibrates depth to your level. Here's what good looks like at each, and what differentiates it from the level below.

L3 / L4: Can you build a working system? SDE I / SDE II
What good looks like
  • Identifies that SQL distance queries don't scale; proposes geohash or Elasticsearch
  • Draws the basic read/write split: Search API → Elasticsearch; Review API → DB
  • Handles the idempotency key for review submission
  • Mentions caching search results in Redis with a TTL
What separates L5 from L4
  • L4 says "use geohash" without explaining precision tradeoff or neighbour cell problem
  • L4 proposes synchronous rating recomputation in the write path
  • L4 doesn't reason about read:write ratio or its architectural implication
L5: Do you understand the tradeoffs? Senior SDE
What good looks like
  • Explains geohash precision tradeoff and 9-cell neighbour query
  • Designs async rating pipeline via Kafka; uses sum+count not average storage
  • Reasons about cache invalidation: TTL-based vs event-driven per cache layer
  • Handles review bombing with rate limiting and idempotency
What separates L6 from L5
  • L5 doesn't address Kafka consumer lag spikes or coalescing updates
  • L5 doesn't address hot geo-cell cache stampede on startup
  • L5 doesn't discuss Elasticsearch shard strategy by geo prefix
L6: Can you own this end-to-end? Staff SDE
What good looks like
  • Designs Elasticsearch sharding by geohash prefix to reduce fan-out
  • Designs probabilistic cache expiration to prevent stampedes
  • Proposes review event coalescing window to handle viral businesses
  • Addresses Elasticsearch zero-downtime re-index via alias swapping
What separates L7 from L6
  • L6 doesn't address multi-region topology and cross-region consistency model
  • L6 doesn't frame consistency choices as explicit product decisions
  • L6 stays in one region and doesn't consider global latency budget
L7 / L8: Should we build this, and how? Principal / Distinguished
What good looks like
  • Frames multi-region as AP consistency model with explicit product tradeoff
  • Discusses geo-sharding across regions, not just within one Elasticsearch cluster
  • Addresses S2 cells vs geohash for global accuracy (no polar distortion)
  • Proposes ranking pipeline as a separate subsystem with its own SLA
What distinguishes L8
  • L7 proposes multi-region but doesn't frame the consistency model explicitly
  • L7 doesn't question whether the Elasticsearch index rebuild SLA is acceptable to the business
  • L8 drives the conversation to org-level questions: build vs buy for geo-search infra

Classic probes

QuestionL3/L4 answerL5/L6 answerL7/L8 answer
"How would you handle a very popular business with 10k reviews/hour?" Scale the Review API horizontally Coalesce Agg Worker updates per business_id in a 5s sliding window to reduce ES write amplification Separate the viral business path from the normal path; dedicated consumer group + write-coalescing pipeline; explicit SLA relaxation during viral events
"A user is right on the boundary of two geohash cells. What happens?" Use a smaller geohash cell size Always query target cell + 8 neighbours, then Haversine filter in app layer; precision 6 gives ~1.2 km cells so 9-cell cover handles any 500m radius For very small radii (<100m) switch to precision 7 dynamically; discuss the precision vs candidate set size tradeoff as a configurable parameter tuned to query patterns
"How fresh are search results after a business updates its hours?" Near-real-time — we update the DB and cache immediately Business API triggers synchronous ES upsert on profile change; CDN page cache purged; search results reflect change within ~2s Distinguish between correctness-critical updates (coordinate change, permanent closure) that require synchronous ES update vs non-critical updates (hours, photos) that can propagate via CDC within 60s — different consistency SLAs for different field types
"How would you extend this to support real-time business availability (e.g., restaurant current wait time)?" Add a wait_time field to the business record Treat wait time as a high-velocity ephemeral signal — store in Redis with a 60s TTL, not in PostgreSQL; read-through from the business's POS system or crowd-sourced check-ins via a separate write path Model it as a separate "live signals" service with sub-minute write frequency and TTL-based expiry; never persist it to the main data store; discuss the accuracy vs freshness tradeoff for crowd-sourced signals

How the pieces connect

Every architectural decision traces back to a requirement. Here are the most important chains.

  • 1
    Search latency < 200 ms (§2) → SQL geo-queries scan 200M rows → Elasticsearch geo_distance BKD tree (§4, §5) returns candidates in ~20 ms → Redis search result cache (§8) serves repeat popular queries in ~5 ms
  • 2
    Rating freshness within 60 s (§2) → synchronous recomputation is O(n) per write → Async Kafka consumer (§4, §6) decouples write latency from aggregate update → sum + count storage in review_aggregates (§7) enables O(1) incremental update
  • 3
    1000:1 read:write ratio (§2, §3) → database cannot serve 11,600 QPS → Elasticsearch as primary query engine (§4) — not a relational DB read replica — with three-layer caching (§8) absorbing the majority of traffic
  • 4
    Geohash edge-cell problem (§5) → single-cell query misses nearby businesses → 9-cell neighbour query + Haversine post-filter (§5, §6) → search result accuracy within ±20% of stated radius (§2 NFR)
  • 5
    Popular geo-cell hot spots (§3, §8) → cache cold-start triggers thundering herd → probabilistic early expiration (§8) prevents stampede → sustained >95% cache hit rate for rating aggregates without coordination overhead
  • 6
    Review bombing risk (§10) → raw write path has no fraud signal → idempotency key on POST /reviews (§5b) prevents duplicate submissions + rate limit 1 review/business/user (§10) limits velocity attacks without blocking legitimate reviews

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series