System Design Interview Guide

Ride-Sharing System Design Interview Guide

Real-time driver matching at global scale. Geohashing, quadtrees, WebSocket-driven location tracking, ETA prediction, and surge pricing: the core algorithms that make Uber and Lyft work.

L3/L4: Basic matching + maps API L5/L6: Geohash + real-time location + surge L7/L8: Global scale + ETA ML pipeline + multi-region

~22 min read · 11 sections · interactive capacity estimator

Hero Image for Ride-Sharing System Design
01

What the interviewer is testing

The ride-sharing prompt is the highest-signal question in the FAANG system design canon precisely because it hides four intersecting real-time problems inside a deceptively simple description. Continuous GPS ingestion and spatial indexing of moving drivers, sub-second matching between a rider request and the nearest available driver, ETA computation under live traffic, and demand-aware surge pricing per geohash cell — each of these is a hard systems problem on its own. Together, they expose whether you can reason about competing data freshness requirements rather than just naming components.

A candidate who reduces this to "store driver location in a table and query nearby" has ignored the hardest parts. The table below shows what separates answers by level.

Level What good looks like
L3/L4 Designs a rider request API, stores driver GPS in a DB, uses Google Maps for ETA, understands the basic booking state machine (requested → accepted → in-trip → completed).
L5 Implements geohash-based spatial index in Redis for sub-second driver lookup, designs the WebSocket architecture for real-time location streaming, explains surge pricing per cell, quantifies write QPS from GPS heartbeats.
L6 Owns the end-to-end matching algorithm trade-offs (geohash vs. quadtree), reasons about ETA pipeline latency, designs supply/demand forecasting per cell, explains hot-spot handling for driver-dense areas, and addresses driver location privacy.
L7/L8 Addresses cross-region replication for global expansion, ML-driven ETA accuracy, dispatch optimization (batching nearby requests), per-product SKU differentiation (UberX vs. UberXL vs. UberBlack), and the economics of the supply-demand marketplace.
02

Requirements clarification

Scope this before touching any components. "Ride-sharing" covers carpooling, scheduled rides, freight, food delivery, and scooters — each with different matching and pricing models. For a FAANG interview, the canonical scope is on-demand P2P rides in a single city.

Functional requirements

Requirement In scope
Rider requests a ride from location A to location B
System matches rider to nearest available driver
Driver location streams to backend (GPS heartbeat)
Real-time ETA shown to rider before and during trip
Surge pricing based on local supply/demand ratio
Trip lifecycle: requested → accepted → in-trip → completed
Payment processing on trip completion ✓ (stub — not deep-dived)
Driver and rider ratings ✓ (simplified)
Ride scheduling (future pickup) Out of scope
Carpooling / shared rides Out of scope

Non-functional requirements

NFR Target Why
Matching latency (ride request → driver notified) < 5 seconds p99 Riders abandon requests beyond 5–10 seconds; Uber targets < 3s
Location update latency (driver GPS → index) < 2 seconds Stale driver positions produce wrong ETA and missed matches
GPS write QPS ~125K/s baseline (500K active drivers ÷ 4s cadence); up to 500K/s at global Uber scale (1M drivers, 2s cadence) Every driver pings every 4 seconds; write volume scales with active fleet size, not ride count
Availability 99.99% during peak hours Downtime = no rides; direct revenue impact
Consistency (trip state) Strong — one driver per trip Double-booking a driver is a hard correctness failure
Read/write ratio (location store) Write-dominant Drivers write; matching reads; write >> read
Why strong consistency for trip state but eventual for location?

Trip state (driver assignment, trip status) requires strong consistency because two riders matching the same driver is a hard business error — one trip must be rejected. Location data, by contrast, is continuously refreshed. A stale GPS position by 2 seconds is acceptable because the next heartbeat will correct it. This split allows you to use a strongly consistent RDBMS for trip records while using a high-throughput, eventually consistent spatial index (Redis) for location.

TradeoffThe dual-consistency model requires the matching engine to tolerate briefly stale driver positions. In practice, Uber uses a "supply snapshot" — a cache of driver positions refreshed every ~2s — rather than hitting the live location store per match request.
Why 4-second GPS heartbeat cadence specifically?

GPS heartbeat frequency is a battery-vs-accuracy trade-off on the driver's phone. 1 Hz (every second) drains battery significantly and produces location data faster than the spatial index can usefully act on. 10-second intervals make ETA predictions visibly laggy when a driver is approaching. 4 seconds is a common production compromise: fast enough for sub-10-second ETA refresh on the rider app, slow enough to not drain the driver's battery on an 8-hour shift.

What this drivesAt 4-second intervals with 1M active drivers, you get ~250K location writes/second at steady state. The location ingestion pipeline must be designed for this constant write load, not bursty batch writes.
⚠️

Rate-limit ride requests and guard the driver fleet as a resource. Without a per-rider rate limit on POST /rides/request, a bad actor can send hundreds of requests per minute to effectively map the live driver fleet (each request triggers a GEOSEARCH and exposes approximate driver density). Apply a hard limit of 5 active ride requests per rider per hour at the API Gateway. Additionally, the GET /pricing/estimate endpoint (which also queries driver density per cell) should be token-bucket rate-limited per IP to prevent scraping of surge heat-maps.

03

Capacity estimation

The dominant cost driver in ride-sharing is the GPS location update stream — not rides, not users, but the continuous position heartbeat from every active driver. Size this first; everything else is smaller by orders of magnitude. The location tier is write-heavy; the ETA tier is read-heavy; they need completely different storage and scaling strategies.

Interactive estimator

500K
4 s
128 B
5,000K
4 KB
3 yr
GPS write QPS
active drivers ÷ heartbeat interval
Location data throughput
GPS QPS × record size
Ride request QPS
rides/day ÷ 86,400 s
Location index size
active drivers × record size (hot RAM)
Trip DB size (total)
rides/day × record size × retention days
Peak GPS QPS (×3)
3× surge on rush-hour / event spikes

The key insight: The location index is tiny in storage terms — 500K drivers × 128 bytes each is only ~64 MB — but the write rate is enormous. At 125K writes/second sustained, the bottleneck is ingestion throughput and index update latency, not disk space. This is why GPS data lives in Redis (sub-millisecond in-memory writes), not Postgres. The trip database, by contrast, is large in storage but low-QPS — it writes once per trip and is read infrequently afterward.

04

High-level architecture

Sync call Async / WebSocket Rider App Request ride Driver App GPS stream API Gateway / Load Balancer Auth · Rate limit · Route Location Service GPS ingest + index Dispatch Service Matching + WebSocket Trip Service State machine + billing ETA & Pricing Service Surge × map routing Redis GEOADD index Driver positions WebSocket Cluster Driver push channel PostgreSQL Trip DB ACID + row locking Map Cache + Surge Redis Graph routes + cells Kafka Event Bus location-updates · trip-events · pricing-signals Analytics / ML Pipeline ETA model training · Demand forecasting · Surge calibration

Ride-sharing high-level architecture: GPS ingest → Redis spatial index → Dispatch matching → Trip DB state machine

Component breakdown

Location Service is the hottest service in the system by write volume. Every active driver sends a GPS heartbeat every 4 seconds. The service receives these, validates them (timestamp freshness, coordinate sanity), and upserts into a Redis geospatial index using GEOADD. It also publishes a location-update event to Kafka for the analytics pipeline. Critically, it does not interact with the relational trip database — that path would be far too slow at 125K+ writes/second.

Dispatch Service handles the matching algorithm. When a rider submits a request, the Dispatch Service queries the Redis geospatial index for available drivers within a radius, scores candidates (distance, acceptance rate, vehicle type), picks the best match, and sends a push notification to the driver's active WebSocket connection. It then waits for the driver accept/reject response within a timeout window (typically 10 seconds), cascading to the next candidate if rejected.

📍 What is Redis GEOADD? tap to expand

Redis geospatial commands in one paragraph. Redis provides GEOADD, GEODIST, and GEOSEARCH, all built on sorted sets that encode latitude/longitude pairs as 52-bit geohash scores. GEOADD key longitude latitude member upserts a named member (e.g. a driver ID) at a geographic position. GEOSEARCH drivers:location:available FROMLONLAT -73.9857 40.7484 BYRADIUS 5 km ASC COUNT 10 returns the 10 nearest members within 5 km, sorted by distance, in a single O(N+log M) operation. Note: use FROMLONLAT with the rider's raw coordinates here — FROMMEMBER would only apply if the rider were already stored as a member of the same sorted set, which they are not. Because all data is in memory and geohash encoding enables fast spatial comparisons, latency is typically under 1 ms — and it's the only data structure that can sustain 125K+ GPS writes per second alongside sub-millisecond range queries.

Trip Service owns the lifecycle of a trip: it creates the trip record, transitions its state (requested → driver_assigned → in_progress → completed → settled), and writes to PostgreSQL under row-level locking to prevent double-assignment. It emits trip events to Kafka for downstream payment, rating, and analytics consumers.

ETA and Pricing Service computes two things: estimated arrival time (a graph routing query against a road network graph, adjusted for live traffic) and the fare estimate (base rate × distance × time × surge multiplier). Surge is computed per geohash cell as the ratio of open ride requests to available drivers, with a configurable multiplier schedule. Both the road graph and surge multipliers are cached aggressively — ETA models run on pre-built routing graphs, not live map data per request.

⚠️

WebSocket stickiness matters for dispatch. A driver's device maintains a persistent WebSocket connection to a specific Dispatch Service node. If the matching engine sends an accept/reject notification to the wrong node, the message is silently dropped. Solutions: consistent hashing (route all messages for driver D to the node holding connection D), or a pub/sub layer (publish to a Redis channel that the holding node subscribes to). The pub/sub approach is more resilient to node failures.

Architectural rationale

Why Redis for driver location, not Postgres? Storage choice

Postgres can do geospatial queries via PostGIS, but 125K upserts/second will saturate its write path and create index bloat as every driver update re-indexes a B-tree entry. Redis GEOADD is a sorted-set operation that completes in microseconds and handles the write rate comfortably. The trade-off is durability: if the Redis primary dies, you lose the last few seconds of position data. This is acceptable — drivers will re-send their GPS within 4 seconds, and the index rebuilds automatically.

TradeoffRedis is in-memory; driver location data for 1M active drivers is ~128 MB, trivially small. So the trade-off is not memory vs. disk — it's operational complexity (Redis cluster) vs. write throughput (Postgres).
AlternativesPostGIS (lower throughput)H3 (Uber's hex grid library)Aerospike (SSD-backed, lower latency than Postgres)
Why WebSocket for driver notification, not push (FCM/APNs)? Communication model

Mobile push (FCM/APNs) has two problems for dispatch: latency and reliability. FCM delivery can take 1–30 seconds depending on battery optimization mode, and the platform provides no acknowledgement that the message was received by the app in time. For a 10-second accept window, a 5-second FCM delay is unacceptable. WebSocket connections from the driver app provide sub-100ms delivery and a clear acknowledgement path. The tradeoff is connection overhead: 1M persistent WebSocket connections requires a well-designed connection server (typically a stateful layer separate from the stateless matching logic).

TradeoffWebSocket servers need sticky routing — a driver's message must reach the server holding their specific connection. This adds a pub/sub indirection layer (Redis Pub/Sub or a service mesh) that push notifications avoid.
Why PostgreSQL for the Trip DB? Consistency model

The trip assignment is a distributed locking problem: two concurrent rider requests must not claim the same driver. Postgres row-level locking with SELECT ... FOR UPDATE provides this guarantee natively. The trip write QPS is low (rides/second, not hundreds of thousands), so Postgres is not a throughput bottleneck. A NoSQL database would require building a distributed lock on top — adding complexity without benefit.

TradeoffPostgres does not shard as elegantly as Cassandra or DynamoDB. As trips scale beyond ~100K/day, shard by city ID or use a managed distributed Postgres (e.g. Citus, AlloyDB, Aurora) to maintain the relational guarantees at scale.
AlternativesMySQL + InnoDB row lockingCockroachDB (distributed ACID)DynamoDB + conditional writes (more complex)
05

Geospatial matching, the core algorithm

The matching problem has two parts: proximity search (which drivers are physically nearby?) and driver selection (among those, which one gets the trip?). They have very different characteristics: proximity search must scale to millions of drivers globally while being sub-millisecond, whereas driver selection involves business logic that runs on a small candidate set (typically 5–20 drivers).

Geohash Grid (precision-6 ≈ 1.2 km × 0.6 km cells) dp3w dp3x 🚗 D2 🚗 D5 2 drivers dp3y ← Rider 📍 R1 request origin dp3z 🚗 D7 1 driver dp4b dp3q dp3r 🚗 D1 🚗 D3 🚗 D4 dp3s (same cell) 🚗 D9 (nearest!) dp3t 🚗 D11 dp4c Matching Algorithm ① GEOSEARCH Query Redis: drivers within 5 km of rider R1. Returns D1..D11 sorted by distance asc ② Filter Remove: busy, offline, wrong vehicle type → Candidate set: D2, D5, D7, D9 ③ Score Score = −w₁·dist + w₂·rating + w₃·acceptance_rate D9 wins: closest + top rating ④ Offer + Timeout Push offer to D9 via WebSocket. Wait 10s. If rejected/timeout, cascade to D2.

Geohash-based proximity search + weighted scoring for driver selection

Geohash vs. quadtree

Geohash: string prefix = geographic proximity Spatial index

Geohash encodes coordinates into a base-32 string. Longer strings = smaller cells. Cells with the same prefix are geographically adjacent. This makes lookups trivially simple in a Redis sorted set — encode both the driver and rider position to the same precision and compare scores. The weakness is at cell boundaries: a rider at the edge of cell A and a driver 50 meters away in cell B won't have a common prefix, so you must also query the 8 neighboring cells. Redis GEOSEARCH handles this automatically via its internal geohash scoring.

Precision 6-char geohash ≈ 1.2 km × 0.6 km. 7-char ≈ 153 m × 153 m. For surge pricing, 6 chars is the right granularity. For driver matching, use the Redis GEOSEARCH radius query rather than raw prefix matching.
Quadtree: adaptive subdivision for uneven density Spatial index

A quadtree recursively subdivides space into four quadrants until each cell contains at most N drivers (e.g. N=10). This means dense cities (Manhattan) get fine-grained cells while rural areas get coarse cells — matching their actual driver density. Uber uses a hexagonal grid (H3) which achieves similar adaptive behavior with better geometric properties (no edge effects at poles). Quadtrees are more complex to implement in Redis; for interview purposes, geohash is the correct first answer.

When to useQuadtree (or H3) becomes relevant when your surge pricing cells need to be density-adaptive — you don't want a surge cell in NYC with 10,000 drivers treated the same as one in rural Nevada with 3. Uber's H3 grid serves exactly this purpose.
AlternativesUber H3 hexagonal gridPostGIS R-TreeS2 geometry library (Google)

Edge case: no drivers in the search radius

🔍

Interviewer probe: "What happens if no drivers are within 5 km of the rider?" The naive answer is "expand the radius." The better answer is an exponential backoff search: query 2 km → 5 km → 10 km → 20 km with a timeout at each tier. Additionally, the UI should show real-time driver positions on the map so the rider knows whether to wait. For markets with structural supply shortages, the system should show a predicted pickup time once a driver completes a nearby trip, not just a "no drivers available" error.

05b

API design

The API surface covers three primary actors: the rider requesting a trip, the driver sending GPS data and responding to offers, and the trip lifecycle mutations.

Endpoint Method Description
POST /rides/request REST Rider submits pickup + dropoff coordinates. Returns ride_id and initial ETA + fare estimate.
GET /rides/{id}/status REST (poll) / SSE Returns trip state, driver position, and ETA. Riders poll this or receive via server-sent events.
POST /rides/{id}/cancel REST Rider or driver cancels the trip (with cancellation fee logic).
PUT /drivers/{id}/location REST / WebSocket frame Driver app pushes GPS coordinates. Upserts into Redis GEOADD.
PUT /drivers/{id}/status REST Driver sets availability: available | busy | offline.
POST /drivers/{id}/offer/accept WebSocket Driver accepts or rejects a ride offer. Bidirectional WebSocket frame.
GET /pricing/estimate REST Returns fare estimate and current surge multiplier for a given origin cell.

Request / response schemas

The two most interview-critical endpoints in detail. These are the shapes candidates are expected to sketch during a whiteboard session.

POST /rides/request: rider requests a trip
// Request
POST /rides/request
Content-Type: application/json
Idempotency-Key: <client-generated UUID>          // ← dedup on double-tap

{
  "pickup":        { "lat": 37.7749, "lng": -122.4194 },
  "dropoff":       { "lat": 37.7863, "lng": -122.4102 },
  "vehicle_type":  "standard",          // standard | xl | black
  "idempotency_key": "550e8400-e29b-41d4-a716-446655440000"
}

// 200 OK — trip created or existing trip returned on duplicate key
{
  "ride_id":              "trip_abc123",
  "status":               "REQUESTED",
  "eta_pickup_seconds":   240,
  "fare_estimate_cents":  1450,
  "surge_multiplier":     1.2,
  "created_at":           "2026-04-19T22:00:00Z"
}

// 429 Too Many Requests — rider rate-limited (> 5 active requests/hour)
// 409 Conflict — returned if an active trip already exists for this rider
IdempotencyThe Idempotency-Key header (or body field) is checked against a Redis key with a 60-second TTL before creating the trip. A duplicate submission within the window returns the existing trip ID without creating a second one. This handles the common "double-tap" and "connection retry" cases that would otherwise flood the matching engine with duplicate requests for the same rider.
PUT /drivers/{id}/location: GPS heartbeat
// WebSocket frame (same connection used for offer receipt)
// Sent every 4 seconds from the driver app
{
  "type":      "location_update",
  "driver_id": "drv_xyz789",
  "lat":       37.7812,
  "lng":       -122.4130,
  "heading":   270,             // degrees, 0=north
  "speed_kmh": 32,
  "timestamp": "2026-04-19T22:00:04Z"
}

// Server ACK (sent back on same WS connection)
{
  "type":   "location_ack",
  "status": "ok"
}

// If driver has been matched to a pending trip, ACK includes the offer:
{
  "type":        "ride_offer",
  "ride_id":     "trip_abc123",
  "pickup":      { "lat": 37.7749, "lng": -122.4194, "address": "Market St & 4th" },
  "dropoff":     { "lat": 37.7863, "lng": -122.4102, "address": "Union Square" },
  "rider_name":  "Alex R.",
  "rider_rating": 4.8,
  "fare_estimate_cents": 1450,
  "expires_at":  "2026-04-19T22:00:14Z"   // 10-second accept window
}
Why multiplex on one connectionSending GPS heartbeats and receiving ride offers over the same WebSocket connection eliminates the overhead of a separate long-poll channel for offer delivery. It also lets the server detect driver offline status from a single missed heartbeat, without needing a separate keepalive mechanism.
💡

Driver location updates should be sent over the same persistent WebSocket connection used for receiving ride offers. This amortizes connection setup cost across both GPS writes and offer pushes, and allows the server to detect driver disconnection (loss of heartbeat) without a separate polling mechanism.

06

Core flow: requesting a ride

The ride request flow is a cascade of real-time operations that must complete within the matching latency SLA. Each step has a failure mode and recovery path.

Rider App API Gateway Trip Svc Dispatch Svc Location Svc Driver App Request ride Auth + rate limit Create trip (REQUESTED) Enqueue match job GEOSEARCH 5 km Redis GEOADD index Score candidates UPDATE trips SET driver=D9 WebSocket offer push Accept (WS frame) Trip → DRIVER_ASSIGNED Show driver ETA t=0 Rider taps t ≈ 1–3s Driver offer sent t ≈ 3–5s Driver accepted, rider sees ETA

Ride request flow: rider tap → Redis proximity query → speculative offer via WebSocket → driver accepts → trip committed to DB → rider notified. Total wall time: < 5 seconds p99.

Step-by-step request flow
  1. Rider app sends POST /rides/request with pickup/dropoff coordinates and vehicle type preference.
  2. API Gateway authenticates the rider, checks rate limits, and forwards to Trip Service.
  3. Trip Service creates a trip record in Postgres with status REQUESTED and returns the ride_id immediately.
  4. Dispatch Service picks up the request (via queue or direct call), runs GEOSEARCH against the Redis location index to fetch drivers within 5 km, filters by availability/vehicle type, and scores the candidate set.
  5. Dispatch Service pushes the ride offer to D9 via WebSocket. No DB write happens yet : the offer is speculative. The driver has 10 seconds to accept or reject.
  6. Driver accepts. Only now does Dispatch Service write to the Trip DB: UPDATE trips SET driver_id = D9, status = 'driver_assigned' WHERE trip_id = T1 AND driver_id IS NULL. If 0 rows are updated (another request beat it), the driver reverts to available and the system cascades to the next candidate. Simultaneously, the driver's status is flipped to busy in both Postgres and Redis, removing them from the drivers:location:available index via ZREM. Dispatch Service then publishes a trip.assigned event to Kafka.
  7. Rider app, polling GET /rides/{id}/status, receives the driver's name, vehicle plate, and live ETA.
  8. On trip completion, Trip Service transitions state to COMPLETED and emits a trip.completed event with fare_cents, rider_id, and driver_id. The Payment Service consumes this, charges the stored payment method via Stripe, and emits payment.succeeded or payment.failed. On failure the trip enters COMPLETED_UNPAID pending automated retry and, if exhausted, manual support review.
07

Data model

Two fundamentally different stores: a relational database for transactional records (trips, drivers, riders), and an in-memory geospatial index for live driver positions. The split exists because these two datasets have opposite throughput and durability requirements — collapsing them into a single store is the most common L3/L4 design error on this question.

Trip state machine

The trip record is a state machine. Every valid transition is a database write; invalid transitions are rejected. Interviewers frequently ask "what states can a trip be in?" — sketch this before drawing the data model table.

REQUESTED DRIVER_ASSIGNED IN_PROGRESS COMPLETED SETTLED (payment ok) CANCELLED rider / driver / timeout COMPLETED _UNPAID match pickup dropoff payment ok pay fails cancel

Trip state machine — solid arrows are happy-path transitions; dashed red arrows are cancellation and payment failure paths

Relational schema (PostgreSQL)

Table Key columns Notes
trips trip_id PK, rider_id FK, driver_id FK (nullable), status, pickup_lat, pickup_lng, dropoff_lat, dropoff_lng, fare_cents, surge_multiplier, created_at, completed_at Double-assignment is prevented by a partial unique index — CREATE UNIQUE INDEX active_trip_per_driver ON trips(driver_id) WHERE status NOT IN ('completed','cancelled'), so the DB rejects any second active trip for the same driver as a constraint violation, regardless of how many concurrent matchers attempt the write.
drivers driver_id PK, status (available|busy|offline), vehicle_type, rating_avg, acceptance_rate_14d Status is write-through to Redis. Rating and acceptance rate are rolling aggregates updated asynchronously.
riders rider_id PK, rating_avg, payment_method_id, created_at Payment method references a Stripe Customer ID — actual processing is delegated to Payment Service.
trip_events event_id PK, trip_id FK, event_type, lat, lng, occurred_at Append-only audit log of GPS waypoints during a trip (driver pings written every 30s by the analytics consumer, not in real-time).

Redis data structures

Key pattern Type Purpose
drivers:location:available Geo sorted set All available drivers, indexed by lng/lat via GEOADD. The primary spatial index for matching. Updated on every GPS heartbeat.
driver:{id}:meta Hash Driver attributes needed by the scoring function: vehicle_type, rating_avg, acceptance_rate. TTL 60s (refreshed periodically from Postgres).
surge:cell:{geohash6} String (float) Current surge multiplier for a geohash-6 cell. Updated by the Pricing Service every 30–60 seconds. TTL 120s.
driver:{id}:ws_node String Which WebSocket server node holds this driver's connection. Used for routing offer messages. TTL matches connection lifetime.
💡

When a driver accepts a trip, they are removed from drivers:location:available via ZREM and their status is set to busy in a Redis Hash key (driver:{id}:status). This keeps the proximity search result set clean in steady state. However, the Location Service's GPS-heartbeat path must check this status key before executing GEOADD — otherwise a heartbeat arriving in the window between ZREM and the driver-status write will silently re-add the busy driver back into the available index. The guard is a single GET driver:{id}:status check; skip the GEOADD if the result is busy.

08

Caching strategy

What Where TTL Invalidation
Driver live positions Redis GEOADD (primary) Always current (heartbeat overwrites) GEOADD upsert on every GPS ping
Driver attributes (rating, vehicle type) Redis Hash 60 s Time-based; refreshed from Postgres on miss
Surge multiplier per geohash cell Redis String 120 s Written by Pricing Service every 30–60 s
ETA estimates (by route hash) Redis String 30 s Time-based; stale ETAs skew toward over-estimate, safer than under-estimate
Road network graph In-process memory (ETA service) Never (static) Shipped as artifact during deploy; live traffic overlaid at query time
Rider/driver profiles Redis + CDN 5 min Invalidated on profile update event from Kafka
Why is the road graph in-process, not Redis? Cache placement

The road network graph for a city is a large, read-only data structure (typically 200 MB–2 GB as a compressed adjacency graph with travel time weights). Loading it into Redis would create an enormous serialization overhead — every ETA request would require deserializing a graph that is orders of magnitude larger than the query result. Instead, ETA service pods load the graph on startup into process memory. Live traffic data (speed adjustments per road segment) is a much smaller overlay and can be cached in Redis or fetched via an HTTP cache from a traffic API.

TradeoffIn-process caching means graph updates require a rolling restart of ETA service pods. This is acceptable for a road network that changes slowly (new roads, speed limit changes). Live traffic deltas are applied as overlays, not full graph rebuilds.
How does surge pricing work computationally? Pricing model

Every 30–60 seconds, the Pricing Service scans all active geohash-6 cells in the system. For each cell, it computes the supply/demand ratio: count of available drivers vs. open ride requests in the past 5-minute window. When demand exceeds supply by a threshold (e.g. demand/supply > 1.5), the surge multiplier activates. The multiplier is capped (typically 5×–8× in production) and is applied to the base fare at trip creation time, not at payment time. The computed multiplier is written to surge:cell:{geohash6} in Redis with a 120-second TTL.

Edge caseCell boundaries can create jarring surge discontinuities — same block, different cell, 1.0× vs. 3.0×. Production systems use a spatial smoothing pass (weighted average with neighboring cells) to reduce boundary artifacts.
09

Scalability deep dive

The GPS ingestion pipeline and the matching engine are the two hardest scaling problems here. They look similar on the surface — both handle ride-related data — but they have opposite characteristics and need completely different solutions.

🔒

Driver location privacy is a regulatory requirement, not just a product decision. Driver GPS coordinates are never transmitted to riders at full precision: the Rider App receives a coarse position (randomized within approximately 200 meters) until the driver accepts the trip. After the trip, the exact route is retained for a maximum of 90 days under GDPR and California CCPA. The trip_events table must support right-to-erasure deletions; driver GPS history must be stored separately from trip records and purged on schedule. In markets with stricter regulations (e.g., Germany), location data may not be transmitted to servers outside the EU at all, requiring regional data residency for the location index Redis cluster.

GPS location ingestion

At 125K–500K writes/second (depending on fleet size and heartbeat cadence), the Location Service is the highest-throughput component in the system. The scaling strategy:

Scaling techniques
  • Shard Redis by city/region. A single Redis cluster can handle ~100K writes/second. Partition driver IDs across N clusters by city or geographic shard.
  • Location Service is stateless. Any Location Service pod can write to any Redis shard. Add pods horizontally with the load balancer.
  • Batch micro-writes. Instead of writing each GPS update instantly, buffer 100ms of updates per driver and batch-write them as a pipeline. Reduces Redis RTT overhead without meaningfully degrading freshness.
  • Kafka as a buffer. Route GPS updates through Kafka first; Location Service consumers write to Redis. Kafka absorbs burst spikes and provides replay capability for analytics.
Hotspots to watch
  • Airport queues. 500 drivers in a 1 km radius at SFO means a single geohash cell is extremely hot. Redis GEOADD on a single key from 500 concurrent writers can serialize. Mitigate by cell-level sharding or client-side location aggregation.
  • Event surge. A stadium event ending adds 10K driver GPS pings simultaneously from a small area. Expect write spikes 5–10× normal for that cell. Design Redis capacity for peak, not average.
  • Cascading match failures. If the Redis location index falls behind (stale positions), the Dispatch Service will send offers to drivers who have already moved away — increasing accept-window timeouts and cascading failures.

Matching throughput

The Dispatch Service scales horizontally for normal load. The tricky case is matching under surge: many simultaneous ride requests from the same area competing for the same driver pool. Two techniques:

Lock-free optimistic assignment

Instead of acquiring a row lock on the driver record during matching (which serializes all concurrent matches), use an optimistic update: UPDATE trips SET driver_id=D9 WHERE trip_id=T1 AND driver_id IS NULL. If two concurrent requests both try to assign D9, only one will succeed (the other will update 0 rows and retry with the next candidate). This keeps the matching loop lock-free in the common case and only serializes on the rare double-claim.

TradeoffOptimistic locking increases retry frequency under heavy contention (surge). Add a short jitter (50–200ms) before retrying to reduce thundering herd behavior.
Batch matching for high-density markets

Rather than matching each ride request independently, accumulate requests for 1–2 seconds and match them as a batch against the available driver pool. This allows the system to globally optimize assignments (minimize total wait time across all pending requests) rather than greedily assigning the nearest driver to each request in arrival order. Uber's dispatch engine does this for UberPool and surge scenarios. The trade-off is a 1–2 second latency penalty for the first request in each batch window.

When to useBatch matching only wins when the number of simultaneously pending requests is significant relative to the driver pool size (i.e. during surge). At normal load, greedy matching is simpler and faster.
10

Failure modes and mitigations

Each component in this system has a distinct failure mode. Designing for these is the difference between L5 and L6 answers.

Failure Impact Mitigation
Redis location index primary failure Matching fails; no drivers visible Redis Cluster with replica promotion. At-most 2-second data loss. Location service falls back to Redis replica (read-only) and drops writes until promotion completes.
Driver WebSocket server failure Offers drop for all drivers on that node Driver apps reconnect immediately; Redis driver:{id}:ws_node key expires automatically; Dispatch Service refreshes routing on reconnect. Trips in offer-pending state retry with the next candidate.
Dispatch Service crash during matching Ride request stuck in REQUESTED state Trip record has a last_match_attempt_at timestamp. A background re-queuer finds trips in REQUESTED state older than 15s and re-enqueues them. Idempotent because driver assignment uses optimistic locking.
Driver accepts but Trip Service write fails Driver sees trip; Trip DB has no driver assigned Dispatch Service handles assignment write failures by retrying with exponential backoff. If retries fail, it sends a "system error, please re-accept" message to the driver and reverts the trip to REQUESTED for re-matching.
GPS position dramatically incorrect (GPS spoof/jump) Wrong driver offered; bad ETA Location Service applies a sanity filter: reject updates where the new position implies speed > 200 km/h since the last ping. Flag coordinates for manual review. Apply Kalman filter smoothing to reduce noise from legitimate rapid updates.
Surge pricing service down All prices revert to base rate Pricing Service writes surge multipliers with 120s TTL. If the service is down, multipliers expire and the ETA service falls back to 1.0×. This is acceptable; it errs toward the rider, not the driver.
🔍

Common interviewer follow-up: "How do you handle a driver who accepts a ride but never arrives?" This is a product + systems problem. System-side: the trip emits a driver_en_route event; if the driver's GPS position doesn't converge toward the pickup within N minutes, a background job triggers an alert and the rider can cancel penalty-free. Product-side: cancellation policies and fraud detection run on the trip_events stream asynchronously.

11

How to answer by level

The same prompt — "Design Uber" — is given to L3 candidates and L7 candidates. What separates the answers is the depth of trade-off reasoning, not the number of buzzwords used.

Level Expected depth Common gaps to avoid
L3/L4 Identifies core components (rider app, driver app, server, maps API, payment), defines the trip state machine, and handles basic GPS storage. Google Maps for ETA is fine at this level. Not identifying that GPS writes are the dominant throughput problem. Storing driver location in a relational table with a lat/lng index without noting the write volume.
L5 Designs the Redis geospatial index, explains geohash precision trade-offs, designs WebSocket architecture with session routing, quantifies GPS write QPS and justifies the Redis choice, explains surge pricing mechanics. Skipping the WebSocket routing problem (which Dispatch node holds which driver connection). Describing surge pricing without explaining how supply/demand is measured per cell.
L6 Owns the full failure mode matrix, explains optimistic vs. pessimistic locking trade-offs for driver assignment, designs the ETA pipeline with map graph caching vs. live traffic overlays, discusses GPS spoofing detection, and draws the Kafka event topology for downstream consumers. Missing the Redis cluster sharding strategy for high-density cells (airport queues). Not discussing how driver's acceptance rate feeds back into the scoring function.
L7/L8 Addresses global multi-region deployment (city-isolated shards with a global control plane), discusses ML pipeline for ETA accuracy improvements (gradient boosting on historical trip time vs. naive map routing), batch dispatch optimization for UberPool, the economics of the supply-demand marketplace (driver incentive programs as a supply-shaping mechanism), and data residency constraints for international markets. Treating the system as purely technical without the business model context. The surge pricing algorithm, driver incentive programs, and matching fairness policies are all coupled to the two-sided marketplace economics.
💡

Interview technique: Start with the data flows, not the components. Say: "There are three data flows here: continuous GPS writes from drivers, ride request and matching, and trip lifecycle management. They have very different throughput and consistency requirements, which is why they need separate components and storage." This framing signals to the interviewer that you understand why the system is complex, not just what components it has.

Real-world comparison

Decision This design Uber Lyft
Spatial index Redis GEOADD (geohash) H3 hexagonal grid + custom dispatch engine S2 geometry library + Redis
Driver comms WebSocket QUIC / long-poll fallback WebSocket
Trip DB PostgreSQL + row locking Schemaless (DocStore) + MySQL MySQL with custom sharding
ETA model Pre-built road graph + live traffic overlay DeepETA (ML, trained on historical trips) Google Maps API + fine-tuning layer
Surge pricing Supply/demand ratio per geohash cell ML-based surge with demand forecasting Heat-map based, similar cell approach
GPS ingestion Kafka → Location Service → Redis Kafka + internal stream processing (Flink) Kafka + Flink
💡

Uber's "DeepETA" replaced their map-routing ETA with an ML model trained on millions of historical trips. The ML model outperforms naive map routing by 26% on mean absolute error by capturing real-world factors (traffic patterns, construction, driver behavior) that static road graphs can't represent. This is an L7/L8 topic — you don't need to propose it unprompted, but if asked "how would you improve ETA accuracy?" it's the right answer.

How the pieces connect
Tracing each NFR to the architectural decision it forced
1
Low GPS latency NFR (<2 sec) + 4s heartbeat rate (§2) continuous write volume of 125K+ QPS across the active fleet (§3) relational databases become unviable decoupled spatial index using Redis GEO or volatile Cassandra (§5)
2
Diverging consistency requirements (§2) driver location needs high-throughput eventual consistency, but trip booking needs strong ACID guarantees double-booking creates business incidents separate stores: Redis for live positions, PostgreSQL for durable trip state (§4, §6)
3
Sub-second matching expectations (§4) filtering 500K dynamic drivers sequentially is too slow 2D geometry indexing using S2 or Geohash collapses spatial searches into fast 1D prefix lookups (§5)
4
Driver availability protection (§2) multiple riders cannot claim the same driver simultaneously Dispatch service pushes speculative WebSocket offers and utilizes atomic SQL driver_id IS NULL checks to lock the transaction (§6)

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series