How does Uber match riders to drivers?

Uber's dispatch engine queries a geospatial index (internally using an H3 hexagonal grid) to find available drivers within a radius of the rider's pickup location. Candidate drivers are scored by distance, acceptance rate, and vehicle type, then offered the trip sequentially via a persistent WebSocket connection. If a driver rejects or times out (typically 10 seconds), the system cascades to the next ranked candidate.

What data structure should I use for real-time driver location in a system design interview?

Redis with its built-in geospatial commands (GEOADD, GEOSEARCH) is the standard answer. Redis stores driver positions as geohash-encoded scores in a sorted set, enabling sub-millisecond radius queries. It handles the 100K+ writes/second from GPS heartbeats comfortably, where a relational database with a geospatial index would be saturated by the write rate.

How does surge pricing work technically?

The Pricing Service divides the city into geohash cells (approximately 1–2 km squares) and computes the supply/demand ratio for each cell every 30–60 seconds: available driver count vs. open ride requests in the past 5-minute window. When demand significantly exceeds supply, a surge multiplier (capped at a configured maximum, typically 5–8×) is applied to the base fare. The multiplier is stored in Redis with a short TTL and served to riders as a fare estimate before they confirm the request.

How do you prevent two riders from being matched to the same driver?

Driver assignment uses an atomic optimistic update in the relational trip database: UPDATE trips SET driver_id = D9 WHERE trip_id = T1 AND driver_id IS NULL. If two concurrent requests both target driver D9, only one UPDATE will succeed (the other matches 0 rows and retries with the next candidate). This avoids holding database locks during the entire matching flow while still preventing double-assignment.

Ride-Sharing System Design (Uber / Lyft) — FAANG Interview Guide (L3

01

What the interviewer is testing

The ride-sharing prompt is the highest-signal question in the FAANG system design canon precisely because it hides four intersecting real-time problems inside a deceptively simple description. Continuous GPS ingestion and spatial indexing of moving drivers, sub-second matching between a rider request and the nearest available driver, ETA computation under live traffic, and demand-aware surge pricing per geohash cell — each of these is a hard systems problem on its own. Together, they expose whether you can reason about competing data freshness requirements rather than just naming components.

A candidate who reduces this to "store driver location in a table and query nearby" has ignored the hardest parts. The table below shows what separates answers by level.

Level	What good looks like
L3/L4	Designs a rider request API, stores driver GPS in a DB, uses Google Maps for ETA, understands the basic booking state machine (requested → accepted → in-trip → completed).
L5	Implements geohash-based spatial index in Redis for sub-second driver lookup, designs the WebSocket architecture for real-time location streaming, explains surge pricing per cell, quantifies write QPS from GPS heartbeats.
L6	Owns the end-to-end matching algorithm trade-offs (geohash vs. quadtree), reasons about ETA pipeline latency, designs supply/demand forecasting per cell, explains hot-spot handling for driver-dense areas, and addresses driver location privacy.
L7/L8	Addresses cross-region replication for global expansion, ML-driven ETA accuracy, dispatch optimization (batching nearby requests), per-product SKU differentiation (UberX vs. UberXL vs. UberBlack), and the economics of the supply-demand marketplace.

02

Requirements clarification

Scope this before touching any components. "Ride-sharing" covers carpooling, scheduled rides, freight, food delivery, and scooters — each with different matching and pricing models. For a FAANG interview, the canonical scope is on-demand P2P rides in a single city.

Functional requirements

Requirement	In scope
Rider requests a ride from location A to location B	✓
System matches rider to nearest available driver	✓
Driver location streams to backend (GPS heartbeat)	✓
Real-time ETA shown to rider before and during trip	✓
Surge pricing based on local supply/demand ratio	✓
Trip lifecycle: requested → accepted → in-trip → completed	✓
Payment processing on trip completion	✓ (stub — not deep-dived)
Driver and rider ratings	✓ (simplified)
Ride scheduling (future pickup)	Out of scope
Carpooling / shared rides	Out of scope

Non-functional requirements

NFR	Target	Why
Matching latency (ride request → driver notified)	< 5 seconds p99	Riders abandon requests beyond 5–10 seconds; Uber targets < 3s
Location update latency (driver GPS → index)	< 2 seconds	Stale driver positions produce wrong ETA and missed matches
GPS write QPS	~125K/s baseline (500K active drivers ÷ 4s cadence); up to 500K/s at global Uber scale (1M drivers, 2s cadence)	Every driver pings every 4 seconds; write volume scales with active fleet size, not ride count
Availability	99.99% during peak hours	Downtime = no rides; direct revenue impact
Consistency (trip state)	Strong — one driver per trip	Double-booking a driver is a hard correctness failure
Read/write ratio (location store)	Write-dominant	Drivers write; matching reads; write >> read

Why strong consistency for trip state but eventual for location? ›

Trip state (driver assignment, trip status) requires strong consistency because two riders matching the same driver is a hard business error — one trip must be rejected. Location data, by contrast, is continuously refreshed. A stale GPS position by 2 seconds is acceptable because the next heartbeat will correct it. This split allows you to use a strongly consistent RDBMS for trip records while using a high-throughput, eventually consistent spatial index (Redis) for location.

TradeoffThe dual-consistency model requires the matching engine to tolerate briefly stale driver positions. In practice, Uber uses a "supply snapshot" — a cache of driver positions refreshed every ~2s — rather than hitting the live location store per match request.

Why 4-second GPS heartbeat cadence specifically? ›

GPS heartbeat frequency is a battery-vs-accuracy trade-off on the driver's phone. 1 Hz (every second) drains battery significantly and produces location data faster than the spatial index can usefully act on. 10-second intervals make ETA predictions visibly laggy when a driver is approaching. 4 seconds is a common production compromise: fast enough for sub-10-second ETA refresh on the rider app, slow enough to not drain the driver's battery on an 8-hour shift.

What this drivesAt 4-second intervals with 1M active drivers, you get ~250K location writes/second at steady state. The location ingestion pipeline must be designed for this constant write load, not bursty batch writes.

⚠️

Rate-limit ride requests and guard the driver fleet as a resource. Without a per-rider rate limit on POST /rides/request, a bad actor can send hundreds of requests per minute to effectively map the live driver fleet (each request triggers a GEOSEARCH and exposes approximate driver density). Apply a hard limit of 5 active ride requests per rider per hour at the API Gateway. Additionally, the GET /pricing/estimate endpoint (which also queries driver density per cell) should be token-bucket rate-limited per IP to prevent scraping of surge heat-maps.

03

Capacity estimation

The dominant cost driver in ride-sharing is the GPS location update stream — not rides, not users, but the continuous position heartbeat from every active driver. Size this first; everything else is smaller by orders of magnitude. The location tier is write-heavy; the ETA tier is read-heavy; they need completely different storage and scaling strategies.

Interactive estimator

Active drivers (K) 500K

GPS heartbeat interval (s) 4 s

Location record size (B) 128 B

Ride requests/day (K) 5,000K

Trip record size (KB) 4 KB

Trip data retention (years) 3 yr

GPS write QPS

—

active drivers ÷ heartbeat interval

Location data throughput

—

GPS QPS × record size

Ride request QPS

—

rides/day ÷ 86,400 s

Location index size

—

active drivers × record size (hot RAM)

Trip DB size (total)

—

rides/day × record size × retention days

Peak GPS QPS (×3)

—

3× surge on rush-hour / event spikes

⚡

The key insight: The location index is tiny in storage terms — 500K drivers × 128 bytes each is only ~64 MB — but the write rate is enormous. At 125K writes/second sustained, the bottleneck is ingestion throughput and index update latency, not disk space. This is why GPS data lives in Redis (sub-millisecond in-memory writes), not Postgres. The trip database, by contrast, is large in storage but low-QPS — it writes once per trip and is read infrequently afterward.

04

High-level architecture

A ride-sharing platform has three core data flows: continuous driver GPS ingestion (high-write, spatially indexed), rider request matching (low-latency geospatial query + driver notification), and trip lifecycle management (strongly consistent state machine). The architecture separates these flows into a Location Service that maintains a real-time geohash index in Redis, a Dispatch Service that runs the matching algorithm and communicates with drivers via WebSocket, and a Trip Service that manages the booking state machine using a relational database with optimistic locking to prevent double-assignment.

Ride-sharing high-level architecture: GPS ingest → Redis spatial index → Dispatch matching → Trip DB state machine

Component breakdown

Location Service is the hottest service in the system by write volume. Every active driver sends a GPS heartbeat every 4 seconds. The service receives these, validates them (timestamp freshness, coordinate sanity), and upserts into a Redis geospatial index using GEOADD. It also publishes a location-update event to Kafka for the analytics pipeline. Critically, it does not interact with the relational trip database — that path would be far too slow at 125K+ writes/second.

Dispatch Service handles the matching algorithm. When a rider submits a request, the Dispatch Service queries the Redis geospatial index for available drivers within a radius, scores candidates (distance, acceptance rate, vehicle type), picks the best match, and sends a push notification to the driver's active WebSocket connection. It then waits for the driver accept/reject response within a timeout window (typically 10 seconds), cascading to the next candidate if rejected.

📍 What is Redis GEOADD? tap to expand ›

Redis geospatial commands in one paragraph. Redis provides GEOADD, GEODIST, and GEOSEARCH, all built on sorted sets that encode latitude/longitude pairs as 52-bit geohash scores. GEOADD key longitude latitude member upserts a named member (e.g. a driver ID) at a geographic position. GEOSEARCH drivers:location:available FROMLONLAT -73.9857 40.7484 BYRADIUS 5 km ASC COUNT 10 returns the 10 nearest members within 5 km, sorted by distance, in a single O(N+log M) operation. Note: use FROMLONLAT with the rider's raw coordinates here — FROMMEMBER would only apply if the rider were already stored as a member of the same sorted set, which they are not. Because all data is in memory and geohash encoding enables fast spatial comparisons, latency is typically under 1 ms — and it's the only data structure that can sustain 125K+ GPS writes per second alongside sub-millisecond range queries.

Trip Service owns the lifecycle of a trip: it creates the trip record, transitions its state (requested → driver_assigned → in_progress → completed → settled), and writes to PostgreSQL under row-level locking to prevent double-assignment. It emits trip events to Kafka for downstream payment, rating, and analytics consumers.

ETA and Pricing Service computes two things: estimated arrival time (a graph routing query against a road network graph, adjusted for live traffic) and the fare estimate (base rate × distance × time × surge multiplier). Surge is computed per geohash cell as the ratio of open ride requests to available drivers, with a configurable multiplier schedule. Both the road graph and surge multipliers are cached aggressively — ETA models run on pre-built routing graphs, not live map data per request.

⚠️

WebSocket stickiness matters for dispatch. A driver's device maintains a persistent WebSocket connection to a specific Dispatch Service node. If the matching engine sends an accept/reject notification to the wrong node, the message is silently dropped. Solutions: consistent hashing (route all messages for driver D to the node holding connection D), or a pub/sub layer (publish to a Redis channel that the holding node subscribes to). The pub/sub approach is more resilient to node failures.

Architectural rationale

Why Redis for driver location, not Postgres? Storage choice ›

Postgres can do geospatial queries via PostGIS, but 125K upserts/second will saturate its write path and create index bloat as every driver update re-indexes a B-tree entry. Redis GEOADD is a sorted-set operation that completes in microseconds and handles the write rate comfortably. The trade-off is durability: if the Redis primary dies, you lose the last few seconds of position data. This is acceptable — drivers will re-send their GPS within 4 seconds, and the index rebuilds automatically.

TradeoffRedis is in-memory; driver location data for 1M active drivers is ~128 MB, trivially small. So the trade-off is not memory vs. disk — it's operational complexity (Redis cluster) vs. write throughput (Postgres).

AlternativesPostGIS (lower throughput)H3 (Uber's hex grid library)Aerospike (SSD-backed, lower latency than Postgres)

Why WebSocket for driver notification, not push (FCM/APNs)? Communication model ›

Mobile push (FCM/APNs) has two problems for dispatch: latency and reliability. FCM delivery can take 1–30 seconds depending on battery optimization mode, and the platform provides no acknowledgement that the message was received by the app in time. For a 10-second accept window, a 5-second FCM delay is unacceptable. WebSocket connections from the driver app provide sub-100ms delivery and a clear acknowledgement path. The tradeoff is connection overhead: 1M persistent WebSocket connections requires a well-designed connection server (typically a stateful layer separate from the stateless matching logic).

TradeoffWebSocket servers need sticky routing — a driver's message must reach the server holding their specific connection. This adds a pub/sub indirection layer (Redis Pub/Sub or a service mesh) that push notifications avoid.

Why PostgreSQL for the Trip DB? Consistency model ›

The trip assignment is a distributed locking problem: two concurrent rider requests must not claim the same driver. Postgres row-level locking with SELECT ... FOR UPDATE provides this guarantee natively. The trip write QPS is low (rides/second, not hundreds of thousands), so Postgres is not a throughput bottleneck. A NoSQL database would require building a distributed lock on top — adding complexity without benefit.

TradeoffPostgres does not shard as elegantly as Cassandra or DynamoDB. As trips scale beyond ~100K/day, shard by city ID or use a managed distributed Postgres (e.g. Citus, AlloyDB, Aurora) to maintain the relational guarantees at scale.

AlternativesMySQL + InnoDB row lockingCockroachDB (distributed ACID)DynamoDB + conditional writes (more complex)

05

Geospatial matching, the core algorithm

Geohash-based spatial indexing is the backbone of real-time ride matching. A geohash encodes arbitrary latitude/longitude coordinates into a compact alphanumeric string where common string prefixes denote geographic proximity. By indexing drivers by geohash cell in Redis (using GEOADD), the Dispatch Service can find all available drivers within a radius using a single GEOSEARCH command, then score them by distance, driver rating, and acceptance rate to pick the best match.

The matching problem has two parts: proximity search (which drivers are physically nearby?) and driver selection (among those, which one gets the trip?). They have very different characteristics: proximity search must scale to millions of drivers globally while being sub-millisecond, whereas driver selection involves business logic that runs on a small candidate set (typically 5–20 drivers).

Geohash-based proximity search + weighted scoring for driver selection

Geohash vs. quadtree

Geohash: string prefix = geographic proximity Spatial index ›

Geohash encodes coordinates into a base-32 string. Longer strings = smaller cells. Cells with the same prefix are geographically adjacent. This makes lookups trivially simple in a Redis sorted set — encode both the driver and rider position to the same precision and compare scores. The weakness is at cell boundaries: a rider at the edge of cell A and a driver 50 meters away in cell B won't have a common prefix, so you must also query the 8 neighboring cells. Redis GEOSEARCH handles this automatically via its internal geohash scoring.

Precision 6-char geohash ≈ 1.2 km × 0.6 km. 7-char ≈ 153 m × 153 m. For surge pricing, 6 chars is the right granularity. For driver matching, use the Redis GEOSEARCH radius query rather than raw prefix matching.

Quadtree: adaptive subdivision for uneven density Spatial index ›

A quadtree recursively subdivides space into four quadrants until each cell contains at most N drivers (e.g. N=10). This means dense cities (Manhattan) get fine-grained cells while rural areas get coarse cells — matching their actual driver density. Uber uses a hexagonal grid (H3) which achieves similar adaptive behavior with better geometric properties (no edge effects at poles). Quadtrees are more complex to implement in Redis; for interview purposes, geohash is the correct first answer.

When to useQuadtree (or H3) becomes relevant when your surge pricing cells need to be density-adaptive — you don't want a surge cell in NYC with 10,000 drivers treated the same as one in rural Nevada with 3. Uber's H3 grid serves exactly this purpose.

AlternativesUber H3 hexagonal gridPostGIS R-TreeS2 geometry library (Google)

Edge case: no drivers in the search radius

🔍

Interviewer probe: "What happens if no drivers are within 5 km of the rider?" The naive answer is "expand the radius." The better answer is an exponential backoff search: query 2 km → 5 km → 10 km → 20 km with a timeout at each tier. Additionally, the UI should show real-time driver positions on the map so the rider knows whether to wait. For markets with structural supply shortages, the system should show a predicted pickup time once a driver completes a nearby trip, not just a "no drivers available" error.

05b

API design

The API surface covers three primary actors: the rider requesting a trip, the driver sending GPS data and responding to offers, and the trip lifecycle mutations.

Endpoint	Method	Description
`POST /rides/request`	REST	Rider submits pickup + dropoff coordinates. Returns `ride_id` and initial ETA + fare estimate.
`GET /rides/{id}/status`	REST (poll) / SSE	Returns trip state, driver position, and ETA. Riders poll this or receive via server-sent events.
`POST /rides/{id}/cancel`	REST	Rider or driver cancels the trip (with cancellation fee logic).
`PUT /drivers/{id}/location`	REST / WebSocket frame	Driver app pushes GPS coordinates. Upserts into Redis GEOADD.
`PUT /drivers/{id}/status`	REST	Driver sets availability: `available \| busy \| offline`.
`POST /drivers/{id}/offer/accept`	WebSocket	Driver accepts or rejects a ride offer. Bidirectional WebSocket frame.
`GET /pricing/estimate`	REST	Returns fare estimate and current surge multiplier for a given origin cell.

Request / response schemas

The two most interview-critical endpoints in detail. These are the shapes candidates are expected to sketch during a whiteboard session.

POST /rides/request: rider requests a trip ›

// Request
POST /rides/request
Content-Type: application/json
Idempotency-Key: <client-generated UUID>          // ← dedup on double-tap

{
  "pickup":        { "lat": 37.7749, "lng": -122.4194 },
  "dropoff":       { "lat": 37.7863, "lng": -122.4102 },
  "vehicle_type":  "standard",          // standard | xl | black
  "idempotency_key": "550e8400-e29b-41d4-a716-446655440000"
}

// 200 OK — trip created or existing trip returned on duplicate key
{
  "ride_id":              "trip_abc123",
  "status":               "REQUESTED",
  "eta_pickup_seconds":   240,
  "fare_estimate_cents":  1450,
  "surge_multiplier":     1.2,
  "created_at":           "2026-04-19T22:00:00Z"
}

// 429 Too Many Requests — rider rate-limited (> 5 active requests/hour)
// 409 Conflict — returned if an active trip already exists for this rider

IdempotencyThe Idempotency-Key header (or body field) is checked against a Redis key with a 60-second TTL before creating the trip. A duplicate submission within the window returns the existing trip ID without creating a second one. This handles the common "double-tap" and "connection retry" cases that would otherwise flood the matching engine with duplicate requests for the same rider.

PUT /drivers/{id}/location: GPS heartbeat ›

// WebSocket frame (same connection used for offer receipt)
// Sent every 4 seconds from the driver app
{
  "type":      "location_update",
  "driver_id": "drv_xyz789",
  "lat":       37.7812,
  "lng":       -122.4130,
  "heading":   270,             // degrees, 0=north
  "speed_kmh": 32,
  "timestamp": "2026-04-19T22:00:04Z"
}

// Server ACK (sent back on same WS connection)
{
  "type":   "location_ack",
  "status": "ok"
}

// If driver has been matched to a pending trip, ACK includes the offer:
{
  "type":        "ride_offer",
  "ride_id":     "trip_abc123",
  "pickup":      { "lat": 37.7749, "lng": -122.4194, "address": "Market St & 4th" },
  "dropoff":     { "lat": 37.7863, "lng": -122.4102, "address": "Union Square" },
  "rider_name":  "Alex R.",
  "rider_rating": 4.8,
  "fare_estimate_cents": 1450,
  "expires_at":  "2026-04-19T22:00:14Z"   // 10-second accept window
}

Why multiplex on one connectionSending GPS heartbeats and receiving ride offers over the same WebSocket connection eliminates the overhead of a separate long-poll channel for offer delivery. It also lets the server detect driver offline status from a single missed heartbeat, without needing a separate keepalive mechanism.

💡

Driver location updates should be sent over the same persistent WebSocket connection used for receiving ride offers. This amortizes connection setup cost across both GPS writes and offer pushes, and allows the server to detect driver disconnection (loss of heartbeat) without a separate polling mechanism.

06

Core flow: requesting a ride

The ride request flow is a cascade of real-time operations that must complete within the matching latency SLA. Each step has a failure mode and recovery path.

Ride request flow: rider tap → Redis proximity query → speculative offer via WebSocket → driver accepts → trip committed to DB → rider notified. Total wall time: < 5 seconds p99.

Step-by-step request flow

Rider app sends POST /rides/request with pickup/dropoff coordinates and vehicle type preference.
API Gateway authenticates the rider, checks rate limits, and forwards to Trip Service.
Trip Service creates a trip record in Postgres with status REQUESTED and returns the ride_id immediately.
Dispatch Service picks up the request (via queue or direct call), runs GEOSEARCH against the Redis location index to fetch drivers within 5 km, filters by availability/vehicle type, and scores the candidate set.
Dispatch Service pushes the ride offer to D9 via WebSocket. No DB write happens yet : the offer is speculative. The driver has 10 seconds to accept or reject.
Driver accepts. Only now does Dispatch Service write to the Trip DB: UPDATE trips SET driver_id = D9, status = 'driver_assigned' WHERE trip_id = T1 AND driver_id IS NULL. If 0 rows are updated (another request beat it), the driver reverts to available and the system cascades to the next candidate. Simultaneously, the driver's status is flipped to busy in both Postgres and Redis, removing them from the drivers:location:available index via ZREM. Dispatch Service then publishes a trip.assigned event to Kafka.
Rider app, polling GET /rides/{id}/status, receives the driver's name, vehicle plate, and live ETA.
On trip completion, Trip Service transitions state to COMPLETED and emits a trip.completed event with fare_cents, rider_id, and driver_id. The Payment Service consumes this, charges the stored payment method via Stripe, and emits payment.succeeded or payment.failed. On failure the trip enters COMPLETED_UNPAID pending automated retry and, if exhausted, manual support review.

07

Data model

Two fundamentally different stores: a relational database for transactional records (trips, drivers, riders), and an in-memory geospatial index for live driver positions. The split exists because these two datasets have opposite throughput and durability requirements — collapsing them into a single store is the most common L3/L4 design error on this question.

Trip state machine

The trip record is a state machine. Every valid transition is a database write; invalid transitions are rejected. Interviewers frequently ask "what states can a trip be in?" — sketch this before drawing the data model table.

Trip state machine — solid arrows are happy-path transitions; dashed red arrows are cancellation and payment failure paths

Relational schema (PostgreSQL)

Table	Key columns	Notes
`trips`	`trip_id PK, rider_id FK, driver_id FK (nullable), status, pickup_lat, pickup_lng, dropoff_lat, dropoff_lng, fare_cents, surge_multiplier, created_at, completed_at`	Double-assignment is prevented by a partial unique index — `CREATE UNIQUE INDEX active_trip_per_driver ON trips(driver_id) WHERE status NOT IN ('completed','cancelled')`, so the DB rejects any second active trip for the same driver as a constraint violation, regardless of how many concurrent matchers attempt the write.
`drivers`	`driver_id PK, status (available\|busy\|offline), vehicle_type, rating_avg, acceptance_rate_14d`	Status is write-through to Redis. Rating and acceptance rate are rolling aggregates updated asynchronously.
`riders`	`rider_id PK, rating_avg, payment_method_id, created_at`	Payment method references a Stripe Customer ID — actual processing is delegated to Payment Service.
`trip_events`	`event_id PK, trip_id FK, event_type, lat, lng, occurred_at`	Append-only audit log of GPS waypoints during a trip (driver pings written every 30s by the analytics consumer, not in real-time).

Redis data structures

Key pattern	Type	Purpose
`drivers:location:available`	Geo sorted set	All available drivers, indexed by lng/lat via GEOADD. The primary spatial index for matching. Updated on every GPS heartbeat.
`driver:{id}:meta`	Hash	Driver attributes needed by the scoring function: `vehicle_type`, `rating_avg`, `acceptance_rate`. TTL 60s (refreshed periodically from Postgres).
`surge:cell:{geohash6}`	String (float)	Current surge multiplier for a geohash-6 cell. Updated by the Pricing Service every 30–60 seconds. TTL 120s.
`driver:{id}:ws_node`	String	Which WebSocket server node holds this driver's connection. Used for routing offer messages. TTL matches connection lifetime.

💡

When a driver accepts a trip, they are removed from drivers:location:available via ZREM and their status is set to busy in a Redis Hash key (driver:{id}:status). This keeps the proximity search result set clean in steady state. However, the Location Service's GPS-heartbeat path must check this status key before executing GEOADD — otherwise a heartbeat arriving in the window between ZREM and the driver-status write will silently re-add the busy driver back into the available index. The guard is a single GET driver:{id}:status check; skip the GEOADD if the result is busy.

08

Caching strategy

In a ride-sharing system, caching operates at three distinct layers with different TTLs and consistency requirements: the driver location index (in-memory, millisecond TTL, continuously overwritten by GPS heartbeats), the driver attribute cache (1-minute TTL, refreshed from Postgres), and the surge multiplier cache (30–60 second TTL, updated by the pricing service). The ETA result cache is the most nuanced — identical origin-destination pairs within the same geohash cell can share a cached ETA for 30 seconds without meaningfully degrading accuracy.

What	Where	TTL	Invalidation
Driver live positions	Redis GEOADD (primary)	Always current (heartbeat overwrites)	GEOADD upsert on every GPS ping
Driver attributes (rating, vehicle type)	Redis Hash	60 s	Time-based; refreshed from Postgres on miss
Surge multiplier per geohash cell	Redis String	120 s	Written by Pricing Service every 30–60 s
ETA estimates (by route hash)	Redis String	30 s	Time-based; stale ETAs skew toward over-estimate, safer than under-estimate
Road network graph	In-process memory (ETA service)	Never (static)	Shipped as artifact during deploy; live traffic overlaid at query time
Rider/driver profiles	Redis + CDN	5 min	Invalidated on profile update event from Kafka

Why is the road graph in-process, not Redis? Cache placement ›

The road network graph for a city is a large, read-only data structure (typically 200 MB–2 GB as a compressed adjacency graph with travel time weights). Loading it into Redis would create an enormous serialization overhead — every ETA request would require deserializing a graph that is orders of magnitude larger than the query result. Instead, ETA service pods load the graph on startup into process memory. Live traffic data (speed adjustments per road segment) is a much smaller overlay and can be cached in Redis or fetched via an HTTP cache from a traffic API.

TradeoffIn-process caching means graph updates require a rolling restart of ETA service pods. This is acceptable for a road network that changes slowly (new roads, speed limit changes). Live traffic deltas are applied as overlays, not full graph rebuilds.

How does surge pricing work computationally? Pricing model ›

Every 30–60 seconds, the Pricing Service scans all active geohash-6 cells in the system. For each cell, it computes the supply/demand ratio: count of available drivers vs. open ride requests in the past 5-minute window. When demand exceeds supply by a threshold (e.g. demand/supply > 1.5), the surge multiplier activates. The multiplier is capped (typically 5×–8× in production) and is applied to the base fare at trip creation time, not at payment time. The computed multiplier is written to surge:cell:{geohash6} in Redis with a 120-second TTL.

Edge caseCell boundaries can create jarring surge discontinuities — same block, different cell, 1.0× vs. 3.0×. Production systems use a spatial smoothing pass (weighted average with neighboring cells) to reduce boundary artifacts.

09

Scalability deep dive

The GPS ingestion pipeline and the matching engine are the two hardest scaling problems here. They look similar on the surface — both handle ride-related data — but they have opposite characteristics and need completely different solutions.

🔒

Driver location privacy is a regulatory requirement, not just a product decision. Driver GPS coordinates are never transmitted to riders at full precision: the Rider App receives a coarse position (randomized within approximately 200 meters) until the driver accepts the trip. After the trip, the exact route is retained for a maximum of 90 days under GDPR and California CCPA. The trip_events table must support right-to-erasure deletions; driver GPS history must be stored separately from trip records and purged on schedule. In markets with stricter regulations (e.g., Germany), location data may not be transmitted to servers outside the EU at all, requiring regional data residency for the location index Redis cluster.

GPS location ingestion

At 125K–500K writes/second (depending on fleet size and heartbeat cadence), the Location Service is the highest-throughput component in the system. The scaling strategy:

Scaling techniques

Shard Redis by city/region. A single Redis cluster can handle ~100K writes/second. Partition driver IDs across N clusters by city or geographic shard.
Location Service is stateless. Any Location Service pod can write to any Redis shard. Add pods horizontally with the load balancer.
Batch micro-writes. Instead of writing each GPS update instantly, buffer 100ms of updates per driver and batch-write them as a pipeline. Reduces Redis RTT overhead without meaningfully degrading freshness.
Kafka as a buffer. Route GPS updates through Kafka first; Location Service consumers write to Redis. Kafka absorbs burst spikes and provides replay capability for analytics.

Hotspots to watch

Airport queues. 500 drivers in a 1 km radius at SFO means a single geohash cell is extremely hot. Redis GEOADD on a single key from 500 concurrent writers can serialize. Mitigate by cell-level sharding or client-side location aggregation.
Event surge. A stadium event ending adds 10K driver GPS pings simultaneously from a small area. Expect write spikes 5–10× normal for that cell. Design Redis capacity for peak, not average.
Cascading match failures. If the Redis location index falls behind (stale positions), the Dispatch Service will send offers to drivers who have already moved away — increasing accept-window timeouts and cascading failures.

Matching throughput

The Dispatch Service scales horizontally for normal load. The tricky case is matching under surge: many simultaneous ride requests from the same area competing for the same driver pool. Two techniques:

Lock-free optimistic assignment ›

Instead of acquiring a row lock on the driver record during matching (which serializes all concurrent matches), use an optimistic update: UPDATE trips SET driver_id=D9 WHERE trip_id=T1 AND driver_id IS NULL. If two concurrent requests both try to assign D9, only one will succeed (the other will update 0 rows and retry with the next candidate). This keeps the matching loop lock-free in the common case and only serializes on the rare double-claim.

TradeoffOptimistic locking increases retry frequency under heavy contention (surge). Add a short jitter (50–200ms) before retrying to reduce thundering herd behavior.

Batch matching for high-density markets ›

Rather than matching each ride request independently, accumulate requests for 1–2 seconds and match them as a batch against the available driver pool. This allows the system to globally optimize assignments (minimize total wait time across all pending requests) rather than greedily assigning the nearest driver to each request in arrival order. Uber's dispatch engine does this for UberPool and surge scenarios. The trade-off is a 1–2 second latency penalty for the first request in each batch window.

When to useBatch matching only wins when the number of simultaneously pending requests is significant relative to the driver pool size (i.e. during surge). At normal load, greedy matching is simpler and faster.

10

Failure modes and mitigations

Each component in this system has a distinct failure mode. Designing for these is the difference between L5 and L6 answers.

Failure	Impact	Mitigation
Redis location index primary failure	Matching fails; no drivers visible	Redis Cluster with replica promotion. At-most 2-second data loss. Location service falls back to Redis replica (read-only) and drops writes until promotion completes.
Driver WebSocket server failure	Offers drop for all drivers on that node	Driver apps reconnect immediately; Redis `driver:{id}:ws_node` key expires automatically; Dispatch Service refreshes routing on reconnect. Trips in offer-pending state retry with the next candidate.
Dispatch Service crash during matching	Ride request stuck in REQUESTED state	Trip record has a `last_match_attempt_at` timestamp. A background re-queuer finds trips in REQUESTED state older than 15s and re-enqueues them. Idempotent because driver assignment uses optimistic locking.
Driver accepts but Trip Service write fails	Driver sees trip; Trip DB has no driver assigned	Dispatch Service handles assignment write failures by retrying with exponential backoff. If retries fail, it sends a "system error, please re-accept" message to the driver and reverts the trip to REQUESTED for re-matching.
GPS position dramatically incorrect (GPS spoof/jump)	Wrong driver offered; bad ETA	Location Service applies a sanity filter: reject updates where the new position implies speed > 200 km/h since the last ping. Flag coordinates for manual review. Apply Kalman filter smoothing to reduce noise from legitimate rapid updates.
Surge pricing service down	All prices revert to base rate	Pricing Service writes surge multipliers with 120s TTL. If the service is down, multipliers expire and the ETA service falls back to `1.0×`. This is acceptable; it errs toward the rider, not the driver.

🔍

Common interviewer follow-up: "How do you handle a driver who accepts a ride but never arrives?" This is a product + systems problem. System-side: the trip emits a driver_en_route event; if the driver's GPS position doesn't converge toward the pickup within N minutes, a background job triggers an alert and the rider can cancel penalty-free. Product-side: cancellation policies and fraud detection run on the trip_events stream asynchronously.

11

How to answer by level

The same prompt — "Design Uber" — is given to L3 candidates and L7 candidates. What separates the answers is the depth of trade-off reasoning, not the number of buzzwords used.

Level	Expected depth	Common gaps to avoid
L3/L4	Identifies core components (rider app, driver app, server, maps API, payment), defines the trip state machine, and handles basic GPS storage. Google Maps for ETA is fine at this level.	Not identifying that GPS writes are the dominant throughput problem. Storing driver location in a relational table with a lat/lng index without noting the write volume.
L5	Designs the Redis geospatial index, explains geohash precision trade-offs, designs WebSocket architecture with session routing, quantifies GPS write QPS and justifies the Redis choice, explains surge pricing mechanics.	Skipping the WebSocket routing problem (which Dispatch node holds which driver connection). Describing surge pricing without explaining how supply/demand is measured per cell.
L6	Owns the full failure mode matrix, explains optimistic vs. pessimistic locking trade-offs for driver assignment, designs the ETA pipeline with map graph caching vs. live traffic overlays, discusses GPS spoofing detection, and draws the Kafka event topology for downstream consumers.	Missing the Redis cluster sharding strategy for high-density cells (airport queues). Not discussing how driver's acceptance rate feeds back into the scoring function.
L7/L8	Addresses global multi-region deployment (city-isolated shards with a global control plane), discusses ML pipeline for ETA accuracy improvements (gradient boosting on historical trip time vs. naive map routing), batch dispatch optimization for UberPool, the economics of the supply-demand marketplace (driver incentive programs as a supply-shaping mechanism), and data residency constraints for international markets.	Treating the system as purely technical without the business model context. The surge pricing algorithm, driver incentive programs, and matching fairness policies are all coupled to the two-sided marketplace economics.

💡

Interview technique: Start with the data flows, not the components. Say: "There are three data flows here: continuous GPS writes from drivers, ride request and matching, and trip lifecycle management. They have very different throughput and consistency requirements, which is why they need separate components and storage." This framing signals to the interviewer that you understand why the system is complex, not just what components it has.

Real-world comparison

Decision	This design	Uber	Lyft
Spatial index	Redis GEOADD (geohash)	H3 hexagonal grid + custom dispatch engine	S2 geometry library + Redis
Driver comms	WebSocket	QUIC / long-poll fallback	WebSocket
Trip DB	PostgreSQL + row locking	Schemaless (DocStore) + MySQL	MySQL with custom sharding
ETA model	Pre-built road graph + live traffic overlay	DeepETA (ML, trained on historical trips)	Google Maps API + fine-tuning layer
Surge pricing	Supply/demand ratio per geohash cell	ML-based surge with demand forecasting	Heat-map based, similar cell approach
GPS ingestion	Kafka → Location Service → Redis	Kafka + internal stream processing (Flink)	Kafka + Flink

💡

Uber's "DeepETA" replaced their map-routing ETA with an ML model trained on millions of historical trips. The ML model outperforms naive map routing by 26% on mean absolute error by capturing real-world factors (traffic patterns, construction, driver behavior) that static road graphs can't represent. This is an L7/L8 topic — you don't need to propose it unprompted, but if asked "how would you improve ETA accuracy?" it's the right answer.

Ride-Sharing System Design Interview Guide