What the interviewer is testing
The ride-sharing prompt is the highest-signal question in the FAANG system design canon precisely because it hides four intersecting real-time problems inside a deceptively simple description. Continuous GPS ingestion and spatial indexing of moving drivers, sub-second matching between a rider request and the nearest available driver, ETA computation under live traffic, and demand-aware surge pricing per geohash cell — each of these is a hard systems problem on its own. Together, they expose whether you can reason about competing data freshness requirements rather than just naming components.
A candidate who reduces this to "store driver location in a table and query nearby" has ignored the hardest parts. The table below shows what separates answers by level.
| Level | What good looks like |
|---|---|
| L3/L4 | Designs a rider request API, stores driver GPS in a DB, uses Google Maps for ETA, understands the basic booking state machine (requested → accepted → in-trip → completed). |
| L5 | Implements geohash-based spatial index in Redis for sub-second driver lookup, designs the WebSocket architecture for real-time location streaming, explains surge pricing per cell, quantifies write QPS from GPS heartbeats. |
| L6 | Owns the end-to-end matching algorithm trade-offs (geohash vs. quadtree), reasons about ETA pipeline latency, designs supply/demand forecasting per cell, explains hot-spot handling for driver-dense areas, and addresses driver location privacy. |
| L7/L8 | Addresses cross-region replication for global expansion, ML-driven ETA accuracy, dispatch optimization (batching nearby requests), per-product SKU differentiation (UberX vs. UberXL vs. UberBlack), and the economics of the supply-demand marketplace. |
Requirements clarification
Scope this before touching any components. "Ride-sharing" covers carpooling, scheduled rides, freight, food delivery, and scooters — each with different matching and pricing models. For a FAANG interview, the canonical scope is on-demand P2P rides in a single city.
Functional requirements
| Requirement | In scope |
|---|---|
| Rider requests a ride from location A to location B | ✓ |
| System matches rider to nearest available driver | ✓ |
| Driver location streams to backend (GPS heartbeat) | ✓ |
| Real-time ETA shown to rider before and during trip | ✓ |
| Surge pricing based on local supply/demand ratio | ✓ |
| Trip lifecycle: requested → accepted → in-trip → completed | ✓ |
| Payment processing on trip completion | ✓ (stub — not deep-dived) |
| Driver and rider ratings | ✓ (simplified) |
| Ride scheduling (future pickup) | Out of scope |
| Carpooling / shared rides | Out of scope |
Non-functional requirements
| NFR | Target | Why |
|---|---|---|
| Matching latency (ride request → driver notified) | < 5 seconds p99 | Riders abandon requests beyond 5–10 seconds; Uber targets < 3s |
| Location update latency (driver GPS → index) | < 2 seconds | Stale driver positions produce wrong ETA and missed matches |
| GPS write QPS | ~125K/s baseline (500K active drivers ÷ 4s cadence); up to 500K/s at global Uber scale (1M drivers, 2s cadence) | Every driver pings every 4 seconds; write volume scales with active fleet size, not ride count |
| Availability | 99.99% during peak hours | Downtime = no rides; direct revenue impact |
| Consistency (trip state) | Strong — one driver per trip | Double-booking a driver is a hard correctness failure |
| Read/write ratio (location store) | Write-dominant | Drivers write; matching reads; write >> read |
Why strong consistency for trip state but eventual for location? ›
Trip state (driver assignment, trip status) requires strong consistency because two riders matching the same driver is a hard business error — one trip must be rejected. Location data, by contrast, is continuously refreshed. A stale GPS position by 2 seconds is acceptable because the next heartbeat will correct it. This split allows you to use a strongly consistent RDBMS for trip records while using a high-throughput, eventually consistent spatial index (Redis) for location.
Why 4-second GPS heartbeat cadence specifically? ›
GPS heartbeat frequency is a battery-vs-accuracy trade-off on the driver's phone. 1 Hz (every second) drains battery significantly and produces location data faster than the spatial index can usefully act on. 10-second intervals make ETA predictions visibly laggy when a driver is approaching. 4 seconds is a common production compromise: fast enough for sub-10-second ETA refresh on the rider app, slow enough to not drain the driver's battery on an 8-hour shift.
Rate-limit ride requests and guard the driver fleet as a resource.
Without a per-rider rate limit on POST /rides/request, a bad actor can
send hundreds of requests per minute to effectively map the live driver fleet
(each request triggers a GEOSEARCH and exposes approximate driver density). Apply
a hard limit of 5 active ride requests per rider per hour at the API Gateway.
Additionally, the GET /pricing/estimate endpoint
(which also queries driver density per cell) should be token-bucket rate-limited
per IP to prevent scraping of surge heat-maps.
Capacity estimation
The dominant cost driver in ride-sharing is the GPS location update stream — not rides, not users, but the continuous position heartbeat from every active driver. Size this first; everything else is smaller by orders of magnitude. The location tier is write-heavy; the ETA tier is read-heavy; they need completely different storage and scaling strategies.
Interactive estimator
The key insight: The location index is tiny in storage terms — 500K drivers × 128 bytes each is only ~64 MB — but the write rate is enormous. At 125K writes/second sustained, the bottleneck is ingestion throughput and index update latency, not disk space. This is why GPS data lives in Redis (sub-millisecond in-memory writes), not Postgres. The trip database, by contrast, is large in storage but low-QPS — it writes once per trip and is read infrequently afterward.
High-level architecture
A ride-sharing platform has three core data flows: continuous driver GPS ingestion (high-write, spatially indexed), rider request matching (low-latency geospatial query + driver notification), and trip lifecycle management (strongly consistent state machine). The architecture separates these flows into a Location Service that maintains a real-time geohash index in Redis, a Dispatch Service that runs the matching algorithm and communicates with drivers via WebSocket, and a Trip Service that manages the booking state machine using a relational database with optimistic locking to prevent double-assignment.
Ride-sharing high-level architecture: GPS ingest → Redis spatial index → Dispatch matching → Trip DB state machine
Component breakdown
Location Service is the hottest service in the system by write volume. Every
active driver sends a GPS heartbeat every 4 seconds. The service receives these, validates them
(timestamp freshness, coordinate sanity), and upserts into a Redis geospatial index using
GEOADD. It also publishes a location-update event to Kafka for the analytics
pipeline. Critically, it does not interact with the relational trip database — that path
would be far too slow at 125K+ writes/second.
Dispatch Service handles the matching algorithm. When a rider submits a request, the Dispatch Service queries the Redis geospatial index for available drivers within a radius, scores candidates (distance, acceptance rate, vehicle type), picks the best match, and sends a push notification to the driver's active WebSocket connection. It then waits for the driver accept/reject response within a timeout window (typically 10 seconds), cascading to the next candidate if rejected.
What is Redis GEOADD? tap to expand ›
Redis geospatial commands in one paragraph. Redis provides
GEOADD, GEODIST, and GEOSEARCH, all built
on sorted sets that encode latitude/longitude pairs as 52-bit geohash scores.
GEOADD key longitude latitude member upserts a named member (e.g.
a driver ID) at a geographic position.
GEOSEARCH drivers:location:available FROMLONLAT -73.9857 40.7484 BYRADIUS
5 km ASC COUNT 10 returns the 10 nearest members within 5 km, sorted by
distance, in a single O(N+log M) operation. Note: use FROMLONLAT
with the rider's raw coordinates here — FROMMEMBER would only apply
if the rider were already stored as a member of the same sorted set, which
they are not. Because all data is in memory and geohash encoding enables fast
spatial comparisons, latency is typically under 1 ms — and it's the only
data structure that can sustain 125K+ GPS writes per second alongside
sub-millisecond range queries.
Trip Service owns the lifecycle of a trip: it creates the trip record, transitions its state (requested → driver_assigned → in_progress → completed → settled), and writes to PostgreSQL under row-level locking to prevent double-assignment. It emits trip events to Kafka for downstream payment, rating, and analytics consumers.
ETA and Pricing Service computes two things: estimated arrival time (a graph routing query against a road network graph, adjusted for live traffic) and the fare estimate (base rate × distance × time × surge multiplier). Surge is computed per geohash cell as the ratio of open ride requests to available drivers, with a configurable multiplier schedule. Both the road graph and surge multipliers are cached aggressively — ETA models run on pre-built routing graphs, not live map data per request.
WebSocket stickiness matters for dispatch. A driver's device maintains a persistent WebSocket connection to a specific Dispatch Service node. If the matching engine sends an accept/reject notification to the wrong node, the message is silently dropped. Solutions: consistent hashing (route all messages for driver D to the node holding connection D), or a pub/sub layer (publish to a Redis channel that the holding node subscribes to). The pub/sub approach is more resilient to node failures.
Architectural rationale
Why Redis for driver location, not Postgres? Storage choice ›
Postgres can do geospatial queries via PostGIS, but 125K upserts/second will saturate its write path and create index bloat as every driver update re-indexes a B-tree entry. Redis GEOADD is a sorted-set operation that completes in microseconds and handles the write rate comfortably. The trade-off is durability: if the Redis primary dies, you lose the last few seconds of position data. This is acceptable — drivers will re-send their GPS within 4 seconds, and the index rebuilds automatically.
Why WebSocket for driver notification, not push (FCM/APNs)? Communication model ›
Mobile push (FCM/APNs) has two problems for dispatch: latency and reliability. FCM delivery can take 1–30 seconds depending on battery optimization mode, and the platform provides no acknowledgement that the message was received by the app in time. For a 10-second accept window, a 5-second FCM delay is unacceptable. WebSocket connections from the driver app provide sub-100ms delivery and a clear acknowledgement path. The tradeoff is connection overhead: 1M persistent WebSocket connections requires a well-designed connection server (typically a stateful layer separate from the stateless matching logic).
Why PostgreSQL for the Trip DB? Consistency model ›
The trip assignment is a distributed locking problem: two concurrent rider requests
must not claim the same driver. Postgres row-level locking with
SELECT ... FOR UPDATE provides this guarantee natively.
The trip write QPS is low (rides/second, not hundreds of thousands), so Postgres is
not a throughput bottleneck. A NoSQL database would require building a distributed
lock on top — adding complexity without benefit.
Geospatial matching, the core algorithm
Geohash-based spatial indexing is the backbone of real-time ride matching. A geohash encodes arbitrary latitude/longitude coordinates into a compact alphanumeric string where common string prefixes denote geographic proximity. By indexing drivers by geohash cell in Redis (using GEOADD), the Dispatch Service can find all available drivers within a radius using a single GEOSEARCH command, then score them by distance, driver rating, and acceptance rate to pick the best match.
The matching problem has two parts: proximity search (which drivers are physically nearby?) and driver selection (among those, which one gets the trip?). They have very different characteristics: proximity search must scale to millions of drivers globally while being sub-millisecond, whereas driver selection involves business logic that runs on a small candidate set (typically 5–20 drivers).
Geohash-based proximity search + weighted scoring for driver selection
Geohash vs. quadtree
Geohash: string prefix = geographic proximity Spatial index ›
Geohash encodes coordinates into a base-32 string. Longer strings = smaller cells. Cells with the same prefix are geographically adjacent. This makes lookups trivially simple in a Redis sorted set — encode both the driver and rider position to the same precision and compare scores. The weakness is at cell boundaries: a rider at the edge of cell A and a driver 50 meters away in cell B won't have a common prefix, so you must also query the 8 neighboring cells. Redis GEOSEARCH handles this automatically via its internal geohash scoring.
Quadtree: adaptive subdivision for uneven density Spatial index ›
A quadtree recursively subdivides space into four quadrants until each cell contains at most N drivers (e.g. N=10). This means dense cities (Manhattan) get fine-grained cells while rural areas get coarse cells — matching their actual driver density. Uber uses a hexagonal grid (H3) which achieves similar adaptive behavior with better geometric properties (no edge effects at poles). Quadtrees are more complex to implement in Redis; for interview purposes, geohash is the correct first answer.
Edge case: no drivers in the search radius
Interviewer probe: "What happens if no drivers are within 5 km of the rider?" The naive answer is "expand the radius." The better answer is an exponential backoff search: query 2 km → 5 km → 10 km → 20 km with a timeout at each tier. Additionally, the UI should show real-time driver positions on the map so the rider knows whether to wait. For markets with structural supply shortages, the system should show a predicted pickup time once a driver completes a nearby trip, not just a "no drivers available" error.
API design
The API surface covers three primary actors: the rider requesting a trip, the driver sending GPS data and responding to offers, and the trip lifecycle mutations.
| Endpoint | Method | Description |
|---|---|---|
POST /rides/request |
REST | Rider submits pickup + dropoff coordinates. Returns ride_id and initial
ETA + fare estimate. |
GET /rides/{id}/status |
REST (poll) / SSE | Returns trip state, driver position, and ETA. Riders poll this or receive via server-sent events. |
POST /rides/{id}/cancel |
REST | Rider or driver cancels the trip (with cancellation fee logic). |
PUT /drivers/{id}/location |
REST / WebSocket frame | Driver app pushes GPS coordinates. Upserts into Redis GEOADD. |
PUT /drivers/{id}/status |
REST | Driver sets availability: available | busy | offline. |
POST /drivers/{id}/offer/accept |
WebSocket | Driver accepts or rejects a ride offer. Bidirectional WebSocket frame. |
GET /pricing/estimate |
REST | Returns fare estimate and current surge multiplier for a given origin cell. |
Request / response schemas
The two most interview-critical endpoints in detail. These are the shapes candidates are expected to sketch during a whiteboard session.
POST /rides/request: rider requests a trip
›
// Request
POST /rides/request
Content-Type: application/json
Idempotency-Key: <client-generated UUID> // ← dedup on double-tap
{
"pickup": { "lat": 37.7749, "lng": -122.4194 },
"dropoff": { "lat": 37.7863, "lng": -122.4102 },
"vehicle_type": "standard", // standard | xl | black
"idempotency_key": "550e8400-e29b-41d4-a716-446655440000"
}
// 200 OK — trip created or existing trip returned on duplicate key
{
"ride_id": "trip_abc123",
"status": "REQUESTED",
"eta_pickup_seconds": 240,
"fare_estimate_cents": 1450,
"surge_multiplier": 1.2,
"created_at": "2026-04-19T22:00:00Z"
}
// 429 Too Many Requests — rider rate-limited (> 5 active requests/hour)
// 409 Conflict — returned if an active trip already exists for this rider
Idempotency-Key header (or body field) is checked against a Redis key
with a 60-second TTL before creating the trip. A duplicate submission within the
window returns the existing trip ID without creating a second one. This handles the
common "double-tap" and "connection retry" cases that would otherwise flood the
matching engine with duplicate requests for the same rider.
PUT /drivers/{id}/location: GPS heartbeat
›
// WebSocket frame (same connection used for offer receipt)
// Sent every 4 seconds from the driver app
{
"type": "location_update",
"driver_id": "drv_xyz789",
"lat": 37.7812,
"lng": -122.4130,
"heading": 270, // degrees, 0=north
"speed_kmh": 32,
"timestamp": "2026-04-19T22:00:04Z"
}
// Server ACK (sent back on same WS connection)
{
"type": "location_ack",
"status": "ok"
}
// If driver has been matched to a pending trip, ACK includes the offer:
{
"type": "ride_offer",
"ride_id": "trip_abc123",
"pickup": { "lat": 37.7749, "lng": -122.4194, "address": "Market St & 4th" },
"dropoff": { "lat": 37.7863, "lng": -122.4102, "address": "Union Square" },
"rider_name": "Alex R.",
"rider_rating": 4.8,
"fare_estimate_cents": 1450,
"expires_at": "2026-04-19T22:00:14Z" // 10-second accept window
}
Driver location updates should be sent over the same persistent WebSocket connection used for receiving ride offers. This amortizes connection setup cost across both GPS writes and offer pushes, and allows the server to detect driver disconnection (loss of heartbeat) without a separate polling mechanism.
Core flow: requesting a ride
The ride request flow is a cascade of real-time operations that must complete within the matching latency SLA. Each step has a failure mode and recovery path.
Ride request flow: rider tap → Redis proximity query → speculative offer via WebSocket → driver accepts → trip committed to DB → rider notified. Total wall time: < 5 seconds p99.
- Rider app sends
POST /rides/requestwith pickup/dropoff coordinates and vehicle type preference. - API Gateway authenticates the rider, checks rate limits, and forwards to Trip Service.
- Trip Service creates a trip record in Postgres with status
REQUESTEDand returns theride_idimmediately. - Dispatch Service picks up the request (via queue or direct call), runs
GEOSEARCHagainst the Redis location index to fetch drivers within 5 km, filters by availability/vehicle type, and scores the candidate set. - Dispatch Service pushes the ride offer to D9 via WebSocket. No DB write happens yet : the offer is speculative. The driver has 10 seconds to accept or reject.
- Driver accepts. Only now does Dispatch Service write to the Trip DB:
UPDATE trips SET driver_id = D9, status = 'driver_assigned' WHERE trip_id = T1 AND driver_id IS NULL. If 0 rows are updated (another request beat it), the driver reverts to available and the system cascades to the next candidate. Simultaneously, the driver's status is flipped tobusyin both Postgres and Redis, removing them from thedrivers:location:availableindex viaZREM. Dispatch Service then publishes atrip.assignedevent to Kafka. - Rider app, polling
GET /rides/{id}/status, receives the driver's name, vehicle plate, and live ETA. - On trip completion, Trip Service transitions state to
COMPLETEDand emits atrip.completedevent withfare_cents,rider_id, anddriver_id. The Payment Service consumes this, charges the stored payment method via Stripe, and emitspayment.succeededorpayment.failed. On failure the trip entersCOMPLETED_UNPAIDpending automated retry and, if exhausted, manual support review.
Data model
Two fundamentally different stores: a relational database for transactional records (trips, drivers, riders), and an in-memory geospatial index for live driver positions. The split exists because these two datasets have opposite throughput and durability requirements — collapsing them into a single store is the most common L3/L4 design error on this question.
Trip state machine
The trip record is a state machine. Every valid transition is a database write; invalid transitions are rejected. Interviewers frequently ask "what states can a trip be in?" — sketch this before drawing the data model table.
Trip state machine — solid arrows are happy-path transitions; dashed red arrows are cancellation and payment failure paths
Relational schema (PostgreSQL)
| Table | Key columns | Notes |
|---|---|---|
trips |
trip_id PK, rider_id FK, driver_id FK (nullable), status, pickup_lat,
pickup_lng, dropoff_lat, dropoff_lng, fare_cents, surge_multiplier,
created_at, completed_at |
Double-assignment is prevented by a partial unique index —
CREATE UNIQUE INDEX active_trip_per_driver ON trips(driver_id)
WHERE status NOT IN ('completed','cancelled'), so the DB
rejects any second active trip for the same driver as a constraint violation,
regardless of how many concurrent matchers attempt the write. |
drivers |
driver_id PK, status (available|busy|offline), vehicle_type,
rating_avg, acceptance_rate_14d |
Status is write-through to Redis. Rating and acceptance rate are rolling aggregates updated asynchronously. |
riders |
rider_id PK, rating_avg, payment_method_id, created_at |
Payment method references a Stripe Customer ID — actual processing is delegated to Payment Service. |
trip_events |
event_id PK, trip_id FK, event_type, lat, lng, occurred_at |
Append-only audit log of GPS waypoints during a trip (driver pings written every 30s by the analytics consumer, not in real-time). |
Redis data structures
| Key pattern | Type | Purpose |
|---|---|---|
drivers:location:available |
Geo sorted set | All available drivers, indexed by lng/lat via GEOADD. The primary spatial index for matching. Updated on every GPS heartbeat. |
driver:{id}:meta |
Hash | Driver attributes needed by the scoring function: vehicle_type,
rating_avg, acceptance_rate. TTL 60s (refreshed
periodically from Postgres). |
surge:cell:{geohash6} |
String (float) | Current surge multiplier for a geohash-6 cell. Updated by the Pricing Service every 30–60 seconds. TTL 120s. |
driver:{id}:ws_node |
String | Which WebSocket server node holds this driver's connection. Used for routing offer messages. TTL matches connection lifetime. |
When a driver accepts a trip, they are removed from
drivers:location:available via ZREM and their status is
set to busy in a Redis Hash key
(driver:{id}:status). This keeps the proximity search result set clean
in steady state. However, the Location Service's GPS-heartbeat path must check this
status key before executing GEOADD — otherwise a heartbeat
arriving in the window between ZREM and the driver-status write will
silently re-add the busy driver back into the available index. The guard is a
single GET driver:{id}:status check; skip the GEOADD if the result is
busy.
Caching strategy
In a ride-sharing system, caching operates at three distinct layers with different TTLs and consistency requirements: the driver location index (in-memory, millisecond TTL, continuously overwritten by GPS heartbeats), the driver attribute cache (1-minute TTL, refreshed from Postgres), and the surge multiplier cache (30–60 second TTL, updated by the pricing service). The ETA result cache is the most nuanced — identical origin-destination pairs within the same geohash cell can share a cached ETA for 30 seconds without meaningfully degrading accuracy.
| What | Where | TTL | Invalidation |
|---|---|---|---|
| Driver live positions | Redis GEOADD (primary) | Always current (heartbeat overwrites) | GEOADD upsert on every GPS ping |
| Driver attributes (rating, vehicle type) | Redis Hash | 60 s | Time-based; refreshed from Postgres on miss |
| Surge multiplier per geohash cell | Redis String | 120 s | Written by Pricing Service every 30–60 s |
| ETA estimates (by route hash) | Redis String | 30 s | Time-based; stale ETAs skew toward over-estimate, safer than under-estimate |
| Road network graph | In-process memory (ETA service) | Never (static) | Shipped as artifact during deploy; live traffic overlaid at query time |
| Rider/driver profiles | Redis + CDN | 5 min | Invalidated on profile update event from Kafka |
Why is the road graph in-process, not Redis? Cache placement ›
The road network graph for a city is a large, read-only data structure (typically 200 MB–2 GB as a compressed adjacency graph with travel time weights). Loading it into Redis would create an enormous serialization overhead — every ETA request would require deserializing a graph that is orders of magnitude larger than the query result. Instead, ETA service pods load the graph on startup into process memory. Live traffic data (speed adjustments per road segment) is a much smaller overlay and can be cached in Redis or fetched via an HTTP cache from a traffic API.
How does surge pricing work computationally? Pricing model ›
Every 30–60 seconds, the Pricing Service scans all active geohash-6 cells in the system.
For each cell, it computes the supply/demand ratio: count of available drivers
vs. open ride requests in the past 5-minute window. When demand exceeds supply
by a threshold (e.g. demand/supply > 1.5), the surge multiplier activates.
The multiplier is capped (typically 5×–8× in production) and is applied to the
base fare at trip creation time, not at payment time. The computed multiplier is
written to surge:cell:{geohash6} in Redis with a 120-second TTL.
Scalability deep dive
The GPS ingestion pipeline and the matching engine are the two hardest scaling problems here. They look similar on the surface — both handle ride-related data — but they have opposite characteristics and need completely different solutions.
Driver location privacy is a regulatory requirement, not just a product decision.
Driver GPS coordinates are never transmitted to riders at full precision: the Rider App
receives a coarse position (randomized within approximately 200 meters) until the driver
accepts the trip. After the trip, the exact route is retained for a maximum of 90 days
under GDPR and California CCPA. The trip_events table must support
right-to-erasure deletions; driver GPS history must be stored separately from trip records
and purged on schedule. In markets with stricter regulations (e.g., Germany), location data
may not be transmitted to servers outside the EU at all, requiring regional data
residency for the location index Redis cluster.
GPS location ingestion
At 125K–500K writes/second (depending on fleet size and heartbeat cadence), the Location Service is the highest-throughput component in the system. The scaling strategy:
- Shard Redis by city/region. A single Redis cluster can handle ~100K writes/second. Partition driver IDs across N clusters by city or geographic shard.
- Location Service is stateless. Any Location Service pod can write to any Redis shard. Add pods horizontally with the load balancer.
- Batch micro-writes. Instead of writing each GPS update instantly, buffer 100ms of updates per driver and batch-write them as a pipeline. Reduces Redis RTT overhead without meaningfully degrading freshness.
- Kafka as a buffer. Route GPS updates through Kafka first; Location Service consumers write to Redis. Kafka absorbs burst spikes and provides replay capability for analytics.
- Airport queues. 500 drivers in a 1 km radius at SFO means a single geohash cell is extremely hot. Redis GEOADD on a single key from 500 concurrent writers can serialize. Mitigate by cell-level sharding or client-side location aggregation.
- Event surge. A stadium event ending adds 10K driver GPS pings simultaneously from a small area. Expect write spikes 5–10× normal for that cell. Design Redis capacity for peak, not average.
- Cascading match failures. If the Redis location index falls behind (stale positions), the Dispatch Service will send offers to drivers who have already moved away — increasing accept-window timeouts and cascading failures.
Matching throughput
The Dispatch Service scales horizontally for normal load. The tricky case is matching under surge: many simultaneous ride requests from the same area competing for the same driver pool. Two techniques:
Lock-free optimistic assignment ›
Instead of acquiring a row lock on the driver record during matching (which serializes
all concurrent matches), use an optimistic update:
UPDATE trips SET driver_id=D9 WHERE trip_id=T1 AND driver_id IS NULL.
If two concurrent requests both try to assign D9, only one will succeed (the other
will update 0 rows and retry with the next candidate). This keeps the matching
loop lock-free in the common case and only serializes on the rare double-claim.
Batch matching for high-density markets ›
Rather than matching each ride request independently, accumulate requests for 1–2 seconds and match them as a batch against the available driver pool. This allows the system to globally optimize assignments (minimize total wait time across all pending requests) rather than greedily assigning the nearest driver to each request in arrival order. Uber's dispatch engine does this for UberPool and surge scenarios. The trade-off is a 1–2 second latency penalty for the first request in each batch window.
Failure modes and mitigations
Each component in this system has a distinct failure mode. Designing for these is the difference between L5 and L6 answers.
| Failure | Impact | Mitigation |
|---|---|---|
| Redis location index primary failure | Matching fails; no drivers visible | Redis Cluster with replica promotion. At-most 2-second data loss. Location service falls back to Redis replica (read-only) and drops writes until promotion completes. |
| Driver WebSocket server failure | Offers drop for all drivers on that node | Driver apps reconnect immediately; Redis driver:{id}:ws_node key
expires automatically; Dispatch Service refreshes routing on reconnect. Trips in
offer-pending state retry with the next candidate. |
| Dispatch Service crash during matching | Ride request stuck in REQUESTED state | Trip record has a last_match_attempt_at timestamp. A background
re-queuer finds trips in REQUESTED state older than 15s and re-enqueues them.
Idempotent because driver assignment uses optimistic locking. |
| Driver accepts but Trip Service write fails | Driver sees trip; Trip DB has no driver assigned | Dispatch Service handles assignment write failures by retrying with exponential backoff. If retries fail, it sends a "system error, please re-accept" message to the driver and reverts the trip to REQUESTED for re-matching. |
| GPS position dramatically incorrect (GPS spoof/jump) | Wrong driver offered; bad ETA | Location Service applies a sanity filter: reject updates where the new position implies speed > 200 km/h since the last ping. Flag coordinates for manual review. Apply Kalman filter smoothing to reduce noise from legitimate rapid updates. |
| Surge pricing service down | All prices revert to base rate | Pricing Service writes surge multipliers with 120s TTL. If the service is down,
multipliers expire and the ETA service falls back to 1.0×. This is
acceptable; it errs toward the rider, not the driver. |
Common interviewer follow-up: "How do you handle a driver who accepts a ride
but never arrives?" This is a product + systems problem. System-side: the trip emits a
driver_en_route event; if the driver's GPS position doesn't converge toward
the pickup within N minutes, a background job triggers an alert and the rider can cancel
penalty-free. Product-side: cancellation policies and fraud detection run on the
trip_events stream asynchronously.
How to answer by level
The same prompt — "Design Uber" — is given to L3 candidates and L7 candidates. What separates the answers is the depth of trade-off reasoning, not the number of buzzwords used.
| Level | Expected depth | Common gaps to avoid |
|---|---|---|
| L3/L4 | Identifies core components (rider app, driver app, server, maps API, payment), defines the trip state machine, and handles basic GPS storage. Google Maps for ETA is fine at this level. | Not identifying that GPS writes are the dominant throughput problem. Storing driver location in a relational table with a lat/lng index without noting the write volume. |
| L5 | Designs the Redis geospatial index, explains geohash precision trade-offs, designs WebSocket architecture with session routing, quantifies GPS write QPS and justifies the Redis choice, explains surge pricing mechanics. | Skipping the WebSocket routing problem (which Dispatch node holds which driver connection). Describing surge pricing without explaining how supply/demand is measured per cell. |
| L6 | Owns the full failure mode matrix, explains optimistic vs. pessimistic locking trade-offs for driver assignment, designs the ETA pipeline with map graph caching vs. live traffic overlays, discusses GPS spoofing detection, and draws the Kafka event topology for downstream consumers. | Missing the Redis cluster sharding strategy for high-density cells (airport queues). Not discussing how driver's acceptance rate feeds back into the scoring function. |
| L7/L8 | Addresses global multi-region deployment (city-isolated shards with a global control plane), discusses ML pipeline for ETA accuracy improvements (gradient boosting on historical trip time vs. naive map routing), batch dispatch optimization for UberPool, the economics of the supply-demand marketplace (driver incentive programs as a supply-shaping mechanism), and data residency constraints for international markets. | Treating the system as purely technical without the business model context. The surge pricing algorithm, driver incentive programs, and matching fairness policies are all coupled to the two-sided marketplace economics. |
Interview technique: Start with the data flows, not the components. Say: "There are three data flows here: continuous GPS writes from drivers, ride request and matching, and trip lifecycle management. They have very different throughput and consistency requirements, which is why they need separate components and storage." This framing signals to the interviewer that you understand why the system is complex, not just what components it has.
Real-world comparison
| Decision | This design | Uber | Lyft |
|---|---|---|---|
| Spatial index | Redis GEOADD (geohash) | H3 hexagonal grid + custom dispatch engine | S2 geometry library + Redis |
| Driver comms | WebSocket | QUIC / long-poll fallback | WebSocket |
| Trip DB | PostgreSQL + row locking | Schemaless (DocStore) + MySQL | MySQL with custom sharding |
| ETA model | Pre-built road graph + live traffic overlay | DeepETA (ML, trained on historical trips) | Google Maps API + fine-tuning layer |
| Surge pricing | Supply/demand ratio per geohash cell | ML-based surge with demand forecasting | Heat-map based, similar cell approach |
| GPS ingestion | Kafka → Location Service → Redis | Kafka + internal stream processing (Flink) | Kafka + Flink |
Uber's "DeepETA" replaced their map-routing ETA with an ML model trained on millions of historical trips. The ML model outperforms naive map routing by 26% on mean absolute error by capturing real-world factors (traffic patterns, construction, driver behavior) that static road graphs can't represent. This is an L7/L8 topic — you don't need to propose it unprompted, but if asked "how would you improve ETA accuracy?" it's the right answer.
driver_id IS NULL checks to lock the transaction (§6)