System Design Interview Guide

Design a Booking & Reservation System
like Airbnb

Simple to describe, hard to scale: availability search across millions of unique listings without ever double-booking a single night.

L3 / L4 — Build it L5 / L6: Own the tradeoffs L7 / L8 — Drive the architecture

~35 min read · 11 sections · interactive estimator

01

What the interviewer is testing

"Design Airbnb" sits in every FAANG loop for a reason: it compresses three distinct hard problems into a single question. You need a workable answer to all three, not a deep dive into just one.

ProblemWhy it's hardWhat changes at scale
Availability searchFilter millions of unique listings by date range + geo + attributes simultaneouslySearch index must be decoupled from the booking DB; eventual consistency is expected
Double-booking preventionTwo guests book the same night concurrently: the window is milliseconds wideDistributed locking + DB constraints; correctness must survive replica lag and partial failures
Consistency vs. availabilityListings, pricing, and calendars change constantly; reads must be fast but must not serve stale state that causes phantom bookingsCache invalidation strategy; search snapshot freshness SLA
🎯

Level signal: L3/L4 candidates often describe the booking flow correctly but skip the reservation race condition entirely. L5 candidates identify the conflict window and propose one solution. L6 candidates reason about why that solution still has a failure window and add a second line of defence. L7/L8 candidates frame the tradeoff between lock duration, user experience, and inventory efficiency.

02

Requirements clarification

Start here before touching architecture. The NFR targets below drive every significant design decision in the article.

Functional requirements

CapabilityIn scopeOut of scope (for this interview)
SearchGeo-bounding-box, check-in/check-out dates, guest count, price range, basic amenitiesML ranking, personalisation, saved searches
Listing managementHosts create/update listings, set availability calendar, set pricingSmart pricing ML, dynamic weekend uplift
BookingGuest reserves dates → host accepts (or instant-book) → confirmation + notificationPayments, payouts, tax handling
CancellationsGuest or host cancels; calendar opens back upRefund workflows, dispute resolution
ReviewsPost-stay reviewReview fraud detection, host response

Non-functional requirements

NFRTargetWhy this level?
Search latency (p99)< 300 msUsers abandon search after ~400 ms; maps must feel live
Booking correctnessZero structural double-bookingsA double-booked listing destroys trust for both parties; even a 1-in-million rate is unacceptable at Airbnb's volume
Availability freshnessSearch reflects calendar changes within 60 sNear-real-time is sufficient; guests expect a moment of lag between a host blocking dates and it appearing in search
Booking confirmation latency< 2 s end-to-end (instant-book); async for request-to-bookInstant-book must feel synchronous; request-to-book can be async
Availability99.99% (search and booking critical paths)Downtime during peak travel seasons (holidays, summer) is disproportionately costly
Scalability~150 M active listings; ~10 M bookings/day peakAirbnb's reported scale as of 2024

NFR reasoning

Search latency < 300 ms Drives §4, §8, §9

300 ms p99 on a filtered geo-query across 150 M listings is not achievable with a single relational DB scan. The only way to meet this is to pre-index availability data into a dedicated search index (Elasticsearch) that can answer geo+date+attribute queries in <50 ms, then layer a CDN and results cache on top.

TradeoffDecoupling search from the booking DB means search results may be slightly stale (up to 60 s). This is an explicit business decision: a listing appearing available in search but booking-failing 0.1% of the time is acceptable; >300 ms search latency is not.
Zero structural double-bookings Drives §5, §6, §7

Unlike inventory systems that can tolerate a small oversell window (e.g., airline overbooking), a vacation rental double-booking requires real humans to be turned away from their accommodation. The business cost is severe: rehousing, lost trust, regulatory risk in some markets. The system must provide a structural guarantee, meaning it must be impossible for the data model to represent a double-booking, not just improbable.

What this drivesA unique constraint on (listing_id, date) in the booking database provides the structural guarantee. All application-level locking is defence-in-depth, not the primary protection.
Availability freshness within 60 s Drives §4, §8

Strong consistency between the booking DB and the search index would require synchronous writes to two systems inside the booking transaction — adding latency and creating a failure coupling. Eventual consistency with a 60-second SLA is achievable via a CDC (change-data-capture) stream: booking events publish to a durable message queue (Kafka), and an indexer consumer updates Elasticsearch within seconds in steady state.

TradeoffIf the indexer falls behind (e.g. after a restart), stale listings may appear available for longer than 60 s. Monitoring lag and alerting is the operational mitigation.
03

Capacity estimation

Airbnb's load profile is unusual: search traffic is enormous and spiky (every travel-season booking surge), but actual booking write volume is modest. The read:write ratio is extreme — roughly 500:1. This has a direct architectural implication: the system should be almost entirely read-optimised.

Interactive capacity estimator

10 M
500×
1 KB
5 yr
Booking writes
writes / second (avg)
Peak booking writes
writes / sec (3× avg)
Search QPS
queries / second
Total bookings stored
records (retention period)
Booking DB size
total raw storage
Hot cache target
recent 30 days (20% of DB)
💡

Key insight: At 10 M bookings/day, the booking insert write rate is ~115/s — but each booking also updates N calendar_availability rows (one per night booked, typically 2–7). Actual DB write pressure is ~350–800 writes/second: still very manageable for a single well-tuned relational primary. The real challenge is the search side: 57,500 QPS of geo-filtered date-range queries. That's the number that demands a dedicated search cluster, not the booking write path.

DimensionEstimateKey insight
Search latency budget300 ms p9950 ms Elasticsearch + 30 ms enrichment + 220 ms CDN/network
Availability calendar rows~55 B rows150 M listings × 365 days = one row per (listing, date)
Listing search index~150 M documents~2 KB/doc avg → ~300 GB Elasticsearch index (sharded)
Peak multiplier3–5×Holiday travel season; system must handle 5× steady state
04

High-level architecture

The architecture splits into two clearly separated planes: the search plane (read-heavy, latency-critical, eventual consistency acceptable) and the booking plane (write-important, correctness-critical, synchronous). A change-data-capture stream connects them.

Client Web / App API Gateway LB + Auth Search API geo + date filter Booking API reserve + confirm Listing Svc host CRUD Elasticsearch listing index Redis Cache search results Listing DB PostgreSQL Booking DB PostgreSQL (sharded) Kafka CDC event stream Indexer ES sync consumer LEGEND Synchronous Async / CDC
Figure 1 — High-level architecture. Search plane (top) and booking plane (bottom) are decoupled; Kafka CDC keeps Elasticsearch in sync within ~60 s.

Component breakdown

API Gateway / Load Balancer is the single entry point for all client traffic. It handles TLS termination, authentication token validation, and routes requests to the appropriate downstream service. Because search and booking have very different latency characteristics, routing them to separate service fleets allows independent scaling.

Search API translates guest queries (bounding box, date range, guest count, filters) into Elasticsearch queries. It checks the Redis results cache first on high-traffic queries (popular destination + date combinations) before hitting Elasticsearch. Results come back as listing IDs with scores; the service then enriches them with real-time pricing from a pricing service.

Booking API is the correctness-critical core. It handles the reserve → lock → write sequence, enforces the double-booking constraint, triggers notifications, and communicates with a payment service (out of scope here). It talks only to the Booking database — never to the search index.

Listing Service manages host-facing CRUD operations: creating listings, setting availability calendars, updating photos and descriptions. Changes propagate to both the Listing DB and, via CDC, to Elasticsearch.

Elasticsearch (search index) holds a document per listing: geo-point, available date ranges (encoded as a bitset or range list), price, amenity flags. Both the Search API and the Indexer consumer interact with it. The index is not the source of truth — it's a read-optimised projection of the Listing and Booking databases.

Redis (search results cache) caches rendered search result pages for popular origin+date combinations (e.g., "Paris, 3–5 guests, July 4–10"). Cache keys include all filter parameters; TTL is short (~30 s) to bound staleness.

Booking DB (PostgreSQL, sharded by listing_id) is the source of truth for all reservations. It holds the booking records and the availability calendar table. The unique constraint on (listing_id, date) lives here. Sharding by listing_id co-locates all bookings for a listing on the same shard, making availability queries and locking efficient.

Listing DB (PostgreSQL) holds listing metadata: title, description, amenities, photos, host details, pricing tiers. Reads are served from read replicas; writes go to the primary.

Kafka event pipeline is implemented using the outbox pattern: the Booking API writes a row to an outbox table inside the same PostgreSQL transaction as the booking record. A Debezium connector reads from the outbox table via WAL and publishes typed semantic events (booking.confirmed, booking.expired, listing.updated) to Kafka. This is important: a raw Debezium CDC tap on the main tables produces low-level row-mutation events — not domain events. The outbox pattern bridges the gap, ensuring typed application events are delivered reliably without coupling the booking transaction to Kafka availability. The Indexer consumer subscribes and drives Elasticsearch updates. Other consumers (notification service, analytics) can subscribe independently without coupling to the booking write path.

Indexer is a stateless consumer that reads from Kafka and writes to Elasticsearch. It handles partial failures gracefully: if it crashes, it replays from its last committed Kafka offset. This is why Kafka's durable log is essential — it decouples the indexer's liveness from the booking transaction.

Notification Service is a separate Kafka consumer on the same booking event stream. On a booking.confirmed event it fans out a push notification to the host (mobile push via FCM/APNs, with email fallback). On a booking.expired event it notifies the guest. Delivery is at-least-once: if the consumer crashes after processing but before ACKing the Kafka offset, it re-processes the event on restart — notifications suppress duplicates using a notification_id keyed on (booking_id, event_type) stored in Redis with a 7-day TTL (matching Kafka's default retention; a 24 h TTL would allow duplicates if a consumer replays events older than one day).

Architectural rationale

Why separate the search plane from the booking plane? Core tradeoff

Search runs at ~500× the write rate and needs <300 ms latency across 150 M listings. A relational database scan cannot provide this. Elasticsearch's geo-spatial indexing and inverted indexes on availability ranges reduce search time to ~50 ms. But running searches against the live booking DB would block write transactions and vice versa: the workloads are incompatible.

Separation means the booking transaction only writes to a single relational shard — a simple, fast, ACID operation. The search index updates asynchronously within 60 s. The cost is eventual consistency in search results: a listing that was just booked may still appear available in search for up to 60 s. This is an explicit product decision.

TradeoffIf the Indexer consumer falls behind (e.g., Kafka consumer lag spikes), search freshness degrades beyond 60 s. Operational monitoring on consumer lag is required.
AlternativesSynchronous dual-writePolling-based sync
Why shard the Booking DB by listing_id? Sharding key

All writes and reads for a listing's availability are scoped to that listing. Sharding by listing_id co-locates them on one shard: the availability calendar range scan, the conflict check, and the booking insert all hit the same shard — no cross-shard transactions needed.

Hot spot riskA very popular listing (e.g., a Paris apartment during Fashion Week) gets all its booking attempts on one shard. The distributed lock in §5/§6 serialises these: the DB is not overwhelmed, but the lock becomes a bottleneck. Mitigated by short lock duration (~200 ms) and exponential backoff retry.
AlternativesShard by user_idShard by region
Why Kafka for CDC rather than direct DB-to-Elasticsearch sync? Decoupling

Direct sync (the Indexer writes to ES inside the booking transaction) would mean a slow or failed ES write blocks the booking response. Kafka decouples the two: the booking transaction commits to the DB alone (fast, local), then publishes an event (sub-millisecond). The Indexer consumes independently — at any rate, and with retry on failure, without affecting booking correctness.

AlternativesOutbox pattern + pollingDirect ES write in txn

Real-world comparison

DecisionThis designAirbnb (reported)Booking.com
Search indexElasticsearch (geo + date-range)Custom Solr → Elasticsearch migrationElasticsearch-based
Availability storePostgreSQL (one row per listing-day)Service-oriented; MySQL-based at launchRelational, denormalised per room type
Conflict preventionDB unique constraint + distributed lock (Redis)DB-level constraints; application-level lockingOptimistic concurrency + DB constraint
Search ↔ booking syncKafka CDC → Elasticsearch consumerEvent-driven; eventual consistency acceptedNear-real-time sync; stricter SLA on heavily booked hotels
Inventory modelBinary (one listing, one booking per date)Binary per unitCount-based per room type
🌐

There is no universally correct inventory model: the binary-per-listing model is right for unique vacation rentals; a count-based model is right for hotel chains where 50 identical rooms exist. The right choice follows directly from the uniqueness of the listing.

05

Core algorithm — availability encoding

Before designing the booking flow, you have to answer a deceptively simple question: how do you represent whether a listing is available on a given date? The answer shapes the conflict detection, the search index document structure, and the calendar update logic.

The encoding choice is constrained by two conflicting requirements established in §2 and §4: the booking DB needs atomic row-level locking (rules out bitset compare-and-swap for multi-day ranges), and the search index needs O(1) bitwise date-range evaluation (rules out scanning one relational row per listing per day at 150 M listing scale). No single encoding satisfies both, which is why the final design uses two different representations in two different systems.

There are three main approaches. All eventually appear in the discussion — but they're not equivalent, and the right choice depends on which operation needs to be fastest.

① ROW PER DAY (listing_id, date) L001 | 2024-07-04 | ✓ L001 | 2024-07-05 | ✗ L001 | 2024-07-06 | ✗ ... Range scan: O(n days) Atomic update: O(1) Unique constraint ✓ ← Our choice ② BOOKED RANGES (listing_id, start, end) L001 | Jul 5 → Jul 8 L001 | Aug 2 → Aug 9 (fewer rows) Range overlap: O(log n) Atomic update: complex Overlap check: tricky ③ BITSET (per listing) one bit per day-of-year L001: 1100001111100… ↑ bit 0 = Jan 1 (365 bits = ~46 bytes/yr) Range scan: O(1) bitwise Search index friendly Atomic update: CAS
Figure 2 — Three availability encoding approaches. ① Row-per-day is our choice for the booking DB; ③ Bitset is used in the Elasticsearch index for fast range queries.

Our choice for this system: Use row-per-day (①) in the booking database and bitset (③) in the Elasticsearch search index. They serve different purposes. The relational row-per-day model gives us the unique constraint enforcement and simple atomic UPDATE that make double-booking structurally impossible. The bitset in Elasticsearch lets the search index quickly evaluate "are all nights of this 7-night trip available?" with a bitwise AND, without scanning multiple rows per listing per query.

Implementation sketch: availability check + claim
-- Booking DB schema (one row per listing-day)
CREATE TABLE calendar_availability (
  listing_id  BIGINT      NOT NULL,
  stay_date   DATE        NOT NULL,
  available   BOOLEAN     NOT NULL DEFAULT TRUE,
  booking_id  BIGINT,
  PRIMARY KEY (listing_id, stay_date)   -- unique constraint
);

-- Prerequisite: calendar_availability is pre-seeded with one row per date
-- per listing when the host activates the listing (available = TRUE for each date).

-- Atomic availability claim (all-or-nothing for the date range)
BEGIN;
  -- Step 1: Lock AVAILABLE rows for the date range
  SELECT count(*) FROM calendar_availability
  WHERE listing_id = $1
    AND stay_date BETWEEN $2 AND $3
    AND available = TRUE
  FOR UPDATE;
  -- If count < requested nights → one or more dates unavailable; ROLLBACK + 409

  -- Step 2: Mark dates as booked
  UPDATE calendar_availability
  SET available = FALSE, booking_id = $booking_id
  WHERE listing_id = $1
    AND stay_date BETWEEN $2 AND $3;

  -- Step 3: Insert booking record (with outbox entry for Kafka)
  INSERT INTO bookings (...) VALUES (...);
  INSERT INTO outbox (event_type, payload) VALUES ('booking.confirmed', ...);
COMMIT;

The SELECT FOR UPDATE acquires row-level locks on the available = TRUE calendar rows for the date range, serialising concurrent requests for the same listing. If the row count is less than the number of requested nights, at least one date is already booked and the transaction rolls back. Combined with the PRIMARY KEY unique constraint, this is structurally double-booking-proof. Note that this requires the calendar to be pre-seeded: when a host activates a listing, one row per date is inserted into calendar_availability with available = TRUE. Without pre-seeding, missing rows would return an empty result set and the conflict check would silently pass.

💬

One open question: What about partial availability? If a guest requests 7 nights and nights 3 and 4 are blocked, the system returns a conflict. Should it suggest the nearest available alternate window? This is a product feature (smart date suggestions), not a correctness concern, and is typically handled in the search layer pre-booking — presenting only windows where all nights are available.

05b

API design

The two most critical endpoints are search and booking creation. Both are guest-facing and sit on the hot path.

POST /search/listings

Use POST (not GET) for search because the request body can be large (complex filters, bounding box polygons) and GET requests with long query strings are unwieldy and don't benefit from request caching. Note: the response can still be cached aggressively by the application layer using the request body as a cache key: the Redis search cache in §8 uses a SHA-256 hash of the full request body as the cache key, which is why CDN non-cacheability of POST is not a problem here.

// Request
{
  "geo": {
    "type": "bounding_box",
    "top_left": { "lat": 48.92, "lon": 2.25 },
    "bottom_right": { "lat": 48.81, "lon": 2.42 }
  },
  "check_in": "2024-07-04",
  "check_out": "2024-07-11",
  "guests": 3,
  "filters": {
    "max_price_per_night": 250,
    "amenities": ["wifi", "kitchen"],
    "property_type": ["entire_place"]
  },
  "page_token": "eyJ..."  // cursor-based pagination
}

// Response 200 OK
{
  "results": [
    {
      "listing_id": "L-8823771",
      "title": "Cosy studio near the Marais",
      "price_per_night": 189,
      "total_price": 1323,
      "rating": 4.87,
      "review_count": 142,
      "location": { "lat": 48.858, "lon": 2.361 },
      "thumbnail_url": "https://cdn.example.com/...",
      "availability_snapshot_age_s": 34  // freshness signal
    }
  ],
  "next_page_token": "eyJ...",
  "total_count": 847
}

POST /bookings

Idempotency key is mandatory — mobile clients retry on network failures, and a missing idempotency key would create duplicate bookings. The key should be client-generated (UUID) and stored server-side with a 24-hour TTL. Repeated calls with the same key return the original response.

// Request — Idempotency-Key header required
// POST /bookings
// Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{
  "listing_id": "L-8823771",
  "check_in": "2024-07-04",
  "check_out": "2024-07-11",
  "guests": 3,
  "guest_message": "Looking forward to our trip!",
  "price_quote_id": "PQ-991234"  // locks in the price shown at search time
}

// Response 201 Created (instant-book) / 202 Accepted (request-to-book)
{
  "booking_id": "BK-44123890",
  "status": "confirmed",  // or "pending_host_approval"
  "listing_id": "L-8823771",
  "check_in": "2024-07-04",
  "check_out": "2024-07-11",
  "total_price": 1323,
  "confirmation_code": "HMAB4C"
}

// Conflict: 409 Conflict
{
  "error": "dates_unavailable",
  "message": "One or more requested dates are no longer available.",
  "conflicting_dates": ["2024-07-06", "2024-07-07"]
}

Optional endpoints by level

EndpointPurposeLevel
GET /listings/:id/availabilityFull calendar view for a listing (host or guest)L3/L4
PUT /listings/:id/calendarHost blocks / unblocks dates in bulkL3/L4
DELETE /bookings/:idCancel a booking; rolls back calendar availabilityL3/L4
GET /bookings/:id/status (polling)Request-to-book status pollingL5
POST /search/listings/countMap view: count of available listings per geo-cellL5
POST /bookings/:id/extendExtend a stay in progress (partial overlap variant)L7/L8
POST /external/ical-syncTrigger an iCal pull from an external OTA calendar URL; updates availability for cross-listed properties. iCal is polled every 15–60 min (platform-configurable). Conflict on overlap: Airbnb booking wins, external event triggers host notification.L5/L6
06

Core flow — search & book

The system has two distinct critical paths. The search path must be fast (the <300 ms NFR from §2 means caching is mandatory, not optional). The booking path must be correct (the double-booking-prevention NFR from §2 means the distributed lock and DB transaction are mandatory, not optional).

SEARCH PATH Guest submits search (geo + dates + filters) Cache HIT? Redis — key: hash(geo+dates+filters) HIT Return cached MISS Elasticsearch query geo_bounding_box + date bitset filter Enrich with pricing + ratings join listing metadata Write to Redis cache (TTL 30s) Return results to guest BOOKING PATH Guest selects listing + dates Acquire distributed lock Redis SETNX key=(listing+dates) TTL 200ms FAIL 409 ACQUIRED BEGIN transaction SELECT FOR UPDATE on calendar rows Check availability any unavailable dates → ROLLBACK + 409 UPDATE calendar + INSERT booking COMMIT — DB constraint enforces uniqueness Release lock → publish event to Kafka async: notify host, trigger indexer sync Return booking_id + confirmation
Figure 3 — Left: search path with Redis cache hit/miss branch. Right: booking path with two-layer conflict prevention (distributed lock + DB transaction). Dashed arrows = async.
⚠️

The lock-then-check pattern matters: the distributed lock in Redis is acquired before the database transaction starts. This prevents the thundering-herd scenario where 50 concurrent requests for the same popular listing all enter database transactions simultaneously, causing heavy lock contention on the same rows. The Redis lock filters out all but one concurrent request cheaply. The DB transaction is the safety net — structurally impossible to fail due to a concurrent booking for the same dates.

📌

Request-to-book uses a hard hold: when a booking is created in pending_host_approval state, the calendar rows are marked available = FALSE immediately — in the same atomic transaction as the booking record insert. This prevents other guests from booking the same dates while waiting for host approval. The hold is released transactionally on host rejection or expiry (see §10). The alternative — a soft hold where the calendar stays available during the pending window — is simpler but creates a poor guest experience when the host accepts a request for dates the guest has already reserved elsewhere.

Tradeoff: distributed lock duration vs. booking latency

The lock TTL of 200 ms is tight by design. If the lock is held longer (e.g., 2 s), competing requests are blocked for 2 s — visible as high booking latency. If the lock TTL is too short, the lock may expire before the DB transaction commits — but the DB's SELECT FOR UPDATE still serialises the second writer: it blocks until the first transaction commits or rolls back, and the unique constraint on (listing_id, stay_date) catches any race that slips through. The Redis lock is defence-in-depth, not the primary guard; the DB is the last line of defence.

07

Data model

The data model is shaped by access patterns — not by what's convenient to normalise. Access patterns come first.

Access patterns

OperationFrequencyQuery shape
Search by geo + dates + filtersVery high (500× booking rate)Multi-dimensional filter → Elasticsearch, not relational DB
Check listing availability for date rangeHigh (every booking attempt)Range scan on (listing_id, date) — clustered index
Claim availability (book)ModerateRange UPDATE — must be atomic with conflict check
Host views own calendarLow–moderateScan all rows for a listing; returned as a month view
Guest views booking historyLowLookup by user_id; small result set
Display listing detail pageHighSingle listing fetch — heavily cached at CDN

Two things stand out. First, the availability check is the hottest write-path query and needs to be both fast (range scan) and atomic (must combine check + claim in one transaction). This drives the row-per-day calendar model. Second, search never touches these tables directly — it reads from Elasticsearch. Those two planes must stay separate.

users 🔑 user_id BIGINT email VARCHAR name VARCHAR role ENUM created_at TSTZ listings 🔑 listing_id BIGINT 🔗 host_id FK → users title VARCHAR lat / lon FLOAT price_usd INT (cents) amenities JSONB, … calendar_availability 🔑 listing_id BIGINT (shard) 🔑 stay_date DATE available BOOL DEFAULT T booking_id BIGINT (FK) PK (listing_id, stay_date) = structural double-book guard bookings 🔑 booking_id BIGINT 🔗 listing_id BIGINT (shard key) 🔗 guest_id FK → users check_in DATE check_out DATE guests SMALLINT status ENUM (confirmed,…) total_usd INT (cents) idempotency_key UUID (unique) Key indexes bookings: IDX (guest_id, check_in) bookings: IDX (listing_id, status) bookings: UNIQUE (idempotency_key) calendar: PARTIAL IDX available=TRUE listings: GIN idx amenities JSONB
Figure 4 — Core data model. The PK on calendar_availability(listing_id, stay_date) is the structural double-booking guard. Bookings and calendar rows are co-sharded by listing_id.
📋

Schema note — required timestamp columns on bookings: the diagram omits three columns that are required in practice: created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (used by the §10 expiry job query), updated_at TIMESTAMPTZ NOT NULL (audit trail for GDPR and financial records), and expires_at TIMESTAMPTZ (the explicit expiry deadline for pending_host_approval bookings — cleaner than computing created_at + INTERVAL '24 hours' at query time). These are omitted from the diagram for space, but must be present in the physical schema.

Why store price as INT (cents) not DECIMAL?

Integer arithmetic for money avoids floating-point rounding errors entirely. Storing cents as BIGINT (or INT for sub-$21M values) is the standard practice at any fintech-adjacent system. The application layer converts to dollars for display. This also makes the column trivially indexable and sortable without cast overhead.

Why store amenities as JSONB rather than a normalised junction table?

Amenity sets are append-only and rarely queried relationally (you never need "give me all listings that added wifi after a specific date"). A GIN index on a JSONB column supports containment queries (amenities @> '["wifi","kitchen"]') with good performance. Adding a new amenity type requires no schema migration. The tradeoff is that multi-amenity filter queries are slower than indexed columns for high-cardinality attributes, which is why heavy amenity filtering is pushed to Elasticsearch, not the Listing DB.

08

Caching strategy

The caching hierarchy is not one-size-fits-all. Search results have different staleness budgets and access patterns than listing detail pages or availability calendars. Each cache layer exists for a specific reason, anchored to the architecture in §4.

Client CDN Static assets listing images TTL: 24 h Redis — search Rendered result pages key: hash(query params) TTL: 30 s Redis — listings Listing detail + price key: listing_id TTL: 5 min (invalidate on update) Elasticsearch Availability index (eventual ≤60 s) not a true cache — projection Booking DB source of truth
Figure 5 — Cache hierarchy anchored to the §4 architecture. Each layer caches a different artifact with a different freshness SLA.

Cache layer breakdown

LayerWhat it cachesTTLWhy it existsInvalidation
CDN (e.g., Cloudflare)Listing photos, static assets24 hImages are the dominant bandwidth — CDN cache-hit rates >95% cut bandwidth costs dramaticallyVersioned URLs on photo upload (cache-busting)
Redis — search resultsRendered search result pages (list of listing IDs + metadata)30 sPopular destination+date combos are requested continuously; same Elasticsearch query 1000×/min is wastefulTTL expiry only (too granular to invalidate on booking)
Redis — listing detailFull listing document (title, amenities, price, photos)5 minListing detail page load is the hottest read; DB read replicas can't handle 50k QPS aloneOn-write invalidation when host edits listing
Elasticsearch (search index)Availability + geo + attribute snapshotEventual (≤60 s)This is a read-optimised projection, not a traditional cache — it exists because the booking DB can't answer geo+date queries at this scaleCDC → Kafka → Indexer consumer writes new document
⚠️

Don't cache availability for booking decisions. The Redis listing cache is for display purposes only — showing a listing's details on the detail page. Never use a cached availability result as the basis for a booking decision. Always read from the Booking DB with a SELECT FOR UPDATE for the actual reservation attempt. A stale cache could show "available" when the listing is already booked.

09

Deep-dive scalability

At peak scale (holidays, summer surge), traffic multiplies 3–5× across all planes simultaneously. The architecture must handle this without reconfiguration.

CDN PoPs Global LB BGP Anycast Search API fleet N stateless pods auto-scaled by CPU / QPS Booking API fleet N stateless pods auto-scaled by QPS ES cluster 12 shards × 3 replicas geo-distributed Redis cluster consistent hashing 6 shards × 2 replicas Booking DB 256 shards by listing_id primary + 2 sync replicas Kafka cluster 32 partitions × listing_id % 32 7-day retention Indexer fleet 32 consumers (1 per partition) idempotent ES upserts
Figure 6 — Production-scale architecture. All stateless tiers auto-scale horizontally. Booking DB sharded by listing_id. Dashed = async CDC pipeline.
Geo-sharding for global coverage (L5+) L5

Airbnb operates globally. A single-region deployment would add 200–300 ms round-trip latency for users in Asia or Europe. The solution is to deploy read replicas of the Listing DB and Elasticsearch clusters in each major region (US, EU, APAC). Write operations for bookings still go to a primary region (chosen by listing location) but search reads are served from the nearest replica.

TradeoffCross-region replication adds ~60–150 ms replica lag on top of the existing 60-second availability freshness SLA. In practice, this is additive — total freshness could be 90–210 s in non-primary regions during peak load.
Calendar pre-generation for popular listings (L5+) L5

For the top-1% most-viewed listings, proactively generate and cache the full 12-month availability calendar as a pre-rendered JSON blob in Redis. This eliminates the Elasticsearch query for the "does this listing have any availability in my date range?" check on the listing detail page: the most common guest interaction before booking. TTL: 5 minutes, invalidated on any calendar change event via the Kafka pipeline. Note that invalidation arrives asynchronously through the same CDC pipeline, so this pre-generated cache has the same ≤60 s freshness budget as the search index — not a stronger guarantee. It is a read-path latency optimisation only and must never be used for booking decisions.

Request queuing for hot listings (L6) L6

A listing that goes viral (celebrity home, destination peak season) may receive thousands of concurrent booking attempts in seconds. The distributed lock serialises these, but the rejected requests all immediately retry — amplifying load. A virtual queue (backed by a Redis sorted set, ordered by arrival timestamp) absorbs the burst: each request gets a queue position and polls for its turn. Only the front-of-queue request attempts the actual booking transaction.

TradeoffAdds complexity and a new failure mode (queue starvation). Worth implementing only for listings with demonstrably high contention — detected by monitoring lock contention rates per listing_id.
Distributed booking ID generation (L6) L6

With 256 DB shards, auto-increment IDs are not globally unique. Use a Snowflake-style distributed ID: 1 sign bit (always 0) + 41 bits timestamp + 10 bits worker/machine ID + 12 bits sequence = 64 bits total, giving ~4096 IDs/ms/worker. Note: the 10-bit field identifies the application server instance, not the DB shard number. The DB shard is determined separately by hash(listing_id) % 256 — this is the routing key for calendar and booking writes. The booking_id is used for lookup; the shard for a given booking_id is found via the listing_id it carries as a foreign key.

AlternativesUUID v7Central ID serviceULID
Search index warm-up on cold start (L7) L7/L8

When a new Elasticsearch node joins the cluster (e.g., during scale-out), it needs to receive its share of data via shard rebalancing — a process that takes minutes and degrades search latency during the transfer. L7 candidates recognise this and propose: (1) green/blue ES cluster deployments where a new cluster is fully loaded before traffic is shifted, and (2) dedicated "hot" shards for the most-queried geo regions to ensure those are always on the fastest nodes.

10

Failure modes & edge cases

ScenarioProblemSolutionLevel
Redis lock TTL expires mid-transaction Two concurrent booking requests both enter the DB transaction after the lock expires; DB sees two writers SELECT FOR UPDATE at DB level serialises them even without the lock. The unique constraint on (listing_id, date) ensures only one commits. Second gets a DB conflict exception → 409 to user. Lock is defence-in-depth, not the primary guard. L3/L4
Kafka consumer lag Indexer falls behind; search results show listings as available when they are actually booked (stale for >60 s) Monitor consumer lag (alert at >10 s lag). On restart, consumer replays from last committed offset — no data loss, just a catch-up period. Guests who book a stale listing see a conflict at checkout — design the booking flow to expect >0 search-to-book conflicts. L5
Idempotency key collision Client generates the same idempotency key for two different bookings (e.g., bug reusing a key) Server stores idempotency_key in bookings table with a UNIQUE constraint. A repeated key returns the original booking response. Client-side: keys should be UUID v4 generated per booking attempt, not per session. L5
Host cancels while guest is booking Host deletes a listing or cancels availability; guest is mid-checkout with a price quote Booking transaction checks listing status = 'active' as part of the same SELECT FOR UPDATE. If listing is deactivated, transaction rolls back → 409 with specific error code. Notification to guest. L5
Booking DB shard failure One of 256 shards goes down; all bookings for listings on that shard fail Each shard has 2 synchronous replicas. On primary failure, automatic failover promotes a replica (<30 s with Patroni or similar). Bookings in flight during failover get a transient 503; clients retry with exponential backoff. Writes are not rebalanced to other shards — availability for affected listings is degraded, not global. L5
Price staleness between search and checkout Prices shown in search reflect a cache snapshot; host updates price during the guest's checkout flow The booking request includes a price_quote_id generated at checkout initiation time. The Booking API validates the quote is <15 minutes old and the price hasn't changed since quote generation. If the price changed, a 409 with a price_changed error code is returned. Guest sees the new price and must confirm before proceeding. The price quote record stores the full per-night price vector (not just the total) to support partial cancellation refunds and proration on early checkout. L5
Thundering herd on viral listing A celebrity or trending listing receives tens of thousands of concurrent booking requests; Redis lock contention causes cascading retry storms Implement per-listing virtual queue (Redis sorted set). On excessive contention (detected by lock acquisition failure rate), route new booking requests to the queue and issue position tokens. Requests poll for their turn. Shed load with 429 + Retry-After once queue depth exceeds threshold. L7/L8
Elasticsearch full cluster failure All search queries fail; guests can't discover listings Degrade gracefully: fall back to a pre-generated static snapshot of popular listings per geo-region (updated hourly, stored in object storage). Serve a reduced search experience ("showing popular listings in Paris") while ES recovers. Geographic failover to a secondary ES cluster in another region. L7/L8
Host doesn't respond to request-to-book Guest is blocked in pending_host_approval state; calendar dates are held but not confirmed; guest has no resolution path after 24 hours A scheduled expiry job (runs every minute) queries bookings WHERE status = 'pending_host_approval' AND created_at < NOW() - INTERVAL '24 hours'. For each expired booking: atomically set status = 'expired' and UPDATE calendar_availability SET available = TRUE, booking_id = NULL WHERE booking_id = $id in the same transaction. Publishes a booking.expired Kafka event → guest notification. Calendar is now open for new bookings. L3/L4

Security & compliance

ConcernRequirementImplementationLevel
Write-path rate limiting A guest (or bot) hammering POST /bookings can exhaust Redis lock slots and DB write capacity Rate-limit POST /bookings per authenticated user_id: 10 attempts/minute in steady state, 3/minute after two consecutive 409 conflicts. Implemented at the API Gateway layer (token bucket per user, stored in Redis). Exceeded limit returns 429 Too Many Requests with a Retry-After header. L5
PII minimisation in booking records The bookings table must not duplicate guest PII (name, email, phone) — it must only hold a guest_id foreign key into the users table Guest identity data lives exclusively in the users table. The bookings table stores only guest_id (a BIGINT FK). This means a GDPR erasure request can anonymise the users row (null out name/email/phone) without touching booking records — preserving the financial audit trail required by tax and accounting regulations. Booking records themselves are retained per applicable financial record-keeping laws (typically 7 years). L5
GDPR data residency Listing and booking data for EU hosts must physically reside in EU shards; shard placement cannot be purely hash-based Maintain a region attribute on each listing (derived from geo-coordinates at creation time). Use a composite shard key: region_prefix + listing_id % N to guarantee EU data stays in EU-hosted PostgreSQL instances. Elasticsearch clusters are region-partitioned on the same boundary. Cross-region replication of EU data to non-EU nodes is prohibited — affects backup and DR strategy (EU-only DR site required). L6/L7
⚠️

Interview signal: Mentioning GDPR data residency as a constraint on the sharding key is a strong L6+ signal. Most candidates shard by listing_id % N and never consider that regulatory requirements can override pure throughput-optimised key selection.

11

How to answer by level

L3 / L4 SDE I / SDE II — Can you build a working system?
What good looks like
  • Correct data model: listings, bookings, availability calendar with one row per listing-day
  • Understand that availability check and booking insert must be atomic
  • Identify the double-booking problem without prompting
  • Propose a DB unique constraint as the structural safeguard
  • Understand search needs a different read path than booking writes
What separates L5 from L3
  • Knowing the lock is defence-in-depth, not the primary guard
  • Recognising search at 150 M listings requires Elasticsearch, not a DB scan
  • Proposing CDC + Kafka rather than synchronous dual-write
  • Understanding why search and booking must be decoupled planes
L5 Senior SDE — Do you understand the tradeoffs?
What good looks like
  • Distributed lock + DB transaction two-layer approach with failure analysis
  • Explains why lock TTL must be short and what happens if it expires early
  • Designs the CDC pipeline: Kafka partition key = listing_id for ordering
  • Idempotency key requirement for the booking endpoint
  • Price quote validation at booking time
What separates L6 from L5
  • Proactively discusses geo-sharding strategy for global scale
  • Identifies and addresses the thundering herd on popular listings
  • Designs for Elasticsearch failure graceful degradation
  • Discusses Snowflake ID generation for distributed shards
L6 Staff SDE — Can you own this end-to-end?
What good looks like
  • Virtual queue for high-contention listings with load shedding
  • Blue/green Elasticsearch deployment for zero-downtime index migrations
  • Per-listing monitoring: lock failure rate, booking conflict rate as operational signals
  • Consumer lag SLA enforcement and alerting
  • Cross-region replication and read routing strategy
What separates L7 from L6
  • Reasons about multi-region consistency tradeoffs for global deployments
  • Addresses regulatory requirements (GDPR data residency) affecting shard placement
  • Proposes capacity planning and cost modelling for the Elasticsearch cluster
L7 / L8 Principal / Distinguished — Should we build this, and how?
What good looks like
  • Frames the binary-per-listing vs. count-based inventory model tradeoff for different business models
  • Addresses GDPR data residency: listing data for EU hosts must stay in EU shards — affects shard architecture
  • Proposes cost-optimised tiered storage: recent bookings in hot tier, historical in cold tier (e.g., S3 + Athena)
  • Thinks about the dual-sided marketplace: host tooling (dynamic pricing, calendar sync with iCal/OTA platforms) as extensions to the core booking architecture
Signals that stand out
  • Proactively discuss OTA (Online Travel Agency) channel management: Airbnb listings also appear on Expedia/Booking.com → need to synchronise availability across external systems
  • Distinguish "availability" from "pricing" as separate services with different consistency requirements
  • Identify that the search plane and booking plane should be separate services with separate deployment and on-call teams
  • Proposes a concrete SLA for OTA conflict resolution: the channel manager must detect and rollback an external double-booking within ≤5 s of receiving the OTA webhook. Reasons through the CAP tradeoff: when Airbnb and an OTA both confirm a booking for the same night simultaneously, first-write-wins to the DB unique constraint determines the winner; the loser triggers an automated cancellation + compensation flow to the guest on that platform.

Classic probes

QuestionL3/L4L5/L6L7/L8
How do you prevent double-booking? DB unique constraint on (listing_id, date) Unique constraint + distributed lock (Redis); explains lock is defence-in-depth Analyses lock expiry failure window; proposes virtual queue for high-contention listings; quantifies false conflict rate
How does search work at 150 M listings? Use a database with geo-indexing Elasticsearch with geo_bounding_box + availability bitset filter; decoupled from booking DB via CDC Green/blue index deployment; ES sharding strategy by geo region; graceful degradation to static snapshot on cluster failure
How fresh are search results? "Near real-time" Eventual consistency via Kafka CDC; ≤60 s SLA in steady state; consumer lag monitoring Multi-region lag adds 60–210 s total; business decision between freshness SLA and search latency budget; implications for search-to-book conflict rate
How do you handle a host who lists their property on both Airbnb and Booking.com? Not usually considered Mentions iCal sync as a common pattern; double-booking risk from external OTAs Designs a channel manager service: subscribes to external OTA webhooks, updates the availability calendar as an event source, and handles conflict resolution when two channels book simultaneously (first-write-wins with automatic rollback notification)
How the pieces connect
NFR → decision chains traceable across sections
1
Search latency <300 ms (§2) → search cannot scan the booking DB → dedicated Elasticsearch index (§4) → search and booking planes must be independently scaled (§9)
2
Zero structural double-bookings (§2) → unique constraint on (listing_id, date) is the structural guarantee (§5, §7) → distributed lock is defence-in-depth, not the primary guard (§6)
3
Availability freshness ≤60 s (§2) → synchronous dual-write couples booking latency to ES write speed → Kafka CDC + Indexer consumer (§4) achieves ≤60 s without coupling (§8)
4
Shard by listing_id (§4, §7) → all calendar rows + booking records for a listing co-located → single-shard transaction for conflict check + write → no cross-shard coordination needed (§6)
5
500:1 read:write ratio (§3) → write QPS (~115/s) is trivial; search QPS (~57k/s) is the challenge → caching hierarchy with 30 s Redis TTL for search results eliminates most Elasticsearch calls (§8)
6
Idempotency key (§5b) → mobile clients retry on network failure → without idempotency, retries create duplicate bookings → unique constraint on idempotency_key in bookings table (§7) prevents duplicates

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series