Chat System (WhatsApp) System Design — FAANG Interview Guide (L3

01

What the interviewer is testing

A chat system sits at an interesting intersection: it combines the real-time push requirements of a streaming system, the durability requirements of a transactional store, and the fan-out requirements of a social network. Asking you to "design WhatsApp" is a way of probing whether you understand all three, and how they conflict.

The core tension the interviewer wants to see you navigate is reliability vs latency. Every time a message is sent, you must decide: do you acknowledge to the sender before or after you durably persist the message? Before or after the recipient's device confirms receipt? These are not the same moment, and mixing them up is the single most common mistake in this interview.

Level	What they're listening for
L3/L4	Can you explain the basic send-receive flow using WebSockets? Do you understand why polling doesn't scale?
L5	Can you reason about delivery guarantees, message ordering, and offline delivery queuing? Do you know the tradeoffs between fan-out on write vs read?
L6	Can you own the full design end-to-end, including multi-device sync, media storage, and group messaging at scale?
L7/L8	Do you think about multi-region active-active, end-to-end encryption implications on server-side fan-out, and the cost vs latency tradeoffs at 100B msg/day?

02

Requirements clarification

Before drawing any boxes, establish what you're building. A chat system can mean a WhatsApp-style mobile messenger, a Slack-style workspace tool, or a Twitch-style live chat. The architecture diverges significantly. These are the questions that matter most.

Functional requirements

Requirement	Decision
1-to-1 messaging	✅ Core requirement
Group messaging	✅ Up to 500 members per group
Online presence / last seen	✅ Required
Message delivery receipts (sent / delivered / read)	✅ Three-state: ✓ ✓✓ ✓✓ (blue)
Media messages (images, video, audio)	✅ Via CDN, separate upload flow
Message history / sync across devices	✅ Multi-device support
Push notifications (offline users)	✅ Via APNs / FCM
End-to-end encryption	⚠️ Mention at L5+; deep-dive at L7/L8
Voice / video calls	❌ Out of scope (different protocol entirely, WebRTC)

Non-functional requirements

NFR	Target	Why this number
Message delivery latency (p99)	< 500 ms	Human perception threshold for "instant" conversation
Availability	99.99% (52 min/year downtime)	Messaging is real-time; even brief outages disrupt live conversations
Durability	No message loss after server ACK (in steady state)	Users expect sent messages to arrive, data loss destroys trust
Message ordering	Causal order within a conversation	Out-of-order messages make conversations incomprehensible
Scale	2B users, 100B messages/day	WhatsApp scale, drives the entire horizontal scaling story
Offline delivery	Messages queue until recipient reconnects	Mobile users are frequently offline; messages must survive

NFR reasoning

Sub-500ms p99 delivery latency›

500ms is the rough threshold below which a conversation feels live. Above it, users notice the lag and conversations feel sluggish. The p99 (not average) matters because a consistently fast average with a long tail still breaks the experience for 1 in 100 messages.

What this drivesWebSocket connections (not polling) to eliminate round-trip overhead; in-memory routing tables on chat servers to avoid DB lookups on the hot path; geographic routing to keep traffic within the same region.

Durability: no message loss after server ACK›

The ACK boundary matters enormously. If the server acknowledges receipt to the sender's device before persisting the message, a server crash between those two moments causes silent data loss, the sender thinks the message was sent, but it's gone. The server must persist durably before sending the ACK. Note that "no message loss after server ACK" is a steady-state guarantee: a simultaneous failure of two or more Cassandra replicas within the write commit window is a known, rare failure mode rather than a design shortcut to accept.

What this drivesSynchronous write to a replicated, durable store (a wide-column database like Cassandra with RF=3 or a replicated relational DB) before ACKing the sender. This adds a few milliseconds of latency but is non-negotiable for correctness.

Causal ordering within a conversation›

Strict global ordering across all conversations is impossibly expensive at this scale (it would require a distributed consensus round for every message). Causal ordering within a single conversation, where if message A happens before B in the conversation, every device sees A before B, is sufficient and much cheaper to achieve.

What this drivesSequence numbers assigned per-conversation by the chat server (or a Snowflake-style distributed ID generator). Each conversation's messages are partitioned together in the message store so a range scan returns them in order. Wall-clock timestamps are insufficient because clocks on mobile devices can drift.

03

Capacity estimation

Chat systems have an asymmetric read/write ratio compared to most other systems. The write QPS is high (every message sent is a write), but the read QPS is even higher because every message delivered to a recipient is also a read, and in group chats, one write can fan out to many reads.

The other unusual dimension is storage. Messages are mostly small text payloads, but retention is long and the total volume compounds fast. Use the estimator to build intuition before a live interview.

Interactive capacity estimator

Daily active users (DAU) 500M

Avg messages per user per day 40

Avg group size (for fan-out) 10

Avg message size (bytes) 500 B

Peak concurrency (% online) 10%

Retention (years) 3 yr

Write QPS

,

messages/sec

DAU × msgs/day ÷ 86,400

Peak Write QPS

,

messages/sec

Write QPS × 3× burst multiplier

Fan-out Read QPS

,

delivers/sec

Write QPS × avg group size (upper bound, assumes all messages are group)

Daily Storage

,

text messages only

DAU × msgs/day × msg size (text only; receipts add ~Write QPS × group size × 50B/row)

Total Storage

,

at retention

Daily storage × 365 × retention years

Concurrent Connections

,

WebSockets at peak

DAU × peak concurrency %

⚠️

Key architectural implication: At 500M DAU with 10% peak concurrency, there are ~50M concurrent WebSocket connections. At ~50–100k connections per chat server, that requires 500–1,000 chat servers, scaling to 5,000–10,000 at 2B DAU. This is why chat servers must be stateful connection holders, they cannot be replaced with stateless HTTP servers on the hot path. The connection registry (which server holds which device's socket) becomes critical infrastructure.

04

High-level architecture

The overall shape of a chat system is a message broker with persistent WebSocket connections on both ends. Every other component exists to support the three core guarantees: low-latency delivery, durability, and offline queueing.

High-level chat system architecture, sender path (sync) and recipient delivery path (async via Kafka)

Component breakdown

Load Balancer routes incoming WebSocket upgrade requests to a chat server. Unlike HTTP load balancers that distribute each request independently, WebSocket connections are long-lived, once established, all subsequent frames from a client go to the same server. This means the load balancer only participates in connection establishment, not in ongoing message delivery.

Chat Servers are the core of the system, each maintains thousands of open WebSocket connections. When a message arrives on a sender's socket, the server must (a) persist it, (b) route it to the recipient. Chat servers are inherently stateful because they hold open sockets. This is the primary scalability challenge: you can't freely scale them horizontally without a way to route messages across servers.

Session Cache (Redis) is an in-memory key-value store that maps each device ID to whichever chat server holds that device's active WebSocket connection. This is a device-level mapping, not a user-level one, a user with a phone, tablet, and laptop has three separate routing entries. A companion set per user (user:devices:{user_id}) tracks all of a user's active device IDs so fan-out can find them in a single lookup. This routing table is what lets chat servers talk to each other: when delivering to Bob, the sender's server looks up Bob's device set, resolves each device to its server, and routes accordingly.

Message Queue (Kafka), a durable, distributed log (Apache Kafka), decouples the send path from the delivery path. When a message is received, the chat server publishes it to a topic in the queue. This gives us two things: (1) durability even if a chat server crashes before delivery, and (2) a natural fan-out mechanism for group messages, one write to the queue, multiple consumer groups pick it up.

Message Store (Cassandra), a wide-column store (Apache Cassandra), persists all messages durably. Messages are partitioned by conversation ID and clustered by message sequence number, which gives fast range scans for a conversation's history. Cassandra's write-optimized storage engine (LSM tree) handles the high write QPS naturally.

Presence Service tracks which users are online, their last seen time, and typing indicators. It is built on a pub/sub in-memory store (Redis Pub/Sub) and intentionally separated from the message path because presence updates are extremely high frequency (heartbeats every 5–10 seconds from every connected client) with different durability requirements, a stale presence indicator is annoying, but not data loss.

Push Notification Service handles offline delivery. If a recipient has no active WebSocket connection, the system routes a notification through third-party mobile push platforms (Apple Push Notification service for iOS, Google's Firebase Cloud Messaging for Android). The notification wakes the client, which then establishes a WebSocket and fetches missed messages.

Object Store (S3 + CDN), a cloud object store (Amazon S3) fronted by a content delivery network, handles media. Images, videos, and audio files are uploaded directly to S3 via a pre-signed URL, they never pass through the chat servers. The chat message itself only contains a reference (URL) to the media object.

Architectural rationale

Why stateful chat servers instead of stateless HTTP handlers?Connectivity model›

WebSocket connections require the server to maintain open file descriptors for the lifetime of the connection, sometimes hours. A stateless request-response model (HTTP) would require the client to poll every few hundred milliseconds, which at 500M DAU creates an enormous polling QPS and introduces the latency of a full polling cycle for every message.

TradeoffStateful servers are harder to scale and replace. You can't kill a chat server without first draining its connections (client reconnects to another server). Rolling deploys require careful connection draining.

AlternativesLong pollingServer-Sent Events

Why Kafka between chat servers and the message store?Decoupling›

Without a message queue, the chat server would write synchronously to Cassandra and then attempt delivery. If the chat server crashes after writing but before delivery, the recipient never gets the message (or gets it only after fetching history). Once a message is in Kafka, it will reach a consumer at least once, even if a chat server crashes partway through processing. The client_msg_id idempotency key at the application layer converts that at-least-once guarantee into exactly-once delivery at the recipient's device.

TradeoffAdds latency (the round trip through Kafka). Adds operational complexity. For small-scale systems, a direct write + delivery is simpler and acceptable.

AlternativesRabbitMQSQSDirect fanout

Why Cassandra for message storage?Data store choice›

Messages have a very clear access pattern: you almost always read a conversation's recent messages in order. This maps perfectly to Cassandra's wide-column model: partition key = conversation_id, clustering key = message_id (time-ordered). A range scan retrieves the last N messages in a single sequential read. Cassandra also handles high write throughput naturally, its LSM-tree storage engine is write-optimized.

TradeoffCassandra does not support multi-row transactions or joins. User and group metadata (which needs referential integrity) is better stored in a relational database like PostgreSQL.

AlternativesHBaseDynamoDBScyllaDB

Real-world comparison

Decision	This design	WhatsApp	Slack
Connection protocol	WebSocket	Custom XMPP over WebSocket	WebSocket
Message store	Cassandra	Mnesia → Cassandra	MySQL (sharded)
Server language	Any	Erlang (massive concurrency)	Go / Java
Group fan-out	Kafka consumer groups	Server-side fan-out, 256 cap	Fan-out + channel subscriptions
Media delivery	S3 + CDN pre-signed URLs	Proprietary CDN	AWS S3 + CloudFront

💡

WhatsApp's choice of Erlang (and later Elixir) is architectural: the BEAM virtual machine supports millions of lightweight processes, each maintaining a user's connection state. This lets a single server hold far more concurrent WebSocket connections than a thread-per-connection model. The lesson: the right tool for your concurrency model matters as much as the right tool for your storage model.

05

Connection protocol: WebSockets vs polling

The choice of connection protocol is the first fundamental design decision. Every other latency and throughput number in this system depends on it. There are three main options worth understanding.

Three approaches to real-time connections. WebSockets win for full-duplex chat. SSE is a viable compromise where WebSockets are blocked by network infrastructure.

Our choice: WebSockets. For a bidirectional chat system, WebSockets are the right model. They establish a single persistent TCP connection through which both client and server can send frames at any time, with no header overhead per frame (unlike HTTP). The initial handshake adds a small one-time cost, but all subsequent messages travel with minimal overhead.

Code: WebSocket connection lifecycle on the client›

// Client establishes WebSocket connection
const ws = new WebSocket('wss://chat.example.com/ws');

ws.onopen = () => {
  // Authenticate immediately after connection
  ws.send(JSON.stringify({ type: 'auth', token: jwtToken }));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === 'message') {
    displayMessage(msg);
    // ACK delivery back to server
    ws.send(JSON.stringify({ type: 'ack', msgId: msg.id }));
  }
};

ws.onclose = (event) => {
  // Exponential backoff reconnect
  scheduleReconnect(event.code);
};

🎯

Common probe: "What happens when a WebSocket connection drops?" The client must reconnect with exponential backoff. On reconnect, it sends its last-seen message sequence number so the server can replay any missed messages from the message store. This is why sequence numbers per conversation are critical, they define exactly where the client's view of the conversation ends.

5b

API design

Chat systems have two distinct API layers: the WebSocket protocol (real-time in-session) and a REST API (account management, history fetch, media upload). The WebSocket protocol handles the hot path; REST handles everything else.

WebSocket message protocol (hot path)

// Client → Server: Send message
{
  "type": "message.send",
  "client_msg_id": "uuid-abc123",  // idempotency key, client-generated
  "conversation_id": "conv-xyz",
  "content_type": "text",           // or "image", "audio", "video"
  "body": "Hey, are you free tonight?",
  "timestamp_client": 1712000000000   // for ordering hint only, not authoritative
}

// Server → Client: ACK (sent receipt, one tick ✓)
{
  "type": "message.ack",
  "client_msg_id": "uuid-abc123",
  "server_msg_id": 1000042,           // authoritative sequence number
  "status": "sent"
}

// Server → Recipient: Message delivery (two ticks ✓✓ on delivery)
{
  "type": "message.receive",
  "server_msg_id": 1000042,
  "conversation_id": "conv-xyz",
  "sender_id": "user-alice",
  "body": "Hey, are you free tonight?",
  "timestamp_server": 1712000001234
}

// Recipient → Server: Delivery receipt (triggers two-tick on sender)
{ "type": "message.delivered", "server_msg_id": 1000042 }

// Recipient → Server: Read receipt (triggers blue ticks)
{ "type": "message.read", "conversation_id": "conv-xyz", "up_to_msg_id": 1000042 }

REST API (account management & history)

Endpoint	Method	Purpose	Level
`POST /messages/media/upload-url`	POST	Get pre-signed S3 URL for media upload	L3/L4
`GET /conversations/{id}/messages`	GET	Fetch history with cursor pagination	L3/L4
`POST /conversations`	POST	Create 1:1 or group conversation	L3/L4
`GET /users/{id}/presence`	GET	Get online status and last seen	L5
`POST /devices/register`	POST	Register new device for multi-device delivery	L5
`GET /messages/sync?since={seq}`	GET	Catch up missed messages after reconnect	L6

💡

The client_msg_id field is the idempotency key. If a client sends a message and the WebSocket drops before receiving the ACK, the client retransmits with the same client_msg_id. The server deduplicates on this key, so the message appears exactly once in the conversation even if transmitted multiple times.

Security & rate limiting L5+

A chat system has three distinct abuse surfaces, each requiring a different mitigation. Interviewers at L6+ frequently probe on at least one of these after the core design is established.

WebSocket authentication on upgradeWho can connect?›

WebSocket connections are long-lived, so authentication cannot rely on per-request session cookies the way HTTP does. Sending a token as a URL query parameter (?token=...) is a common but dangerous pattern, query parameters appear in nginx access logs, CDN edge logs, and browser history, making the token semi-public. The correct production pattern is the first-message auth handshake: the WebSocket connection is established without credentials, and the client sends its JWT as the very first message frame before any application traffic. The server holds the connection in an authenticating state for up to 5 seconds; if no valid auth frame arrives, it closes the connection. This keeps tokens out of all log surfaces.

The JWT itself should be short-lived (15-minute expiry). Before it expires, the client fetches a new JWT via a REST call and sends it over the open WebSocket as a token.refresh frame, no reconnect needed. The server validates and updates the session's expiry in-place.

TradeoffThe first-message pattern means a brief window exists between connection establishment and auth validation. The server must refuse any application frames (message sends, presence updates) that arrive before the auth frame is processed. Implementing this state machine correctly requires care, but it's the approach used by production systems including Slack and Discord, for exactly the log-exposure reason above.

Per-user message rate limitingSpam prevention›

Without rate limiting, a compromised account or bot can flood conversations with thousands of messages per second, degrading the system for everyone. Enforce a per-user message rate limit at the chat server using a token bucket stored in an in-memory store (Redis): each user has a bucket with a capacity of N messages per minute (e.g. 60), refilled at 1/second. If the bucket is empty, the server responds with a rate_limited WebSocket frame and does not persist or forward the message.

TradeoffThe rate limit state lives in Redis, so it is shared across chat servers. A user who reconnects to a different server after a drain does not reset their bucket. The downside is an extra Redis read on every message send, acceptable given that the session cache lookup already happens on the same path.

Also considerGroup message flood limitsNew account sending capsMedia upload frequency limits

Media upload abuse preventionStorage cost›

Pre-signed S3 URLs allow clients to upload directly to object storage, which means the server must control what it signs for. Two mitigations: (1) enforce a max file size on the pre-signed URL (S3 supports content-length-range in the policy), so a client cannot upload a 10 GB file using a URL issued for a 25 MB limit; (2) limit pre-signed URL issuance per user per hour to prevent a bot from generating thousands of upload slots. Both checks happen in the REST handler before issuing the URL, and cost nothing in S3 bandwidth.

What this drivesMalware and CSAM scanning must be async after upload, never block the chat flow on a scan. The object store triggers a cloud function that scans the file and either marks it clean (serving the CDN URL) or quarantines it and sends a media.rejected event back to the conversation.

06

Core message flow

The critical design question in the message flow is: when do we acknowledge to the sender? This drives everything else. The durability NFR from §2 says we must persist before ACKing, so the key architectural choice is persisting fast enough to stay within the 500ms p99 budget.

💡

Two-path write design: the chat server has two sequential jobs when a message arrives, persist it durably to Cassandra (step 2), then publish it to Kafka for fan-out delivery (step 4). Kafka is on the delivery path, not the persistence path. Durability comes from Cassandra; Kafka provides the async delivery mechanism to all recipient devices.

Message flow: steps 1–3 happen synchronously on the critical path (≤ 100ms). Steps 4–8 are async. Offline path falls back to push notification + inbox fetch.

The key structural decision is that steps 1–3 form the synchronous critical path. The sender gets their ACK only after the message is durably persisted (step 2), satisfying our durability NFR from §2. Everything after step 3 is async, the delivery to the recipient, the two-tick notification, and the read receipt.

⚠️

Common mistake: ACKing the sender before persisting. If the server crashes between ACK and persist, the sender thinks the message was delivered but it's gone. The durability NFR from §2 must not be violated structurally, even though a simultaneous multi-replica failure in the commit window could cause data loss in an extreme failure scenario, it should never be an intentional design shortcut. The latency cost of synchronous persist is worth it, Cassandra quorum writes at RF=3 typically complete in 5–30ms depending on node locality, well within the 500ms p99 budget.

07

Data model

Before writing any schema, identify the entities and how they'll be accessed. Chat systems have a small number of entity types but very different access patterns across them, which is why a single database rarely serves all of them well.

Entities and access patterns

Entity	Primary access pattern	Query shape	Best store
Messages	Fetch recent N messages in a conversation; fetch messages after seq X (sync)	Range scan by (conversation_id, msg_id DESC)	Cassandra
Conversations (inbox)	List a user's conversations sorted by most-recent message; update sort position on every new message	Redis sorted set ZREVRANGE (score = last_msg_ts); per-conversation metadata in Cassandra	Redis + Cassandra*
Users	Lookup by user_id or phone; auth; profile	Point lookup + join with contacts	PostgreSQL
Group Membership	Who is in group X? Which groups does user Y belong to?	Two directions: by group_id, by user_id	PostgreSQL
Presence	Is user X online right now? Last seen?	Point lookup, very high write frequency	Redis
Session / routing	Which chat server holds user X's WebSocket?	Point lookup on hot path	Redis

Two things jump out from this table. First, messages are the dominant entity by volume and by read/write frequency, and their access pattern (range scan by conversation + time order) is a perfect fit for Cassandra's partition + clustering key model. Second, presence and session data need sub-millisecond reads on the hot path, making in-memory storage (Redis) the only viable choice.

The store assignments follow directly from access pattern constraints: messages and receipts go to Cassandra, immutable writes, fast range reads. The inbox list goes to Redis, it needs atomic score updates on every new message without generating tombstones. User and group metadata goes to PostgreSQL, it needs referential integrity and joins. Presence and session state go to Redis too, but for a different reason: sub-millisecond point lookups on the write hot path, not just read performance. The schemas below follow from these choices.

Message store schema (Cassandra)

-- Core message record: immutable after write, no mutable status field
-- Delivery/read state is tracked separately in message_receipts
CREATE TABLE messages (
  conversation_id  UUID,
  message_id       BIGINT,        -- Snowflake ID: sortable + time-embedded
  sender_id        UUID,
  content_type     TEXT,          -- 'text' | 'image' | 'audio' | 'video'
  body             TEXT,          -- Encrypted ciphertext at L7/L8
  media_url        TEXT,          -- S3 URL if content_type != 'text'
  created_at       TIMESTAMP,
  PRIMARY KEY ((conversation_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Why Snowflake IDs instead of UUIDs for message_id?›

UUIDs are random, they don't sort by time. Cassandra sorts rows within a partition by the clustering key, so if message_id is a UUID, messages come out in random order and must be sorted in application code. Snowflake IDs (64-bit integers with a millisecond timestamp in the high bits) sort by creation time natively, range scans return messages in chronological order with no application-side sort. They're also smaller (8 bytes vs 16 bytes), which matters at 100 billion messages.

TradeoffSnowflake IDs require a coordination service (or per-datacenter epoch + worker ID scheme) to avoid collisions. UUID v7 is an emerging alternative that embeds a sortable timestamp while remaining UUID-compatible.

-- Delivery and read receipts: one row per (message, recipient)
-- Works for both 1:1 (1 recipient) and group (N recipients)
CREATE TABLE message_receipts (
  conversation_id  UUID,
  message_id       BIGINT,
  recipient_id     UUID,          -- user_id of the recipient
  status           TEXT,          -- 'sent' | 'delivered' | 'read'
  updated_at       TIMESTAMP,
  PRIMARY KEY ((conversation_id, message_id), recipient_id)
);
-- Access pattern A: all receipts for one message (sender's tick state)
--   SELECT * FROM message_receipts WHERE conversation_id=? AND message_id=?
--   → iterate rows: "Delivered" if any status='delivered'; "Read" if any status='read'
-- Access pattern B: point lookup (did this specific user receive this message?)
--   SELECT * FROM message_receipts WHERE conversation_id=? AND message_id=? AND recipient_id=?
-- Cross-conversation unread count is NOT queried from here.
-- It is derived from last_read_msg_id in user_conversation_meta (see above).

-- Inbox (conversation list sorted by recency) is NOT a Cassandra table.
-- Reason: using last_message_at as a clustering key means updating it on every
-- new message requires a delete + insert (Cassandra can't mutate clustering keys).
-- At 100B messages/day this generates a continuous tombstone storm. Use Redis instead:

-- Redis sorted set: inbox per user, score = last_message_unix_ms, member = conv_id
-- ZADD inbox:{user_id} {last_msg_ts_ms} {conversation_id}
-- ZREVRANGE inbox:{user_id} 0 19 WITHSCORES  → top 20 convs by recency (O(log N))
-- Updating recency on new message: ZADD overwrites the score atomically, no tombstones

-- Per-conversation metadata (name, type, last_read cursor) stored separately:
CREATE TABLE user_conversation_meta (
  user_id          UUID,
  conversation_id  UUID,
  last_read_msg_id BIGINT,        -- cursor: unread = msgs with id > this value
  joined_at        TIMESTAMP,
  PRIMARY KEY ((user_id), conversation_id)
);
-- This table is append-write-only for the cursor (UPSERT on read event); no sort mutation.

Why a separate message_receipts table?›

A single status column on the messages table fails for two reasons.

First, in a group of 500 members, "delivered" is per-recipient, one column can't represent 500 independent delivery states. You'd need a row per recipient anyway, which is exactly what message_receipts provides.

Second, updating a non-clustering column in Cassandra generates a tombstone on every cell write (Cassandra's MVCC model). At 100B messages/day with delivery + read events per message, the tombstone accumulation would continuously slow compaction. The receipt table avoids this entirely: each delivery event is an upsert of a new row keyed by recipient, no mutations, no tombstones.

Access patternsSender's tick state: SELECT * WHERE conversation_id=? AND message_id=?, iterate rows, "delivered" if any status='delivered'. Per-device check: add AND recipient_id=? for a point lookup. Cross-conversation unread count comes from the last_read_msg_id cursor in user_conversation_meta, not from this table.

Why Redis for the inbox, not Cassandra?›

The inbox is a list of conversations sorted by most-recent-message time. Every time a new message arrives in any conversation, that conversation needs to move to the top of the list.

In Cassandra, sort order is baked into the clustering key, you can't update it in place. Moving a conversation up the list means deleting the old row and inserting a new one with the updated timestamp. At 100B messages/day this is a continuous delete + insert churn, generating a tombstone on every operation. Tombstones don't get cleaned up immediately, they accumulate until compaction runs, and a heavy tombstone load degrades read performance across the partition.

A Redis sorted set handles this natively. ZADD inbox:{user_id} {ts} {conv_id} overwrites the score atomically if the key already exists, no delete, no tombstone. ZREVRANGE returns the top N conversations by recency in O(log N). The sorted list lives in Redis; per-conversation metadata and the read cursor (last_read_msg_id) stay in Cassandra where they belong.

TradeoffRedis is volatile by default, a cold restart loses the inbox list. Mitigate with AOF persistence or RDB snapshots. On cache miss (e.g. after a restart), rebuild the inbox by scanning user_conversation_meta for the user's conversations and reseeding the sorted set. This is an infrequent operation and acceptable.

User and group schema (PostgreSQL)

CREATE TABLE users (
  user_id     UUID PRIMARY KEY,
  phone       TEXT UNIQUE NOT NULL,
  display_name TEXT,
  avatar_url  TEXT,
  created_at  TIMESTAMPTZ
);

CREATE TABLE conversations (
  conversation_id UUID PRIMARY KEY,
  type            TEXT,          -- 'direct' | 'group'
  name            TEXT,          -- null for direct messages
  created_by      UUID REFERENCES users(user_id),
  created_at      TIMESTAMPTZ
);

CREATE TABLE conversation_members (
  conversation_id UUID REFERENCES conversations(conversation_id),
  user_id         UUID REFERENCES users(user_id),
  role            TEXT,          -- 'admin' | 'member'
  joined_at       TIMESTAMPTZ,
  PRIMARY KEY (conversation_id, user_id)
);

Session and presence schema (Redis)

-- Per-device routing: device_id → chat server (atomic SET with TTL)
-- Using SET with EX is atomic, no separate EXPIRE needed, avoiding key leak on crash
SET session:device:{device_id} "chat-server-042" EX 30
-- Refreshed every ~10s by client heartbeat; expires naturally on disconnect

-- User's active device set: used to fan-out to all a user's connected devices
-- SADD alone does NOT reset the TTL, must use a Lua script to keep it atomic:
--   redis.call("SADD", "user:devices:"..userId, deviceId)
--   redis.call("EXPIRE", "user:devices:"..userId, 86400)
-- This ensures the 24h window resets on every device reconnect, not just the first.
SADD user:devices:{user_id} {device_id}
EXPIRE user:devices:{user_id} 86400        -- run atomically via Lua (see above)

-- Cleanup stale device IDs: enable Redis keyspace notifications (notify-keyspace-events KEA)
-- When session:device:{id} expires, the notification triggers SREM user:devices:{uid} {id}
-- Without this, disconnected device IDs accumulate in the set indefinitely,
-- causing spurious SMEMBERS results and wasted delivery attempts.

-- To deliver to Bob: SMEMBERS user:devices:{bob_id} → list of device_ids
-- For each device_id: GET session:device:{device_id} → server address or nil (offline)

-- Presence: user-level status (not per-device; visible to contacts)
HSET presence:{user_id}  status    "online"
                          last_seen  "2026-04-02T14:00:00Z"
EXPIRE presence:{user_id} 60              -- refreshed by any active device's heartbeat

⚠️

Why device-level, not user-level? A user with phone + laptop has two active WebSocket connections, potentially on two different chat servers. A single session:user:{id} key would be overwritten by whichever device connected last, making the earlier device's connection unreachable. Routing at device granularity is the only correct model for multi-device support.

Extensions: search, reactions, threading

These features are usually addressed as follow-ups rather than in the first design pass, but naming them and sketching their schema impact is a clear L5+ signal.

🔍

Message search, Cassandra supports range scans by conversation but not full-text search. For message history search the answer is a separate inverted index: messages are dual-written via a Kafka consumer into a search engine (Elasticsearch), partitioned by user_id so each user can only search their own conversations. This is architecturally distinct from the main message store, worth naming as an extension, not designing in the first pass.

💬

Reactions & threaded replies, Both change the data model. Reactions are a new entity, a Cassandra table partitioned by (conversation_id, message_id) with one row per (user_id, emoji). Threaded replies add a parent_message_id column to the messages table; threads are fetched as a range scan filtered to that parent. Neither is complex to sketch, naming them and their schema impact shows completeness at L5+.

08

Caching strategy

The latency NFR of <500ms p99 from §2 requires that the hot path, message routing, never touches a slow store. Three caching layers serve the system, but they sit on different request paths: the session and presence caches are consulted on every message write; the history cache is consulted only on reads. Conflating them into one diagram is a common mistake that makes the design look simpler than it is.

Two distinct Redis caching layers serving different request paths, conflating them is a common design mistake. Layer 1 is on the routing hot path; Layer 2 is on the read path only.

Cache layer	What it stores	TTL	Invalidation	Why it exists
Session cache	Two structures: `session:device:{id}` → server address (string, 30s TTL); `user:devices:{uid}` → set of active device IDs (set, 24h TTL)	30s per device key (atomic SET with EX, refreshed on heartbeat); set TTL reset via Lua on reconnect	Device key expires naturally on disconnect; stale device IDs removed via keyspace notification → SREM	Routing lookups happen on every message delivery, must be <1ms. Device-level granularity is required to fan out correctly to all a user's active devices.
Recent message cache	Last 100 messages per conversation	24 hours	Write-through on new message; evict on TTL	When a user opens a conversation, they see the last N messages. Serving from cache avoids a Cassandra read on every conversation open, the dominant read pattern.
Presence cache	user_id → {online, last_seen}	60s (refresh on heartbeat)	Updated by presence service on heartbeat; expires naturally when client disconnects and stops sending heartbeats	Presence is read on every conversation list render but has relaxed accuracy requirements, a user appearing online for up to 60s after they disconnect is an acceptable UX tradeoff (WhatsApp shows "last seen" not a real-time indicator) and avoids costly fan-out of every disconnect event to all contacts.

Cache invalidation: how do receipt ticks stay consistent?›

The messages table is now fully immutable, there is no status field to update. Receipt state lives in message_receipts (Cassandra) and is reflected to the sender's UI via two lightweight Redis cursors, one per user per conversation:

-- Last message ID the server knows was delivered to at least one device
SET receipt:delivered:{user_id}:{conv_id} {msg_id}
-- Last message ID the server knows was read on at least one device
SET receipt:read:{user_id}:{conv_id} {msg_id}

When a delivery or read ACK arrives, the server atomically updates the cursor in Redis and writes to message_receipts in Cassandra. The sender's client polls these cursors (or receives them as a push frame) to render the correct tick state. This is a single Redis write per ACK event, not a per-message cache invalidation, regardless of how many messages are in the recent history cache.

The recent message cache (Layer 2) stores message bodies only, which are immutable. It never needs to be invalidated for receipt changes. It is invalidated only when a new message is written (append to the cached list) or on TTL expiry.

TradeoffThe cursor approach means a sender sees "delivered" as soon as any one of the recipient's devices ACKs, which is the correct WhatsApp semantics. If you needed per-device delivery tracking on the sender's UI (e.g. Slack's "seen by" list), you'd query message_receipts directly rather than using a single cursor.

09

Deep-dive: scalability & advanced topics

At WhatsApp scale, each component from the high-level architecture hits a constraint. The deep dives below address the most important ones, the topics that separate L5 from L7 answers.

Production-scale architecture: stateful chat server pool + Kafka + Cassandra cluster + supporting services. Cross-region replication runs async.

Scaling stateful chat serversL5+›

At 500M DAU with 10% concurrently online, that's 50M active WebSocket connections. A single server can hold 50k–100k connections (limited by file descriptor limits and memory), so you need 500–1,000 chat servers minimum, and many more for redundancy and geographic distribution.

Scaling out is straightforward, add more chat servers and update the load balancer. The complexity is scaling in (or rolling deploys). You can't kill a chat server without disrupting all its active connections. The solution is graceful draining: the server stops accepting new connections, lets existing ones time out or explicitly closes them, and the clients automatically reconnect to surviving servers. This means your load balancer must support connection draining, and your client must handle reconnect with exponential backoff.

Key insightThe device-level session keys in Redis are what make graceful draining work. When a device reconnects after a drain, it establishes a new WebSocket to whichever server the load balancer assigns. That server atomically sets a new session:device:{device_id} key pointing to itself (with EX 30). Any Kafka messages for that device that land on the new server will now find an active socket. The old server's key has already expired, or is overwritten, so routing is immediately correct after reconnect.

Group messaging: fan-out on write vs readL5+›

When Alice sends a message to a group of 100 people, the system must deliver it to 100 recipients. There are two fundamental strategies.

Fan-out on write: when the message is sent, immediately write a copy to each member's inbox (or enqueue a delivery event per member). Each member's delivery is handled independently. This is simple and keeps the read path fast (each member just reads their own inbox), but it's expensive for large groups, one message creates N writes and N delivery events.

Fan-out on read: store the message once, and let each member's client fetch it when they open the group conversation. Much cheaper on write, but requires a shared cursor per member (which messages have I seen?) and a more complex read path. Also means group members who open the conversation simultaneously all hit the same message record simultaneously, hot partition risk in Cassandra.

Our choice for groups ≤ 500 membersFan-out on write via Kafka. One Kafka message per group message; the consumer fans it out to each member's delivery queue. This keeps delivery latency consistent regardless of whether a member is in 1 group or 50. For unencrypted or server-side-encrypted messages, the write amplification (500×) is acceptable because message bodies are stored once, the fan-out events are routing metadata, not full message copies. However, with E2EE enabled this changes: the sender's client must encrypt the message key separately for each recipient device, making the fan-out bandwidth proportional to group size × devices per member. This is why E2EE group size limits exist, they cap the client-side encryption cost, not just the server fan-out.

Used byWhatsApp: 256 cap → always writeSlack: read for large channelsDiscord: hybrid

Multi-device deliveryL6+›

WhatsApp supports multiple linked devices (phone + web + desktop). Each device needs its own delivery state, a message delivered to your phone shouldn't automatically count as delivered to your laptop. The routing abstraction therefore operates at device granularity, not user granularity: each device registers its own session key (session:device:{device_id}) pointing to its chat server, and a companion set (user:devices:{user_id}) lists all of a user's currently active device IDs. This is the design already reflected in the §7 Redis schema.

When a message arrives for Bob, the system reads SMEMBERS user:devices:{bob_id} to get all his active device IDs, then for each device looks up its server address and routes the message there. Each device independently sends a delivery ACK. The "two-tick" indicator on the sender's side lights up when at least one of the recipient's devices acknowledges delivery. The "blue tick" lights up when any device marks it read, recorded as a new row in the message_receipts table (§7).

For offline sync: when a device comes online after being offline, it sends its last-seen message sequence number (last_read_msg_id from the user_conversations table) to the server. The server fetches all messages in the user's conversations with msg_id > last_read_msg_id from Cassandra and pushes them to the reconnecting device. This is the GET /messages/sync?since={seq} endpoint from §5b.

Complexity with E2E encryptionIn an end-to-end encrypted system (Signal Protocol), each device has its own key pair. Encrypting for a recipient means encrypting separately for each of their registered devices. The sender's client must know all the recipient's device public keys and encrypt N copies. This is why E2E encryption and multi-device support are hard to combine cleanly, it requires a key distribution server that the server itself cannot decrypt.

Cassandra hot partition problemL6+›

Cassandra partitions by conversation_id. A conversation with many active members, say, a group of 500 people who are all actively messaging, generates enormous write and read traffic to a single partition on a single Cassandra node. This is the "hot partition" problem: most traffic concentrates on one node, defeating the point of distributing across a cluster.

For most conversations this isn't a problem, traffic is distributed across millions of conversation_ids. But for extremely popular group conversations (100k+ member broadcast channels, which are a different beast from WhatsApp groups), you need a different partition strategy.

SolutionsFor large groups: partition by (conversation_id, time_bucket), e.g. shard the conversation into hourly buckets, spreading writes across nodes. This complicates range queries (must query multiple partitions) but prevents a single node from becoming a bottleneck. Alternatively, move large groups to a different storage backend (e.g. a log-structured system designed for broadcast channels).

Multi-region active-activeL7/L8›

A single-region deployment adds hundreds of milliseconds of latency for users far from the data center, and a regional outage takes down all users globally. The L7/L8 answer requires multi-region deployment.

The key challenge: message ordering across regions. If Alice is in US-East and Bob is in EU-West, and both send messages to the same group conversation simultaneously, which one appears first? Without coordination, each region might apply them in a different order, and users see different conversation histories.

Solutions involve either routing all writes for a conversation to a single "home region" (the conversation leader), which adds cross-region latency for one party, or using a conflict-free replicated data type (CRDT) or vector clock scheme that allows concurrent writes and merges them deterministically. WhatsApp routes connections to the nearest region but uses a globally replicated database (built on top of custom Erlang infrastructure) to resolve ordering.

Practical approachAssign each conversation a home region based on the creator's location. Route that conversation's writes to the home region. Other regions read via async cross-region replication. Accept that cross-region messages have slightly higher latency. This is simpler than full active-active write merge and sufficient for most latency requirements.

Contact discovery without leaking non-user dataL6+›

When a new user installs WhatsApp, the app identifies which of their phone contacts are already registered, without revealing to the server which contacts are not registered. Revealing non-matches would expose the user's full address book, including people who chose not to sign up.

The mechanism: the client hashes all contact phone numbers (e.g. SHA-256 of the E.164 normalised number) and sends only the hashes. The server checks each hash against a hash index of all registered phone numbers, and returns only the matching hashes. The client maps those back to names from its own contacts. The server never sees plaintext phone numbers for non-users; it only confirms existence for those in its own registry.

At scale, the hash lookup must be fast, a bloom filter pre-screens obvious misses before hitting the full index. Interviewers at L6+ sometimes ask this as a standalone question ("how do you implement contact discovery while preserving privacy?") so knowing the hash + bloom filter pattern is valuable on its own.

Preimage vulnerability, important L7/L8 nuancePlain SHA-256 of phone numbers is not truly private: phone numbers are a small, enumerable space (~10 billion possible E.164 numbers globally). A server operator, or an attacker who compromises the server, could precompute hashes of all possible phone numbers in minutes and build a full reverse-lookup table, de-anonymising the entire uploaded contact list. This is a known weakness. Signal's Private Contact Discovery uses Intel SGX trusted execution enclaves to run the set intersection inside a hardware-attested enclave, preventing even the server operator from observing the inputs. For an interview answer, name the hash approach and note this limitation at L7/L8.

Advanced alternativeSignal's SGX-based Private Contact DiscoveryPrivate Set Intersection (PSI) crypto protocols

End-to-end encryption (E2EE) architectureL7/L8›

WhatsApp uses the Signal Protocol for E2EE. The fundamental property: the server can see that a message was sent from A to B, but cannot read its contents. The server holds encrypted ciphertext only.

This changes the server-side architecture significantly. The message body stored in Cassandra is ciphertext. The server cannot index, moderate, or search message content. The fan-out for group messages must account for encrypting the message for each recipient's device key separately, the sender's client does this, generating N encrypted payloads for N recipient devices. The server delivers the right ciphertext to the right device without understanding any of it.

The critical infrastructure piece is the key distribution server: each device registers its public key (identity key, signed prekeys, one-time prekeys) with the server. When a sender wants to message a new recipient for the first time, they fetch the recipient's prekeys from the server and establish a shared secret. The server never sees the shared secret, it only distributes public keys.

Implication for group messagingE2EE group messaging does not scale to very large groups easily. For a group of 500 members, the sender must encrypt the group message key for each member's device. WhatsApp's 256 (now 1024) member limit partly reflects this encryption fan-out cost on the sender's device.

10

Failure modes & edge cases

Scenario	Problem	Solution	Level
Chat server crashes mid-send	Message in memory, not yet persisted or published to Kafka, potential loss	Always persist to Cassandra before ACKing sender. Kafka publish happens after persist, if the server crashes after persist but before Kafka publish, a recovery process reads the message store and re-publishes undelivered messages on restart.	L3/L4
Duplicate message delivery	Network retry or server restart causes the same message to be processed twice	Use `client_msg_id` as an idempotency key (from §5b API design). Server deduplicates on this key with a short TTL dedup cache (Redis SETNX). Recipient gets at-most-once display even if delivered twice at the protocol level.	L3/L4
Recipient's chat server goes down during delivery	Message was published to Kafka but the consumer (Chat Server B) is dead, message undelivered	Kafka consumer groups automatically rebalance, another chat server picks up the partition. That server checks the session cache: if the user is now connected to it, delivers immediately. If not, the message stays in Cassandra until the recipient reconnects.	L5
Redis session cache is down	Can't look up which server holds the recipient's socket, can't route messages	Fall back to publish-to-all-servers via broadcast (expensive) or use the message store as the delivery mechanism: write to Cassandra, trigger push notification, let the client poll on reconnect. Route through Kafka and let consumer groups deliver when the client next connects.	L5
Message ordering in group chat	Two members send messages simultaneously, different servers assign different sequence numbers, members see different orderings	Assign sequence numbers from a single authoritative source per conversation (Snowflake ID service or a conversation-level counter in a coordination service like ZooKeeper). Messages with the same timestamp are ordered by server_msg_id. This is causal ordering, not strict global ordering, sufficient for UX.	L5
WebSocket connection storm on reconnect	Regional network event causes millions of clients to reconnect simultaneously, server overload	Clients use exponential backoff with jitter (random delay). The load balancer is configured with connection rate limits per server. New chat servers spin up via auto-scaling to absorb the reconnect wave. Kafka consumer lag is monitored, if it grows, delivery still occurs once connections stabilize.	L7/L8
Cassandra hot partition	Very active group conversation concentrates all reads/writes on one Cassandra node, node becomes bottleneck	Add a time-bucket dimension to the partition key: `PRIMARY KEY ((conversation_id, time_bucket), message_id)` where time_bucket = YYYYMMDD_HH. Spreads a hot conversation across 24 nodes per day. Adds complexity to range queries (must query multiple buckets). Acceptable for conversations that have been flagged as high-traffic.	L7/L8

11

How to answer by level

The same system design question plays out very differently across levels. The difference isn't just depth, it's the quality of reasoning about tradeoffs and the ability to drive the conversation rather than just respond to it.

L3 / L4Working system›

What good looks like

Explains the WebSocket connection model clearly; knows why polling doesn't work at scale
Describes the basic send → persist → deliver flow with delivery receipts
Identifies the need for a message store and a user database
Handles offline delivery via push notifications

What separates L3/L4 from L5

Doesn't think about what happens when the server crashes between steps
Acknowledges sender before persisting (durability mistake)
Treats group and 1:1 messaging identically without considering fan-out cost
No awareness of the stateful server routing problem

L5Tradeoffs and guarantees›

What good looks like

Explains at-least-once vs at-most-once delivery and why idempotency keys matter
Designs the session cache (Redis) for routing messages across stateful servers
Distinguishes fan-out on write vs read; picks the right one for group size
Designs the offline delivery path: messages persist in Cassandra, inbox recency in Redis, push notification wakes offline clients
Mentions Snowflake IDs for causal ordering

What separates L5 from L6

Doesn't consider multi-device delivery state tracking
Doesn't think about what happens to in-flight messages during a Redis outage
Can't describe the full connection lifecycle: establish → heartbeat → drain → reconnect

L6End-to-end ownership›

What good looks like

Owns the full design including multi-device delivery, device-level session routing, and sync
Addresses media upload separately (pre-signed S3 URLs) and explains why media must bypass chat servers, routing gigabyte video files through WebSocket servers would saturate their memory and CPU
Explains the rolling deploy problem unprompted: killing a chat server drops all its active sockets; describes the drain sequence (stop accepting, signal clients, wait for close/timeout) and what happens without it
Identifies the Cassandra hot partition risk for high-traffic groups and names the time-bucketed partition key as the mitigation
Mentions E2EE at an architectural level, server holds ciphertext only, key distribution server needed, per-device key pairs

What separates L6 from L7/L8

Treats writes as single-region; doesn't address ordering conflicts when two regions write concurrently to the same conversation
Doesn't reason about E2EE's implications for group fan-out cost
Can't discuss the cost model or what to optimize first at 100B msg/day

L7 / L8Architecture and strategy›

What good looks like

Drives the conversation: defines the problem constraints before the interviewer does
Designs multi-region with explicit reasoning about the ordering tradeoff (home region vs full active-active)
Discusses E2EE implications for fan-out, prekey distribution, and group size limits
Reasons about cost: Kafka egress, Cassandra storage, CDN bandwidth, and what to optimize first
Anticipates operational concerns: monitoring message delivery lag, Kafka consumer lag SLOs, WebSocket connection churn rate

Not acceptable at L7/L8

Needs prompting for any major component
Can't reason about the tradeoff without thinking through it in the interview
Focuses on technology choices without connecting them to requirements

Classic interview probes

Question	L3/L4	L5/L6	L7/L8
"How do you ensure a message is never lost?"	Persist to a database before responding to sender	Explain the ACK boundary precisely: persist → ACK sender → publish → deliver. Describe what happens on crash at each step.	Full delivery guarantee chain with Kafka consumer group offsets, idempotency keys, and the monitoring strategy to detect lost messages in production.
"A user is connected to Server 1. Their friend is on Server 2. How does the message get delivered?"	Mentions some kind of "routing" but doesn't have a concrete mechanism	Session cache (Redis): Server 1 reads `SMEMBERS user:devices:{friend_id}` to get all active device IDs. For each device it does `GET session:device:{id}` to find the target server. It publishes a delivery event to Kafka; each target server's Kafka consumer finds the open socket and pushes the frame. If a device has no session entry it is offline, message stays in Cassandra until that device reconnects and syncs.	Full lifecycle including what happens when the session cache is stale (server 2 holds an expired entry), how to detect and recover, and the consistency model of the routing table.
"How would you scale this to 1 billion users?"	Add more servers and a load balancer	Horizontal scaling of each tier independently; Cassandra's peer-to-peer scaling model; Kafka partition scaling; Redis cluster mode.	Multi-region topology, cost modeling across tiers, which tier hits its limit first at 1B DAU (hint: WebSocket connection count), and the resulting architectural changes.
"How would you implement typing indicators?"	Send a "typing" event message through the normal message flow	Typing events are ephemeral, they should NOT go through the durable message store. Route through the presence service (Redis pub/sub) with a short TTL. The recipient's client stops showing "typing..." after 5–10s if no new event arrives.	At scale, typing indicators from large groups create significant pub/sub traffic. Discuss throttling (only broadcast typing indicator once per 2–3 seconds per user), and the decision to not persist them at all vs persist with very short TTL.

Design a Chat System
like WhatsApp

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

NFR reasoning

Capacity estimation

Interactive capacity estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Connection protocol: WebSockets vs polling

API design

WebSocket message protocol (hot path)

REST API (account management & history)

Security & rate limiting L5+

Core message flow

Data model

Entities and access patterns

Message store schema (Cassandra)

User and group schema (PostgreSQL)

Session and presence schema (Redis)

Extensions: search, reactions, threading

Caching strategy

Deep-dive: scalability & advanced topics

Failure modes & edge cases

How to answer by level

Classic interview probes

How the pieces connect

System Design Mock Interviews

Practice Coding Interviews Now