Design a Chat System
like WhatsApp
Simple to describe, brutally hard to scale reliably. 100 billion messages a day delivered in under a second, to 2 billion users, across flaky mobile networks.
What the interviewer is testing
A chat system sits at an interesting intersection: it combines the real-time push requirements of a streaming system, the durability requirements of a transactional store, and the fan-out requirements of a social network. Asking you to "design WhatsApp" is a way of probing whether you understand all three, and how they conflict.
The core tension the interviewer wants to see you navigate is reliability vs latency. Every time a message is sent, you must decide: do you acknowledge to the sender before or after you durably persist the message? Before or after the recipient's device confirms receipt? These are not the same moment, and mixing them up is the single most common mistake in this interview.
| Level | What they're listening for |
|---|---|
| L3/L4 | Can you explain the basic send-receive flow using WebSockets? Do you understand why polling doesn't scale? |
| L5 | Can you reason about delivery guarantees, message ordering, and offline delivery queuing? Do you know the tradeoffs between fan-out on write vs read? |
| L6 | Can you own the full design end-to-end, including multi-device sync, media storage, and group messaging at scale? |
| L7/L8 | Do you think about multi-region active-active, end-to-end encryption implications on server-side fan-out, and the cost vs latency tradeoffs at 100B msg/day? |
Requirements clarification
Before drawing any boxes, establish what you're building. A chat system can mean a WhatsApp-style mobile messenger, a Slack-style workspace tool, or a Twitch-style live chat. The architecture diverges significantly. These are the questions that matter most.
Functional requirements
| Requirement | Decision |
|---|---|
| 1-to-1 messaging | ✅ Core requirement |
| Group messaging | ✅ Up to 500 members per group |
| Online presence / last seen | ✅ Required |
| Message delivery receipts (sent / delivered / read) | ✅ Three-state: ✓ ✓✓ ✓✓ (blue) |
| Media messages (images, video, audio) | ✅ Via CDN, separate upload flow |
| Message history / sync across devices | ✅ Multi-device support |
| Push notifications (offline users) | ✅ Via APNs / FCM |
| End-to-end encryption | ⚠️ Mention at L5+; deep-dive at L7/L8 |
| Voice / video calls | ❌ Out of scope (different protocol entirely, WebRTC) |
Non-functional requirements
| NFR | Target | Why this number |
|---|---|---|
| Message delivery latency (p99) | < 500 ms | Human perception threshold for "instant" conversation |
| Availability | 99.99% (52 min/year downtime) | Messaging is real-time; even brief outages disrupt live conversations |
| Durability | No message loss after server ACK (in steady state) | Users expect sent messages to arrive, data loss destroys trust |
| Message ordering | Causal order within a conversation | Out-of-order messages make conversations incomprehensible |
| Scale | 2B users, 100B messages/day | WhatsApp scale, drives the entire horizontal scaling story |
| Offline delivery | Messages queue until recipient reconnects | Mobile users are frequently offline; messages must survive |
NFR reasoning
Sub-500ms p99 delivery latency›
500ms is the rough threshold below which a conversation feels live. Above it, users notice the lag and conversations feel sluggish. The p99 (not average) matters because a consistently fast average with a long tail still breaks the experience for 1 in 100 messages.
Durability: no message loss after server ACK›
The ACK boundary matters enormously. If the server acknowledges receipt to the sender's device before persisting the message, a server crash between those two moments causes silent data loss, the sender thinks the message was sent, but it's gone. The server must persist durably before sending the ACK. Note that "no message loss after server ACK" is a steady-state guarantee: a simultaneous failure of two or more Cassandra replicas within the write commit window is a known, rare failure mode rather than a design shortcut to accept.
Causal ordering within a conversation›
Strict global ordering across all conversations is impossibly expensive at this scale (it would require a distributed consensus round for every message). Causal ordering within a single conversation, where if message A happens before B in the conversation, every device sees A before B, is sufficient and much cheaper to achieve.
Capacity estimation
Chat systems have an asymmetric read/write ratio compared to most other systems. The write QPS is high (every message sent is a write), but the read QPS is even higher because every message delivered to a recipient is also a read, and in group chats, one write can fan out to many reads.
The other unusual dimension is storage. Messages are mostly small text payloads, but retention is long and the total volume compounds fast. Use the estimator to build intuition before a live interview.
Interactive capacity estimator
Key architectural implication: At 500M DAU with 10% peak concurrency, there are ~50M concurrent WebSocket connections. At ~50–100k connections per chat server, that requires 500–1,000 chat servers, scaling to 5,000–10,000 at 2B DAU. This is why chat servers must be stateful connection holders, they cannot be replaced with stateless HTTP servers on the hot path. The connection registry (which server holds which device's socket) becomes critical infrastructure.
High-level architecture
The overall shape of a chat system is a message broker with persistent WebSocket connections on both ends. Every other component exists to support the three core guarantees: low-latency delivery, durability, and offline queueing.
Component breakdown
Load Balancer routes incoming WebSocket upgrade requests to a chat server. Unlike HTTP load balancers that distribute each request independently, WebSocket connections are long-lived, once established, all subsequent frames from a client go to the same server. This means the load balancer only participates in connection establishment, not in ongoing message delivery.
Chat Servers are the core of the system, each maintains thousands of open WebSocket connections. When a message arrives on a sender's socket, the server must (a) persist it, (b) route it to the recipient. Chat servers are inherently stateful because they hold open sockets. This is the primary scalability challenge: you can't freely scale them horizontally without a way to route messages across servers.
Session Cache (Redis) is an in-memory key-value store that maps each device ID
to whichever chat server holds that device's active WebSocket connection. This is a device-level
mapping, not a user-level one, a user with a phone, tablet, and laptop has three separate routing
entries. A companion set per user (user:devices:{user_id}) tracks all of a user's active
device IDs so fan-out can find them in a single lookup. This routing table is what lets chat servers
talk to each other: when delivering to Bob, the sender's server looks up Bob's device set, resolves each
device to its server, and routes accordingly.
Message Queue (Kafka), a durable, distributed log (Apache Kafka), decouples the send path from the delivery path. When a message is received, the chat server publishes it to a topic in the queue. This gives us two things: (1) durability even if a chat server crashes before delivery, and (2) a natural fan-out mechanism for group messages, one write to the queue, multiple consumer groups pick it up.
Message Store (Cassandra), a wide-column store (Apache Cassandra), persists all messages durably. Messages are partitioned by conversation ID and clustered by message sequence number, which gives fast range scans for a conversation's history. Cassandra's write-optimized storage engine (LSM tree) handles the high write QPS naturally.
Presence Service tracks which users are online, their last seen time, and typing indicators. It is built on a pub/sub in-memory store (Redis Pub/Sub) and intentionally separated from the message path because presence updates are extremely high frequency (heartbeats every 5–10 seconds from every connected client) with different durability requirements, a stale presence indicator is annoying, but not data loss.
Push Notification Service handles offline delivery. If a recipient has no active WebSocket connection, the system routes a notification through third-party mobile push platforms (Apple Push Notification service for iOS, Google's Firebase Cloud Messaging for Android). The notification wakes the client, which then establishes a WebSocket and fetches missed messages.
Object Store (S3 + CDN), a cloud object store (Amazon S3) fronted by a content delivery network, handles media. Images, videos, and audio files are uploaded directly to S3 via a pre-signed URL, they never pass through the chat servers. The chat message itself only contains a reference (URL) to the media object.
Architectural rationale
Why stateful chat servers instead of stateless HTTP handlers?Connectivity model›
WebSocket connections require the server to maintain open file descriptors for the lifetime of the connection, sometimes hours. A stateless request-response model (HTTP) would require the client to poll every few hundred milliseconds, which at 500M DAU creates an enormous polling QPS and introduces the latency of a full polling cycle for every message.
Why Kafka between chat servers and the message store?Decoupling›
Without a message queue, the chat server would write synchronously to Cassandra and then
attempt delivery. If the chat server crashes after writing but before delivery, the
recipient never gets the message (or gets it only after fetching history). Once a message is
in Kafka, it will reach a consumer at least once, even if a chat server crashes partway
through processing. The client_msg_id idempotency key at the application layer
converts that at-least-once guarantee into exactly-once delivery at the recipient's device.
Why Cassandra for message storage?Data store choice›
Messages have a very clear access pattern: you almost always read a conversation's recent messages in order. This maps perfectly to Cassandra's wide-column model: partition key = conversation_id, clustering key = message_id (time-ordered). A range scan retrieves the last N messages in a single sequential read. Cassandra also handles high write throughput naturally, its LSM-tree storage engine is write-optimized.
Real-world comparison
| Decision | This design | Slack | |
|---|---|---|---|
| Connection protocol | WebSocket | Custom XMPP over WebSocket | WebSocket |
| Message store | Cassandra | Mnesia → Cassandra | MySQL (sharded) |
| Server language | Any | Erlang (massive concurrency) | Go / Java |
| Group fan-out | Kafka consumer groups | Server-side fan-out, 256 cap | Fan-out + channel subscriptions |
| Media delivery | S3 + CDN pre-signed URLs | Proprietary CDN | AWS S3 + CloudFront |
WhatsApp's choice of Erlang (and later Elixir) is architectural: the BEAM virtual machine supports millions of lightweight processes, each maintaining a user's connection state. This lets a single server hold far more concurrent WebSocket connections than a thread-per-connection model. The lesson: the right tool for your concurrency model matters as much as the right tool for your storage model.
Connection protocol: WebSockets vs polling
The choice of connection protocol is the first fundamental design decision. Every other latency and throughput number in this system depends on it. There are three main options worth understanding.
Our choice: WebSockets. For a bidirectional chat system, WebSockets are the right model. They establish a single persistent TCP connection through which both client and server can send frames at any time, with no header overhead per frame (unlike HTTP). The initial handshake adds a small one-time cost, but all subsequent messages travel with minimal overhead.
Code: WebSocket connection lifecycle on the client›
// Client establishes WebSocket connection
const ws = new WebSocket('wss://chat.example.com/ws');
ws.onopen = () => {
// Authenticate immediately after connection
ws.send(JSON.stringify({ type: 'auth', token: jwtToken }));
};
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'message') {
displayMessage(msg);
// ACK delivery back to server
ws.send(JSON.stringify({ type: 'ack', msgId: msg.id }));
}
};
ws.onclose = (event) => {
// Exponential backoff reconnect
scheduleReconnect(event.code);
};
Common probe: "What happens when a WebSocket connection drops?" The client must reconnect with exponential backoff. On reconnect, it sends its last-seen message sequence number so the server can replay any missed messages from the message store. This is why sequence numbers per conversation are critical, they define exactly where the client's view of the conversation ends.
API design
Chat systems have two distinct API layers: the WebSocket protocol (real-time in-session) and a REST API (account management, history fetch, media upload). The WebSocket protocol handles the hot path; REST handles everything else.
WebSocket message protocol (hot path)
// Client → Server: Send message
{
"type": "message.send",
"client_msg_id": "uuid-abc123", // idempotency key, client-generated
"conversation_id": "conv-xyz",
"content_type": "text", // or "image", "audio", "video"
"body": "Hey, are you free tonight?",
"timestamp_client": 1712000000000 // for ordering hint only, not authoritative
}
// Server → Client: ACK (sent receipt, one tick ✓)
{
"type": "message.ack",
"client_msg_id": "uuid-abc123",
"server_msg_id": 1000042, // authoritative sequence number
"status": "sent"
}
// Server → Recipient: Message delivery (two ticks ✓✓ on delivery)
{
"type": "message.receive",
"server_msg_id": 1000042,
"conversation_id": "conv-xyz",
"sender_id": "user-alice",
"body": "Hey, are you free tonight?",
"timestamp_server": 1712000001234
}
// Recipient → Server: Delivery receipt (triggers two-tick on sender)
{ "type": "message.delivered", "server_msg_id": 1000042 }
// Recipient → Server: Read receipt (triggers blue ticks)
{ "type": "message.read", "conversation_id": "conv-xyz", "up_to_msg_id": 1000042 }
REST API (account management & history)
| Endpoint | Method | Purpose | Level |
|---|---|---|---|
POST /messages/media/upload-url |
POST | Get pre-signed S3 URL for media upload | L3/L4 |
GET /conversations/{id}/messages |
GET | Fetch history with cursor pagination | L3/L4 |
POST /conversations |
POST | Create 1:1 or group conversation | L3/L4 |
GET /users/{id}/presence |
GET | Get online status and last seen | L5 |
POST /devices/register |
POST | Register new device for multi-device delivery | L5 |
GET /messages/sync?since={seq} |
GET | Catch up missed messages after reconnect | L6 |
The client_msg_id field is the idempotency key. If a client sends a message and the
WebSocket drops before receiving the ACK, the client retransmits with the same
client_msg_id. The server deduplicates on this key, so the message appears exactly once
in the conversation even if transmitted multiple times.
Security & rate limiting L5+
A chat system has three distinct abuse surfaces, each requiring a different mitigation. Interviewers at L6+ frequently probe on at least one of these after the core design is established.
WebSocket authentication on upgradeWho can connect?›
WebSocket connections are long-lived, so authentication cannot rely on per-request session
cookies the way HTTP does. Sending a token as a URL query parameter
(?token=...) is a common but dangerous pattern, query parameters appear in
nginx access logs, CDN edge logs, and browser history, making the token semi-public. The
correct production pattern is the first-message auth handshake: the
WebSocket connection is established without credentials, and the client sends its JWT as the
very first message frame before any application traffic. The server holds the connection in
an authenticating state for up to 5 seconds; if no valid auth frame arrives, it
closes the connection. This keeps tokens out of all log surfaces.
The JWT itself should be short-lived (15-minute expiry). Before it expires, the client
fetches a new JWT via a REST call and sends it over the open WebSocket as a
token.refresh frame, no reconnect needed. The server validates and updates the
session's expiry in-place.
Per-user message rate limitingSpam prevention›
Without rate limiting, a compromised account or bot can flood conversations with thousands of
messages per second, degrading the system for everyone. Enforce a per-user message rate
limit at the chat server using a token bucket stored in an in-memory store (Redis): each
user has a bucket with a capacity of N messages per minute (e.g. 60), refilled at 1/second.
If the bucket is empty, the server responds with a rate_limited WebSocket frame
and does not persist or forward the message.
Media upload abuse preventionStorage cost›
Pre-signed S3 URLs allow clients to upload directly to object storage, which means the
server must control what it signs for. Two mitigations: (1) enforce a max file size on the
pre-signed URL (S3 supports content-length-range in the policy), so a client
cannot upload a 10 GB file using a URL issued for a 25 MB limit; (2) limit pre-signed URL
issuance per user per hour to prevent a bot from generating thousands of upload slots. Both
checks happen in the REST handler before issuing the URL, and cost nothing in S3 bandwidth.
media.rejected event back to the
conversation.Core message flow
The critical design question in the message flow is: when do we acknowledge to the sender? This drives everything else. The durability NFR from §2 says we must persist before ACKing, so the key architectural choice is persisting fast enough to stay within the 500ms p99 budget.
Two-path write design: the chat server has two sequential jobs when a message arrives, persist it durably to Cassandra (step 2), then publish it to Kafka for fan-out delivery (step 4). Kafka is on the delivery path, not the persistence path. Durability comes from Cassandra; Kafka provides the async delivery mechanism to all recipient devices.
The key structural decision is that steps 1–3 form the synchronous critical path. The sender gets their ACK only after the message is durably persisted (step 2), satisfying our durability NFR from §2. Everything after step 3 is async, the delivery to the recipient, the two-tick notification, and the read receipt.
Common mistake: ACKing the sender before persisting. If the server crashes between ACK and persist, the sender thinks the message was delivered but it's gone. The durability NFR from §2 must not be violated structurally, even though a simultaneous multi-replica failure in the commit window could cause data loss in an extreme failure scenario, it should never be an intentional design shortcut. The latency cost of synchronous persist is worth it, Cassandra quorum writes at RF=3 typically complete in 5–30ms depending on node locality, well within the 500ms p99 budget.
Data model
Before writing any schema, identify the entities and how they'll be accessed. Chat systems have a small number of entity types but very different access patterns across them, which is why a single database rarely serves all of them well.
Entities and access patterns
| Entity | Primary access pattern | Query shape | Best store |
|---|---|---|---|
| Messages | Fetch recent N messages in a conversation; fetch messages after seq X (sync) | Range scan by (conversation_id, msg_id DESC) | Cassandra |
| Conversations (inbox) | List a user's conversations sorted by most-recent message; update sort position on every new message | Redis sorted set ZREVRANGE (score = last_msg_ts); per-conversation metadata in Cassandra | Redis + Cassandra* |
| Users | Lookup by user_id or phone; auth; profile | Point lookup + join with contacts | PostgreSQL |
| Group Membership | Who is in group X? Which groups does user Y belong to? | Two directions: by group_id, by user_id | PostgreSQL |
| Presence | Is user X online right now? Last seen? | Point lookup, very high write frequency | Redis |
| Session / routing | Which chat server holds user X's WebSocket? | Point lookup on hot path | Redis |
Two things jump out from this table. First, messages are the dominant entity by volume and by read/write frequency, and their access pattern (range scan by conversation + time order) is a perfect fit for Cassandra's partition + clustering key model. Second, presence and session data need sub-millisecond reads on the hot path, making in-memory storage (Redis) the only viable choice.
The store assignments follow directly from access pattern constraints: messages and receipts go to Cassandra, immutable writes, fast range reads. The inbox list goes to Redis, it needs atomic score updates on every new message without generating tombstones. User and group metadata goes to PostgreSQL, it needs referential integrity and joins. Presence and session state go to Redis too, but for a different reason: sub-millisecond point lookups on the write hot path, not just read performance. The schemas below follow from these choices.
Message store schema (Cassandra)
-- Core message record: immutable after write, no mutable status field
-- Delivery/read state is tracked separately in message_receipts
CREATE TABLE messages (
conversation_id UUID,
message_id BIGINT, -- Snowflake ID: sortable + time-embedded
sender_id UUID,
content_type TEXT, -- 'text' | 'image' | 'audio' | 'video'
body TEXT, -- Encrypted ciphertext at L7/L8
media_url TEXT, -- S3 URL if content_type != 'text'
created_at TIMESTAMP,
PRIMARY KEY ((conversation_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Why Snowflake IDs instead of UUIDs for message_id?›
UUIDs are random, they don't sort by time. Cassandra sorts rows within a partition by the clustering key, so if message_id is a UUID, messages come out in random order and must be sorted in application code. Snowflake IDs (64-bit integers with a millisecond timestamp in the high bits) sort by creation time natively, range scans return messages in chronological order with no application-side sort. They're also smaller (8 bytes vs 16 bytes), which matters at 100 billion messages.
-- Delivery and read receipts: one row per (message, recipient)
-- Works for both 1:1 (1 recipient) and group (N recipients)
CREATE TABLE message_receipts (
conversation_id UUID,
message_id BIGINT,
recipient_id UUID, -- user_id of the recipient
status TEXT, -- 'sent' | 'delivered' | 'read'
updated_at TIMESTAMP,
PRIMARY KEY ((conversation_id, message_id), recipient_id)
);
-- Access pattern A: all receipts for one message (sender's tick state)
-- SELECT * FROM message_receipts WHERE conversation_id=? AND message_id=?
-- → iterate rows: "Delivered" if any status='delivered'; "Read" if any status='read'
-- Access pattern B: point lookup (did this specific user receive this message?)
-- SELECT * FROM message_receipts WHERE conversation_id=? AND message_id=? AND recipient_id=?
-- Cross-conversation unread count is NOT queried from here.
-- It is derived from last_read_msg_id in user_conversation_meta (see above).
-- Inbox (conversation list sorted by recency) is NOT a Cassandra table.
-- Reason: using last_message_at as a clustering key means updating it on every
-- new message requires a delete + insert (Cassandra can't mutate clustering keys).
-- At 100B messages/day this generates a continuous tombstone storm. Use Redis instead:
-- Redis sorted set: inbox per user, score = last_message_unix_ms, member = conv_id
-- ZADD inbox:{user_id} {last_msg_ts_ms} {conversation_id}
-- ZREVRANGE inbox:{user_id} 0 19 WITHSCORES → top 20 convs by recency (O(log N))
-- Updating recency on new message: ZADD overwrites the score atomically, no tombstones
-- Per-conversation metadata (name, type, last_read cursor) stored separately:
CREATE TABLE user_conversation_meta (
user_id UUID,
conversation_id UUID,
last_read_msg_id BIGINT, -- cursor: unread = msgs with id > this value
joined_at TIMESTAMP,
PRIMARY KEY ((user_id), conversation_id)
);
-- This table is append-write-only for the cursor (UPSERT on read event); no sort mutation.
Why a separate message_receipts table?›
A single status column on the messages table fails for two reasons.
First, in a group of 500 members, "delivered" is per-recipient, one column can't represent
500 independent delivery states. You'd need a row per recipient anyway, which is exactly
what message_receipts provides.
Second, updating a non-clustering column in Cassandra generates a tombstone on every cell write (Cassandra's MVCC model). At 100B messages/day with delivery + read events per message, the tombstone accumulation would continuously slow compaction. The receipt table avoids this entirely: each delivery event is an upsert of a new row keyed by recipient, no mutations, no tombstones.
SELECT * WHERE conversation_id=? AND message_id=?, iterate rows,
"delivered" if any status='delivered'. Per-device check: add AND recipient_id=?
for a point lookup. Cross-conversation unread count comes from the
last_read_msg_id cursor in user_conversation_meta, not from this
table.Why Redis for the inbox, not Cassandra?›
The inbox is a list of conversations sorted by most-recent-message time. Every time a new message arrives in any conversation, that conversation needs to move to the top of the list.
In Cassandra, sort order is baked into the clustering key, you can't update it in place. Moving a conversation up the list means deleting the old row and inserting a new one with the updated timestamp. At 100B messages/day this is a continuous delete + insert churn, generating a tombstone on every operation. Tombstones don't get cleaned up immediately, they accumulate until compaction runs, and a heavy tombstone load degrades read performance across the partition.
A Redis sorted set handles this natively. ZADD inbox:{user_id} {ts} {conv_id}
overwrites the score atomically if the key already exists, no delete, no tombstone.
ZREVRANGE returns the top N conversations by recency in O(log N). The sorted
list lives in Redis; per-conversation metadata and the read cursor
(last_read_msg_id) stay in Cassandra where they belong.
user_conversation_meta for the user's conversations and reseeding the sorted
set. This is an infrequent operation and acceptable.User and group schema (PostgreSQL)
CREATE TABLE users (
user_id UUID PRIMARY KEY,
phone TEXT UNIQUE NOT NULL,
display_name TEXT,
avatar_url TEXT,
created_at TIMESTAMPTZ
);
CREATE TABLE conversations (
conversation_id UUID PRIMARY KEY,
type TEXT, -- 'direct' | 'group'
name TEXT, -- null for direct messages
created_by UUID REFERENCES users(user_id),
created_at TIMESTAMPTZ
);
CREATE TABLE conversation_members (
conversation_id UUID REFERENCES conversations(conversation_id),
user_id UUID REFERENCES users(user_id),
role TEXT, -- 'admin' | 'member'
joined_at TIMESTAMPTZ,
PRIMARY KEY (conversation_id, user_id)
);
Session and presence schema (Redis)
-- Per-device routing: device_id → chat server (atomic SET with TTL)
-- Using SET with EX is atomic, no separate EXPIRE needed, avoiding key leak on crash
SET session:device:{device_id} "chat-server-042" EX 30
-- Refreshed every ~10s by client heartbeat; expires naturally on disconnect
-- User's active device set: used to fan-out to all a user's connected devices
-- SADD alone does NOT reset the TTL, must use a Lua script to keep it atomic:
-- redis.call("SADD", "user:devices:"..userId, deviceId)
-- redis.call("EXPIRE", "user:devices:"..userId, 86400)
-- This ensures the 24h window resets on every device reconnect, not just the first.
SADD user:devices:{user_id} {device_id}
EXPIRE user:devices:{user_id} 86400 -- run atomically via Lua (see above)
-- Cleanup stale device IDs: enable Redis keyspace notifications (notify-keyspace-events KEA)
-- When session:device:{id} expires, the notification triggers SREM user:devices:{uid} {id}
-- Without this, disconnected device IDs accumulate in the set indefinitely,
-- causing spurious SMEMBERS results and wasted delivery attempts.
-- To deliver to Bob: SMEMBERS user:devices:{bob_id} → list of device_ids
-- For each device_id: GET session:device:{device_id} → server address or nil (offline)
-- Presence: user-level status (not per-device; visible to contacts)
HSET presence:{user_id} status "online"
last_seen "2026-04-02T14:00:00Z"
EXPIRE presence:{user_id} 60 -- refreshed by any active device's heartbeat
Why device-level, not user-level? A user with phone + laptop has two active
WebSocket connections, potentially on two different chat servers. A single
session:user:{id} key would be overwritten by whichever device connected last, making
the earlier device's connection unreachable. Routing at device granularity is the only correct model
for multi-device support.
Extensions: search, reactions, threading
These features are usually addressed as follow-ups rather than in the first design pass, but naming them and sketching their schema impact is a clear L5+ signal.
Message search, Cassandra supports range scans by conversation but not full-text search. For message history search the answer is a separate inverted index: messages are dual-written via a Kafka consumer into a search engine (Elasticsearch), partitioned by user_id so each user can only search their own conversations. This is architecturally distinct from the main message store, worth naming as an extension, not designing in the first pass.
Reactions & threaded replies, Both change the data model. Reactions are a new
entity, a Cassandra table partitioned by (conversation_id, message_id) with one row
per (user_id, emoji). Threaded replies add a parent_message_id column to
the messages table; threads are fetched as a range scan filtered to that parent.
Neither is complex to sketch, naming them and their schema impact shows completeness at L5+.
Caching strategy
The latency NFR of <500ms p99 from §2 requires that the hot path, message routing, never touches a slow store. Three caching layers serve the system, but they sit on different request paths: the session and presence caches are consulted on every message write; the history cache is consulted only on reads. Conflating them into one diagram is a common mistake that makes the design look simpler than it is.
| Cache layer | What it stores | TTL | Invalidation | Why it exists |
|---|---|---|---|---|
| Session cache | Two structures: session:device:{id} → server address (string, 30s TTL);
user:devices:{uid} → set of active device IDs (set, 24h TTL) |
30s per device key (atomic SET with EX, refreshed on heartbeat); set TTL reset via Lua on reconnect | Device key expires naturally on disconnect; stale device IDs removed via keyspace notification → SREM | Routing lookups happen on every message delivery, must be <1ms. Device-level granularity is required to fan out correctly to all a user's active devices. |
| Recent message cache | Last 100 messages per conversation | 24 hours | Write-through on new message; evict on TTL | When a user opens a conversation, they see the last N messages. Serving from cache avoids a Cassandra read on every conversation open, the dominant read pattern. |
| Presence cache | user_id → {online, last_seen} | 60s (refresh on heartbeat) | Updated by presence service on heartbeat; expires naturally when client disconnects and stops sending heartbeats | Presence is read on every conversation list render but has relaxed accuracy requirements, a user appearing online for up to 60s after they disconnect is an acceptable UX tradeoff (WhatsApp shows "last seen" not a real-time indicator) and avoids costly fan-out of every disconnect event to all contacts. |
Cache invalidation: how do receipt ticks stay consistent?›
The messages table is now fully immutable, there is no status field to
update. Receipt state lives in message_receipts (Cassandra) and is reflected to the
sender's UI via two lightweight Redis cursors, one per user per conversation:
-- Last message ID the server knows was delivered to at least one device
SET receipt:delivered:{user_id}:{conv_id} {msg_id}
-- Last message ID the server knows was read on at least one device
SET receipt:read:{user_id}:{conv_id} {msg_id}
When a delivery or read ACK arrives, the server atomically updates the cursor in Redis and writes
to message_receipts in Cassandra. The sender's client polls these cursors (or
receives them as a push frame) to render the correct tick state. This is a single Redis write
per ACK event, not a per-message cache invalidation, regardless of how many messages are in
the recent history cache.
The recent message cache (Layer 2) stores message bodies only, which are immutable. It never needs to be invalidated for receipt changes. It is invalidated only when a new message is written (append to the cached list) or on TTL expiry.
message_receipts directly
rather than using a single cursor.Deep-dive: scalability & advanced topics
At WhatsApp scale, each component from the high-level architecture hits a constraint. The deep dives below address the most important ones, the topics that separate L5 from L7 answers.
Scaling stateful chat serversL5+›
At 500M DAU with 10% concurrently online, that's 50M active WebSocket connections. A single server can hold 50k–100k connections (limited by file descriptor limits and memory), so you need 500–1,000 chat servers minimum, and many more for redundancy and geographic distribution.
Scaling out is straightforward, add more chat servers and update the load balancer. The complexity is scaling in (or rolling deploys). You can't kill a chat server without disrupting all its active connections. The solution is graceful draining: the server stops accepting new connections, lets existing ones time out or explicitly closes them, and the clients automatically reconnect to surviving servers. This means your load balancer must support connection draining, and your client must handle reconnect with exponential backoff.
session:device:{device_id} key pointing to itself
(with EX 30). Any Kafka messages for that device that land on the new server will now find
an active socket. The old server's key has already expired, or is overwritten, so routing
is immediately correct after reconnect.Group messaging: fan-out on write vs readL5+›
When Alice sends a message to a group of 100 people, the system must deliver it to 100 recipients. There are two fundamental strategies.
Fan-out on write: when the message is sent, immediately write a copy to each member's inbox (or enqueue a delivery event per member). Each member's delivery is handled independently. This is simple and keeps the read path fast (each member just reads their own inbox), but it's expensive for large groups, one message creates N writes and N delivery events.
Fan-out on read: store the message once, and let each member's client fetch it when they open the group conversation. Much cheaper on write, but requires a shared cursor per member (which messages have I seen?) and a more complex read path. Also means group members who open the conversation simultaneously all hit the same message record simultaneously, hot partition risk in Cassandra.
Multi-device deliveryL6+›
WhatsApp supports multiple linked devices (phone + web + desktop). Each device needs its own
delivery state, a message delivered to your phone shouldn't automatically count as
delivered to your laptop. The routing abstraction therefore operates at device granularity,
not user granularity: each device registers its own session key
(session:device:{device_id}) pointing to its chat server, and a companion set
(user:devices:{user_id}) lists all of a user's currently active device IDs.
This is the design already reflected in the §7 Redis schema.
When a message arrives for Bob, the system reads SMEMBERS user:devices:{bob_id}
to get all his active device IDs, then for each device looks up its server address and
routes the message there. Each device independently sends a delivery ACK. The "two-tick"
indicator on the sender's side lights up when at least one of the recipient's
devices acknowledges delivery. The "blue tick" lights up when any device marks it read, recorded as a new row in the message_receipts table (§7).
For offline sync: when a device comes online after being offline, it sends its last-seen
message sequence number (last_read_msg_id from the
user_conversations table) to the server. The server fetches all messages in the
user's conversations with msg_id > last_read_msg_id from Cassandra and
pushes them to the reconnecting device. This is the
GET /messages/sync?since={seq} endpoint from §5b.
Cassandra hot partition problemL6+›
Cassandra partitions by conversation_id. A conversation with many active members, say, a group of 500 people who are all actively messaging, generates enormous write and read traffic to a single partition on a single Cassandra node. This is the "hot partition" problem: most traffic concentrates on one node, defeating the point of distributing across a cluster.
For most conversations this isn't a problem, traffic is distributed across millions of conversation_ids. But for extremely popular group conversations (100k+ member broadcast channels, which are a different beast from WhatsApp groups), you need a different partition strategy.
Multi-region active-activeL7/L8›
A single-region deployment adds hundreds of milliseconds of latency for users far from the data center, and a regional outage takes down all users globally. The L7/L8 answer requires multi-region deployment.
The key challenge: message ordering across regions. If Alice is in US-East and Bob is in EU-West, and both send messages to the same group conversation simultaneously, which one appears first? Without coordination, each region might apply them in a different order, and users see different conversation histories.
Solutions involve either routing all writes for a conversation to a single "home region" (the conversation leader), which adds cross-region latency for one party, or using a conflict-free replicated data type (CRDT) or vector clock scheme that allows concurrent writes and merges them deterministically. WhatsApp routes connections to the nearest region but uses a globally replicated database (built on top of custom Erlang infrastructure) to resolve ordering.
Contact discovery without leaking non-user dataL6+›
When a new user installs WhatsApp, the app identifies which of their phone contacts are already registered, without revealing to the server which contacts are not registered. Revealing non-matches would expose the user's full address book, including people who chose not to sign up.
The mechanism: the client hashes all contact phone numbers (e.g. SHA-256 of the E.164 normalised number) and sends only the hashes. The server checks each hash against a hash index of all registered phone numbers, and returns only the matching hashes. The client maps those back to names from its own contacts. The server never sees plaintext phone numbers for non-users; it only confirms existence for those in its own registry.
At scale, the hash lookup must be fast, a bloom filter pre-screens obvious misses before hitting the full index. Interviewers at L6+ sometimes ask this as a standalone question ("how do you implement contact discovery while preserving privacy?") so knowing the hash + bloom filter pattern is valuable on its own.
End-to-end encryption (E2EE) architectureL7/L8›
WhatsApp uses the Signal Protocol for E2EE. The fundamental property: the server can see that a message was sent from A to B, but cannot read its contents. The server holds encrypted ciphertext only.
This changes the server-side architecture significantly. The message body stored in Cassandra is ciphertext. The server cannot index, moderate, or search message content. The fan-out for group messages must account for encrypting the message for each recipient's device key separately, the sender's client does this, generating N encrypted payloads for N recipient devices. The server delivers the right ciphertext to the right device without understanding any of it.
The critical infrastructure piece is the key distribution server: each device registers its public key (identity key, signed prekeys, one-time prekeys) with the server. When a sender wants to message a new recipient for the first time, they fetch the recipient's prekeys from the server and establish a shared secret. The server never sees the shared secret, it only distributes public keys.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Chat server crashes mid-send | Message in memory, not yet persisted or published to Kafka, potential loss | Always persist to Cassandra before ACKing sender. Kafka publish happens after persist, if the server crashes after persist but before Kafka publish, a recovery process reads the message store and re-publishes undelivered messages on restart. | L3/L4 |
| Duplicate message delivery | Network retry or server restart causes the same message to be processed twice | Use client_msg_id as an idempotency key (from §5b API design). Server
deduplicates on this key with a short TTL dedup cache (Redis SETNX). Recipient gets
at-most-once display even if delivered twice at the protocol level. |
L3/L4 |
| Recipient's chat server goes down during delivery | Message was published to Kafka but the consumer (Chat Server B) is dead, message undelivered | Kafka consumer groups automatically rebalance, another chat server picks up the partition. That server checks the session cache: if the user is now connected to it, delivers immediately. If not, the message stays in Cassandra until the recipient reconnects. | L5 |
| Redis session cache is down | Can't look up which server holds the recipient's socket, can't route messages | Fall back to publish-to-all-servers via broadcast (expensive) or use the message store as the delivery mechanism: write to Cassandra, trigger push notification, let the client poll on reconnect. Route through Kafka and let consumer groups deliver when the client next connects. | L5 |
| Message ordering in group chat | Two members send messages simultaneously, different servers assign different sequence numbers, members see different orderings | Assign sequence numbers from a single authoritative source per conversation (Snowflake ID service or a conversation-level counter in a coordination service like ZooKeeper). Messages with the same timestamp are ordered by server_msg_id. This is causal ordering, not strict global ordering, sufficient for UX. | L5 |
| WebSocket connection storm on reconnect | Regional network event causes millions of clients to reconnect simultaneously, server overload | Clients use exponential backoff with jitter (random delay). The load balancer is configured with connection rate limits per server. New chat servers spin up via auto-scaling to absorb the reconnect wave. Kafka consumer lag is monitored, if it grows, delivery still occurs once connections stabilize. | L7/L8 |
| Cassandra hot partition | Very active group conversation concentrates all reads/writes on one Cassandra node, node becomes bottleneck | Add a time-bucket dimension to the partition key:
PRIMARY KEY ((conversation_id, time_bucket), message_id) where time_bucket =
YYYYMMDD_HH. Spreads a hot conversation across 24 nodes per day. Adds complexity to range
queries (must query multiple buckets). Acceptable for conversations that have been flagged
as high-traffic. |
L7/L8 |
How to answer by level
The same system design question plays out very differently across levels. The difference isn't just depth, it's the quality of reasoning about tradeoffs and the ability to drive the conversation rather than just respond to it.
L3 / L4Working system›
- Explains the WebSocket connection model clearly; knows why polling doesn't work at scale
- Describes the basic send → persist → deliver flow with delivery receipts
- Identifies the need for a message store and a user database
- Handles offline delivery via push notifications
- Doesn't think about what happens when the server crashes between steps
- Acknowledges sender before persisting (durability mistake)
- Treats group and 1:1 messaging identically without considering fan-out cost
- No awareness of the stateful server routing problem
L5Tradeoffs and guarantees›
- Explains at-least-once vs at-most-once delivery and why idempotency keys matter
- Designs the session cache (Redis) for routing messages across stateful servers
- Distinguishes fan-out on write vs read; picks the right one for group size
- Designs the offline delivery path: messages persist in Cassandra, inbox recency in Redis, push notification wakes offline clients
- Mentions Snowflake IDs for causal ordering
- Doesn't consider multi-device delivery state tracking
- Doesn't think about what happens to in-flight messages during a Redis outage
- Can't describe the full connection lifecycle: establish → heartbeat → drain → reconnect
L6End-to-end ownership›
- Owns the full design including multi-device delivery, device-level session routing, and sync
- Addresses media upload separately (pre-signed S3 URLs) and explains why media must bypass chat servers, routing gigabyte video files through WebSocket servers would saturate their memory and CPU
- Explains the rolling deploy problem unprompted: killing a chat server drops all its active sockets; describes the drain sequence (stop accepting, signal clients, wait for close/timeout) and what happens without it
- Identifies the Cassandra hot partition risk for high-traffic groups and names the time-bucketed partition key as the mitigation
- Mentions E2EE at an architectural level, server holds ciphertext only, key distribution server needed, per-device key pairs
- Treats writes as single-region; doesn't address ordering conflicts when two regions write concurrently to the same conversation
- Doesn't reason about E2EE's implications for group fan-out cost
- Can't discuss the cost model or what to optimize first at 100B msg/day
L7 / L8Architecture and strategy›
- Drives the conversation: defines the problem constraints before the interviewer does
- Designs multi-region with explicit reasoning about the ordering tradeoff (home region vs full active-active)
- Discusses E2EE implications for fan-out, prekey distribution, and group size limits
- Reasons about cost: Kafka egress, Cassandra storage, CDN bandwidth, and what to optimize first
- Anticipates operational concerns: monitoring message delivery lag, Kafka consumer lag SLOs, WebSocket connection churn rate
- Needs prompting for any major component
- Can't reason about the tradeoff without thinking through it in the interview
- Focuses on technology choices without connecting them to requirements
Classic interview probes
| Question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "How do you ensure a message is never lost?" | Persist to a database before responding to sender | Explain the ACK boundary precisely: persist → ACK sender → publish → deliver. Describe what happens on crash at each step. | Full delivery guarantee chain with Kafka consumer group offsets, idempotency keys, and the monitoring strategy to detect lost messages in production. |
| "A user is connected to Server 1. Their friend is on Server 2. How does the message get delivered?" | Mentions some kind of "routing" but doesn't have a concrete mechanism | Session cache (Redis): Server 1 reads SMEMBERS user:devices:{friend_id} to get
all active device IDs. For each device it does GET session:device:{id} to find
the target server. It publishes a delivery event to Kafka; each target server's Kafka
consumer finds the open socket and pushes the frame. If a device has no session entry it is
offline, message stays in Cassandra until that device reconnects and syncs. |
Full lifecycle including what happens when the session cache is stale (server 2 holds an expired entry), how to detect and recover, and the consistency model of the routing table. |
| "How would you scale this to 1 billion users?" | Add more servers and a load balancer | Horizontal scaling of each tier independently; Cassandra's peer-to-peer scaling model; Kafka partition scaling; Redis cluster mode. | Multi-region topology, cost modeling across tiers, which tier hits its limit first at 1B DAU (hint: WebSocket connection count), and the resulting architectural changes. |
| "How would you implement typing indicators?" | Send a "typing" event message through the normal message flow | Typing events are ephemeral, they should NOT go through the durable message store. Route through the presence service (Redis pub/sub) with a short TTL. The recipient's client stops showing "typing..." after 5–10s if no new event arrives. | At scale, typing indicators from large groups create significant pub/sub traffic. Discuss throttling (only broadcast typing indicator once per 2–3 seconds per user), and the decision to not persist them at all vs persist with very short TTL. |
How the pieces connect
Every major architectural decision traces back to a requirement or capacity number established earlier in the article.
- 1 Durability NFR (§2), "no message loss after server ACK" → synchronous Cassandra write before ACKing sender (§6 core flow) → Snowflake IDs for deterministic ordering (§7 data model)
- 2 Sub-500ms latency NFR (§2) → WebSocket persistent connections (§5 protocol) → Redis session cache on the hot routing path (§8 caching) → stateful chat server pool that cannot be replaced with stateless HTTP handlers (§4 architecture)
- 3 50M concurrent connections at default DAU (§3 estimator) → 500–1,000 chat servers minimum at 50k–100k connections each (adjusts dynamically with the estimator slider) → rolling deploys cannot simply kill servers, each holds active sockets → connection draining sequence required (§9 scalability)
- 4 Group messaging up to 500 members (§2 functional requirements) → fan-out on write is viable at this scale → Kafka consumer groups for fan-out (§4, §9) → Cassandra hot partition risk for very active groups requires time-bucketed partitioning (§10 failure modes)
- 5 Offline delivery requirement (§2) → messages persist durably in Cassandra; Redis sorted set tracks conversation recency for the inbox → push notification (APNs/FCM) to wake offline clients (§4 architecture) → sync endpoint returning messages since last-seen sequence number (§5b API)
- 6 Multi-device support (§2) → device-level routing (session:device:{id} keys in Redis §7) instead of user-level → fan-out to all active devices on message receive (§9 deep-dive) → E2EE requires per-device key pairs → key distribution server architecture → group size limits reflect encryption fan-out cost; receipt state tracked per-recipient in message_receipts table (§7 data model)
- Rate Limiter System Design — atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design — hash-based ID generation, Redis caching strategy, and async analytics pipeline
- Web Crawler System Design — Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter Feed System Design — fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design — multi-channel delivery, idempotency keys, and priority queues at scale
- Key-Value Store System Design — consistent hashing, eviction policies, replication, and failure modes
- Search Autocomplete