How do consumers know where to start reading in a distributed message queue?

Consumers track progress using stored offsets, a numeric pointer to the next message in a partition. The offset is committed to an internal topic (__consumer_offsets) after each processed batch. On first start, consumers can reset to the earliest or latest offset. This allows independent consumer groups to read the same topic at their own pace without coordination.

What happens if a Kafka broker crashes immediately after acknowledging a write?

With acks=1, the message can be lost if the leader crashes before any follower has replicated it. With acks=all combined with min.insync.replicas=2, the leader waits for at least one follower to confirm before sending the ack, making loss structurally impossible during the ack window. The tradeoff is higher write latency: approximately 10-30 ms within a single AZ, or 50-100 ms for cross-AZ replication.

What is the difference between at-least-once and exactly-once delivery in a message queue?

At-least-once delivery guarantees a message is delivered but may be delivered more than once. For example, if a consumer crashes before committing its offset, it re-reads the same message on restart. Exactly-once delivery requires an idempotent producer (deduplicating retried sends via a sequence number) and a transactional consumer that atomically commits output and offset together, preventing duplicate processing entirely.

How many partitions should a Kafka topic have?

A good starting formula is: partitions = max(target_consumer_count, ceil(peak_throughput_MB_s / 100)), where 100 MB/s is a conservative single-partition ceiling on NVMe. Add 20-50% headroom for growth. Partitions can be added but never reduced without data migration, and adding partitions breaks key-based ordering for existing keys, so over-provision from the start for ordering-sensitive topics.

How does Kafka handle a hot partition caused by a skewed partition key?

When a single entity generates far more messages than others, its partition key consistently hashes to the same partition, saturating that broker. The standard mitigation is to append a random salt suffix to the partition key for overloaded entities (e.g. userId + '-' + random(0,N)), distributing load across N partitions. Consumers then merge and sort the sub-partitions to reconstruct per-entity ordering.

System Design Interview Guide

Design a Distributed Message Queue

Q: What is the High-Watermark (HWM) in Kafka?

The High-Watermark is the highest message offset that has been successfully replicated to all in-sync replicas (ISR). Consumers can only read messages up to the HWM. Messages above it exist on the leader but have not yet been confirmed by all ISR members. On leader failover, messages above the HWM are truncated from followers, ensuring consumers never read data that could be lost.

The interface is two operations: write a message, read it back. The hard part is doing that reliably at a million messages per second, without losing a single one.

L3/L4: Working system L5/L6: Tradeoffs & durability L7/L8: Architecture ownership

~40 min read · 12 sections

Architecture diagram of a distributed message queue showing producers, partition router, broker cluster with leader and follower replicas, and consumer groups

What the interviewer is testing

A distributed message queue, such as Apache Kafka, Amazon Kinesis, or Google Pub/Sub, sits between services that produce events and services that consume them. The question looks simple: write here, read there. But the moment you add durability, ordering, and millions of events per second, the design space explodes.

Interviewers use this question to probe whether you understand distributed systems fundamentals, not just one vendor's API. The core tensions are: write throughput vs. durability guarantees, strict ordering vs. horizontal scalability, and push-based vs. pull-based consumption. These tensions play out at every level:

Level	Primary focus	Hard question at this level
L3/L4	Working producer/consumer model with a single broker	How do consumers know where to read next?
L5	Replication, durability, consumer group coordination	What happens when a broker crashes mid-write?
L6	Exactly-once delivery, partition rebalancing, back-pressure	How do you eliminate duplicates without sacrificing throughput?
L7/L8	Cross-region replication, tiered storage, cost at scale	How do you serve petabyte-scale history without running out of disk?

🎯

The single most differentiating question across all levels is offset management. How are offsets stored, who advances them, and what happens if a consumer crashes between processing a message and committing its offset? Nail this and the rest of the design falls into place.

Requirements clarification

The right design depends entirely on scope. A message queue for an internal microservices event bus has very different requirements from a shared platform ingesting click-stream events from 200 million mobile users. Throughput targets, retention windows, and multi-tenancy constraints all change dramatically based on that scope.

Functional requirements

Requirement	Description
Publish messages	Producers write messages to named topics with an optional partition key
Subscribe and consume	Consumer groups pull messages from topics starting at any stored offset
Replay	Consumers can rewind to any offset within the retention window
Multi-tenant topics	Multiple independent consumer groups can read the same topic independently
Message retention	Messages are retained for a configurable period (hours to years)
Log compaction (opt-in)	Retain only the latest value per key for CDC and event-sourcing topics via `cleanup.policy=compact`. Note: the `__consumer_offsets` topic itself uses compaction to track the current offset without growing unboundedly

Non-functional requirements

NFR	Target	Notes
Write throughput	1 M messages/s sustained	Across the whole cluster; single partition ~100 MB/s
Write latency (p99)	< 20 ms (at-least-once), < 100 ms (exactly-once)	Durability and latency are in direct tension here
Read latency (p99)	< 5 ms for warm consumers	Zero-copy sendfile to consumer
Availability	99.99% (52 min/year downtime)	Requires multi-broker replication
Durability	No message loss once acked	Requires flushed replicas before ack
Ordering	Strict within partition; best-effort across partitions	Total ordering requires single partition, kills throughput
Delivery semantics	At-least-once default; exactly-once opt-in	Exactly-once requires idempotent producers + transactional consumers
Retention	7 days hot; years via tiered storage	Tiered storage offloads cold segments to object store (S3)

NFR reasoning

Write latency target: 20 ms (at-least-once) Durability tradeoff ›

20 ms (inclusive of a linger.ms=10 producer batching window; with linger.ms=0, typical p99 is 3–10 ms at the cost of lower throughput) gives us room for acks=1 as the default: the leader broker receives the message into the OS page cache and acknowledges immediately, without waiting for any follower to confirm or for a disk fsync. Kafka does not fsync on every write by default. Durability against a leader power failure comes from replication, not from disk writes alone. The risk is a narrow loss window: if the leader crashes after acknowledging but before any follower has replicated the message, that message is gone. For most workloads (analytics, click-stream, metrics) this window is acceptable because the data has low value relative to its volume.

TradeoffSetting acks=all with min.insync.replicas=2 requires the leader and at least one follower to confirm before acking. This eliminates the loss window, but adds replication round-trip latency: ~10–30 ms within a single AZ, or ~50–100 ms for cross-AZ replication (round-trip to a follower in a different availability zone). These are two independent settings: acks controls what the producer waits for; min.insync.replicas controls the minimum ISR size below which acks=all writes are rejected (503). Use both together for financial or audit-critical events.

Drivesacks=1 default modeleader-flush-then-ack flow in §6ISR and HWM design in §9

Availability: 99.99% (52 min/year) Replication factor ›

99.99% requires that a single broker failure causes no visible downtime. This means partition leaders must be re-elected automatically and consumer clients must reconnect transparently within that 52-minute budget per year. In practice, leader election takes ~5–30 seconds with a metadata service (ZooKeeper/KRaft), so even weekly broker restarts fit the SLA.

TradeoffCrossing into 99.999% would require eliminating planned maintenance windows and using active-active multi-region routing, which roughly 3-5x the operational cost.

Drivesreplication factor=3 in §8controller re-election in §10

Ordering: within-partition only Throughput enabler ›

Total global ordering across all partitions and all consumers would require a single write path, eliminating parallelism. The practical solution is key-based routing: all events belonging to the same entity (same user ID, same order ID) hash to the same partition, guaranteeing that the consumer for that partition sees those events in write order.

TradeoffIf a hot key (a viral user generating 100x the traffic of others) maps to one partition, that partition becomes a bottleneck. Mitigation: add a random salt suffix to the partition key for over-loaded entities and handle out-of-order events in the consumer.

Drivesmurmur2 hash partitioning in §5hot-key section in §10

Capacity estimation

Message queues are fundamentally I/O-bound systems. The interesting estimation dimensions are write throughput and storage, not compute. Think in terms of bytes per second first, then convert to broker count and disk size.

Interactive capacity estimator

Messages per day (millions) 500 M

Avg message size (bytes) 1,000 B

Consumer groups (fan-out) 3

Retention days (hot) 7 days

Write QPS

5,787

messages/sec

Peak Write QPS

17,361

messages/sec

peak = write × 3× (assumed)

Write Bandwidth

5.5

MB/s sustained

Storage (hot)

3.3

TB raw (×3 for replicas)

Consumer Bandwidth

16.5

MB/s total fan-out

Min Brokers (est.)

at 200 MB/s per broker (×RF=3)

ceil(peak_bw × RF / 200 MB/s)

💡

Key insight: At replication factor 3, every byte you write consumes 3× the storage on brokers. With tiered storage (cold segments offloaded to an object store like S3), you keep only the last N days hot on brokers and let the object store handle the long tail, typically at 10–100× cost reduction for data older than 7 days.

High-level architecture

A distributed message queue is a durable, ordered log that decouples the services that produce events from the services that consume them. Producers write messages to named topics; consumers pull messages at their own pace using a stored offset. Unlike traditional queues, messages are not deleted after delivery; they are retained for a configurable window, allowing multiple independent consumer groups to replay the same data independently.

The core insight of a log-based message queue is that a durable, ordered, append-only log is both the storage layer and the transport layer. Every message is written once to disk and read many times by many consumers, with zero network copies between storage and consumer in the happy path (a technique called zero-copy, via the OS sendfile() syscall).

Figure 1: High-level architecture. Producers route to the partition leader broker via a hash; brokers replicate asynchronously to followers; consumers pull from brokers using stored offsets; cold log segments spill to object storage.

Component breakdown

Partition Router (load balancer layer): receives producer requests and routes each message to the correct partition leader. It holds a cached partition-to-broker mapping from the metadata service; on a cache miss or leader change, it refreshes. This layer can be embedded in the producer client library rather than a separate server hop.

Broker cluster: the core of the system. Each broker owns the leadership for a set of partitions and runs the append-to-log cycle. Brokers are symmetric: any broker can be a leader for some partitions and a follower for others. Leadership is assigned by the controller and balances automatically across the cluster.

Metadata service (ZooKeeper / KRaft), tracks which broker is the leader for each partition, which brokers are alive, and what the current ISR (in-sync replica) set is. With KRaft (Kafka's built-in Raft consensus mode), this component is no longer external to Kafka, reducing operational complexity at the cost of requiring a dedicated controller quorum within the cluster.

Append-only log: the actual message store. Each partition is a sequence of segment files on disk. Writes are always sequential appends; reads are random-access by offset but typically sequential as consumers advance through the log. The OS page cache acts as a read buffer, making recent messages essentially free to serve without touching disk.

Offset store (__consumer_offsets): an internal topic that tracks the last committed offset for each consumer group and partition. This is itself a distributed log, giving offset commits the same durability guarantees as regular messages. It also provides the coordination point for consumer rebalancing.

Tiered storage: closed log segments are uploaded to an object store (a durable blob service like S3 or GCS). Brokers only retain the last N days of hot segments locally; consumers requesting older offsets are transparently redirected to fetch from the object store. This is what makes multi-year retention economically viable.

Architectural rationale

Why append-only logs instead of a traditional message queue? Core model ›

Traditional message queues (like AMQP brokers, such as RabbitMQ and ActiveMQ) delete messages after delivery. This makes replay impossible without external storage. The log-based model keeps messages until the retention window expires, treating the queue as a persistent, ordered journal. Any consumer can read at any point in the log, enabling replay, backfill, and independent consumer groups with no coordination between them.

TradeoffLog retention costs disk proportional to time, not to unconsumed message count. A fast consumer benefits from the same storage cost as a slow consumer: this is the opposite of a traditional queue where costs depend on lag depth.

AlternativesRabbitMQActiveMQ ArtemisAmazon SQS

Why embed the router in the producer client rather than a separate layer? Topology ›

A separate routing proxy adds a network hop and a failure domain. By embedding the partition metadata cache in the producer client library, producers route directly to the partition leader, requiring only a single network round-trip per produce request. The client refreshes its metadata cache when it receives a NOT_LEADER_FOR_PARTITION error, which is the signal that leadership has changed.

TradeoffFat clients mean the routing logic is spread across every producer language runtime (Java, Python, Go, .NET). Protocol-level bugs must be fixed across all SDKs simultaneously.

AlternativeSidecar proxy (Envoy)

Why KRaft instead of ZooKeeper for metadata? Controller ›

A distributed coordination service (ZooKeeper) was an external dependency requiring its own cluster, its own monitoring, and its own expertise. ZooKeeper had a hard limit around 200,000 partitions before metadata propagation became slow. KRaft uses Raft consensus within the Kafka broker process, removing the external dependency and scaling to millions of partitions. In an interview, mention ZooKeeper as the historical baseline and KRaft as the current standard.

Still supportedZooKeeper mode (legacy)etcd (not used in Kafka)

Real-world comparison

Decision	This design (Kafka-like)	Amazon Kinesis	Google Pub/Sub
Storage model	Append-only log segments on broker disk + tiered to S3	Shard-based log, managed by AWS (no local disk)	Fully managed; no user-visible disk
Partitioning	Explicit partitions, hash-based routing	Shards with configurable shard count	Automatic; no user-visible partitioning
Consumer model	Pull-based, offset committed by consumer	Pull-based, sequence number tracked per shard	Push-based (Pub/Sub pushes to subscriber endpoint)
Retention	Configurable; unlimited with tiered storage	Up to 7 days (standard) / 365 days (extended)	Up to 7 days; no tiered storage
Ordering	Within partition, strict	Within shard, strict	Best-effort; no guarantee across messages

ℹ️

The right choice follows from requirements. Kinesis wins if you want zero operational overhead and your team is already on AWS. Pub/Sub wins for push-based delivery to HTTP endpoints. Kafka wins when you need replay, exactly-once semantics, or tight control over partition placement and retention policy.

Partitioning strategy & log design

The central algorithmic question in this design is: given a message, which partition does it go to, and how is that partition physically structured on disk? These two decisions determine throughput, ordering guarantees, and disk efficiency.

Figure 2: Hash-based routing (left) guarantees ordering per key. Round-robin routing (right) maximises throughput at the cost of any per-key ordering. Our choice: hash-based with a hot-key salt escape hatch.

Our choice for this design is hash-based partitioning using murmur2 (the same hash Kafka uses). Producers specify an optional partition key; if omitted, the client falls back to round-robin for maximum throughput, useful for workloads where ordering genuinely does not matter.

Log segment structure

Each partition is a sequence of segment files on disk. A segment has a fixed maximum size (default 1 GB) and a maximum age (default 7 days). When a segment is rolled, the old file is closed and becomes immutable, enabling safe concurrent reads and background upload to tiered storage. Active segments are always the last file in the sequence.

Each segment has a companion index file mapping message offsets to byte positions, allowing O(log n) random seeks. The index is sparse, so not every offset is indexed: a seek involves a binary search in the index, then a linear scan of at most a few KB of the log file.

Producer batching: the key to high throughput Performance ›

Disk writes are expensive when they're small and frequent. Producers buffer messages locally for up to linger.ms (default 0, recommended 5–100 ms) and then flush as a batch. A single disk write containing 1,000 messages costs roughly the same I/O as a single-message write, so batching is a nearly free throughput multiplier. Batch sizes of 16 KB to 1 MB are typical in production.

// Producer config: batching + compression
props.put("linger.ms", "10");           // wait 10ms to fill buffer
props.put("batch.size", "65536");     // 64 KB batch
props.put("compression.type", "lz4"); // 3-4x compression on text
props.put("acks", "1");               // leader ack only (at-least-once)

Latency tradeoffSetting linger.ms > 0 adds up to that value of producer-side latency. For real-time systems requiring sub-5ms write latency, keep linger.ms=0 and accept lower throughput. For throughput-focused pipelines, 10–20 ms is usually acceptable and yields 5–10× throughput improvement.

Message serialization and schema evolution L6 probe ›

The value field in every message is raw bytes. The broker is format-agnostic. The format contract lives entirely outside the broker. Three dominant choices exist: JSON (human-readable, 2–5× larger, no schema enforcement), Protocol Buffers (compact binary, strongly typed, Google-originated), and Avro (compact binary, schema-embedded, native to the Confluent ecosystem). In practice, Avro + Confluent Schema Registry is the most common production pattern for shared message queues with multiple producer/consumer teams.

A schema registry is a central service that stores versioned schemas and assigns each schema a numeric ID. Producers embed only the schema ID (4 bytes) at the start of the message value, not the full schema. Consumers look up the schema by ID on first encounter, then cache it locally. This keeps messages compact while ensuring every consumer can deserialize any message it receives.

Schema evolution is controlled by compatibility modes: BACKWARD (new schema can read old messages, safe default), FORWARD (old consumers can read new messages), FULL (both directions). Adding an optional field with a default is BACKWARD compatible. Removing a required field or changing a field type is not.

L6 Probe"How do you evolve a message schema without breaking running consumers?" Use BACKWARD compatibility, add fields with defaults, never rename fields. Deploy consumers first (they can read both old and new), then deploy producers. If a breaking change is necessary, create a new topic version (e.g. orders.v2) and run dual-write for a migration window.

AlternativesProtobuf (proto3)JSON SchemaApache Thrift

05b

API design

The public API has two core operations: publishing a message and fetching messages. Everything else (offset management, consumer group coordination, partition reassignment) is handled by internal protocols or admin APIs that are unlikely to be the focus of an interview.

POST /topics/{topic}/messages: publish

// Request
POST /topics/orders/messages
Content-Type: application/json

{
  "messages": [
    {
      "key": "user-42",          // optional; determines partition
      "value": "eyJvcmRlcl...",  // base64-encoded payload
      "headers": {              // optional metadata
        "content-type": "application/json",
        "trace-id": "abc-123"
      }
    }
  ],
  "acks": "1"               // "0", "1", or "all"
}

// Response: 200 OK (acks=1 or all) or 202 Accepted (acks=0)
{
  "offsets": [
    { "partition": 3, "offset": 104920 }
  ]
}

// Validation errors
400 Bad Request: message exceeds max.message.bytes (default 1 MB)
429 Too Many Requests: producer quota exceeded (bytes/sec per client)
503 Service Unavailable: not enough in-sync replicas (acks=all, ISR < min.insync.replicas)

GET /topics/{topic}/messages: consume (fetch)

// Request: consumer pulls next batch
GET /topics/orders/messages?partition=3&offset=104921&max_bytes=1048576&wait_ms=500
Authorization: Bearer <consumer-group-token>

// Response: 200 OK
{
  "partition": 3,
  "messages": [
    {
      "offset": 104921,
      "timestamp": 1712700000000,
      "key": "user-42",
      "value": "eyJvcmRlcl...",
      "headers": { ... }
    }
  ],
  "high_watermark": 105000  // latest committed offset in partition
}

// Notable responses
204 No Content: long-poll timeout, no new messages (wait_ms elapsed)
416 Range Not Satisfiable: offset out of retention range (too old)

Optional endpoints by level

Endpoint	Description	Level
`POST /consumers/{group}/offsets`	Commit consumer group offset	L3/L4
`GET /consumers/{group}/offsets`	Fetch current offset per partition	L3/L4
`POST /consumers/{group}/seek`	Reset offset to timestamp or beginning/end	L5
`GET /topics/{topic}/metadata`	List partitions, leaders, ISR state	L5
`POST /transactions/init`	Begin exactly-once transaction	L7/L8

Core flow: produce and consume

The durability guarantee in §2 (no loss once acked) drives every decision in the produce flow. The question is: exactly when does the broker send the ack, and how many replicas must see the message before that happens? The answer creates the central tension between latency and durability.

Figure 3: Produce flow (top) appends to the leader log then acks at-latest. Consume flow (bottom) pulls from leader using a stored offset; if offset equals HWM, the connection is held open (long-poll) until new messages arrive; after processing, the consumer commits its new offset to the offset topic.

⚠️

The gap between processing and committing the offset is where at-least-once delivery lives. If a consumer processes a message and then crashes before committing the offset, on restart it will re-read and re-process the same message. Consumers must be designed for idempotency: processing the same message twice must produce the same result as processing it once.

acks=1 vs acks=all: the key tradeoff

This decision maps directly back to the write latency NFR from §2. With acks=1, the leader receives the message into the OS page cache and acks immediately. A leader crash after ack but before followers replicate loses that message. With acks=all, the leader waits until all in-sync replicas (ISR) have acknowledged before returning, making loss structurally impossible during the ack window, at the cost of latency proportional to the replication round-trip (same AZ: ~10–30 ms; cross-AZ: ~50–100 ms).

ℹ️

Consumer reads from the nearest replica (KIP-392, Kafka 2.4+): by default, consumers fetch from the partition leader. Since Kafka 2.4, consumers can be configured to fetch from the nearest follower replica in the same AZ via client.rack + replica.selector.class=RackAwareReplicaSelector. This eliminates cross-AZ read traffic and reduces consumer latency for geographically distributed clusters, an L6+ optimization worth mentioning when cross-AZ egress cost is a concern.

Data model

There are three categories of entities in this system: messages themselves (the payload), the log structure (how messages are stored and indexed), and the metadata about topics and consumer state. Each is used very differently, and those differences shape where and how each entity is stored.

Access patterns

Operation	Frequency	Query shape
Append message to partition	Very high (sustained M/s)	Sequential write to end of file
Fetch messages at offset	Very high (per consumer poll cycle)	Random seek by offset, then sequential scan
Read partition metadata (leader, ISR)	High (per produce/consume)	Key lookup by topic+partition
Commit consumer offset	Medium (every N messages or T seconds)	Write by group+topic+partition
Fetch consumer offset on startup	Low (per consumer startup / rebalance)	Key lookup by group+topic+partition

Two things stand out from this table. First, the message append and fetch patterns are purely sequential I/O: this rules out any key-value store or database as the primary storage layer and points directly to a file-based log. Second, metadata reads happen on every produce and consume request, so they must be served from memory, not from disk.

Message schema (binary, per partition)

Figure 4: On-disk message batch layout. producer_id and sequence_no enable exactly-once deduplication. The sparse index maps relative offsets to byte positions in the log file, enabling O(log n) seeks.

Metadata schema (in-memory, replicated via KRaft)

Topic and partition metadata In-memory ›

// In-memory partition metadata (per broker)
{
  "topic": "orders",
  "partition": 3,
  "leader_id": 5,           // broker ID
  "isr": [5, 2, 7],         // in-sync replica broker IDs
  "replicas": [5, 2, 7],    // all assigned replicas
  "high_watermark": 104999, // last offset replicated to all ISR
  "log_start_offset": 0,    // first available offset (after retention)
  "epoch": 12               // leader epoch for fencing
}

The leader epoch is critical for preventing split-brain: a follower that receives a write from a broker claiming to be the leader must reject it if the epoch is stale. This is how Kafka prevents duplicate appends when a deposed leader comes back online and tries to continue writing.

Caching strategy

A Kafka broker does not maintain its own message cache. Instead, it relies entirely on the OS page cache, the kernel's LRU-managed RAM buffer, to serve recent messages. When a consumer fetches messages that are still in the page cache, the broker uses the OS sendfile() syscall to transfer bytes directly from the page cache to the network socket, bypassing the broker process entirely. This is called zero-copy and is why a single broker thread can sustain hundreds of MB/s of consumer throughput at sub-5 ms latency.

A message queue that reads from disk on every consumer fetch would be unusable at scale. The sub-5 ms read latency NFR from §2 is only achievable because recent messages almost never touch disk: the OS page cache absorbs them. Understanding where each cache layer sits in the request path is what separates a good answer from a great one.

Figure 7: Three-layer cache hierarchy. Layer 1 (OS page cache) serves the vast majority of warm reads at ~1–3 ms. Layer 2 (broker heap) holds index files and metadata. Layer 3 (consumer fetch buffer) pre-fetches ahead of the current offset. A disk read only occurs on a page cache MISS, typically for cold segments or consumers with significant lag.

Cache hierarchy

Layer	Location	What it caches	Invalidation	Latency when hit
Layer 1: OS page cache	Kernel RAM on each broker host	Recent log segment bytes (LRU eviction by kernel)	Automatic: kernel evicts cold pages when RAM pressure rises; segment roll does not invalidate, rolled segments stay warm until evicted	~1–3 ms
Layer 2: Broker process heap	JVM heap on broker process	Partition metadata, sparse index files, ISR state, producer ID dedup state	On leader election change or segment roll; automatically garbage collected	<1 ms (in-process)
Layer 3: Consumer fetch buffer	Consumer client memory	Pre-fetched message batches ahead of current processing offset	On consumer group rebalance or explicit seek; configurable via `fetch.min.bytes` and `max.partition.fetch.bytes`	~0 ms (already in RAM)

Why the OS page cache is the right choice here

Kafka deliberately does not maintain its own in-process message cache. Instead it relies entirely on the OS page cache. This is a deliberate architectural decision: the JVM heap is bounded and subject to garbage collection pauses; the OS page cache is unbounded and managed by the kernel's LRU algorithm, which is well-tuned for sequential access patterns. A broker restart that loses its JVM heap state recovers instantly, because the OS page cache persists across process restarts (until the host reboots).

The consequence is that broker sizing is dominated by RAM, not CPU. A broker with 256 GB RAM and a 48 TB NVMe array will serve warm reads from the page cache at ~2 ms. A broker with 32 GB RAM on the same hardware will see frequent page cache evictions under load, forcing disk reads at ~8–10 ms, a 4–5× latency regression. This is the first thing to check when diagnosing slow consumer reads in production.

💡

Zero-copy reads via sendfile(): When a consumer requests messages that are in the page cache, the broker uses the OS sendfile() syscall to transfer bytes directly from the page cache to the network socket, without copying them through userspace. This eliminates the broker process entirely from the data path for warm reads, and is why a single broker thread can serve hundreds of MB/s of consumer traffic without saturating its CPU.

Replication and durability

Kafka replicates each partition to a configurable number of brokers (replication factor). The broker currently acting as the partition leader accepts all writes; follower brokers continuously fetch and replay the leader's log. The High-Watermark (HWM) is the highest offset that has been confirmed by all in-sync replicas (ISR). Consumers can only read up to the HWM. Messages above it exist on the leader but have not been replicated to all ISR members yet. This guarantee means a consumer never reads a message that could disappear after a leader failover.

Durability in this system is a property of the replication flow, not of disk writes alone. A single disk write survives a process crash but not a disk failure. Replication across brokers (ideally across availability zones) is what makes data durable against hardware failures.

Figure 5: Three brokers across three AZs. Replication factor 3 tolerates simultaneous failure of any two brokers. HWM is the safe read boundary: messages above HWM exist on the leader but have not been confirmed by all ISR members yet.

Leader election and log truncation

When a new leader is elected, it issues a LeaderEpochRequest to all follower replicas. Any follower whose log-end-offset (LEO) exceeds the new leader's committed high-watermark for that epoch must truncate the excess entries. This log truncation step is what enforces the HWM contract: no consumer ever reads a message that was written above the HWM, because followers discard those writes when they learn of the new leader's epoch. Without this truncation, a deposed leader that comes back online and a newly-elected leader could have diverged logs, potentially creating duplicate or inconsistent consumer reads.

The leader epoch (the epoch field in the metadata schema in §7) is a monotonically increasing counter that prevents split-brain: followers reject writes from any broker whose epoch is lower than the controller's current assignment. An old leader that recovers from a network partition detects the stale epoch and stops accepting writes before rejoining as a follower.

🎯

Classic L5 probe: “What happens to messages between the leader’s LEO and HWM when a new leader is elected?” Answer: they are truncated by all followers. The new leader’s LEO becomes the cluster’s HWM baseline for that epoch. Producers that received acks under acks=1 for those messages will see a LEADER_NOT_AVAILABLE error on the next produce and must retry, which is exactly why idempotent producers are essential even under acks=1.

Exactly-once delivery: how it actually works

Idempotent producer: deduplicating retried messages Producer side ›

When a producer retries after a timeout (because it didn't receive an ack), it may send the same message twice: once that succeeded silently and once that the broker will process again. The broker deduplicates by tracking the (producer_id, partition, sequence_number) triple. Any sequence number the broker has already seen is silently discarded. This converts "at-least-once" into "exactly-once on the delivery path" without performance cost, because the sequence number is already in the message header.

The limit of 5 in-flight requests per partition exists because the broker’s deduplication window only covers a bounded sequence range. With more than 5 in-flight batches, a retried batch (carrying a higher sequence number) could arrive out of order in a way the broker cannot detect, allowing a duplicate to slip through as a genuine new message.

Required configenable.idempotence=trueacks=allmax.in.flight=5

Transactional API: atomic produce + offset commit Consumer side ›

End-to-end exactly-once (read → process → write → commit) requires that the final offset commit and any downstream produce happen atomically. Kafka's transaction API wraps these operations in a two-phase commit: the producer sends a BeginTransaction marker, writes messages and/or offset commits to a transaction log, then sends a CommitTransaction marker. Consumers configured with isolation.level=read_committed will not see any of the writes until the commit marker arrives.

TradeoffTransaction overhead adds ~10–20 ms per transaction plus one extra log write. For pipelines processing 10,000+ messages per second, batch your transactions (commit every N messages, not per message) or the overhead dominates.

Cost+10–20ms latencytransaction coordinator overhead

Deep-dive scalability

Scaling a message queue is fundamentally about scaling partitions and scaling disk. Compute is rarely the bottleneck: network bandwidth and disk I/O are. The diagram below shows a production-grade multi-region setup.

Figure 6: Production multi-region deployment. US-East is the primary cluster; a cross-cluster replication tool (MirrorMaker 2) continuously mirrors topic data to EU-West for both low-latency EU reads and disaster recovery failover.

How many partitions should you choose? L5 deep-dive ›

Partitions are the unit of parallelism for both producers and consumers. A topic with P partitions can be consumed by at most P consumers in a group simultaneously. A useful starting heuristic: partitions = max(target_consumers, ceil(peak_throughput_MB_s / 100)), where 100 MB/s is a conservative single-partition write ceiling on NVMe. Add 20–50% headroom for growth.

CautionPartitions can be added to a topic but never reduced without data migration. Key-based producers will start routing keys to new partitions immediately, breaking ordering for any entity whose hash now maps to a different partition than it did historically. For topics with strict ordering requirements, over-provision from the start (e.g. 200–1000 partitions) and plan never to change the count.

Consumer rebalancing: the thundering herd L6 deep-dive ›

When a consumer joins or leaves a group, Kafka triggers a rebalance: all partitions are temporarily unassigned and reassigned across living consumers. During a rebalance, consumption for the entire group stops. With large consumer groups (100+ consumers), rebalances triggered by routine deployments can cause 30–60 second pauses.

Mitigation: use static group membership (group.instance.id). A consumer with a static ID rejoins after a restart without triggering a full rebalance, reclaiming its previously assigned partitions as long as it returns within session.timeout.ms. For Kubernetes pod restarts, this typically reduces rebalance events by 80%.

Also helpsincremental cooperative rebalancing (KIP-429)longer session.timeout.ms

Tiered storage: petabyte-scale retention L7 deep-dive ›

Storing years of data on broker disks becomes prohibitively expensive past a few weeks. Tiered storage moves closed log segments to an object store (S3/GCS) automatically. Brokers only retain the last N days locally; the object store provides the rest. Consumers reading historical offsets are transparently redirected to fetch from the object store via a fetch proxy layer. Cost comparison: NVMe storage ~$0.20/GB-month vs. S3 ~$0.023/GB-month, roughly 8× savings for cold data.

TradeoffCold reads from S3 have higher latency (50–200ms vs. 1–5ms from local disk). For consumers that regularly seek to old offsets (e.g. re-processing), this adds significant latency. Solutions: pre-warm a local cache for anticipated replay windows, or accept the latency as acceptable for batch reprocessing jobs.

Back-pressure and producer quotas L6 deep-dive ›

A shared Kafka cluster serving multiple teams needs hard per-producer quotas to prevent a single runaway producer from saturating disk or network. Quota enforcement works via a token bucket model per client ID: the broker measures bytes-per-second produced by each client, and when a client exceeds its quota, the broker throttles it by delaying the fetch metadata response (effectively back-pressuring the producer without dropping messages).

// Set quota for client "order-service" via Admin API
kafka-configs --bootstrap-server broker:9092 --alter \
  --add-config 'producer_byte_rate=10485760' \  // 10 MB/s
  --entity-type clients --entity-name order-service

Failure modes & edge cases

Scenario	Problem	Solution	Level
Broker crash (any)	Partitions whose leader was on the crashed broker are unavailable	Controller detects via heartbeat timeout (~30s), triggers leader election, assigns a follower from the ISR as new leader. Producers see LEADER_NOT_AVAILABLE and retry	L3/L4
Producer retry → duplicate message	Timeout causes producer to resend; broker writes same message twice	Enable idempotent producer (enable.idempotence=true). Broker deduplicates by (producer_id, partition, sequence_no)	L3/L4
Consumer crash mid-processing	Consumer processed message but crashed before committing offset; reprocesses on restart	Make consumers idempotent (e.g. upsert to DB keyed on message ID). Or use exactly-once transactions for atomic process+commit	L5
ISR drops to 1 (under-replication)	With acks=all and min.insync.replicas=2, writes fail (503)	Alert immediately; reduce min.insync.replicas as emergency measure (accepting durability risk) or allow the follower to catch up before continuing	L5
Hot partition (imbalanced key distribution)	One partition gets 10x the traffic of others; its broker saturates	For genuinely hot keys: append a random salt suffix to the partition key for over-loaded entities, consuming consumers merge and sort. Alternatively, pre-list hot-key exceptions and route them to dedicated topics with more partitions	L5
Split-brain after network partition	Old leader (thinking it's still leader) accepts writes after a new leader is elected	Leader epoch prevents this: the new leader gets a higher epoch; followers reject writes from the old leader once they know the current epoch. Old leader detects epoch mismatch and stops accepting writes	L7/L8
Consumer group lag explosion	Consumer falls behind; log retention expires before it catches up; offsets fall below log_start_offset	Monitor consumer lag in real-time (JMX metric: records-lag). Alert before lag reaches retention boundary. When lag expires: reset to earliest (some messages are re-read) or skip to latest (some messages are lost)	L5
Cross-region message ordering	MirrorMaker 2 replicates with async lag; EU consumer reads messages in a different order than US	Accept that cross-region ordering is best-effort (not guaranteed). For strict ordering requirements, route all reads to the primary region at the cost of cross-region read latency	L7/L8
Poison-pill / dead-letter	A malformed message that crashes the consumer on every attempt permanently blocks the partition: no subsequent messages in that partition are processed	Circuit-break on the consumer side: skip the message after N retries and publish it to a dead-letter topic (DLQ) for manual inspection. Schema validation at publish time via a schema registry is the proactive defense: reject malformed messages at the source before they reach consumers. Never advance past a deserialization failure without a DLQ escape hatch	L5

How to answer by level

L3/L4: Junior / SDE I-II Working system ›

What good looks like

Clear producer → broker → consumer flow
Topics, partitions, offsets: defined clearly
Consumer pulls (not push), advances offset after processing
Messages are not deleted after consumption; multiple groups read the same topic independently
Replication exists; explains why (durability, not just availability)
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

What separates L5 from L3/L4

Unprompted discussion of acks=1 vs acks=all and the latency/durability tradeoff
Proactively asks: what are the message ordering requirements?
Explains the high-watermark concept
Knows that consumer offset commit creates the at-least-once window
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

L5: Senior SDE Tradeoffs ›

What good looks like

Explains ISR, HWM, and the ISR-shrink → 503 scenario
Traces exactly-once delivery end-to-end (idempotent producer + transactional consumer)
Explains consumer group rebalancing and session.timeout.ms tradeoff
Can calculate partition count from throughput requirements
Knows hot partition is the Achilles heel of key-based routing
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

What separates L6 from L5

Static group membership and cooperative rebalancing (KIP-429)
Discusses tiered storage economics and cold read latency tradeoff
Designs producer quota enforcement and back-pressure
Knows the difference between KRaft and ZooKeeper operationally
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

L6: Staff SDE End-to-end ownership ›

What good looks like

Designs for operational concerns: multi-team quotas, partition rebalancing for cluster expansion, rack-aware replica assignment
Articulates when to use Kafka vs. SQS vs. Pub/Sub based on concrete requirements
Builds tiered storage into the capacity estimate from the start
Designs monitoring: consumer lag SLO, ISR shrink alert, producer throttle metrics
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

What separates L7 from L6

Initiates cross-region MirrorMaker design unprompted
Discusses MirrorMaker failover: how do you promote EU to primary without data loss?
Proposes schema registry to enforce compatibility without coordination overhead
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

L7/L8: Principal / Distinguished Architecture ownership ›

What good looks like

Questions whether Kafka is the right tool: stream processing? (consider Flink/Spark on top); simple task queues? (SQS is cheaper); event sourcing? (partition count strategy changes fundamentally)
Multi-tenant platform design: per-team quota enforcements, partition isolation, cost attribution
Schema evolution strategy: how do you break compatibility without downtime?
Disaster recovery plan for full region failure: RPO/RTO targets, MirrorMaker lag monitoring, automated promotion runbook
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

What makes this memorable

Initiates build vs. buy: "If I'm at a company where Confluent Cloud or MSK fits, why would I operate this myself?"
Brings cost model: "At 1 PB/year retained, object storage saves $X vs. broker disks"
Mentions Kafka's known limitations (partition count ceiling per cluster before KRaft, JVM GC pauses, in-order exactly-once is expensive)
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

Classic probes table

Question	L3/L4	L5/L6	L7/L8
"How do consumers know where to start reading?"	Stored offset per partition; resets to earliest or latest on first start	Explains offset commit atomicity, manual vs. auto commit tradeoffs, seek-to-timestamp for replay	Schema-versioned replay strategy; offset reset during data migration; impact on downstream aggregations
"What happens if a broker crashes immediately after acknowledging a write?"	"Another broker takes over", vague on mechanism	Explains acks=1 vs acks=all, ISR, the HWM truncation on leader re-election, why LEO can be ahead of HWM	Leader epoch fencing, the log truncation protocol, idempotent producer dedup window, epoch monotonicity as the safety invariant
"How would you add a new topic for a different team without affecting existing consumers?"	"Create a new topic", no cross-team concern	Quota per client ID, rack-aware partition assignment to avoid hotspots on shared brokers, partition count sizing	Tenant isolation model: dedicated vs. shared cluster economics, partition rent / cost attribution, compatibility testing in a shadow cluster before production
"You need five years of event history accessible within 100ms. How?"	"Store everything on brokers", cost-blind	Tiered storage to S3 for cold segments; hot segments on NVMe; tradeoff: cold reads are 50–200ms not 1–5ms	Builds hybrid: tiered storage for durability, a separate columnar store (ClickHouse / BigQuery) for fast analytical queries over old data; replay via Spark/Flink from S3 for backfill

How the pieces connect

1 M msg/s throughput requirement (§2) → hash-based partitioning across 100+ partitions (§5) → consumer groups with one consumer per partition for maximum read parallelism (§6)
"No loss once acked" durability NFR (§2) → replication factor 3 across AZs, min.insync.replicas=2, acks=all for critical topics (§9) → ISR-shrink → 503 failure mode as the expected behavior when durability can't be guaranteed (§11)
7-day hot retention with years of history (§2) → tiered storage: NVMe for 7-day window, S3 for cold segments (§4) → ~8x cost reduction for cold data vs. keeping everything on broker disks (§9)
Strict within-partition ordering NFR (§2) → murmur2 partition key hashing ensures same entity always routes to same partition (§5) → hot-key imbalance is the primary throughput failure mode when key cardinality is low (§10)
At-least-once delivery as default (§2) → offset commit after processing creates the reprocessing window (§6) → idempotent producer + transactional API is the upgrade path to exactly-once, traced fully in §9
99.99% availability SLA (§2) → multi-broker cluster with KRaft controller quorum enables automatic leader election (§4, §8) → cross-region MirrorMaker 2 covers region-level failures for DR (§9)
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

Also in this series

Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
Online Judge System Design — secure sandboxing, execution queues, and worker scaling

Design a Distributed Message Queue

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

NFR reasoning

Capacity estimation

Interactive capacity estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Partitioning strategy & log design

Log segment structure

API design

POST /topics/{topic}/messages: publish

GET /topics/{topic}/messages: consume (fetch)

Optional endpoints by level

Core flow: produce and consume

acks=1 vs acks=all: the key tradeoff

Data model

Access patterns

Message schema (binary, per partition)

Metadata schema (in-memory, replicated via KRaft)

Caching strategy

Cache hierarchy

Why the OS page cache is the right choice here

Replication and durability

Leader election and log truncation

Exactly-once delivery: how it actually works

Deep-dive scalability

Failure modes & edge cases

How to answer by level

Classic probes table

System Design Mock Interviews

Practice Coding Interviews Now