System Design Interview Guide

Design a Distributed Message Queue

The interface is two operations: write a message, read it back. The hard part is doing that reliably at a million messages per second, without losing a single one.

L3/L4: Working system L5/L6: Tradeoffs & durability L7/L8: Architecture ownership

~40 min read  ·  12 sections

Architecture diagram of a distributed message queue showing producers, partition router, broker cluster with leader and follower replicas, and consumer groups
01

What the interviewer is testing

A distributed message queue, such as Apache Kafka, Amazon Kinesis, or Google Pub/Sub, sits between services that produce events and services that consume them. The question looks simple: write here, read there. But the moment you add durability, ordering, and millions of events per second, the design space explodes.

Interviewers use this question to probe whether you understand distributed systems fundamentals, not just one vendor's API. The core tensions are: write throughput vs. durability guarantees, strict ordering vs. horizontal scalability, and push-based vs. pull-based consumption. These tensions play out at every level:

LevelPrimary focusHard question at this level
L3/L4Working producer/consumer model with a single brokerHow do consumers know where to read next?
L5Replication, durability, consumer group coordinationWhat happens when a broker crashes mid-write?
L6Exactly-once delivery, partition rebalancing, back-pressureHow do you eliminate duplicates without sacrificing throughput?
L7/L8Cross-region replication, tiered storage, cost at scaleHow do you serve petabyte-scale history without running out of disk?
🎯

The single most differentiating question across all levels is offset management. How are offsets stored, who advances them, and what happens if a consumer crashes between processing a message and committing its offset? Nail this and the rest of the design falls into place.

02

Requirements clarification

The right design depends entirely on scope. A message queue for an internal microservices event bus has very different requirements from a shared platform ingesting click-stream events from 200 million mobile users. Throughput targets, retention windows, and multi-tenancy constraints all change dramatically based on that scope.

Functional requirements

RequirementDescription
Publish messagesProducers write messages to named topics with an optional partition key
Subscribe and consumeConsumer groups pull messages from topics starting at any stored offset
ReplayConsumers can rewind to any offset within the retention window
Multi-tenant topicsMultiple independent consumer groups can read the same topic independently
Message retentionMessages are retained for a configurable period (hours to years)
Log compaction (opt-in)Retain only the latest value per key for CDC and event-sourcing topics via cleanup.policy=compact. Note: the __consumer_offsets topic itself uses compaction to track the current offset without growing unboundedly

Non-functional requirements

NFRTargetNotes
Write throughput1 M messages/s sustainedAcross the whole cluster; single partition ~100 MB/s
Write latency (p99)< 20 ms (at-least-once), < 100 ms (exactly-once)Durability and latency are in direct tension here
Read latency (p99)< 5 ms for warm consumersZero-copy sendfile to consumer
Availability99.99% (52 min/year downtime)Requires multi-broker replication
DurabilityNo message loss once ackedRequires flushed replicas before ack
OrderingStrict within partition; best-effort across partitionsTotal ordering requires single partition, kills throughput
Delivery semanticsAt-least-once default; exactly-once opt-inExactly-once requires idempotent producers + transactional consumers
Retention7 days hot; years via tiered storageTiered storage offloads cold segments to object store (S3)

NFR reasoning

Write latency target: 20 ms (at-least-once) Durability tradeoff

20 ms (inclusive of a linger.ms=10 producer batching window; with linger.ms=0, typical p99 is 3–10 ms at the cost of lower throughput) gives us room for acks=1 as the default: the leader broker receives the message into the OS page cache and acknowledges immediately, without waiting for any follower to confirm or for a disk fsync. Kafka does not fsync on every write by default. Durability against a leader power failure comes from replication, not from disk writes alone. The risk is a narrow loss window: if the leader crashes after acknowledging but before any follower has replicated the message, that message is gone. For most workloads (analytics, click-stream, metrics) this window is acceptable because the data has low value relative to its volume.

TradeoffSetting acks=all with min.insync.replicas=2 requires the leader and at least one follower to confirm before acking. This eliminates the loss window, but adds replication round-trip latency: ~10–30 ms within a single AZ, or ~50–100 ms for cross-AZ replication (round-trip to a follower in a different availability zone). These are two independent settings: acks controls what the producer waits for; min.insync.replicas controls the minimum ISR size below which acks=all writes are rejected (503). Use both together for financial or audit-critical events.
Drivesacks=1 default modeleader-flush-then-ack flow in §6ISR and HWM design in §9
Availability: 99.99% (52 min/year) Replication factor

99.99% requires that a single broker failure causes no visible downtime. This means partition leaders must be re-elected automatically and consumer clients must reconnect transparently within that 52-minute budget per year. In practice, leader election takes ~5–30 seconds with a metadata service (ZooKeeper/KRaft), so even weekly broker restarts fit the SLA.

TradeoffCrossing into 99.999% would require eliminating planned maintenance windows and using active-active multi-region routing, which roughly 3-5x the operational cost.
Drivesreplication factor=3 in §8controller re-election in §10
Ordering: within-partition only Throughput enabler

Total global ordering across all partitions and all consumers would require a single write path, eliminating parallelism. The practical solution is key-based routing: all events belonging to the same entity (same user ID, same order ID) hash to the same partition, guaranteeing that the consumer for that partition sees those events in write order.

TradeoffIf a hot key (a viral user generating 100x the traffic of others) maps to one partition, that partition becomes a bottleneck. Mitigation: add a random salt suffix to the partition key for over-loaded entities and handle out-of-order events in the consumer.
Drivesmurmur2 hash partitioning in §5hot-key section in §10
03

Capacity estimation

Message queues are fundamentally I/O-bound systems. The interesting estimation dimensions are write throughput and storage, not compute. Think in terms of bytes per second first, then convert to broker count and disk size.

Interactive capacity estimator

500 M
1,000 B
3
7 days
Write QPS
5,787
messages/sec
Peak Write QPS
17,361
messages/sec
peak = write × 3× (assumed)
Write Bandwidth
5.5
MB/s sustained
Storage (hot)
3.3
TB raw (×3 for replicas)
Consumer Bandwidth
16.5
MB/s total fan-out
Min Brokers (est.)
2
at 200 MB/s per broker (×RF=3)
ceil(peak_bw × RF / 200 MB/s)
💡

Key insight: At replication factor 3, every byte you write consumes 3× the storage on brokers. With tiered storage (cold segments offloaded to an object store like S3), you keep only the last N days hot on brokers and let the object store handle the long tail, typically at 10–100× cost reduction for data older than 7 days.

04

High-level architecture

A distributed message queue is a durable, ordered log that decouples the services that produce events from the services that consume them. Producers write messages to named topics; consumers pull messages at their own pace using a stored offset. Unlike traditional queues, messages are not deleted after delivery; they are retained for a configurable window, allowing multiple independent consumer groups to replay the same data independently.

The core insight of a log-based message queue is that a durable, ordered, append-only log is both the storage layer and the transport layer. Every message is written once to disk and read many times by many consumers, with zero network copies between storage and consumer in the happy path (a technique called zero-copy, via the OS sendfile() syscall).

Producers Services / Apps Producers Mobile / Web Metadata Service ZooKeeper / KRaft controller Partition Router Hash(key) → partition Broker (Leader) Partition P0 leader Broker (Leader) Partition P1 leader Broker (Follower) Replicas for P0, P1 replicate Append-only Log Segment files on disk (NVMe) Tiered Storage Cold segments → S3 / GCS Consumer Group A (analytics) Consumer Group B (search) Offset Store __consumer_offsets LEGEND Synchronous Async / internal Broker (API layer)
Figure 1: High-level architecture. Producers route to the partition leader broker via a hash; brokers replicate asynchronously to followers; consumers pull from brokers using stored offsets; cold log segments spill to object storage.

Component breakdown

Partition Router (load balancer layer): receives producer requests and routes each message to the correct partition leader. It holds a cached partition-to-broker mapping from the metadata service; on a cache miss or leader change, it refreshes. This layer can be embedded in the producer client library rather than a separate server hop.

Broker cluster: the core of the system. Each broker owns the leadership for a set of partitions and runs the append-to-log cycle. Brokers are symmetric: any broker can be a leader for some partitions and a follower for others. Leadership is assigned by the controller and balances automatically across the cluster.

Metadata service (ZooKeeper / KRaft), tracks which broker is the leader for each partition, which brokers are alive, and what the current ISR (in-sync replica) set is. With KRaft (Kafka's built-in Raft consensus mode), this component is no longer external to Kafka, reducing operational complexity at the cost of requiring a dedicated controller quorum within the cluster.

Append-only log: the actual message store. Each partition is a sequence of segment files on disk. Writes are always sequential appends; reads are random-access by offset but typically sequential as consumers advance through the log. The OS page cache acts as a read buffer, making recent messages essentially free to serve without touching disk.

Offset store (__consumer_offsets): an internal topic that tracks the last committed offset for each consumer group and partition. This is itself a distributed log, giving offset commits the same durability guarantees as regular messages. It also provides the coordination point for consumer rebalancing.

Tiered storage: closed log segments are uploaded to an object store (a durable blob service like S3 or GCS). Brokers only retain the last N days of hot segments locally; consumers requesting older offsets are transparently redirected to fetch from the object store. This is what makes multi-year retention economically viable.

Architectural rationale

Why append-only logs instead of a traditional message queue? Core model

Traditional message queues (like AMQP brokers, such as RabbitMQ and ActiveMQ) delete messages after delivery. This makes replay impossible without external storage. The log-based model keeps messages until the retention window expires, treating the queue as a persistent, ordered journal. Any consumer can read at any point in the log, enabling replay, backfill, and independent consumer groups with no coordination between them.

TradeoffLog retention costs disk proportional to time, not to unconsumed message count. A fast consumer benefits from the same storage cost as a slow consumer: this is the opposite of a traditional queue where costs depend on lag depth.
AlternativesRabbitMQActiveMQ ArtemisAmazon SQS
Why embed the router in the producer client rather than a separate layer? Topology

A separate routing proxy adds a network hop and a failure domain. By embedding the partition metadata cache in the producer client library, producers route directly to the partition leader, requiring only a single network round-trip per produce request. The client refreshes its metadata cache when it receives a NOT_LEADER_FOR_PARTITION error, which is the signal that leadership has changed.

TradeoffFat clients mean the routing logic is spread across every producer language runtime (Java, Python, Go, .NET). Protocol-level bugs must be fixed across all SDKs simultaneously.
AlternativeSidecar proxy (Envoy)
Why KRaft instead of ZooKeeper for metadata? Controller

A distributed coordination service (ZooKeeper) was an external dependency requiring its own cluster, its own monitoring, and its own expertise. ZooKeeper had a hard limit around 200,000 partitions before metadata propagation became slow. KRaft uses Raft consensus within the Kafka broker process, removing the external dependency and scaling to millions of partitions. In an interview, mention ZooKeeper as the historical baseline and KRaft as the current standard.

Still supportedZooKeeper mode (legacy)etcd (not used in Kafka)

Real-world comparison

DecisionThis design (Kafka-like)Amazon KinesisGoogle Pub/Sub
Storage modelAppend-only log segments on broker disk + tiered to S3Shard-based log, managed by AWS (no local disk)Fully managed; no user-visible disk
PartitioningExplicit partitions, hash-based routingShards with configurable shard countAutomatic; no user-visible partitioning
Consumer modelPull-based, offset committed by consumerPull-based, sequence number tracked per shardPush-based (Pub/Sub pushes to subscriber endpoint)
RetentionConfigurable; unlimited with tiered storageUp to 7 days (standard) / 365 days (extended)Up to 7 days; no tiered storage
OrderingWithin partition, strictWithin shard, strictBest-effort; no guarantee across messages
ℹ️

The right choice follows from requirements. Kinesis wins if you want zero operational overhead and your team is already on AWS. Pub/Sub wins for push-based delivery to HTTP endpoints. Kafka wins when you need replay, exactly-once semantics, or tight control over partition placement and retention policy.

05

Partitioning strategy & log design

The central algorithmic question in this design is: given a message, which partition does it go to, and how is that partition physically structured on disk? These two decisions determine throughput, ordering guarantees, and disk efficiency.

① Hash-based partitioning partition = murmur2(key) % numPartitions msg: user=42 key=42 msg: user=42 key=42 msg: user=99 key=99 Partition 0 Partition 1 ✔ Ordered per key; even load ② Round-robin (no key) partition = counter++ % numPartitions msg1 msg2 msg3 Partition 0 Partition 1 Max throughput; no ordering
Figure 2: Hash-based routing (left) guarantees ordering per key. Round-robin routing (right) maximises throughput at the cost of any per-key ordering. Our choice: hash-based with a hot-key salt escape hatch.

Our choice for this design is hash-based partitioning using murmur2 (the same hash Kafka uses). Producers specify an optional partition key; if omitted, the client falls back to round-robin for maximum throughput, useful for workloads where ordering genuinely does not matter.

Log segment structure

Each partition is a sequence of segment files on disk. A segment has a fixed maximum size (default 1 GB) and a maximum age (default 7 days). When a segment is rolled, the old file is closed and becomes immutable, enabling safe concurrent reads and background upload to tiered storage. Active segments are always the last file in the sequence.

Each segment has a companion index file mapping message offsets to byte positions, allowing O(log n) random seeks. The index is sparse, so not every offset is indexed: a seek involves a binary search in the index, then a linear scan of at most a few KB of the log file.

Producer batching: the key to high throughput Performance

Disk writes are expensive when they're small and frequent. Producers buffer messages locally for up to linger.ms (default 0, recommended 5–100 ms) and then flush as a batch. A single disk write containing 1,000 messages costs roughly the same I/O as a single-message write, so batching is a nearly free throughput multiplier. Batch sizes of 16 KB to 1 MB are typical in production.

// Producer config: batching + compression
props.put("linger.ms", "10");           // wait 10ms to fill buffer
props.put("batch.size", "65536");     // 64 KB batch
props.put("compression.type", "lz4"); // 3-4x compression on text
props.put("acks", "1");               // leader ack only (at-least-once)
Latency tradeoffSetting linger.ms > 0 adds up to that value of producer-side latency. For real-time systems requiring sub-5ms write latency, keep linger.ms=0 and accept lower throughput. For throughput-focused pipelines, 10–20 ms is usually acceptable and yields 5–10× throughput improvement.
Message serialization and schema evolution L6 probe

The value field in every message is raw bytes. The broker is format-agnostic. The format contract lives entirely outside the broker. Three dominant choices exist: JSON (human-readable, 2–5× larger, no schema enforcement), Protocol Buffers (compact binary, strongly typed, Google-originated), and Avro (compact binary, schema-embedded, native to the Confluent ecosystem). In practice, Avro + Confluent Schema Registry is the most common production pattern for shared message queues with multiple producer/consumer teams.

A schema registry is a central service that stores versioned schemas and assigns each schema a numeric ID. Producers embed only the schema ID (4 bytes) at the start of the message value, not the full schema. Consumers look up the schema by ID on first encounter, then cache it locally. This keeps messages compact while ensuring every consumer can deserialize any message it receives.

Schema evolution is controlled by compatibility modes: BACKWARD (new schema can read old messages, safe default), FORWARD (old consumers can read new messages), FULL (both directions). Adding an optional field with a default is BACKWARD compatible. Removing a required field or changing a field type is not.

L6 Probe"How do you evolve a message schema without breaking running consumers?" Use BACKWARD compatibility, add fields with defaults, never rename fields. Deploy consumers first (they can read both old and new), then deploy producers. If a breaking change is necessary, create a new topic version (e.g. orders.v2) and run dual-write for a migration window.
AlternativesProtobuf (proto3)JSON SchemaApache Thrift
05b

API design

The public API has two core operations: publishing a message and fetching messages. Everything else (offset management, consumer group coordination, partition reassignment) is handled by internal protocols or admin APIs that are unlikely to be the focus of an interview.

POST /topics/{topic}/messages: publish

// Request
POST /topics/orders/messages
Content-Type: application/json

{
  "messages": [
    {
      "key": "user-42",          // optional; determines partition
      "value": "eyJvcmRlcl...",  // base64-encoded payload
      "headers": {              // optional metadata
        "content-type": "application/json",
        "trace-id": "abc-123"
      }
    }
  ],
  "acks": "1"               // "0", "1", or "all"
}

// Response: 200 OK (acks=1 or all) or 202 Accepted (acks=0)
{
  "offsets": [
    { "partition": 3, "offset": 104920 }
  ]
}

// Validation errors
400 Bad Request: message exceeds max.message.bytes (default 1 MB)
429 Too Many Requests: producer quota exceeded (bytes/sec per client)
503 Service Unavailable: not enough in-sync replicas (acks=all, ISR < min.insync.replicas)

GET /topics/{topic}/messages: consume (fetch)

// Request: consumer pulls next batch
GET /topics/orders/messages?partition=3&offset=104921&max_bytes=1048576&wait_ms=500
Authorization: Bearer <consumer-group-token>

// Response: 200 OK
{
  "partition": 3,
  "messages": [
    {
      "offset": 104921,
      "timestamp": 1712700000000,
      "key": "user-42",
      "value": "eyJvcmRlcl...",
      "headers": { ... }
    }
  ],
  "high_watermark": 105000  // latest committed offset in partition
}

// Notable responses
204 No Content: long-poll timeout, no new messages (wait_ms elapsed)
416 Range Not Satisfiable: offset out of retention range (too old)

Optional endpoints by level

EndpointDescriptionLevel
POST /consumers/{group}/offsetsCommit consumer group offsetL3/L4
GET /consumers/{group}/offsetsFetch current offset per partitionL3/L4
POST /consumers/{group}/seekReset offset to timestamp or beginning/endL5
GET /topics/{topic}/metadataList partitions, leaders, ISR stateL5
POST /transactions/initBegin exactly-once transactionL7/L8
06

Core flow: produce and consume

The durability guarantee in §2 (no loss once acked) drives every decision in the produce flow. The question is: exactly when does the broker send the ack, and how many replicas must see the message before that happens? The answer creates the central tension between latency and durability.

Produce Flow (acks=1) Producer ① produce Leader Broker ② append Partition Log (page cache) ③ ack ④ replicate Follower Broker Replica Log (disk) CONSUME FLOW Consume Flow (pull model) Consumer ① fetch(offset) Leader Broker HWM check msgs available? YES Read from log / cache ② sendfile() NO → long-poll wait Hold conn until HWM advances ③ commit offset __consumer_offsets topic Legend Sync Async
Figure 3: Produce flow (top) appends to the leader log then acks at-latest. Consume flow (bottom) pulls from leader using a stored offset; if offset equals HWM, the connection is held open (long-poll) until new messages arrive; after processing, the consumer commits its new offset to the offset topic.
⚠️

The gap between processing and committing the offset is where at-least-once delivery lives. If a consumer processes a message and then crashes before committing the offset, on restart it will re-read and re-process the same message. Consumers must be designed for idempotency: processing the same message twice must produce the same result as processing it once.

acks=1 vs acks=all: the key tradeoff

This decision maps directly back to the write latency NFR from §2. With acks=1, the leader receives the message into the OS page cache and acks immediately. A leader crash after ack but before followers replicate loses that message. With acks=all, the leader waits until all in-sync replicas (ISR) have acknowledged before returning, making loss structurally impossible during the ack window, at the cost of latency proportional to the replication round-trip (same AZ: ~10–30 ms; cross-AZ: ~50–100 ms).

ℹ️

Consumer reads from the nearest replica (KIP-392, Kafka 2.4+): by default, consumers fetch from the partition leader. Since Kafka 2.4, consumers can be configured to fetch from the nearest follower replica in the same AZ via client.rack + replica.selector.class=RackAwareReplicaSelector. This eliminates cross-AZ read traffic and reduces consumer latency for geographically distributed clusters, an L6+ optimization worth mentioning when cross-AZ egress cost is a concern.

07

Data model

There are three categories of entities in this system: messages themselves (the payload), the log structure (how messages are stored and indexed), and the metadata about topics and consumer state. Each is used very differently, and those differences shape where and how each entity is stored.

Access patterns

OperationFrequencyQuery shape
Append message to partitionVery high (sustained M/s)Sequential write to end of file
Fetch messages at offsetVery high (per consumer poll cycle)Random seek by offset, then sequential scan
Read partition metadata (leader, ISR)High (per produce/consume)Key lookup by topic+partition
Commit consumer offsetMedium (every N messages or T seconds)Write by group+topic+partition
Fetch consumer offset on startupLow (per consumer startup / rebalance)Key lookup by group+topic+partition

Two things stand out from this table. First, the message append and fetch patterns are purely sequential I/O: this rules out any key-value store or database as the primary storage layer and points directly to a file-based log. Second, metadata reads happen on every produce and consume request, so they must be served from memory, not from disk.

Message schema (binary, per partition)

Message batch (on-disk layout per segment) base_offset int64 (8B) batch_length int32 (4B) timestamp int64 (8B) producer_id int64 (exactly-once) sequence_no int32 (dedup) key bytes (var) value bytes (var) headers key-val pairs Sparse index (.index file) relative_offset (int32) Offset delta from base_offset position (int32) Byte offset in .log file maps to
Figure 4: On-disk message batch layout. producer_id and sequence_no enable exactly-once deduplication. The sparse index maps relative offsets to byte positions in the log file, enabling O(log n) seeks.

Metadata schema (in-memory, replicated via KRaft)

Topic and partition metadata In-memory
// In-memory partition metadata (per broker)
{
  "topic": "orders",
  "partition": 3,
  "leader_id": 5,           // broker ID
  "isr": [5, 2, 7],         // in-sync replica broker IDs
  "replicas": [5, 2, 7],    // all assigned replicas
  "high_watermark": 104999, // last offset replicated to all ISR
  "log_start_offset": 0,    // first available offset (after retention)
  "epoch": 12               // leader epoch for fencing
}

The leader epoch is critical for preventing split-brain: a follower that receives a write from a broker claiming to be the leader must reject it if the epoch is stale. This is how Kafka prevents duplicate appends when a deposed leader comes back online and tries to continue writing.

08

Caching strategy

A Kafka broker does not maintain its own message cache. Instead, it relies entirely on the OS page cache, the kernel's LRU-managed RAM buffer, to serve recent messages. When a consumer fetches messages that are still in the page cache, the broker uses the OS sendfile() syscall to transfer bytes directly from the page cache to the network socket, bypassing the broker process entirely. This is called zero-copy and is why a single broker thread can sustain hundreds of MB/s of consumer throughput at sub-5 ms latency.

A message queue that reads from disk on every consumer fetch would be unusable at scale. The sub-5 ms read latency NFR from §2 is only achievable because recent messages almost never touch disk: the OS page cache absorbs them. Understanding where each cache layer sits in the request path is what separates a good answer from a great one.

Consumer fetch(offset) Broker process Layer 2: Process heap (index files, metadata) HIT? OS Page Cache Layer 1 (primary) Recent log segments in RAM (kernel-managed) ~1–3 ms read latency MISS NVMe Disk Log segment files ~5–10 ms cold read Layer 3: Client fetch buffer (RAM) sendfile(): zero copy Synchronous fetch path Async / fallback path
Figure 7: Three-layer cache hierarchy. Layer 1 (OS page cache) serves the vast majority of warm reads at ~1–3 ms. Layer 2 (broker heap) holds index files and metadata. Layer 3 (consumer fetch buffer) pre-fetches ahead of the current offset. A disk read only occurs on a page cache MISS, typically for cold segments or consumers with significant lag.

Cache hierarchy

LayerLocationWhat it cachesInvalidationLatency when hit
Layer 1: OS page cache Kernel RAM on each broker host Recent log segment bytes (LRU eviction by kernel) Automatic: kernel evicts cold pages when RAM pressure rises; segment roll does not invalidate, rolled segments stay warm until evicted ~1–3 ms
Layer 2: Broker process heap JVM heap on broker process Partition metadata, sparse index files, ISR state, producer ID dedup state On leader election change or segment roll; automatically garbage collected <1 ms (in-process)
Layer 3: Consumer fetch buffer Consumer client memory Pre-fetched message batches ahead of current processing offset On consumer group rebalance or explicit seek; configurable via fetch.min.bytes and max.partition.fetch.bytes ~0 ms (already in RAM)

Why the OS page cache is the right choice here

Kafka deliberately does not maintain its own in-process message cache. Instead it relies entirely on the OS page cache. This is a deliberate architectural decision: the JVM heap is bounded and subject to garbage collection pauses; the OS page cache is unbounded and managed by the kernel's LRU algorithm, which is well-tuned for sequential access patterns. A broker restart that loses its JVM heap state recovers instantly, because the OS page cache persists across process restarts (until the host reboots).

The consequence is that broker sizing is dominated by RAM, not CPU. A broker with 256 GB RAM and a 48 TB NVMe array will serve warm reads from the page cache at ~2 ms. A broker with 32 GB RAM on the same hardware will see frequent page cache evictions under load, forcing disk reads at ~8–10 ms, a 4–5× latency regression. This is the first thing to check when diagnosing slow consumer reads in production.

💡

Zero-copy reads via sendfile(): When a consumer requests messages that are in the page cache, the broker uses the OS sendfile() syscall to transfer bytes directly from the page cache to the network socket, without copying them through userspace. This eliminates the broker process entirely from the data path for warm reads, and is why a single broker thread can serve hundreds of MB/s of consumer traffic without saturating its CPU.

09

Replication and durability

Kafka replicates each partition to a configurable number of brokers (replication factor). The broker currently acting as the partition leader accepts all writes; follower brokers continuously fetch and replay the leader's log. The High-Watermark (HWM) is the highest offset that has been confirmed by all in-sync replicas (ISR). Consumers can only read up to the HWM. Messages above it exist on the leader but have not been replicated to all ISR members yet. This guarantee means a consumer never reads a message that could disappear after a leader failover.

Durability in this system is a property of the replication flow, not of disk writes alone. A single disk write survives a process crash but not a disk failure. Replication across brokers (ideally across availability zones) is what makes data durable against hardware failures.

AZ-1 AZ-2 AZ-3 Broker 1: Leader P0 HWM = 99 | LEO = 100 Partition-0 log (offsets 0–100) Broker 2: Follower P0 LEO = 99 (1 behind leader) Partition-0 replica (offsets 0–99) Broker 3: Follower P0 LEO = 99 (1 behind leader) Partition-0 replica (offsets 0–99) replicate to AZ-3 High-Watermark (HWM) Highest offset replicated to ALL in-sync replicas. Consumers may only read up to HWM, not beyond. In-Sync Replica (ISR) set A follower is removed from the ISR if it falls more than replica.lag.time.max.ms behind the leader (default 30 s). When ISR shrinks below min.insync.replicas, acks=all writes are rejected (503) until ISR recovers.
Figure 5: Three brokers across three AZs. Replication factor 3 tolerates simultaneous failure of any two brokers. HWM is the safe read boundary: messages above HWM exist on the leader but have not been confirmed by all ISR members yet.

Leader election and log truncation

When a new leader is elected, it issues a LeaderEpochRequest to all follower replicas. Any follower whose log-end-offset (LEO) exceeds the new leader's committed high-watermark for that epoch must truncate the excess entries. This log truncation step is what enforces the HWM contract: no consumer ever reads a message that was written above the HWM, because followers discard those writes when they learn of the new leader's epoch. Without this truncation, a deposed leader that comes back online and a newly-elected leader could have diverged logs, potentially creating duplicate or inconsistent consumer reads.

The leader epoch (the epoch field in the metadata schema in §7) is a monotonically increasing counter that prevents split-brain: followers reject writes from any broker whose epoch is lower than the controller's current assignment. An old leader that recovers from a network partition detects the stale epoch and stops accepting writes before rejoining as a follower.

🎯

Classic L5 probe: “What happens to messages between the leader’s LEO and HWM when a new leader is elected?” Answer: they are truncated by all followers. The new leader’s LEO becomes the cluster’s HWM baseline for that epoch. Producers that received acks under acks=1 for those messages will see a LEADER_NOT_AVAILABLE error on the next produce and must retry, which is exactly why idempotent producers are essential even under acks=1.

Exactly-once delivery: how it actually works

Idempotent producer: deduplicating retried messages Producer side

When a producer retries after a timeout (because it didn't receive an ack), it may send the same message twice: once that succeeded silently and once that the broker will process again. The broker deduplicates by tracking the (producer_id, partition, sequence_number) triple. Any sequence number the broker has already seen is silently discarded. This converts "at-least-once" into "exactly-once on the delivery path" without performance cost, because the sequence number is already in the message header.

The limit of 5 in-flight requests per partition exists because the broker’s deduplication window only covers a bounded sequence range. With more than 5 in-flight batches, a retried batch (carrying a higher sequence number) could arrive out of order in a way the broker cannot detect, allowing a duplicate to slip through as a genuine new message.

Required configenable.idempotence=trueacks=allmax.in.flight=5
Transactional API: atomic produce + offset commit Consumer side

End-to-end exactly-once (read → process → write → commit) requires that the final offset commit and any downstream produce happen atomically. Kafka's transaction API wraps these operations in a two-phase commit: the producer sends a BeginTransaction marker, writes messages and/or offset commits to a transaction log, then sends a CommitTransaction marker. Consumers configured with isolation.level=read_committed will not see any of the writes until the commit marker arrives.

TradeoffTransaction overhead adds ~10–20 ms per transaction plus one extra log write. For pipelines processing 10,000+ messages per second, batch your transactions (commit every N messages, not per message) or the overhead dominates.
Cost+10–20ms latencytransaction coordinator overhead
10

Deep-dive scalability

Scaling a message queue is fundamentally about scaling partitions and scaling disk. Compute is rarely the bottleneck: network bandwidth and disk I/O are. The diagram below shows a production-grade multi-region setup.

Region: US-East (Primary) Producers KRaft Controller Quorum (3 nodes) Broker Pool (N brokers) Each broker: 12 × 4TB NVMe = 48 TB raw per broker ~200 MB/s write bandwidth S3 Tiered Storage Cold segments Consumer Groups (Group A, B, C, ...) MirrorMaker 2 Region: EU-West (DR/Read) Consumers DR Consumers Broker Pool (EU replica) Followers of US topics Low-latency reads for EU consumers S3 Tiered Storage (EU) EU consumers DR failover Synchronous Async / MirrorMaker MirrorMaker 2 replicates topics cross-region with configurable lag
Figure 6: Production multi-region deployment. US-East is the primary cluster; a cross-cluster replication tool (MirrorMaker 2) continuously mirrors topic data to EU-West for both low-latency EU reads and disaster recovery failover.
How many partitions should you choose? L5 deep-dive

Partitions are the unit of parallelism for both producers and consumers. A topic with P partitions can be consumed by at most P consumers in a group simultaneously. A useful starting heuristic: partitions = max(target_consumers, ceil(peak_throughput_MB_s / 100)), where 100 MB/s is a conservative single-partition write ceiling on NVMe. Add 20–50% headroom for growth.

CautionPartitions can be added to a topic but never reduced without data migration. Key-based producers will start routing keys to new partitions immediately, breaking ordering for any entity whose hash now maps to a different partition than it did historically. For topics with strict ordering requirements, over-provision from the start (e.g. 200–1000 partitions) and plan never to change the count.
Consumer rebalancing: the thundering herd L6 deep-dive

When a consumer joins or leaves a group, Kafka triggers a rebalance: all partitions are temporarily unassigned and reassigned across living consumers. During a rebalance, consumption for the entire group stops. With large consumer groups (100+ consumers), rebalances triggered by routine deployments can cause 30–60 second pauses.

Mitigation: use static group membership (group.instance.id). A consumer with a static ID rejoins after a restart without triggering a full rebalance, reclaiming its previously assigned partitions as long as it returns within session.timeout.ms. For Kubernetes pod restarts, this typically reduces rebalance events by 80%.

Also helpsincremental cooperative rebalancing (KIP-429)longer session.timeout.ms
Tiered storage: petabyte-scale retention L7 deep-dive

Storing years of data on broker disks becomes prohibitively expensive past a few weeks. Tiered storage moves closed log segments to an object store (S3/GCS) automatically. Brokers only retain the last N days locally; the object store provides the rest. Consumers reading historical offsets are transparently redirected to fetch from the object store via a fetch proxy layer. Cost comparison: NVMe storage ~$0.20/GB-month vs. S3 ~$0.023/GB-month, roughly 8× savings for cold data.

TradeoffCold reads from S3 have higher latency (50–200ms vs. 1–5ms from local disk). For consumers that regularly seek to old offsets (e.g. re-processing), this adds significant latency. Solutions: pre-warm a local cache for anticipated replay windows, or accept the latency as acceptable for batch reprocessing jobs.
Back-pressure and producer quotas L6 deep-dive

A shared Kafka cluster serving multiple teams needs hard per-producer quotas to prevent a single runaway producer from saturating disk or network. Quota enforcement works via a token bucket model per client ID: the broker measures bytes-per-second produced by each client, and when a client exceeds its quota, the broker throttles it by delaying the fetch metadata response (effectively back-pressuring the producer without dropping messages).

// Set quota for client "order-service" via Admin API
kafka-configs --bootstrap-server broker:9092 --alter \
  --add-config 'producer_byte_rate=10485760' \  // 10 MB/s
  --entity-type clients --entity-name order-service
11

Failure modes & edge cases

ScenarioProblemSolutionLevel
Broker crash (any) Partitions whose leader was on the crashed broker are unavailable Controller detects via heartbeat timeout (~30s), triggers leader election, assigns a follower from the ISR as new leader. Producers see LEADER_NOT_AVAILABLE and retry L3/L4
Producer retry → duplicate message Timeout causes producer to resend; broker writes same message twice Enable idempotent producer (enable.idempotence=true). Broker deduplicates by (producer_id, partition, sequence_no) L3/L4
Consumer crash mid-processing Consumer processed message but crashed before committing offset; reprocesses on restart Make consumers idempotent (e.g. upsert to DB keyed on message ID). Or use exactly-once transactions for atomic process+commit L5
ISR drops to 1 (under-replication) With acks=all and min.insync.replicas=2, writes fail (503) Alert immediately; reduce min.insync.replicas as emergency measure (accepting durability risk) or allow the follower to catch up before continuing L5
Hot partition (imbalanced key distribution) One partition gets 10x the traffic of others; its broker saturates For genuinely hot keys: append a random salt suffix to the partition key for over-loaded entities, consuming consumers merge and sort. Alternatively, pre-list hot-key exceptions and route them to dedicated topics with more partitions L5
Split-brain after network partition Old leader (thinking it's still leader) accepts writes after a new leader is elected Leader epoch prevents this: the new leader gets a higher epoch; followers reject writes from the old leader once they know the current epoch. Old leader detects epoch mismatch and stops accepting writes L7/L8
Consumer group lag explosion Consumer falls behind; log retention expires before it catches up; offsets fall below log_start_offset Monitor consumer lag in real-time (JMX metric: records-lag). Alert before lag reaches retention boundary. When lag expires: reset to earliest (some messages are re-read) or skip to latest (some messages are lost) L5
Cross-region message ordering MirrorMaker 2 replicates with async lag; EU consumer reads messages in a different order than US Accept that cross-region ordering is best-effort (not guaranteed). For strict ordering requirements, route all reads to the primary region at the cost of cross-region read latency L7/L8
Poison-pill / dead-letter A malformed message that crashes the consumer on every attempt permanently blocks the partition: no subsequent messages in that partition are processed Circuit-break on the consumer side: skip the message after N retries and publish it to a dead-letter topic (DLQ) for manual inspection. Schema validation at publish time via a schema registry is the proactive defense: reject malformed messages at the source before they reach consumers. Never advance past a deserialization failure without a DLQ escape hatch L5
12

How to answer by level

L3/L4: Junior / SDE I-II Working system
What good looks like
  • Clear producer → broker → consumer flow
  • Topics, partitions, offsets: defined clearly
  • Consumer pulls (not push), advances offset after processing
  • Messages are not deleted after consumption; multiple groups read the same topic independently
  • Replication exists; explains why (durability, not just availability)
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
What separates L5 from L3/L4
  • Unprompted discussion of acks=1 vs acks=all and the latency/durability tradeoff
  • Proactively asks: what are the message ordering requirements?
  • Explains the high-watermark concept
  • Knows that consumer offset commit creates the at-least-once window
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
L5: Senior SDE Tradeoffs
What good looks like
  • Explains ISR, HWM, and the ISR-shrink → 503 scenario
  • Traces exactly-once delivery end-to-end (idempotent producer + transactional consumer)
  • Explains consumer group rebalancing and session.timeout.ms tradeoff
  • Can calculate partition count from throughput requirements
  • Knows hot partition is the Achilles heel of key-based routing
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
What separates L6 from L5
  • Static group membership and cooperative rebalancing (KIP-429)
  • Discusses tiered storage economics and cold read latency tradeoff
  • Designs producer quota enforcement and back-pressure
  • Knows the difference between KRaft and ZooKeeper operationally
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
L6: Staff SDE End-to-end ownership
What good looks like
  • Designs for operational concerns: multi-team quotas, partition rebalancing for cluster expansion, rack-aware replica assignment
  • Articulates when to use Kafka vs. SQS vs. Pub/Sub based on concrete requirements
  • Builds tiered storage into the capacity estimate from the start
  • Designs monitoring: consumer lag SLO, ISR shrink alert, producer throttle metrics
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
What separates L7 from L6
  • Initiates cross-region MirrorMaker design unprompted
  • Discusses MirrorMaker failover: how do you promote EU to primary without data loss?
  • Proposes schema registry to enforce compatibility without coordination overhead
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
L7/L8: Principal / Distinguished Architecture ownership
What good looks like
  • Questions whether Kafka is the right tool: stream processing? (consider Flink/Spark on top); simple task queues? (SQS is cheaper); event sourcing? (partition count strategy changes fundamentally)
  • Multi-tenant platform design: per-team quota enforcements, partition isolation, cost attribution
  • Schema evolution strategy: how do you break compatibility without downtime?
  • Disaster recovery plan for full region failure: RPO/RTO targets, MirrorMaker lag monitoring, automated promotion runbook
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
What makes this memorable
  • Initiates build vs. buy: "If I'm at a company where Confluent Cloud or MSK fits, why would I operate this myself?"
  • Brings cost model: "At 1 PB/year retained, object storage saves $X vs. broker disks"
  • Mentions Kafka's known limitations (partition count ceiling per cluster before KRaft, JVM GC pauses, in-order exactly-once is expensive)
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

Classic probes table

QuestionL3/L4L5/L6L7/L8
"How do consumers know where to start reading?" Stored offset per partition; resets to earliest or latest on first start Explains offset commit atomicity, manual vs. auto commit tradeoffs, seek-to-timestamp for replay Schema-versioned replay strategy; offset reset during data migration; impact on downstream aggregations
"What happens if a broker crashes immediately after acknowledging a write?" "Another broker takes over", vague on mechanism Explains acks=1 vs acks=all, ISR, the HWM truncation on leader re-election, why LEO can be ahead of HWM Leader epoch fencing, the log truncation protocol, idempotent producer dedup window, epoch monotonicity as the safety invariant
"How would you add a new topic for a different team without affecting existing consumers?" "Create a new topic", no cross-team concern Quota per client ID, rack-aware partition assignment to avoid hotspots on shared brokers, partition count sizing Tenant isolation model: dedicated vs. shared cluster economics, partition rent / cost attribution, compatibility testing in a shadow cluster before production
"You need five years of event history accessible within 100ms. How?" "Store everything on brokers", cost-blind Tiered storage to S3 for cold segments; hot segments on NVMe; tradeoff: cold reads are 50–200ms not 1–5ms Builds hybrid: tiered storage for durability, a separate columnar store (ClickHouse / BigQuery) for fast analytical queries over old data; replay via Spark/Flink from S3 for backfill
How the pieces connect
  • 1 M msg/s throughput requirement (§2) hash-based partitioning across 100+ partitions (§5) consumer groups with one consumer per partition for maximum read parallelism (§6)
  • "No loss once acked" durability NFR (§2) replication factor 3 across AZs, min.insync.replicas=2, acks=all for critical topics (§9) ISR-shrink → 503 failure mode as the expected behavior when durability can't be guaranteed (§11)
  • 7-day hot retention with years of history (§2) tiered storage: NVMe for 7-day window, S3 for cold segments (§4) ~8x cost reduction for cold data vs. keeping everything on broker disks (§9)
  • Strict within-partition ordering NFR (§2) murmur2 partition key hashing ensures same entity always routes to same partition (§5) hot-key imbalance is the primary throughput failure mode when key cardinality is low (§10)
  • At-least-once delivery as default (§2) offset commit after processing creates the reprocessing window (§6) idempotent producer + transactional API is the upgrade path to exactly-once, traced fully in §9
  • 99.99% availability SLA (§2) multi-broker cluster with KRaft controller quorum enables automatic leader election (§4, §8) cross-region MirrorMaker 2 covers region-level failures for DR (§9)
  • File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series