Design a Distributed Message Queue
The interface is two operations: write a message, read it back. The hard part is doing that reliably at a million messages per second, without losing a single one.
~40 min read · 12 sections
What the interviewer is testing
A distributed message queue, such as Apache Kafka, Amazon Kinesis, or Google Pub/Sub, sits between services that produce events and services that consume them. The question looks simple: write here, read there. But the moment you add durability, ordering, and millions of events per second, the design space explodes.
Interviewers use this question to probe whether you understand distributed systems fundamentals, not just one vendor's API. The core tensions are: write throughput vs. durability guarantees, strict ordering vs. horizontal scalability, and push-based vs. pull-based consumption. These tensions play out at every level:
| Level | Primary focus | Hard question at this level |
|---|---|---|
| L3/L4 | Working producer/consumer model with a single broker | How do consumers know where to read next? |
| L5 | Replication, durability, consumer group coordination | What happens when a broker crashes mid-write? |
| L6 | Exactly-once delivery, partition rebalancing, back-pressure | How do you eliminate duplicates without sacrificing throughput? |
| L7/L8 | Cross-region replication, tiered storage, cost at scale | How do you serve petabyte-scale history without running out of disk? |
The single most differentiating question across all levels is offset management. How are offsets stored, who advances them, and what happens if a consumer crashes between processing a message and committing its offset? Nail this and the rest of the design falls into place.
Requirements clarification
The right design depends entirely on scope. A message queue for an internal microservices event bus has very different requirements from a shared platform ingesting click-stream events from 200 million mobile users. Throughput targets, retention windows, and multi-tenancy constraints all change dramatically based on that scope.
Functional requirements
| Requirement | Description |
|---|---|
| Publish messages | Producers write messages to named topics with an optional partition key |
| Subscribe and consume | Consumer groups pull messages from topics starting at any stored offset |
| Replay | Consumers can rewind to any offset within the retention window |
| Multi-tenant topics | Multiple independent consumer groups can read the same topic independently |
| Message retention | Messages are retained for a configurable period (hours to years) |
| Log compaction (opt-in) | Retain only the latest value per key for CDC and event-sourcing topics via cleanup.policy=compact. Note: the __consumer_offsets topic itself uses compaction to track the current offset without growing unboundedly |
Non-functional requirements
| NFR | Target | Notes |
|---|---|---|
| Write throughput | 1 M messages/s sustained | Across the whole cluster; single partition ~100 MB/s |
| Write latency (p99) | < 20 ms (at-least-once), < 100 ms (exactly-once) | Durability and latency are in direct tension here |
| Read latency (p99) | < 5 ms for warm consumers | Zero-copy sendfile to consumer |
| Availability | 99.99% (52 min/year downtime) | Requires multi-broker replication |
| Durability | No message loss once acked | Requires flushed replicas before ack |
| Ordering | Strict within partition; best-effort across partitions | Total ordering requires single partition, kills throughput |
| Delivery semantics | At-least-once default; exactly-once opt-in | Exactly-once requires idempotent producers + transactional consumers |
| Retention | 7 days hot; years via tiered storage | Tiered storage offloads cold segments to object store (S3) |
NFR reasoning
Write latency target: 20 ms (at-least-once) Durability tradeoff ›
20 ms (inclusive of a linger.ms=10 producer batching window; with linger.ms=0, typical p99 is 3–10 ms at the cost of lower throughput) gives us room for acks=1 as the default: the leader broker receives the message into the OS page cache and acknowledges immediately, without waiting for any follower to confirm or for a disk fsync. Kafka does not fsync on every write by default. Durability against a leader power failure comes from replication, not from disk writes alone. The risk is a narrow loss window: if the leader crashes after acknowledging but before any follower has replicated the message, that message is gone. For most workloads (analytics, click-stream, metrics) this window is acceptable because the data has low value relative to its volume.
acks=all with min.insync.replicas=2 requires the leader and at least one follower to confirm before acking. This eliminates the loss window, but adds replication round-trip latency: ~10–30 ms within a single AZ, or ~50–100 ms for cross-AZ replication (round-trip to a follower in a different availability zone). These are two independent settings: acks controls what the producer waits for; min.insync.replicas controls the minimum ISR size below which acks=all writes are rejected (503). Use both together for financial or audit-critical events.Availability: 99.99% (52 min/year) Replication factor ›
99.99% requires that a single broker failure causes no visible downtime. This means partition leaders must be re-elected automatically and consumer clients must reconnect transparently within that 52-minute budget per year. In practice, leader election takes ~5–30 seconds with a metadata service (ZooKeeper/KRaft), so even weekly broker restarts fit the SLA.
Ordering: within-partition only Throughput enabler ›
Total global ordering across all partitions and all consumers would require a single write path, eliminating parallelism. The practical solution is key-based routing: all events belonging to the same entity (same user ID, same order ID) hash to the same partition, guaranteeing that the consumer for that partition sees those events in write order.
Capacity estimation
Message queues are fundamentally I/O-bound systems. The interesting estimation dimensions are write throughput and storage, not compute. Think in terms of bytes per second first, then convert to broker count and disk size.
Interactive capacity estimator
Key insight: At replication factor 3, every byte you write consumes 3× the storage on brokers. With tiered storage (cold segments offloaded to an object store like S3), you keep only the last N days hot on brokers and let the object store handle the long tail, typically at 10–100× cost reduction for data older than 7 days.
High-level architecture
A distributed message queue is a durable, ordered log that decouples the services that produce events from the services that consume them. Producers write messages to named topics; consumers pull messages at their own pace using a stored offset. Unlike traditional queues, messages are not deleted after delivery; they are retained for a configurable window, allowing multiple independent consumer groups to replay the same data independently.
The core insight of a log-based message queue is that a durable, ordered, append-only log is both the storage layer and the transport layer. Every message is written once to disk and read many times by many consumers, with zero network copies between storage and consumer in the happy path (a technique called zero-copy, via the OS sendfile() syscall).
Component breakdown
Partition Router (load balancer layer): receives producer requests and routes each message to the correct partition leader. It holds a cached partition-to-broker mapping from the metadata service; on a cache miss or leader change, it refreshes. This layer can be embedded in the producer client library rather than a separate server hop.
Broker cluster: the core of the system. Each broker owns the leadership for a set of partitions and runs the append-to-log cycle. Brokers are symmetric: any broker can be a leader for some partitions and a follower for others. Leadership is assigned by the controller and balances automatically across the cluster.
Metadata service (ZooKeeper / KRaft), tracks which broker is the leader for each partition, which brokers are alive, and what the current ISR (in-sync replica) set is. With KRaft (Kafka's built-in Raft consensus mode), this component is no longer external to Kafka, reducing operational complexity at the cost of requiring a dedicated controller quorum within the cluster.
Append-only log: the actual message store. Each partition is a sequence of segment files on disk. Writes are always sequential appends; reads are random-access by offset but typically sequential as consumers advance through the log. The OS page cache acts as a read buffer, making recent messages essentially free to serve without touching disk.
Offset store (__consumer_offsets): an internal topic that tracks the last committed offset for each consumer group and partition. This is itself a distributed log, giving offset commits the same durability guarantees as regular messages. It also provides the coordination point for consumer rebalancing.
Tiered storage: closed log segments are uploaded to an object store (a durable blob service like S3 or GCS). Brokers only retain the last N days of hot segments locally; consumers requesting older offsets are transparently redirected to fetch from the object store. This is what makes multi-year retention economically viable.
Architectural rationale
Why append-only logs instead of a traditional message queue? Core model ›
Traditional message queues (like AMQP brokers, such as RabbitMQ and ActiveMQ) delete messages after delivery. This makes replay impossible without external storage. The log-based model keeps messages until the retention window expires, treating the queue as a persistent, ordered journal. Any consumer can read at any point in the log, enabling replay, backfill, and independent consumer groups with no coordination between them.
Why embed the router in the producer client rather than a separate layer? Topology ›
A separate routing proxy adds a network hop and a failure domain. By embedding the partition metadata cache in the producer client library, producers route directly to the partition leader, requiring only a single network round-trip per produce request. The client refreshes its metadata cache when it receives a NOT_LEADER_FOR_PARTITION error, which is the signal that leadership has changed.
Why KRaft instead of ZooKeeper for metadata? Controller ›
A distributed coordination service (ZooKeeper) was an external dependency requiring its own cluster, its own monitoring, and its own expertise. ZooKeeper had a hard limit around 200,000 partitions before metadata propagation became slow. KRaft uses Raft consensus within the Kafka broker process, removing the external dependency and scaling to millions of partitions. In an interview, mention ZooKeeper as the historical baseline and KRaft as the current standard.
Real-world comparison
| Decision | This design (Kafka-like) | Amazon Kinesis | Google Pub/Sub |
|---|---|---|---|
| Storage model | Append-only log segments on broker disk + tiered to S3 | Shard-based log, managed by AWS (no local disk) | Fully managed; no user-visible disk |
| Partitioning | Explicit partitions, hash-based routing | Shards with configurable shard count | Automatic; no user-visible partitioning |
| Consumer model | Pull-based, offset committed by consumer | Pull-based, sequence number tracked per shard | Push-based (Pub/Sub pushes to subscriber endpoint) |
| Retention | Configurable; unlimited with tiered storage | Up to 7 days (standard) / 365 days (extended) | Up to 7 days; no tiered storage |
| Ordering | Within partition, strict | Within shard, strict | Best-effort; no guarantee across messages |
The right choice follows from requirements. Kinesis wins if you want zero operational overhead and your team is already on AWS. Pub/Sub wins for push-based delivery to HTTP endpoints. Kafka wins when you need replay, exactly-once semantics, or tight control over partition placement and retention policy.
Partitioning strategy & log design
The central algorithmic question in this design is: given a message, which partition does it go to, and how is that partition physically structured on disk? These two decisions determine throughput, ordering guarantees, and disk efficiency.
Our choice for this design is hash-based partitioning using murmur2 (the same hash Kafka uses). Producers specify an optional partition key; if omitted, the client falls back to round-robin for maximum throughput, useful for workloads where ordering genuinely does not matter.
Log segment structure
Each partition is a sequence of segment files on disk. A segment has a fixed maximum size (default 1 GB) and a maximum age (default 7 days). When a segment is rolled, the old file is closed and becomes immutable, enabling safe concurrent reads and background upload to tiered storage. Active segments are always the last file in the sequence.
Each segment has a companion index file mapping message offsets to byte positions, allowing O(log n) random seeks. The index is sparse, so not every offset is indexed: a seek involves a binary search in the index, then a linear scan of at most a few KB of the log file.
Producer batching: the key to high throughput Performance ›
Disk writes are expensive when they're small and frequent. Producers buffer messages locally for up to linger.ms (default 0, recommended 5–100 ms) and then flush as a batch. A single disk write containing 1,000 messages costs roughly the same I/O as a single-message write, so batching is a nearly free throughput multiplier. Batch sizes of 16 KB to 1 MB are typical in production.
// Producer config: batching + compression
props.put("linger.ms", "10"); // wait 10ms to fill buffer
props.put("batch.size", "65536"); // 64 KB batch
props.put("compression.type", "lz4"); // 3-4x compression on text
props.put("acks", "1"); // leader ack only (at-least-once)
linger.ms > 0 adds up to that value of producer-side latency. For real-time systems requiring sub-5ms write latency, keep linger.ms=0 and accept lower throughput. For throughput-focused pipelines, 10–20 ms is usually acceptable and yields 5–10× throughput improvement.Message serialization and schema evolution L6 probe ›
The value field in every message is raw bytes. The broker is format-agnostic. The format contract lives entirely outside the broker. Three dominant choices exist: JSON (human-readable, 2–5× larger, no schema enforcement), Protocol Buffers (compact binary, strongly typed, Google-originated), and Avro (compact binary, schema-embedded, native to the Confluent ecosystem). In practice, Avro + Confluent Schema Registry is the most common production pattern for shared message queues with multiple producer/consumer teams.
A schema registry is a central service that stores versioned schemas and assigns each schema a numeric ID. Producers embed only the schema ID (4 bytes) at the start of the message value, not the full schema. Consumers look up the schema by ID on first encounter, then cache it locally. This keeps messages compact while ensuring every consumer can deserialize any message it receives.
Schema evolution is controlled by compatibility modes: BACKWARD (new schema can read old messages, safe default), FORWARD (old consumers can read new messages), FULL (both directions). Adding an optional field with a default is BACKWARD compatible. Removing a required field or changing a field type is not.
orders.v2) and run dual-write for a migration window.API design
The public API has two core operations: publishing a message and fetching messages. Everything else (offset management, consumer group coordination, partition reassignment) is handled by internal protocols or admin APIs that are unlikely to be the focus of an interview.
POST /topics/{topic}/messages: publish
// Request
POST /topics/orders/messages
Content-Type: application/json
{
"messages": [
{
"key": "user-42", // optional; determines partition
"value": "eyJvcmRlcl...", // base64-encoded payload
"headers": { // optional metadata
"content-type": "application/json",
"trace-id": "abc-123"
}
}
],
"acks": "1" // "0", "1", or "all"
}
// Response: 200 OK (acks=1 or all) or 202 Accepted (acks=0)
{
"offsets": [
{ "partition": 3, "offset": 104920 }
]
}
// Validation errors
400 Bad Request: message exceeds max.message.bytes (default 1 MB)
429 Too Many Requests: producer quota exceeded (bytes/sec per client)
503 Service Unavailable: not enough in-sync replicas (acks=all, ISR < min.insync.replicas)
GET /topics/{topic}/messages: consume (fetch)
// Request: consumer pulls next batch
GET /topics/orders/messages?partition=3&offset=104921&max_bytes=1048576&wait_ms=500
Authorization: Bearer <consumer-group-token>
// Response: 200 OK
{
"partition": 3,
"messages": [
{
"offset": 104921,
"timestamp": 1712700000000,
"key": "user-42",
"value": "eyJvcmRlcl...",
"headers": { ... }
}
],
"high_watermark": 105000 // latest committed offset in partition
}
// Notable responses
204 No Content: long-poll timeout, no new messages (wait_ms elapsed)
416 Range Not Satisfiable: offset out of retention range (too old)
Optional endpoints by level
| Endpoint | Description | Level |
|---|---|---|
POST /consumers/{group}/offsets | Commit consumer group offset | L3/L4 |
GET /consumers/{group}/offsets | Fetch current offset per partition | L3/L4 |
POST /consumers/{group}/seek | Reset offset to timestamp or beginning/end | L5 |
GET /topics/{topic}/metadata | List partitions, leaders, ISR state | L5 |
POST /transactions/init | Begin exactly-once transaction | L7/L8 |
Core flow: produce and consume
The durability guarantee in §2 (no loss once acked) drives every decision in the produce flow. The question is: exactly when does the broker send the ack, and how many replicas must see the message before that happens? The answer creates the central tension between latency and durability.
The gap between processing and committing the offset is where at-least-once delivery lives. If a consumer processes a message and then crashes before committing the offset, on restart it will re-read and re-process the same message. Consumers must be designed for idempotency: processing the same message twice must produce the same result as processing it once.
acks=1 vs acks=all: the key tradeoff
This decision maps directly back to the write latency NFR from §2. With acks=1, the leader receives the message into the OS page cache and acks immediately. A leader crash after ack but before followers replicate loses that message. With acks=all, the leader waits until all in-sync replicas (ISR) have acknowledged before returning, making loss structurally impossible during the ack window, at the cost of latency proportional to the replication round-trip (same AZ: ~10–30 ms; cross-AZ: ~50–100 ms).
Consumer reads from the nearest replica (KIP-392, Kafka 2.4+): by default, consumers fetch from the partition leader. Since Kafka 2.4, consumers can be configured to fetch from the nearest follower replica in the same AZ via client.rack + replica.selector.class=RackAwareReplicaSelector. This eliminates cross-AZ read traffic and reduces consumer latency for geographically distributed clusters, an L6+ optimization worth mentioning when cross-AZ egress cost is a concern.
Data model
There are three categories of entities in this system: messages themselves (the payload), the log structure (how messages are stored and indexed), and the metadata about topics and consumer state. Each is used very differently, and those differences shape where and how each entity is stored.
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Append message to partition | Very high (sustained M/s) | Sequential write to end of file |
| Fetch messages at offset | Very high (per consumer poll cycle) | Random seek by offset, then sequential scan |
| Read partition metadata (leader, ISR) | High (per produce/consume) | Key lookup by topic+partition |
| Commit consumer offset | Medium (every N messages or T seconds) | Write by group+topic+partition |
| Fetch consumer offset on startup | Low (per consumer startup / rebalance) | Key lookup by group+topic+partition |
Two things stand out from this table. First, the message append and fetch patterns are purely sequential I/O: this rules out any key-value store or database as the primary storage layer and points directly to a file-based log. Second, metadata reads happen on every produce and consume request, so they must be served from memory, not from disk.
Message schema (binary, per partition)
Metadata schema (in-memory, replicated via KRaft)
Topic and partition metadata In-memory ›
// In-memory partition metadata (per broker)
{
"topic": "orders",
"partition": 3,
"leader_id": 5, // broker ID
"isr": [5, 2, 7], // in-sync replica broker IDs
"replicas": [5, 2, 7], // all assigned replicas
"high_watermark": 104999, // last offset replicated to all ISR
"log_start_offset": 0, // first available offset (after retention)
"epoch": 12 // leader epoch for fencing
}
The leader epoch is critical for preventing split-brain: a follower that receives a write from a broker claiming to be the leader must reject it if the epoch is stale. This is how Kafka prevents duplicate appends when a deposed leader comes back online and tries to continue writing.
Caching strategy
A Kafka broker does not maintain its own message cache. Instead, it relies entirely on the OS page cache, the kernel's LRU-managed RAM buffer, to serve recent messages. When a consumer fetches messages that are still in the page cache, the broker uses the OS sendfile() syscall to transfer bytes directly from the page cache to the network socket, bypassing the broker process entirely. This is called zero-copy and is why a single broker thread can sustain hundreds of MB/s of consumer throughput at sub-5 ms latency.
A message queue that reads from disk on every consumer fetch would be unusable at scale. The sub-5 ms read latency NFR from §2 is only achievable because recent messages almost never touch disk: the OS page cache absorbs them. Understanding where each cache layer sits in the request path is what separates a good answer from a great one.
Cache hierarchy
| Layer | Location | What it caches | Invalidation | Latency when hit |
|---|---|---|---|---|
| Layer 1: OS page cache | Kernel RAM on each broker host | Recent log segment bytes (LRU eviction by kernel) | Automatic: kernel evicts cold pages when RAM pressure rises; segment roll does not invalidate, rolled segments stay warm until evicted | ~1–3 ms |
| Layer 2: Broker process heap | JVM heap on broker process | Partition metadata, sparse index files, ISR state, producer ID dedup state | On leader election change or segment roll; automatically garbage collected | <1 ms (in-process) |
| Layer 3: Consumer fetch buffer | Consumer client memory | Pre-fetched message batches ahead of current processing offset | On consumer group rebalance or explicit seek; configurable via fetch.min.bytes and max.partition.fetch.bytes |
~0 ms (already in RAM) |
Why the OS page cache is the right choice here
Kafka deliberately does not maintain its own in-process message cache. Instead it relies entirely on the OS page cache. This is a deliberate architectural decision: the JVM heap is bounded and subject to garbage collection pauses; the OS page cache is unbounded and managed by the kernel's LRU algorithm, which is well-tuned for sequential access patterns. A broker restart that loses its JVM heap state recovers instantly, because the OS page cache persists across process restarts (until the host reboots).
The consequence is that broker sizing is dominated by RAM, not CPU. A broker with 256 GB RAM and a 48 TB NVMe array will serve warm reads from the page cache at ~2 ms. A broker with 32 GB RAM on the same hardware will see frequent page cache evictions under load, forcing disk reads at ~8–10 ms, a 4–5× latency regression. This is the first thing to check when diagnosing slow consumer reads in production.
Zero-copy reads via sendfile(): When a consumer requests messages that are in the page cache, the broker uses the OS sendfile() syscall to transfer bytes directly from the page cache to the network socket, without copying them through userspace. This eliminates the broker process entirely from the data path for warm reads, and is why a single broker thread can serve hundreds of MB/s of consumer traffic without saturating its CPU.
Replication and durability
Kafka replicates each partition to a configurable number of brokers (replication factor). The broker currently acting as the partition leader accepts all writes; follower brokers continuously fetch and replay the leader's log. The High-Watermark (HWM) is the highest offset that has been confirmed by all in-sync replicas (ISR). Consumers can only read up to the HWM. Messages above it exist on the leader but have not been replicated to all ISR members yet. This guarantee means a consumer never reads a message that could disappear after a leader failover.
Durability in this system is a property of the replication flow, not of disk writes alone. A single disk write survives a process crash but not a disk failure. Replication across brokers (ideally across availability zones) is what makes data durable against hardware failures.
Leader election and log truncation
When a new leader is elected, it issues a LeaderEpochRequest to all follower replicas. Any follower whose log-end-offset (LEO) exceeds the new leader's committed high-watermark for that epoch must truncate the excess entries. This log truncation step is what enforces the HWM contract: no consumer ever reads a message that was written above the HWM, because followers discard those writes when they learn of the new leader's epoch. Without this truncation, a deposed leader that comes back online and a newly-elected leader could have diverged logs, potentially creating duplicate or inconsistent consumer reads.
The leader epoch (the epoch field in the metadata schema in §7) is a monotonically increasing counter that prevents split-brain: followers reject writes from any broker whose epoch is lower than the controller's current assignment. An old leader that recovers from a network partition detects the stale epoch and stops accepting writes before rejoining as a follower.
Classic L5 probe: “What happens to messages between the leader’s LEO and HWM when a new leader is elected?” Answer: they are truncated by all followers. The new leader’s LEO becomes the cluster’s HWM baseline for that epoch. Producers that received acks under acks=1 for those messages will see a LEADER_NOT_AVAILABLE error on the next produce and must retry, which is exactly why idempotent producers are essential even under acks=1.
Exactly-once delivery: how it actually works
Idempotent producer: deduplicating retried messages Producer side ›
When a producer retries after a timeout (because it didn't receive an ack), it may send the same message twice: once that succeeded silently and once that the broker will process again. The broker deduplicates by tracking the (producer_id, partition, sequence_number) triple. Any sequence number the broker has already seen is silently discarded. This converts "at-least-once" into "exactly-once on the delivery path" without performance cost, because the sequence number is already in the message header.
The limit of 5 in-flight requests per partition exists because the broker’s deduplication window only covers a bounded sequence range. With more than 5 in-flight batches, a retried batch (carrying a higher sequence number) could arrive out of order in a way the broker cannot detect, allowing a duplicate to slip through as a genuine new message.
Transactional API: atomic produce + offset commit Consumer side ›
End-to-end exactly-once (read → process → write → commit) requires that the final offset commit and any downstream produce happen atomically. Kafka's transaction API wraps these operations in a two-phase commit: the producer sends a BeginTransaction marker, writes messages and/or offset commits to a transaction log, then sends a CommitTransaction marker. Consumers configured with isolation.level=read_committed will not see any of the writes until the commit marker arrives.
Deep-dive scalability
Scaling a message queue is fundamentally about scaling partitions and scaling disk. Compute is rarely the bottleneck: network bandwidth and disk I/O are. The diagram below shows a production-grade multi-region setup.
How many partitions should you choose? L5 deep-dive ›
Partitions are the unit of parallelism for both producers and consumers. A topic with P partitions can be consumed by at most P consumers in a group simultaneously. A useful starting heuristic: partitions = max(target_consumers, ceil(peak_throughput_MB_s / 100)), where 100 MB/s is a conservative single-partition write ceiling on NVMe. Add 20–50% headroom for growth.
Consumer rebalancing: the thundering herd L6 deep-dive ›
When a consumer joins or leaves a group, Kafka triggers a rebalance: all partitions are temporarily unassigned and reassigned across living consumers. During a rebalance, consumption for the entire group stops. With large consumer groups (100+ consumers), rebalances triggered by routine deployments can cause 30–60 second pauses.
Mitigation: use static group membership (group.instance.id). A consumer with a static ID rejoins after a restart without triggering a full rebalance, reclaiming its previously assigned partitions as long as it returns within session.timeout.ms. For Kubernetes pod restarts, this typically reduces rebalance events by 80%.
Tiered storage: petabyte-scale retention L7 deep-dive ›
Storing years of data on broker disks becomes prohibitively expensive past a few weeks. Tiered storage moves closed log segments to an object store (S3/GCS) automatically. Brokers only retain the last N days locally; the object store provides the rest. Consumers reading historical offsets are transparently redirected to fetch from the object store via a fetch proxy layer. Cost comparison: NVMe storage ~$0.20/GB-month vs. S3 ~$0.023/GB-month, roughly 8× savings for cold data.
Back-pressure and producer quotas L6 deep-dive ›
A shared Kafka cluster serving multiple teams needs hard per-producer quotas to prevent a single runaway producer from saturating disk or network. Quota enforcement works via a token bucket model per client ID: the broker measures bytes-per-second produced by each client, and when a client exceeds its quota, the broker throttles it by delaying the fetch metadata response (effectively back-pressuring the producer without dropping messages).
// Set quota for client "order-service" via Admin API
kafka-configs --bootstrap-server broker:9092 --alter \
--add-config 'producer_byte_rate=10485760' \ // 10 MB/s
--entity-type clients --entity-name order-service
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Broker crash (any) | Partitions whose leader was on the crashed broker are unavailable | Controller detects via heartbeat timeout (~30s), triggers leader election, assigns a follower from the ISR as new leader. Producers see LEADER_NOT_AVAILABLE and retry | L3/L4 |
| Producer retry → duplicate message | Timeout causes producer to resend; broker writes same message twice | Enable idempotent producer (enable.idempotence=true). Broker deduplicates by (producer_id, partition, sequence_no) | L3/L4 |
| Consumer crash mid-processing | Consumer processed message but crashed before committing offset; reprocesses on restart | Make consumers idempotent (e.g. upsert to DB keyed on message ID). Or use exactly-once transactions for atomic process+commit | L5 |
| ISR drops to 1 (under-replication) | With acks=all and min.insync.replicas=2, writes fail (503) | Alert immediately; reduce min.insync.replicas as emergency measure (accepting durability risk) or allow the follower to catch up before continuing | L5 |
| Hot partition (imbalanced key distribution) | One partition gets 10x the traffic of others; its broker saturates | For genuinely hot keys: append a random salt suffix to the partition key for over-loaded entities, consuming consumers merge and sort. Alternatively, pre-list hot-key exceptions and route them to dedicated topics with more partitions | L5 |
| Split-brain after network partition | Old leader (thinking it's still leader) accepts writes after a new leader is elected | Leader epoch prevents this: the new leader gets a higher epoch; followers reject writes from the old leader once they know the current epoch. Old leader detects epoch mismatch and stops accepting writes | L7/L8 |
| Consumer group lag explosion | Consumer falls behind; log retention expires before it catches up; offsets fall below log_start_offset | Monitor consumer lag in real-time (JMX metric: records-lag). Alert before lag reaches retention boundary. When lag expires: reset to earliest (some messages are re-read) or skip to latest (some messages are lost) | L5 |
| Cross-region message ordering | MirrorMaker 2 replicates with async lag; EU consumer reads messages in a different order than US | Accept that cross-region ordering is best-effort (not guaranteed). For strict ordering requirements, route all reads to the primary region at the cost of cross-region read latency | L7/L8 |
| Poison-pill / dead-letter | A malformed message that crashes the consumer on every attempt permanently blocks the partition: no subsequent messages in that partition are processed | Circuit-break on the consumer side: skip the message after N retries and publish it to a dead-letter topic (DLQ) for manual inspection. Schema validation at publish time via a schema registry is the proactive defense: reject malformed messages at the source before they reach consumers. Never advance past a deserialization failure without a DLQ escape hatch | L5 |
How to answer by level
L3/L4: Junior / SDE I-II Working system ›
- Clear producer → broker → consumer flow
- Topics, partitions, offsets: defined clearly
- Consumer pulls (not push), advances offset after processing
- Messages are not deleted after consumption; multiple groups read the same topic independently
- Replication exists; explains why (durability, not just availability)
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Unprompted discussion of acks=1 vs acks=all and the latency/durability tradeoff
- Proactively asks: what are the message ordering requirements?
- Explains the high-watermark concept
- Knows that consumer offset commit creates the at-least-once window
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
L5: Senior SDE Tradeoffs ›
- Explains ISR, HWM, and the ISR-shrink → 503 scenario
- Traces exactly-once delivery end-to-end (idempotent producer + transactional consumer)
- Explains consumer group rebalancing and session.timeout.ms tradeoff
- Can calculate partition count from throughput requirements
- Knows hot partition is the Achilles heel of key-based routing
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Static group membership and cooperative rebalancing (KIP-429)
- Discusses tiered storage economics and cold read latency tradeoff
- Designs producer quota enforcement and back-pressure
- Knows the difference between KRaft and ZooKeeper operationally
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
L6: Staff SDE End-to-end ownership ›
- Designs for operational concerns: multi-team quotas, partition rebalancing for cluster expansion, rack-aware replica assignment
- Articulates when to use Kafka vs. SQS vs. Pub/Sub based on concrete requirements
- Builds tiered storage into the capacity estimate from the start
- Designs monitoring: consumer lag SLO, ISR shrink alert, producer throttle metrics
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Initiates cross-region MirrorMaker design unprompted
- Discusses MirrorMaker failover: how do you promote EU to primary without data loss?
- Proposes schema registry to enforce compatibility without coordination overhead
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
L7/L8: Principal / Distinguished Architecture ownership ›
- Questions whether Kafka is the right tool: stream processing? (consider Flink/Spark on top); simple task queues? (SQS is cheaper); event sourcing? (partition count strategy changes fundamentally)
- Multi-tenant platform design: per-team quota enforcements, partition isolation, cost attribution
- Schema evolution strategy: how do you break compatibility without downtime?
- Disaster recovery plan for full region failure: RPO/RTO targets, MirrorMaker lag monitoring, automated promotion runbook
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Initiates build vs. buy: "If I'm at a company where Confluent Cloud or MSK fits, why would I operate this myself?"
- Brings cost model: "At 1 PB/year retained, object storage saves $X vs. broker disks"
- Mentions Kafka's known limitations (partition count ceiling per cluster before KRaft, JVM GC pauses, in-order exactly-once is expensive)
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
Classic probes table
| Question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "How do consumers know where to start reading?" | Stored offset per partition; resets to earliest or latest on first start | Explains offset commit atomicity, manual vs. auto commit tradeoffs, seek-to-timestamp for replay | Schema-versioned replay strategy; offset reset during data migration; impact on downstream aggregations |
| "What happens if a broker crashes immediately after acknowledging a write?" | "Another broker takes over", vague on mechanism | Explains acks=1 vs acks=all, ISR, the HWM truncation on leader re-election, why LEO can be ahead of HWM | Leader epoch fencing, the log truncation protocol, idempotent producer dedup window, epoch monotonicity as the safety invariant |
| "How would you add a new topic for a different team without affecting existing consumers?" | "Create a new topic", no cross-team concern | Quota per client ID, rack-aware partition assignment to avoid hotspots on shared brokers, partition count sizing | Tenant isolation model: dedicated vs. shared cluster economics, partition rent / cost attribution, compatibility testing in a shadow cluster before production |
| "You need five years of event history accessible within 100ms. How?" | "Store everything on brokers", cost-blind | Tiered storage to S3 for cold segments; hot segments on NVMe; tradeoff: cold reads are 50–200ms not 1–5ms | Builds hybrid: tiered storage for durability, a separate columnar store (ClickHouse / BigQuery) for fast analytical queries over old data; replay via Spark/Flink from S3 for backfill |
- 1 M msg/s throughput requirement (§2) → hash-based partitioning across 100+ partitions (§5) → consumer groups with one consumer per partition for maximum read parallelism (§6)
- "No loss once acked" durability NFR (§2) → replication factor 3 across AZs, min.insync.replicas=2, acks=all for critical topics (§9) → ISR-shrink → 503 failure mode as the expected behavior when durability can't be guaranteed (§11)
- 7-day hot retention with years of history (§2) → tiered storage: NVMe for 7-day window, S3 for cold segments (§4) → ~8x cost reduction for cold data vs. keeping everything on broker disks (§9)
- Strict within-partition ordering NFR (§2) → murmur2 partition key hashing ensures same entity always routes to same partition (§5) → hot-key imbalance is the primary throughput failure mode when key cardinality is low (§10)
- At-least-once delivery as default (§2) → offset commit after processing creates the reprocessing window (§6) → idempotent producer + transactional API is the upgrade path to exactly-once, traced fully in §9
- 99.99% availability SLA (§2) → multi-broker cluster with KRaft controller quorum enables automatic leader election (§4, §8) → cross-region MirrorMaker 2 covers region-level failures for DR (§9)
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
- Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
- Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
- Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
- Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
- Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
- Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
- Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
- Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
- Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
- Online Judge System Design — secure sandboxing, execution queues, and worker scaling