System Design Interview

Design a Key-Value Store
& Distributed Cache

Simple to describe, surprisingly hard to scale, the system behind every high-traffic product you use.

L3/L4, Basic single node L5/L6, Distributed cluster L7/L8, Multi-region, hot keys, SLAs
Hero Image for Key-Value Store & Distributed Cache System Design
01

What the interviewer is testing

A key-value store question arrives in two flavours: "design a cache" (think Redis, Memcached, speed is everything, durability is optional) and "design a persistent key-value store" (think DynamoDB, Cassandra, RocksDB, durability is non-negotiable). The first question you should ask is which one the interviewer wants. Both share the same distribution primitives, but their tradeoff trees diverge immediately.

What makes this question deceptively hard is that the core operation, put(key, value) / get(key), is trivially solved with a single hash table. The interview is really asking: what happens when one machine isn't enough? How do you partition 10 TB of data across 50 nodes? How do you handle node failure without service disruption? How do you deal with one key receiving 100× the traffic of any other?

Level Core question Differentiator
L3/L4 Can you build a working single-node in-memory store? LRU eviction, TTL, basic API
L5 Can you distribute it across N nodes with consistent hashing? Replication, quorum reads/writes, hot key detection & sharding, node failure handling
L6 Can you own the end-to-end reliability story? Full hot key lifecycle (monitoring thresholds, auto-promotion/demotion, write-fan-out cost), rebalancing, monitoring SLAs
L7/L8 Should we build this or use an existing system? Multi-region active-active, cross-DC consistency, cost/latency tradeoffs
02

Requirements clarification

Before drawing anything, nail down whether this is a cache (latency-first, data loss acceptable) or a persistent store (durability-first). These two constraints drive almost every subsequent decision.

Functional requirements

Requirement Scope
get(key) → value or null Core
put(key, value) → OK Core
delete(key) → OK Core
TTL / expiry per key Cache variant
Atomic increment / decrement Rate limiting, counters
Range scans (e.g. keys with prefix) Persistent store variant

Non-functional requirements

The numbers below are working assumptions for this design, 10 TB of total data and 1M peak QPS. The estimator in §3 is calibrated to these defaults.

NFR Target Drives
Read latency (p99) < 1 ms (cache) / < 10 ms (store) In-memory tier, avoid disk on hot path
Write latency (p99) < 5 ms Async replication vs synchronous
Availability 99.99% Replication, automatic failover
Consistency Eventual (cache) / configurable (store) Quorum parameters
Durability Not required (cache) / required (store) WAL, persistence strategy
Scale 10 TB data, 1M QPS peak Partitioning strategy
Latency < 1 msWhat it means & what it drives

Sub-millisecond p99 means all hot-path reads must be served from RAM, no disk I/O allowed on the critical path. Even an NVMe SSD random read adds ~0.1 ms, which alone consumes the entire latency budget; a spinning-disk seek adds 4–10 ms. Either way, disk I/O is incompatible with a <1 ms p99 requirement.

This forces the design toward an in-memory primary with optional disk-backed persistence (RDB snapshots or an append-only file), rather than an LSM-tree-on-disk architecture.

DrivesIn-memory storage engine, local cache tier on each API server to avoid network round-trips to the cache cluster.
Eventual consistencyFor cache variant

For a cache, serving slightly stale data is usually acceptable, the source of truth is the backing database. The cache is an optimisation layer, not the system of record.

This permits asynchronous replication to replicas, which eliminates the write-latency penalty of synchronous quorum acknowledgements. Stale reads are tolerable in both steady state (replication lag between primary and replicas) and during failure modes (the failover window when a replica is promoted).

DrivesAsync replication, configurable TTLs for cache entries, and explicit cache invalidation on write paths.
99.99% availability52 min downtime / year

99.99% (four nines) permits roughly 52 minutes of downtime per year, or about 60 seconds per week. This rules out manual failover (humans take minutes to respond) and mandates automatic leader election via a coordination service (Zookeeper, etcd).

DrivesReplication factor ≥ 3, automatic sentinel-based failover, health-check endpoints, client-side retry with exponential backoff.
03

Capacity estimation

Caches are almost entirely read-dominated. The write rate matters because it determines replication pressure and invalidation fan-out. The storage number tells us how many nodes we need and whether in-memory storage is economically viable. Cache sizing is derived from the working set, not total data.

Interactive estimator

10M
20 : 1
500 B
50M
90%
Write QPS
116
writes/sec
Read QPS
2.3K
reads/sec
Peak Read QPS
4.6K
reads/sec (×2)
Total Storage
25 GB
all keys × value size
Cache RAM needed
7.9 GB
√(1−hitRate) × total storage
Cache nodes (32 GB)
1
nodes at hit target
💡

Key insight: At a 90% cache hit rate, the database only sees 10% of reads, a 10× amplification reduction. Raising hit rate from 90% to 99% drops DB load by another 10×. This is why the most impactful knob in the system is cache sizing, not DB hardware.

04

High-level architecture

Clients Load Balancer API Servers L1 Cache L1 Cache L1 Cache Cache Cluster (in-memory KV store) Node 1 Node 2 Node 3 Node N Persistent Store (DB / disk-backed KV) RocksDB / Cassandra Coordinator (Zookeeper / etcd) Cache lookup DB fallback on miss (sync) Coordination (async)

High-level architecture: L1 in-process caches on API servers, backed by a distributed in-memory cache cluster (L2), with a persistent store as the source of truth. A coordinator manages cluster membership and leader election.

Clients are services, web browsers, or mobile apps, anything that issues get/put calls. They connect through a load balancer that distributes requests evenly across API server instances.

API servers hold a small in-process (L1) cache per server. This is the first stop, a read that hits here costs zero network round-trips. The L1 is intentionally small (a few hundred MB) and short-lived, acting as a micro-cache for the hottest keys.

Cache cluster (in-memory key-value store, e.g. Redis or Memcached) is the L2 layer. All L1 misses land here. It holds the working set of the data, the 10–20% of keys receiving 80–90% of traffic. Nodes are partitioned using consistent hashing so that adding or removing a node remaps only a small fraction of keys.

Persistent store (e.g. RocksDB, Cassandra, or a relational database) is the source of truth. Cache misses fall through to this layer. For the pure cache variant, this is the upstream database; for a persistent KV store, this is the storage engine itself.

Coordinator (e.g. Zookeeper, etcd) manages cluster membership, leader election, and failure detection. Nodes heartbeat to the coordinator; when a node goes silent, the coordinator triggers automatic failover to a replica.

Architectural rationale

L1 in-process cache per API serverWhy this layer exists

A network call to the L2 cache cluster costs ~0.1–0.5 ms. At 100k QPS per server, that's meaningful. L1 caches hot keys in process memory, eliminating the network hop entirely, reads from L1 cost nanoseconds.

TradeoffL1 caches introduce temporary inconsistency: a write invalidates the L2 but cannot immediately reach all L1 caches. Either accept stale-for-TTL or push invalidation events to all API servers. Most production systems accept short-lived staleness (seconds) and use small TTLs.
AlternativesNo L1 (simpler, slightly higher latency)Shared L1 via sidecar proxy
Separate cache cluster from persistent storeLayered architecture

Keeping the cache cluster and persistent store separate allows them to be scaled and operated independently. The cache cluster is sized for the working set (RAM), while the persistent store is sized for the full dataset (disk). Mixing them in one layer forces you to pay disk prices for hot data.

TradeoffTwo-tier architecture adds operational complexity: cache misses incur two hops (cache cluster → backing store). But this is the cost of achieving sub-millisecond p99 for the 90%+ of requests that hit cache.
AlternativesSingle-tier with tiered storage (hot: RAM, warm: NVMe)Read replicas only
External coordinator (Zookeeper / etcd)Why not gossip?

Some systems (Cassandra, Riak) use gossip protocols for cluster membership, each node shares state with neighbours, eventually converging. This is decentralised and resilient but slower to converge and harder to reason about during partitions.

A dedicated coordinator (Zookeeper, etcd) provides a single consistent view of cluster membership and enforces linearisable leader election. This is the right choice when failover speed and simplicity matter more than eliminating the coordinator as a dependency.

TradeoffThe coordinator itself becomes a reliability dependency, if it fails, nodes can still serve traffic but cannot safely elect a new leader. Run at least 3 coordinator nodes in a Raft quorum for resilience.
AlternativesGossip (Cassandra-style)Redis SentinelRaft embedded in data nodes

Real-world comparison

Decision This design Redis Cluster Amazon DynamoDB
Partitioning Consistent hashing (virtual nodes) Hash slots (16384 fixed) Consistent hashing + automatic rebalancing
Primary datastore In-memory with optional persistence In-memory + RDB/AOF snapshots SSD-backed, multi-AZ replication
Consistency model Configurable (quorum R/W) Eventual (async replication to replicas) Eventually consistent reads by default; strongly consistent opt-in
Failover External coordinator (etcd) Cluster bus gossip + election Fully managed, transparent
Hot key handling Key sharding + L1 cache Application-level sharding or proxy Adaptive capacity + DAX accelerator
ℹ️

Neither Redis Cluster's fixed-slot approach nor DynamoDB's managed service is universally better, the right choice follows from operational ownership (managed vs self-hosted), consistency requirements, and cost model. This interview question is asking you to reason through those tradeoffs, not to recreate Redis.

05

Partitioning & consistent hashing

Once the dataset outgrows a single machine, you need a rule for deciding which node holds which key. The rule must be stable enough that adding or removing a node doesn't remap everything, fast enough to compute on every request, and balanced enough that no single node carries a disproportionate share of the load.

① Naive modulo: hash(key) % N Before (3 nodes) Node A Node B Node C Add Node D → Node A Node B Node C Node D ~75% of keys remapped! hash(key) % 3 ≠ hash(key) % 4 → Cache stampede on node addition (all remapped keys miss cold cache) ② Consistent hashing (hash ring) A B C D E new Only ~1/N keys remapped (keys between E and its predecessor → E)

Naive modulo hashing remaps ~75% of keys when adding a 4th node. Consistent hashing remaps only ~1/N of keys, just those that fall between the new node and its predecessor on the ring.

The core idea of consistent hashing is placing both nodes and keys on a circular hash ring (0 to 2³²). Each key is assigned to the first node clockwise from its position on the ring. When a node is added, only the keys between the new node and its predecessor need to move. When a node is removed, only its keys move to its successor.

Virtual nodes solve the load imbalance problem. With a small number of physical nodes, the ring partitions unevenly, one node might hold 40% of the keyspace and another 10%. By mapping each physical node to hundreds of virtual positions on the ring (e.g., hash("nodeA:0"), hash("nodeA:1"), ...), the load distributes much more evenly. It also makes rebalancing smoother: adding a node with 150 virtual nodes redistributes load from every existing node proportionally.

🎯

Common probe: "How many virtual nodes per physical node should you use?" There's no universal answer, more virtual nodes = smoother balance but more memory for the routing table. 100–200 virtual nodes per physical node is a common range (Redis Cluster uses a fixed 16384 hash slots as a deterministic alternative).

Rendezvous hashing, an alternativeAdvanced

Rendezvous (highest random weight) hashing works differently: for each key, compute a score for every node using hash(key, nodeID), and assign the key to the node with the highest score. No ring data structure needed, just iterate over nodes. Adding or removing a node remaps only the keys assigned to that node, like consistent hashing, but the algorithm is simpler to implement correctly.

TradeoffO(N) lookup cost per key (must score all nodes). For small clusters (<30 nodes), this is negligible. For large clusters, consistent hashing with a sorted ring lookup is O(log N) and preferred.
05b

API design

The interface surface area for a KV store is intentionally small. Each operation needs a clear ownership of the key, explicit TTL semantics, and a defined behaviour for non-existent keys.

Core endpoints

GET /v1/keys/{key}

// Request (no body, key in path)
GET /v1/keys/user:1234:session HTTP/1.1

// Response, hit
HTTP/1.1 200 OK
X-Cache: HIT
X-TTL-Remaining: 3542
{
  "key": "user:1234:session",
  "value": "eyJhbGciOiJIUzI1NiJ9...",
  "expires_at": "2026-03-29T12:00:00Z"
}

// Response, miss
HTTP/1.1 404 Not Found
{ "error": "key_not_found" }

PUT /v1/keys/{key}

// Request
PUT /v1/keys/user:1234:session HTTP/1.1
Content-Type: application/json
{
  "value": "eyJhbGciOiJIUzI1NiJ9...",
  "ttl_seconds": 3600,          // optional; 0 = no expiry
  "if_not_exists": false         // atomic set-if-absent
}

// Response
HTTP/1.1 200 OK
{ "status": "ok", "version": 42 }

// Response, conflict (if_not_exists: true, key already exists)
HTTP/1.1 409 Conflict
{ "error": "key_exists", "current_version": 38 }

DELETE /v1/keys/{key}

HTTP/1.1 204 No Content  // success (idempotent)
HTTP/1.1 404 Not Found   // key didn't exist (acceptable, idempotent)

Optional endpoints by level

Endpoint Purpose Level
POST /v1/keys/{key}/incr Atomic increment (rate limiting, counters) L5
POST /v1/keys/batch/get Multi-key fetch in one round-trip L5
POST /v1/keys/batch/put Bulk write for seeding or migration L6
GET /v1/keys?prefix=user:1234: Range scan / prefix listing L7/L8
GET /v1/admin/stats Hit rate, eviction rate, memory usage per node L7/L8
⚠️

Key naming is an API contract. Establish key namespacing conventions upfront (e.g., entity:id:attribute). Unstructured key names make prefix scanning, monitoring, and access control nearly impossible at scale. The API should document and enforce naming conventions from day one.

Conditional writes & Compare-and-Swap (CAS)

The version field in the entry schema (§7) enables optimistic locking. Two clients reading the same key both receive "version": 42. If both try to write, only the first succeeds, the second gets a 409 Conflict with the current version and must re-read and retry.

// Conditional write: only update if version matches
PUT /v1/keys/user:42:balance HTTP/1.1
{
  "value": "150",
  "if_version": 42        // fails with 409 if current version ≠ 42
}

// Response on conflict
HTTP/1.1 409 Conflict
{ "error": "version_mismatch", "current_version": 43, "hint": "re-read and retry" }

CAS is the standard answer to "how do you prevent lost updates when two clients write the same key concurrently?" without requiring distributed transactions. Redis offers two separate atomicity mechanisms for this: a Lua script (executes atomically, blocking other commands during its run, no WATCH needed or allowed) or the WATCH + MULTI + EXEC transaction pattern (optimistic locking, aborts the transaction if the watched key changed between WATCH and EXEC). These are mutually exclusive; do not combine them. DynamoDB implements the same guarantee natively via ConditionExpression.

Security & multi-tenancy

For a public or multi-tenant cache, three controls are non-negotiable at L6+. First, namespace isolation: prefix all keys with a tenant or service identifier (e.g., tenant_id:entity:id) and enforce this at the API layer, no cross-namespace reads should be possible at the key level. Second, authentication: Redis has historically shipped with no auth by default, leading to numerous high-profile exposure incidents; always enable Redis ACLs (since Redis 6.0) with per-service credentials. Third, value size limits: reject writes above a configured max (e.g., 1 MB) at the API tier to prevent a single key from monopolising a cache node's memory and triggering eviction avalanches.

06

Core read/write flow

The most important question in the read path is: what happens on a cache miss? The answer determines latency for cold or low-frequency keys, and also whether the system can defend itself against a cache stampede, where a popular key expires and thousands of requests simultaneously hit the backing store.

GET request L1 cache hit? HIT Return value ~0.01 ms MISS L2 cache hit? HIT Populate L1 → return ~0.3 ms MISS Fetch from DB Mutex / lock acquired? YES Fetch DB → fill L2, L1 → return NO (wait) retry L2 on wake ~2–10 ms

Read path: L1 (in-process) → L2 (cache cluster) → DB. The mutex prevents cache stampede by serialising concurrent misses for the same key, only one goroutine fetches from the DB; others wait and retry L2.

Cache stampede prevention

When a popular key expires, every concurrent request sees a miss simultaneously and races to query the database, a thundering herd. This is more destructive than it sounds: because the database is provisioned for only 10% of traffic (the 90% hit rate target from §3), a stampede that doubles or triples miss volume pushes it past its p99 latency SLO immediately. The tier we deliberately sized down becomes the failure point.

The standard mitigations are: (1) distributed mutex, use an atomic compare-and-set in the cache cluster to let only one request fetch from the database while others wait; (2) probabilistic early recompute (a.k.a. "XFetch"), before a key expires, randomly decide to recompute it based on how close the TTL is, spreading recomputations over time; (3) stale-while-revalidate, return the stale cached value immediately while triggering an asynchronous background refresh.

Write path

The write path is simpler but the tradeoff is between write latency and consistency. Write-through updates the cache and database synchronously on every write, cache is always fresh but every write pays a database round-trip. Write-behind acknowledges the write after updating the cache, then asynchronously flushes to the database, lower write latency but risk of data loss if the cache fails before the flush. Cache-aside leaves the cache population to read time, the simplest model but means every cold-start incurs a cache miss.

💡

The choice between write-through and write-behind maps directly back to the durability NFR from §2. For a pure cache, write-behind is acceptable. For a persistent key-value store, write-through or a WAL-backed write is required.

07

Data model & storage engine

A key-value store is deceptively simple at the surface, a map from strings to blobs. The interesting design decisions live in what metadata accompanies each entry, how that metadata is stored efficiently, and what storage engine sits underneath.

Access patterns

Operation Frequency Query shape
Point read (get by key) Very high (90%+ of operations) Exact key lookup → O(1) hash table
Point write (put by key) High Exact key update + metadata update
TTL expiry scan Background, continuous Scan entries with expires_at ≤ now
Prefix scan Low (admin / migration) Range scan over sorted key space (requires sorted storage)
LRU eviction Background, on memory pressure Find least-recently-used entry across all keys

Two things jump out from this table. First, point reads dominate overwhelmingly, O(1) hash map access is the right primary structure for an in-memory store, not a B-tree or skip list. Second, TTL expiry and LRU eviction both require efficient ordering by time, which a pure hash map doesn't support. The typical solution is to maintain a hash map for O(1) lookup alongside a doubly-linked list ordered by recency (for LRU) or by expiry time (for TTL), with the map storing pointers into the list.

Entry (in-memory node)
keystring KEY
valuebytes (opaque)
size_bytesuint32
expires_atint64 (unix ms, 0 = none)
last_accessedint64 (unix ms)
versionuint64 (CAS token)
prev / next*Entry (LRU list pointers)
Persistent log entry (WAL/AOF)
seq_nouint64 PK
openum (SET, DEL, EXPIRE)
keystring
valuebytes (null for DEL)
ttl_secondsuint32
timestampint64 (unix ms)
crc32uint32 (integrity check)

Storage engine options

In-memory hash map + LRU listCache variant, our choice

For the cache use case, all data lives in RAM. The data structure is a hash map for O(1) lookup paired with a doubly-linked list ordered by last-access time. On a get, the accessed node moves to the front of the list. On memory pressure, the tail of the list is evicted.

Persistence is optional: an append-only log (WAL) records all writes. On restart, replay the log to reconstruct state. This is how Redis AOF (Append-Only File) mode works, though full-dataset snapshots (RDB) are also taken periodically to bound recovery time.

TradeoffRecovery time is proportional to WAL size. Redis caps this with periodic RDB snapshots and AOF rewriting (compacting the log). Without snapshotting, a restarted node must replay millions of operations.
LSM-tree (Log-Structured Merge-tree)Persistent store variant

LSM-trees (used by RocksDB, Cassandra, LevelDB) convert random writes into sequential disk I/O by first buffering writes in a memory table (MemTable), then flushing sorted runs to disk (SSTables). Background compaction merges and garbage-collects stale versions.

This is the right engine when write throughput is high and the dataset exceeds available RAM. Reads may need to check multiple levels (one bloom filter per level reduces unnecessary disk I/O), so read amplification is higher than a B-tree, but the write amplification and throughput characteristics are far better for write-heavy workloads.

TradeoffRead amplification: a point read may check the MemTable + multiple SSTable levels. Bloom filters at each level reduce this to near-O(1) for non-existent keys but add memory overhead per level.
AlternativesB-tree (PostgreSQL, MySQL, read-optimised)Bitcask (simple log with in-memory key index)
08

Eviction, TTL & caching strategy

When memory fills up, the cache must decide what to throw away. The right eviction policy depends on the access pattern of the workload. Equally important is how the cache is populated, the caching strategy determines whether writes proactively fill the cache or leave it to reads.

API Server L1: In-process cache LRU, 256 MB, TTL 30s ~0.01 ms access miss L2: Cache Cluster (in-memory KV store) LRU/LFU eviction TTL per key, ~10 GB RAM ~0.3 ms access miss L3: Persistent Store (source of truth) No eviction (all data) Disk-backed, replicated ~2–10 ms access

Each cache layer is anchored to its physical location in the §4 architecture. L1 lives inside the API server process; L2 is the cluster-level in-memory tier; L3 is the durable backing store.

Eviction policies

Policy How it works Best for Weakness
LRU Evict the key not accessed for the longest time Workloads with temporal locality (session stores, recent feeds) Doesn't account for frequency, a key accessed 1000× a day can be evicted after an idle hour
LFU Evict the key accessed the least total times Workloads with stable hot keys (product catalog, user profiles) Frequency counts decay slowly, a once-viral item stays in cache long after interest drops. Mitigated by decaying frequency counters (Redis LFU uses a logarithmic counter with configurable decay via lfu-decay-time)
ARC Adaptive blend of LRU and LFU, self-tuning Mixed or unpredictable workloads (used in ZFS) More complex to implement; not available in all cache systems
TTL expiry Each key carries a deadline; expired keys are removed lazily or eagerly Data with natural staleness windows (tokens, rate limits, OTPs) Not a memory pressure strategy, expired keys still occupy memory until checked
FIFO Evict the oldest-inserted key Simple queues where insertion order matches usefulness Ignores access pattern completely; rarely optimal
ℹ️

In practice, Redis uses approximate LRU, on eviction, it samples a configurable number of random keys and evicts the least recently used among them. This avoids the overhead of maintaining a full doubly linked list across all keys, no per-entry LRU pointers are needed, at the cost of a small accuracy penalty. Redis stores LRU clock data in 24 bits of each object's metadata rather than in list node pointers.

Caching strategies

Cache-aside (lazy population)Most common, our baseline

Use this when reads dominate and occasional staleness is acceptable, it is the right default for most read-heavy services. The application manages the cache explicitly: on a miss, read from DB, write to cache, and return; on a write, update DB then invalidate or update the cache key. Because the cache is only populated when data is actually read, rarely-accessed keys never waste cache space.

TradeoffCold start: after a cache restart or key expiry, the first read for each key hits the DB. If many keys expire simultaneously (e.g., after a deploy), this becomes a cache stampede. Use jittered TTLs (add ±10–20% randomness to TTL values) to spread expiry times.
Code patternget → miss → db.fetch → cache.set → return
Write-throughStrong consistency

Use this when reads must be strongly consistent immediately after writes, financial balances, inventory counts, anything where a stale read has user-visible consequences. Every write goes to both the primary cache node and the DB synchronously before returning success. The primary cache and DB are kept in sync; replica reads may lag slightly until async replication completes.

TradeoffWrite latency includes the DB round-trip on every put. Cache is also populated with data that may never be read (write-heavy items that aren't frequently read waste cache space). For write-heavy, read-light workloads, cache-aside is usually more efficient.
Write-behind (write-back)High write throughput

Use this for write-coalescing, when many writes target the same key and only the final value matters (view counters, rate-limit windows, leaderboard scores). Writes are acknowledged after updating the cache; the DB write is deferred to an asynchronous flush queue, eliminating the DB round-trip from the critical path.

TradeoffIf the cache fails before the DB flush completes, those writes are lost permanently. Acceptable for counters, view counts, and rate-limit windows; unacceptable for financial transactions, user-generated content, or any data where loss is visible to users.
09

Deep-dive: scalability

REGION: US-WEST Clients LB / DNS API Fleet (N servers) L1 cache · hot key shards Hot key sampling Cache Cluster Primary1 Primary2 Replica1 Replica2 Reads → replicas Writes → primaries Persistent Store Sharded, replicated RocksDB / Cassandra REGION: US-EAST Clients LB / DNS API Fleet (N servers) L1 cache · hot key shards Hot key sampling Cache Cluster Primary1 Primary2 Replica1 Replica2 Reads → replicas Writes → primaries Persistent Store Sharded, replicated RocksDB / Cassandra async repl

Production-scale: two active regions each with their own cache cluster and persistent store. Async cross-region replication propagates writes with seconds of lag. Local DNS routing sends users to the nearest region.

Hot key problem & mitigationL5/L6 depth

In any large-scale cache, a small number of keys receive orders-of-magnitude more traffic than average. A viral post, a globally cached config key, a top-seller product ID, these "hot keys" can overwhelm a single cache node even while the rest of the cluster sits idle.

Key sharding: Instead of storing post:12345 on one node, store it as post:12345#0 through post:12345#N across N nodes. Reads are randomly routed among these N copies. Writes must update all N copies, acceptable if writes are infrequent and the data is read-optimised (like a viral post body).

L1 absorption: Detecting hot keys in production is done via request sampling at the API layer. When a key exceeds a QPS threshold (e.g., 10 req/sec on a single server), it gets promoted to the L1 in-process cache. Netflix's EVCache is one example of this pattern, hot-key detection and L1 promotion logic can live in a client library, a monitoring sidecar, or a proxy layer depending on where the system owns the routing decision.

TradeoffKey sharding increases write fan-out from 1 to N. For keys that are read-heavy with infrequent writes (content, config), this is acceptable. For keys with frequent writes (counters, rate limits), sharding is not appropriate, use probabilistic counting or CRDTs instead.
Replication & quorumConsistency under failure

Each cache shard (primary node) has one or more replicas. Replicas serve reads, offloading traffic from the primary and providing redundancy. Write path: the client writes to the primary, which acknowledges once the write is durable in its own memory/WAL. Replication to replicas is asynchronous, replicas may lag by milliseconds to seconds. Failover is orchestrated by a coordination service (Zookeeper, etcd) as described in §4.

For the persistent store variant, quorum reads and writes are configurable. With a replication factor of 3, W=2, R=2 (quorum writes and reads) gives strong consistency, any read will overlap with at least one write, guaranteeing the latest version is returned. Dropping to W=1, R=1 maximises availability and throughput at the cost of possible stale reads.

TradeoffHigher quorum = stronger consistency but higher latency (must wait for 2 of 3 nodes to confirm). A single slow node can delay the entire request. Consider "consistent core" architectures where a small quorum of nodes handles consistency-sensitive operations.
Node rebalancing when adding capacityOperations

Adding a new node to a consistent hashing ring remaps only ~1/N of keys. But those keys must be migrated from existing nodes to the new one without service interruption. The standard approach is dual-read: during the migration window, read from both the old and new node, accepting a cache miss if the new node doesn't have the key yet. Writes go to the new node immediately.

Redis Cluster handles this with hash slot migration, slots move one at a time, with an ASK redirect for any key that's mid-migration. Clients follow the redirect transparently.

TradeoffDuring rebalancing, the new node has a cold cache for migrated keys. If the rebalancing is triggered by a node failure (node went down unexpectedly), the remaining nodes serve increased traffic until the replacement is warm, plan capacity for N-1 nodes handling full load.
Multi-region active-activeL7/L8 depth

In active-active multi-region setups, both regions accept writes, and changes are replicated asynchronously across the WAN. The unavoidable trade-off: cross-region replication lag (typically 50–200 ms) means a write in US-West isn't immediately visible in US-East.

For a cache, this is usually acceptable, caches are not the system of record, and a slightly stale cache read is fine. For a persistent KV store used as a primary data store, conflict resolution becomes critical: if two regions write to the same key simultaneously, which value wins? Last-write-wins (LWW) by timestamp is common but can discard valid writes if clocks are skewed. CRDT-based resolution (Conflict-free Replicated Data Types, e.g., OR-Set, Counter) is conflict-free, any two nodes can apply operations in any order and arrive at the same state, but restricts the value types the store can support.

TradeoffActive-active maximises availability and reduces write latency by routing to the nearest region. But it requires a conflict resolution strategy and careful consideration of which operations can tolerate eventual consistency vs which require global coordination.
AlternativesActive-passive (simpler, one region handles writes)Geo-partitioning (user data pinned to region)
10

Failure modes & edge cases

Scenario Problem Solution Level
Single cache node crashes All keys on that node are unavailable; cache miss rate spikes; DB receives 10–20× normal load Automatic failover to replica (coordinator promotes replica to primary within seconds). Clients retry with exponential backoff. L3/L4
Cache stampede (thundering herd) Popular key expires; all concurrent requests miss simultaneously; DB overwhelmed Distributed mutex on miss path; probabilistic early recompute (XFetch); stale-while-revalidate; jittered TTLs L5
Hot key Single key receives 100×–1000× average traffic; one cache node becomes a bottleneck Key sharding (key#0 through key#N across N nodes); L1 in-process absorption via hot-key detection L5
Network partition (split brain) Cache nodes cannot reach each other; nodes diverge and accept conflicting writes Quorum-based writes (W ≥ N/2 + 1); fencing tokens; coordinator arbitrates leader election; prefer CP over AP for persistent store variant L6
Dirty cache after DB rollback DB transaction rolled back, but cache already holds the committed (now invalid) value Invalidate cache keys inside the same transaction (or immediately after rollback). Use transactional outbox pattern to enqueue invalidations atomically with DB writes. L6
Cache avalanche (mass expiry) Many keys share the same TTL (e.g., seeded in a batch job); all expire at once; DB overwhelmed for seconds Add ±10–30% random jitter to all TTL values at write time; use sliding TTLs that refresh on read access L6
Large value (cache pollution) A single value (e.g., 10 MB serialised object) fills a large fraction of one node's memory; evicts many small hot keys Enforce a max value size policy (e.g., 1 MB). Large objects should be stored in an object store (S3) with only the reference in cache. Alert on keys exceeding threshold. L7/L8
Cascading failure across services Cache tier fails; every upstream service retries aggressively; DB collapses; entire system goes down Circuit breaker on cache client; request rate limiting at API tier; graceful degradation (serve stale data from a secondary read-only replica or CDN edge); load shedding L7/L8
11

How to answer by level

🎯

The interview differentiator: Every candidate describes a Redis-in-front-of-MySQL architecture. What separates levels is whether you can name the failure of that architecture, and reason from requirements (§2) to the solution without being prompted.

L3/L4, SDE I / SDE IIEntry level
What good looks like
  • Clear API: get/put/delete with TTL
  • Correct LRU implementation (hash map + doubly linked list)
  • Single-node in-memory design with WAL for durability
  • Identifies that cache misses must fall through to DB
  • Explains cache-aside pattern correctly
What separates from L5
  • Cannot explain what happens when one node isn't enough
  • No mention of cache stampede or how to prevent it
  • Treats the design as complete without discussing failure
  • Cannot name a partitioning strategy beyond "split the data"
L5, Senior SDESenior
What good looks like
  • Consistent hashing with virtual nodes, explains rebalancing impact
  • Names cache stampede, applies mutex or XFetch proactively
  • Discusses replication factor, async vs sync replication tradeoffs
  • Compares LRU vs LFU for the given workload type
  • Explains write-through vs cache-aside and when each applies
  • Proactively names hot key problem; proposes sharding + L1 absorption
  • Explains CAS / optimistic locking for concurrent write safety
What separates from L6
  • Cannot reason about quorum (R/W = 1 vs N/2+1)
  • No plan for rebalancing when adding nodes
  • Hot key: detects and shards, but cannot quantify write fan-out cost or auto-adapt thresholds
  • Single-region only; multi-region not considered
L6, Staff SDEStaff
What good looks like
  • Owns full hot key lifecycle: detection thresholds, auto-promotion/demotion, write fan-out tradeoff analysis
  • Cascading failure analysis; circuit breaker, graceful degradation
  • Cache avalanche: jittered TTLs, sliding windows
  • Dirty cache after DB rollback, transactional invalidation
  • Capacity planning tied to capacity estimation numbers
  • Monitoring: hit rate, eviction rate, node memory, latency SLOs, error budgets
  • Security: key namespace isolation, Redis ACLs, value size enforcement
What separates from L7/L8
  • Multi-region is "an idea" but no conflict resolution strategy
  • Cannot reason about CRDT vs LWW tradeoffs
  • No make-vs-buy analysis (Redis vs DynamoDB vs building from scratch)
L7/L8, Principal / DistinguishedPrincipal+
What good looks like
  • Leads with: "Should we build this or use Redis / DynamoDB?"
  • Multi-region active-active: names conflict resolution strategy upfront (LWW with fencing, or CRDT-based)
  • Defines SLO targets and reverse-engineers the architecture from them
  • Discusses operational cost: node count, RAM cost per GB, cost of data transfer in multi-region
  • Aware of tail latency (p99.9) and identifies which failure modes cause it
  • Proposes a migration plan from single-node to distributed
L7/L8 is differentiated by
  • Business framing, cost, operational ownership, and team capability are first-class constraints
  • Forward planning: what breaks at 10× current load? 100×?
  • Naming real systems (DynamoDB DAX, ElastiCache, Netflix EVCache) and knowing their actual limitations

Classic probes table

Probe L3/L4 answer L5/L6 answer L7/L8 answer
"What happens when a cache node crashes?" Cache misses spike; DB handles more load Replica is promoted; coordinator triggers automatic failover; client retries with backoff Defines RTO target; questions whether replicas are pre-warmed or cold; discusses cascade risk if DB can't absorb the spike; proposes load-shedding
"How do you handle hot keys?" "Put frequently accessed data on faster nodes" L5: Key sharding across N nodes; detect via request sampling; L1 in-process cache promotes detected hot keys. L6: Owns the full lifecycle, configures detection thresholds, automatic promotion/demotion, and quantifies write fan-out cost of sharding. Proposes adaptive sharding factor tied to per-node QPS ceiling; distinguishes read-heavy keys (sharding fine) from write-heavy keys (sharding inappropriate, use probabilistic counting instead); frames as an SLO problem: "what's the per-node latency budget at peak?"
"How do you ensure consistency between cache and DB?" "Clear the cache when the DB changes" Discusses write-through vs cache-aside; identifies dirty-read window; mentions transactional invalidation for rollbacks Names the fundamental impossibility of perfect consistency in distributed systems under failure; proposes version tokens (optimistic locking); discusses read-your-writes consistency via session affinity
"Why not just make the cache bigger to solve all problems?" "RAM is expensive" Working set is bounded by Zipf distribution, top 10% of keys receive 90% of reads; cache beyond the working set has diminishing returns; hot key problem persists regardless of cache size Derives the economic argument: RAM cost per GB vs. DB query cost per million; discusses the inflection point where cache size increase is no longer cost-effective; proposes tiered storage (hot: RAM, warm: NVMe SSD)

Going deeper, reference sections

Interactive: LRU cache simulator

LRU (Least Recently Used) is the eviction policy asked about most in interviews. The data structure is a hash map for O(1) lookup, wired to a doubly-linked list ordered by recency. On every access, the node moves to the head. On eviction, the tail is removed. The challenge is doing this in O(1) for all three operations, get, put, and evict.

LRU Cache, step-by-step visualization

Capacity: 4 slots  ·  Click an operation or type a custom key below.

Quick ops:
Cache slots (MRU → LRU)
Empty
Operation log
No operations yet.
Internal state: hash map → linked list
Empty
HEAD (most recent), , TAIL (least recent, evicted first)

Why O(1) requires both structures

Hash map alone
  • O(1) get/put ✓
  • Cannot find LRU key without scanning all entries, O(N) eviction ✗
Linked list alone
  • O(1) evict from tail ✓
  • O(N) lookup by key, must scan the list ✗
💡

The combination: hash map stores key → node pointer (O(1) lookup). Doubly linked list stores nodes ordered by recency (O(1) move-to-front and O(1) evict-from-tail). Every operation, get, put, evict, is O(1). Java's LinkedHashMap uses exactly this structure. Python's functools.lru_cache applies the same principle but uses a more memory-efficient compact circular buffer internally rather than individual node objects.

Code: LRU in Go (interview-ready)Expandable
type Node struct {
    key, val    string
    prev, next  *Node
}

type LRUCache struct {
    cap        int
    m          map[string]*Node
    head, tail *Node  // sentinel nodes (never evicted)
}

func NewLRU(cap int) *LRUCache {
    h, t := &Node{}, &Node{}
    h.next, t.prev = t, h
    return &LRUCache{cap: cap, m: make(map[string]*Node), head: h, tail: t}
}

func (c *LRUCache) Get(key string) (string, bool) {
    n, ok := c.m[key]
    if !ok { return "", false }
    c.moveToFront(n)     // O(1), just pointer rewiring
    return n.val, true
}

func (c *LRUCache) Put(key, val string) {
    if n, ok := c.m[key]; ok {
        n.val = val
        c.moveToFront(n)
        return
    }
    if len(c.m) == c.cap {
        lru := c.tail.prev       // O(1), tail sentinel's prev is LRU node
        c.remove(lru)
        delete(c.m, lru.key)
    }
    n := &Node{key: key, val: val}
    c.m[key] = n
    c.insertFront(n)
}

func (c *LRUCache) remove(n *Node) {
    n.prev.next, n.next.prev = n.next, n.prev
}
func (c *LRUCache) insertFront(n *Node) {
    n.next, n.prev = c.head.next, c.head
    c.head.next.prev, c.head.next = n, n
}
func (c *LRUCache) moveToFront(n *Node) { c.remove(n); c.insertFront(n) }

Bloom filters: eliminating wasteful cache misses

Bloom filters have two complementary uses in a key-value system. Their primary role is inside storage engines: RocksDB, Cassandra, and LevelDB attach a Bloom filter to every SSTable level so that a point read can skip a disk seek if the filter says the key definitely isn't there, eliminating most unnecessary I/O in LSM-tree reads. Their secondary role is at the API layer: a miss that falls through to a database for a key that simply doesn't exist is pure waste, and a filter placed in front of the cache can short-circuit it immediately. This matters especially in public APIs where misbehaving or malicious clients probe arbitrary keys.

A Bloom filter is a probabilistic data structure: it can definitively say a key does not exist (no false negatives), but it may occasionally say a key might exist when it doesn't (false positives). The false positive rate is configurable and depends on the bit array size and number of hash functions.

Inserting key "user:42" Querying key "user:99" (never inserted) user:42 h1(key)=2 h2(key)=5 h3(key)=9 Bit array (m=12 bits) 0 0 1 0 2 1 3 0 4 0 5 1 6 0 7 0 8 0 9 1 10 0 11 0 user:99 h1(key)=1 → 0 ✗ h2(key)=5 → 1 ✓ h3(key)=9 → 1 ✓ Bit 1 = 0 → DEFINITELY NOT in set Skip DB lookup entirely, save a round-trip False positive example If h1=2, h2=5, h3=9 (all happen to be set), filter says "MIGHT exist" → DB lookup happens. Rate is tunable: more bits = fewer false positives. idx bit

Left: inserting "user:42" sets bits at positions 2, 5, 9 (one per hash function). Right: querying "user:99", bit 1 is 0, so the key definitely doesn't exist. No DB lookup needed.

False positive rate formula

For a Bloom filter with m bits, n inserted elements, and k hash functions, the false positive probability is approximately:

p ≈ (1 − e−kn/m)k

The optimal number of hash functions for a given m/n ratio is k = (m/n) × ln(2). In practice, using 10 bits per element with 7 hash functions achieves a false positive rate of roughly 1%. For a set of 100 million keys, that's 100 MB, orders of magnitude smaller than storing the keys themselves.

Bits per element (m/n) Optimal k False positive rate Memory (100M keys)
6 4 ~5.6% 75 MB
10 7 ~0.8% 125 MB
14 10 ~0.1% 175 MB
20 14 ~0.007% 250 MB
⚠️

Bloom filters cannot support deletion. Clearing a bit when removing a key would corrupt entries from other keys that share that bit. If you need deletions, use a counting Bloom filter (store counts instead of bits, decrement on delete) or a Cuckoo filter (supports deletion with similar memory efficiency). Redis includes a Bloom filter module via RedisBloom.

Where Bloom filters sit in the architectureIntegration pattern

The Bloom filter runs inside the API server, checked before any cache or DB lookup. On each PUT, the new key is added to the filter. On each GET, if the filter returns "definitely not present," the request short-circuits immediately with a 404, no network hop at all.

In LSM-tree storage engines (RocksDB, LevelDB), each SSTable level has its own Bloom filter. Before checking a level for a key, the filter is consulted. Since most levels won't contain the key, the filter eliminates ~99% of disk reads for non-existent keys, which is why point read performance doesn't degrade catastrophically as LSM levels accumulate.

Production useGoogle BigTable, Apache Cassandra, RocksDB, InfluxDB all use per-SSTable Bloom filters to avoid expensive disk seeks. Redis's SINTERSTORE and similar commands use Bloom filters internally in some configurations.

Redis vs Memcached, when to choose which

Interviewers frequently follow up "design a cache" with "would you use Redis or Memcached?" Both are in-memory key-value stores targeting sub-millisecond latency, but they are built for different workload profiles. The answer should come from requirements, not from brand familiarity.

Dimension Redis Memcached
Data structures Strings, lists, sets, sorted sets, hashes, streams, HyperLogLog, geospatial Strings only (arbitrary byte blobs)
Persistence RDB snapshots + AOF (append-only file); configurable durability None, purely volatile; restart = data loss
Replication Primary–replica replication built in; Redis Sentinel / Redis Cluster for HA No native replication; must be handled by client or proxy layer
Clustering Redis Cluster (16384 hash slots, automatic sharding) Client-side sharding (consistent hashing in the client library)
Throughput ~500K–1M+ ops/sec per node (threaded I/O since Redis 6.0; command execution remains single-threaded) ~500K–1M+ ops/sec per node (multi-threaded)
Memory efficiency Higher per-key overhead (~60–80 bytes per key metadata) Lower overhead, simpler internal representation
Pub/Sub & Streams Built-in pub/sub, consumer groups (Redis Streams) Not available
Lua scripting Full Lua scripting for atomic multi-key operations Not available
Operational model Managed: ElastiCache for Redis, Redis Cloud, Upstash Managed: ElastiCache for Memcached
Choose Redis when…
  • You need persistence or replication
  • You're storing lists, sorted sets, or counters (leaderboards, rate limiters)
  • You want pub/sub for event broadcasting
  • You need atomic operations across multiple keys (Lua)
  • You're using Redis as a primary data store, not just a cache
Choose Memcached when…
  • Pure caching, maximum throughput for simple get/set
  • Multi-threaded performance is the dominant constraint
  • You're caching large blobs (rendered HTML, serialised objects)
  • Operational simplicity matters, no cluster topology to manage
  • You're already using client-side consistent hashing
🎯

Interview move: Don't just list features. Say: "For this system, we need TTL-based expiry, atomic increment for rate limiting, and replication for availability, that's three features Memcached doesn't support natively. Redis is the right choice. If we were caching rendered HTML pages at maximum throughput with no durability requirements, Memcached's multi-threaded model would be worth considering." Requirements → decision.

Rate limiting with Redis atomic increment

One of the most common Redis-specific interview follow-ups is implementing a rate limiter. The simplest approach is a fixed-window counter using INCR + EXPIRE, the right first step to explain before graduating to sliding window or token bucket approaches.

Fixed window rate limiter in RedisCode + tradeoffs
-- Lua script (atomic execution in Redis)
-- Key pattern: ratelimit:{user_id}:{window_start}
-- window_start = floor(current_unix_sec / window_size)

local key    = KEYS[1]         -- e.g. "rl:user:42:1743200"
local limit  = tonumber(ARGV[1]) -- e.g. 100 (requests per window)
local window = tonumber(ARGV[2]) -- e.g. 60 (seconds)

local count = redis.call('INCR', key)
if count == 1 then
    redis.call('EXPIRE', key, window)  -- set TTL only on first request
end

if count > limit then
    return 0  -- rate limited
end
return 1      -- allowed
Fixed window weaknessA user can send limit requests at 11:59:59 and another limit requests at 12:00:01, effectively 2× the limit in a 2-second window straddling the boundary. Sliding window (using a sorted set with ZADD + ZREMRANGEBYSCORE) fixes this but is more expensive. Token bucket (pre-computed tokens with atomic decrement) balances burst handling and accuracy.

Observability & production operations

L6+ candidates are expected to own the system end-to-end. That includes knowing which signals to monitor, how to diagnose degradation, and how to change the system safely in production. The six signals to instrument on any cache cluster follow directly from the NFRs in §2.

Key metrics to monitor

Metric Target Alert if… Implication
Cache hit rate > 90% Hit rate drops below 85% Working set is growing faster than cache capacity; or TTLs are too short; or a new access pattern emerged
Eviction rate Near zero in steady state Evictions spike unexpectedly Memory pressure, cache is full; either add nodes or investigate memory leak (growing value sizes)
p99 read latency < 1 ms p99 > 5 ms Hot key bottleneck, network congestion, or node under CPU pressure from active expiry scanning
Replication lag < 100 ms Lag > 1 s continuously Primary is overloaded or replica can't keep up; reads from replica may return stale data beyond TTL window
Memory fragmentation ratio 1.0–1.5 Fragmentation > 2.0 Memory allocator fragmentation, keys were frequently deleted/resized; restart replica or trigger MEMORY PURGE in Redis
Connection count Below max_clients Connections spike or hold steady at max Connection pool exhaustion at client side; increase pool size or investigate connection leaks

Safe operations playbook

Rolling restart (zero-downtime upgrade)Operations

Restart replicas first, one at a time. Wait for each replica to fully sync with the primary before proceeding. Then perform a planned failover, elect a replica as the new primary, then restart the old primary. At no point is the cluster operating without a primary for a given shard.

After a replica restart, it will be cold, its L2 cache is empty. Monitor hit rate on that node and allow 10–30 minutes for the working set to warm up before triggering another restart.

RiskDuring the warm-up period, the recently restarted node sends more misses to the backing store. If the backing store is already near capacity, this warm-up window can cause cascading latency. Pre-warm by replaying recent access logs if possible.
Cache poisoning incident responseIncident playbook

Cache poisoning occurs when a bug writes incorrect values into the cache, a stale or corrupt DB read gets cached, and subsequent requests serve the bad value for the duration of its TTL.

Response: (1) Identify the affected key pattern using the key naming convention. (2) Flush the affected keys with a targeted DEL command or prefix scan. (3) If the blast radius is large (many keys affected), consider flushing the entire cache namespace and accepting a temporary hit-rate collapse while the cache repopulates. (4) Verify the DB source is healthy before allowing the cache to repopulate.

PreventionChecksum or version-stamp values at write time. On read, validate the checksum. A mismatch triggers a DB re-fetch and re-population. This adds a small CPU cost but catches corruption before it propagates.
Capacity scaling (adding nodes)Scaling ops

When memory utilisation on cache nodes exceeds 75–80%, it's time to add capacity. The consistent hashing ring means adding a node remaps only ~1/N of keys, but those remapped keys will initially miss on the new node.

Safe scaling procedure: (1) Add the new node to the ring in shadow mode (it receives keys but forwards to the existing node for the first hour). (2) Once the new node's hit rate stabilises, promote it to full active participation. (3) Monitor overall cluster hit rate, it should dip 2–5% during rebalancing and recover within 15–30 minutes.

Scale-down cautionRemoving a node is riskier than adding one. The keys from the removed node all move to their successors, which may experience a temporary memory spike. Ensure successor nodes have < 60% memory utilisation before removing a peer.
💡

The L7/L8 framing: An L7 candidate treats monitoring not as an afterthought but as a design constraint. They ask: "How do we detect that the cache is degrading before users feel it?" This is a capacity planning and SLO question, not an alerting question. The answer involves defining error budgets, at 99.9% hit rate, we have a budget of 0.1% misses per time window before we're in SLO breach territory.

How the pieces connect

Every architectural decision in this article traces back to a small number of observations established early on. The chain is worth assembling in one place:

  • 1 NFR: read latency < 1 ms (§2) all hot-path reads must avoid disk I/O in-memory primary store with optional WAL persistence (§7) L1 + L2 cache layers in the architecture (§4)
  • 2 Scale: 1M QPS, 10 TB data (§2, §3) no single node can hold the working set consistent hashing with virtual nodes for partitioning (§5) rebalancing strategy when nodes are added (§9)
  • 3 Cache miss rate drives DB load (§3 insight: 90% hit rate = 10× DB reduction) miss handling becomes critical cache stampede prevention via mutex + jittered TTLs (§6, §8) hot key detection and L1 promotion (§9)
  • 4 NFR: availability 99.99% (§2) human failover is too slow automatic leader election via coordinator (etcd/Zookeeper) (§4) replication factor ≥ 3 with async replica promotion (§9)
  • 5 Durability NFR differs by variant (§2) cache: write-behind acceptable persistent store: write-through or WAL mandatory caching strategy choice drives write latency vs durability tradeoff (§8)
  • 6 Single-region design (§4) global user base multi-region active-active with async replication (§9) conflict resolution strategy required (LWW vs CRDT) (§11)
Also in this series

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →