Design a Key-Value Store
& Distributed Cache
Simple to describe, surprisingly hard to scale, the system behind every high-traffic product you use.
What the interviewer is testing
A key-value store question arrives in two flavours: "design a cache" (think Redis, Memcached, speed is everything, durability is optional) and "design a persistent key-value store" (think DynamoDB, Cassandra, RocksDB, durability is non-negotiable). The first question you should ask is which one the interviewer wants. Both share the same distribution primitives, but their tradeoff trees diverge immediately.
What makes this question deceptively hard is that the core operation, put(key, value) / get(key), is trivially solved with a single hash table. The interview is really asking: what happens when one machine isn't enough? How do you partition 10 TB of data across 50 nodes? How do you handle node failure without service disruption? How do you deal with one key receiving 100× the traffic of any other?
| Level | Core question | Differentiator |
|---|---|---|
| L3/L4 | Can you build a working single-node in-memory store? | LRU eviction, TTL, basic API |
| L5 | Can you distribute it across N nodes with consistent hashing? | Replication, quorum reads/writes, hot key detection & sharding, node failure handling |
| L6 | Can you own the end-to-end reliability story? | Full hot key lifecycle (monitoring thresholds, auto-promotion/demotion, write-fan-out cost), rebalancing, monitoring SLAs |
| L7/L8 | Should we build this or use an existing system? | Multi-region active-active, cross-DC consistency, cost/latency tradeoffs |
Requirements clarification
Before drawing anything, nail down whether this is a cache (latency-first, data loss acceptable) or a persistent store (durability-first). These two constraints drive almost every subsequent decision.
Functional requirements
| Requirement | Scope |
|---|---|
get(key) → value or null |
Core |
put(key, value) → OK |
Core |
delete(key) → OK |
Core |
| TTL / expiry per key | Cache variant |
| Atomic increment / decrement | Rate limiting, counters |
| Range scans (e.g. keys with prefix) | Persistent store variant |
Non-functional requirements
The numbers below are working assumptions for this design, 10 TB of total data and 1M peak QPS. The estimator in §3 is calibrated to these defaults.
| NFR | Target | Drives |
|---|---|---|
| Read latency (p99) | < 1 ms (cache) / < 10 ms (store) | In-memory tier, avoid disk on hot path |
| Write latency (p99) | < 5 ms | Async replication vs synchronous |
| Availability | 99.99% | Replication, automatic failover |
| Consistency | Eventual (cache) / configurable (store) | Quorum parameters |
| Durability | Not required (cache) / required (store) | WAL, persistence strategy |
| Scale | 10 TB data, 1M QPS peak | Partitioning strategy |
Latency < 1 msWhat it means & what it drives›
Sub-millisecond p99 means all hot-path reads must be served from RAM, no disk I/O allowed on the critical path. Even an NVMe SSD random read adds ~0.1 ms, which alone consumes the entire latency budget; a spinning-disk seek adds 4–10 ms. Either way, disk I/O is incompatible with a <1 ms p99 requirement.
This forces the design toward an in-memory primary with optional disk-backed persistence (RDB snapshots or an append-only file), rather than an LSM-tree-on-disk architecture.
Eventual consistencyFor cache variant›
For a cache, serving slightly stale data is usually acceptable, the source of truth is the backing database. The cache is an optimisation layer, not the system of record.
This permits asynchronous replication to replicas, which eliminates the write-latency penalty of synchronous quorum acknowledgements. Stale reads are tolerable in both steady state (replication lag between primary and replicas) and during failure modes (the failover window when a replica is promoted).
99.99% availability52 min downtime / year›
99.99% (four nines) permits roughly 52 minutes of downtime per year, or about 60 seconds per week. This rules out manual failover (humans take minutes to respond) and mandates automatic leader election via a coordination service (Zookeeper, etcd).
Capacity estimation
Caches are almost entirely read-dominated. The write rate matters because it determines replication pressure and invalidation fan-out. The storage number tells us how many nodes we need and whether in-memory storage is economically viable. Cache sizing is derived from the working set, not total data.
Interactive estimator
Key insight: At a 90% cache hit rate, the database only sees 10% of reads, a 10× amplification reduction. Raising hit rate from 90% to 99% drops DB load by another 10×. This is why the most impactful knob in the system is cache sizing, not DB hardware.
High-level architecture
High-level architecture: L1 in-process caches on API servers, backed by a distributed in-memory cache cluster (L2), with a persistent store as the source of truth. A coordinator manages cluster membership and leader election.
Clients are services, web browsers, or mobile apps, anything that issues get/put calls. They connect through a load balancer that distributes requests evenly across API server instances.
API servers hold a small in-process (L1) cache per server. This is the first stop, a read that hits here costs zero network round-trips. The L1 is intentionally small (a few hundred MB) and short-lived, acting as a micro-cache for the hottest keys.
Cache cluster (in-memory key-value store, e.g. Redis or Memcached) is the L2 layer. All L1 misses land here. It holds the working set of the data, the 10–20% of keys receiving 80–90% of traffic. Nodes are partitioned using consistent hashing so that adding or removing a node remaps only a small fraction of keys.
Persistent store (e.g. RocksDB, Cassandra, or a relational database) is the source of truth. Cache misses fall through to this layer. For the pure cache variant, this is the upstream database; for a persistent KV store, this is the storage engine itself.
Coordinator (e.g. Zookeeper, etcd) manages cluster membership, leader election, and failure detection. Nodes heartbeat to the coordinator; when a node goes silent, the coordinator triggers automatic failover to a replica.
Architectural rationale
L1 in-process cache per API serverWhy this layer exists›
A network call to the L2 cache cluster costs ~0.1–0.5 ms. At 100k QPS per server, that's meaningful. L1 caches hot keys in process memory, eliminating the network hop entirely, reads from L1 cost nanoseconds.
Separate cache cluster from persistent storeLayered architecture›
Keeping the cache cluster and persistent store separate allows them to be scaled and operated independently. The cache cluster is sized for the working set (RAM), while the persistent store is sized for the full dataset (disk). Mixing them in one layer forces you to pay disk prices for hot data.
External coordinator (Zookeeper / etcd)Why not gossip?›
Some systems (Cassandra, Riak) use gossip protocols for cluster membership, each node shares state with neighbours, eventually converging. This is decentralised and resilient but slower to converge and harder to reason about during partitions.
A dedicated coordinator (Zookeeper, etcd) provides a single consistent view of cluster membership and enforces linearisable leader election. This is the right choice when failover speed and simplicity matter more than eliminating the coordinator as a dependency.
Real-world comparison
| Decision | This design | Redis Cluster | Amazon DynamoDB |
|---|---|---|---|
| Partitioning | Consistent hashing (virtual nodes) | Hash slots (16384 fixed) | Consistent hashing + automatic rebalancing |
| Primary datastore | In-memory with optional persistence | In-memory + RDB/AOF snapshots | SSD-backed, multi-AZ replication |
| Consistency model | Configurable (quorum R/W) | Eventual (async replication to replicas) | Eventually consistent reads by default; strongly consistent opt-in |
| Failover | External coordinator (etcd) | Cluster bus gossip + election | Fully managed, transparent |
| Hot key handling | Key sharding + L1 cache | Application-level sharding or proxy | Adaptive capacity + DAX accelerator |
Neither Redis Cluster's fixed-slot approach nor DynamoDB's managed service is universally better, the right choice follows from operational ownership (managed vs self-hosted), consistency requirements, and cost model. This interview question is asking you to reason through those tradeoffs, not to recreate Redis.
Partitioning & consistent hashing
Once the dataset outgrows a single machine, you need a rule for deciding which node holds which key. The rule must be stable enough that adding or removing a node doesn't remap everything, fast enough to compute on every request, and balanced enough that no single node carries a disproportionate share of the load.
Naive modulo hashing remaps ~75% of keys when adding a 4th node. Consistent hashing remaps only ~1/N of keys, just those that fall between the new node and its predecessor on the ring.
The core idea of consistent hashing is placing both nodes and keys on a circular hash ring (0 to 2³²). Each key is assigned to the first node clockwise from its position on the ring. When a node is added, only the keys between the new node and its predecessor need to move. When a node is removed, only its keys move to its successor.
Virtual nodes solve the load imbalance problem. With a small number of physical nodes,
the ring partitions unevenly, one node might hold 40% of the keyspace and another 10%. By mapping each
physical node to hundreds of virtual positions on the ring (e.g., hash("nodeA:0"),
hash("nodeA:1"), ...), the load distributes much more evenly. It also makes rebalancing
smoother: adding a node with 150 virtual nodes redistributes load from every existing node
proportionally.
Common probe: "How many virtual nodes per physical node should you use?" There's no universal answer, more virtual nodes = smoother balance but more memory for the routing table. 100–200 virtual nodes per physical node is a common range (Redis Cluster uses a fixed 16384 hash slots as a deterministic alternative).
Rendezvous hashing, an alternativeAdvanced›
Rendezvous (highest random weight) hashing works differently: for each key, compute a score for
every node using hash(key, nodeID), and assign the key to the node with the highest
score. No ring data structure needed, just iterate over nodes. Adding or removing a node remaps
only the keys assigned to that node, like consistent hashing, but the algorithm is simpler to
implement correctly.
API design
The interface surface area for a KV store is intentionally small. Each operation needs a clear ownership of the key, explicit TTL semantics, and a defined behaviour for non-existent keys.
Core endpoints
GET /v1/keys/{key}
// Request (no body, key in path)
GET /v1/keys/user:1234:session HTTP/1.1
// Response, hit
HTTP/1.1 200 OK
X-Cache: HIT
X-TTL-Remaining: 3542
{
"key": "user:1234:session",
"value": "eyJhbGciOiJIUzI1NiJ9...",
"expires_at": "2026-03-29T12:00:00Z"
}
// Response, miss
HTTP/1.1 404 Not Found
{ "error": "key_not_found" }
PUT /v1/keys/{key}
// Request
PUT /v1/keys/user:1234:session HTTP/1.1
Content-Type: application/json
{
"value": "eyJhbGciOiJIUzI1NiJ9...",
"ttl_seconds": 3600, // optional; 0 = no expiry
"if_not_exists": false // atomic set-if-absent
}
// Response
HTTP/1.1 200 OK
{ "status": "ok", "version": 42 }
// Response, conflict (if_not_exists: true, key already exists)
HTTP/1.1 409 Conflict
{ "error": "key_exists", "current_version": 38 }
DELETE /v1/keys/{key}
HTTP/1.1 204 No Content // success (idempotent)
HTTP/1.1 404 Not Found // key didn't exist (acceptable, idempotent)
Optional endpoints by level
| Endpoint | Purpose | Level |
|---|---|---|
POST /v1/keys/{key}/incr |
Atomic increment (rate limiting, counters) | L5 |
POST /v1/keys/batch/get |
Multi-key fetch in one round-trip | L5 |
POST /v1/keys/batch/put |
Bulk write for seeding or migration | L6 |
GET /v1/keys?prefix=user:1234: |
Range scan / prefix listing | L7/L8 |
GET /v1/admin/stats |
Hit rate, eviction rate, memory usage per node | L7/L8 |
Key naming is an API contract. Establish key namespacing conventions upfront (e.g.,
entity:id:attribute). Unstructured key names make prefix scanning, monitoring, and
access control nearly impossible at scale. The API should document and enforce naming conventions
from day one.
Conditional writes & Compare-and-Swap (CAS)
The version field in the entry schema (§7) enables optimistic locking. Two clients reading
the same key both receive "version": 42. If both try to write, only the first succeeds, the second gets a 409 Conflict with the current version and must re-read and retry.
// Conditional write: only update if version matches
PUT /v1/keys/user:42:balance HTTP/1.1
{
"value": "150",
"if_version": 42 // fails with 409 if current version ≠ 42
}
// Response on conflict
HTTP/1.1 409 Conflict
{ "error": "version_mismatch", "current_version": 43, "hint": "re-read and retry" }
CAS is the standard answer to "how do you prevent lost updates when two clients write the same key
concurrently?" without requiring distributed transactions. Redis offers two separate atomicity
mechanisms for this: a Lua script (executes atomically, blocking other commands during
its run, no WATCH needed or allowed) or the WATCH +
MULTI + EXEC transaction pattern (optimistic locking, aborts the
transaction if the watched key changed between WATCH and EXEC). These are
mutually exclusive; do not combine them. DynamoDB implements the same guarantee natively via
ConditionExpression.
Security & multi-tenancy
For a public or multi-tenant cache, three controls are non-negotiable at L6+. First, namespace
isolation: prefix all keys with a tenant or service identifier (e.g.,
tenant_id:entity:id) and enforce this at the API layer, no cross-namespace reads should be
possible at the key level. Second, authentication: Redis has historically shipped with
no auth by default, leading to numerous high-profile exposure incidents; always enable Redis ACLs (since
Redis 6.0) with per-service credentials. Third, value size limits: reject writes above
a configured max (e.g., 1 MB) at the API tier to prevent a single key from monopolising a cache node's
memory and triggering eviction avalanches.
Core read/write flow
The most important question in the read path is: what happens on a cache miss? The answer determines latency for cold or low-frequency keys, and also whether the system can defend itself against a cache stampede, where a popular key expires and thousands of requests simultaneously hit the backing store.
Read path: L1 (in-process) → L2 (cache cluster) → DB. The mutex prevents cache stampede by serialising concurrent misses for the same key, only one goroutine fetches from the DB; others wait and retry L2.
Cache stampede prevention
When a popular key expires, every concurrent request sees a miss simultaneously and races to query the database, a thundering herd. This is more destructive than it sounds: because the database is provisioned for only 10% of traffic (the 90% hit rate target from §3), a stampede that doubles or triples miss volume pushes it past its p99 latency SLO immediately. The tier we deliberately sized down becomes the failure point.
The standard mitigations are: (1) distributed mutex, use an atomic compare-and-set in the cache cluster to let only one request fetch from the database while others wait; (2) probabilistic early recompute (a.k.a. "XFetch"), before a key expires, randomly decide to recompute it based on how close the TTL is, spreading recomputations over time; (3) stale-while-revalidate, return the stale cached value immediately while triggering an asynchronous background refresh.
Write path
The write path is simpler but the tradeoff is between write latency and consistency. Write-through updates the cache and database synchronously on every write, cache is always fresh but every write pays a database round-trip. Write-behind acknowledges the write after updating the cache, then asynchronously flushes to the database, lower write latency but risk of data loss if the cache fails before the flush. Cache-aside leaves the cache population to read time, the simplest model but means every cold-start incurs a cache miss.
The choice between write-through and write-behind maps directly back to the durability NFR from §2. For a pure cache, write-behind is acceptable. For a persistent key-value store, write-through or a WAL-backed write is required.
Data model & storage engine
A key-value store is deceptively simple at the surface, a map from strings to blobs. The interesting design decisions live in what metadata accompanies each entry, how that metadata is stored efficiently, and what storage engine sits underneath.
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Point read (get by key) | Very high (90%+ of operations) | Exact key lookup → O(1) hash table |
| Point write (put by key) | High | Exact key update + metadata update |
| TTL expiry scan | Background, continuous | Scan entries with expires_at ≤ now |
| Prefix scan | Low (admin / migration) | Range scan over sorted key space (requires sorted storage) |
| LRU eviction | Background, on memory pressure | Find least-recently-used entry across all keys |
Two things jump out from this table. First, point reads dominate overwhelmingly, O(1) hash map access is the right primary structure for an in-memory store, not a B-tree or skip list. Second, TTL expiry and LRU eviction both require efficient ordering by time, which a pure hash map doesn't support. The typical solution is to maintain a hash map for O(1) lookup alongside a doubly-linked list ordered by recency (for LRU) or by expiry time (for TTL), with the map storing pointers into the list.
Storage engine options
In-memory hash map + LRU listCache variant, our choice›
For the cache use case, all data lives in RAM. The data structure is a hash map for O(1) lookup paired with a doubly-linked list ordered by last-access time. On a get, the accessed node moves to the front of the list. On memory pressure, the tail of the list is evicted.
Persistence is optional: an append-only log (WAL) records all writes. On restart, replay the log to reconstruct state. This is how Redis AOF (Append-Only File) mode works, though full-dataset snapshots (RDB) are also taken periodically to bound recovery time.
LSM-tree (Log-Structured Merge-tree)Persistent store variant›
LSM-trees (used by RocksDB, Cassandra, LevelDB) convert random writes into sequential disk I/O by first buffering writes in a memory table (MemTable), then flushing sorted runs to disk (SSTables). Background compaction merges and garbage-collects stale versions.
This is the right engine when write throughput is high and the dataset exceeds available RAM. Reads may need to check multiple levels (one bloom filter per level reduces unnecessary disk I/O), so read amplification is higher than a B-tree, but the write amplification and throughput characteristics are far better for write-heavy workloads.
Eviction, TTL & caching strategy
When memory fills up, the cache must decide what to throw away. The right eviction policy depends on the access pattern of the workload. Equally important is how the cache is populated, the caching strategy determines whether writes proactively fill the cache or leave it to reads.
Each cache layer is anchored to its physical location in the §4 architecture. L1 lives inside the API server process; L2 is the cluster-level in-memory tier; L3 is the durable backing store.
Eviction policies
| Policy | How it works | Best for | Weakness |
|---|---|---|---|
| LRU | Evict the key not accessed for the longest time | Workloads with temporal locality (session stores, recent feeds) | Doesn't account for frequency, a key accessed 1000× a day can be evicted after an idle hour |
| LFU | Evict the key accessed the least total times | Workloads with stable hot keys (product catalog, user profiles) | Frequency counts decay slowly, a once-viral item stays in cache long after interest drops.
Mitigated by decaying frequency counters (Redis LFU uses a logarithmic counter with
configurable decay via lfu-decay-time) |
| ARC | Adaptive blend of LRU and LFU, self-tuning | Mixed or unpredictable workloads (used in ZFS) | More complex to implement; not available in all cache systems |
| TTL expiry | Each key carries a deadline; expired keys are removed lazily or eagerly | Data with natural staleness windows (tokens, rate limits, OTPs) | Not a memory pressure strategy, expired keys still occupy memory until checked |
| FIFO | Evict the oldest-inserted key | Simple queues where insertion order matches usefulness | Ignores access pattern completely; rarely optimal |
In practice, Redis uses approximate LRU, on eviction, it samples a configurable number of random keys and evicts the least recently used among them. This avoids the overhead of maintaining a full doubly linked list across all keys, no per-entry LRU pointers are needed, at the cost of a small accuracy penalty. Redis stores LRU clock data in 24 bits of each object's metadata rather than in list node pointers.
Caching strategies
Cache-aside (lazy population)Most common, our baseline›
Use this when reads dominate and occasional staleness is acceptable, it is the right default for most read-heavy services. The application manages the cache explicitly: on a miss, read from DB, write to cache, and return; on a write, update DB then invalidate or update the cache key. Because the cache is only populated when data is actually read, rarely-accessed keys never waste cache space.
Write-throughStrong consistency›
Use this when reads must be strongly consistent immediately after writes, financial balances, inventory counts, anything where a stale read has user-visible consequences. Every write goes to both the primary cache node and the DB synchronously before returning success. The primary cache and DB are kept in sync; replica reads may lag slightly until async replication completes.
Write-behind (write-back)High write throughput›
Use this for write-coalescing, when many writes target the same key and only the final value matters (view counters, rate-limit windows, leaderboard scores). Writes are acknowledged after updating the cache; the DB write is deferred to an asynchronous flush queue, eliminating the DB round-trip from the critical path.
Deep-dive: scalability
Production-scale: two active regions each with their own cache cluster and persistent store. Async cross-region replication propagates writes with seconds of lag. Local DNS routing sends users to the nearest region.
Hot key problem & mitigationL5/L6 depth›
In any large-scale cache, a small number of keys receive orders-of-magnitude more traffic than average. A viral post, a globally cached config key, a top-seller product ID, these "hot keys" can overwhelm a single cache node even while the rest of the cluster sits idle.
Key sharding: Instead of storing post:12345 on one node, store
it as post:12345#0 through post:12345#N across N nodes. Reads are
randomly routed among these N copies. Writes must update all N copies, acceptable if writes
are infrequent and the data is read-optimised (like a viral post body).
L1 absorption: Detecting hot keys in production is done via request sampling at the API layer. When a key exceeds a QPS threshold (e.g., 10 req/sec on a single server), it gets promoted to the L1 in-process cache. Netflix's EVCache is one example of this pattern, hot-key detection and L1 promotion logic can live in a client library, a monitoring sidecar, or a proxy layer depending on where the system owns the routing decision.
Replication & quorumConsistency under failure›
Each cache shard (primary node) has one or more replicas. Replicas serve reads, offloading traffic from the primary and providing redundancy. Write path: the client writes to the primary, which acknowledges once the write is durable in its own memory/WAL. Replication to replicas is asynchronous, replicas may lag by milliseconds to seconds. Failover is orchestrated by a coordination service (Zookeeper, etcd) as described in §4.
For the persistent store variant, quorum reads and writes are configurable. With a replication factor of 3, W=2, R=2 (quorum writes and reads) gives strong consistency, any read will overlap with at least one write, guaranteeing the latest version is returned. Dropping to W=1, R=1 maximises availability and throughput at the cost of possible stale reads.
Node rebalancing when adding capacityOperations›
Adding a new node to a consistent hashing ring remaps only ~1/N of keys. But those keys must be migrated from existing nodes to the new one without service interruption. The standard approach is dual-read: during the migration window, read from both the old and new node, accepting a cache miss if the new node doesn't have the key yet. Writes go to the new node immediately.
Redis Cluster handles this with hash slot migration, slots move one at a time, with an
ASK redirect for any key that's mid-migration. Clients follow the redirect
transparently.
Multi-region active-activeL7/L8 depth›
In active-active multi-region setups, both regions accept writes, and changes are replicated asynchronously across the WAN. The unavoidable trade-off: cross-region replication lag (typically 50–200 ms) means a write in US-West isn't immediately visible in US-East.
For a cache, this is usually acceptable, caches are not the system of record, and a slightly stale cache read is fine. For a persistent KV store used as a primary data store, conflict resolution becomes critical: if two regions write to the same key simultaneously, which value wins? Last-write-wins (LWW) by timestamp is common but can discard valid writes if clocks are skewed. CRDT-based resolution (Conflict-free Replicated Data Types, e.g., OR-Set, Counter) is conflict-free, any two nodes can apply operations in any order and arrive at the same state, but restricts the value types the store can support.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Single cache node crashes | All keys on that node are unavailable; cache miss rate spikes; DB receives 10–20× normal load | Automatic failover to replica (coordinator promotes replica to primary within seconds). Clients retry with exponential backoff. | L3/L4 |
| Cache stampede (thundering herd) | Popular key expires; all concurrent requests miss simultaneously; DB overwhelmed | Distributed mutex on miss path; probabilistic early recompute (XFetch); stale-while-revalidate; jittered TTLs | L5 |
| Hot key | Single key receives 100×–1000× average traffic; one cache node becomes a bottleneck | Key sharding (key#0 through key#N across N nodes); L1 in-process absorption via hot-key detection | L5 |
| Network partition (split brain) | Cache nodes cannot reach each other; nodes diverge and accept conflicting writes | Quorum-based writes (W ≥ N/2 + 1); fencing tokens; coordinator arbitrates leader election; prefer CP over AP for persistent store variant | L6 |
| Dirty cache after DB rollback | DB transaction rolled back, but cache already holds the committed (now invalid) value | Invalidate cache keys inside the same transaction (or immediately after rollback). Use transactional outbox pattern to enqueue invalidations atomically with DB writes. | L6 |
| Cache avalanche (mass expiry) | Many keys share the same TTL (e.g., seeded in a batch job); all expire at once; DB overwhelmed for seconds | Add ±10–30% random jitter to all TTL values at write time; use sliding TTLs that refresh on read access | L6 |
| Large value (cache pollution) | A single value (e.g., 10 MB serialised object) fills a large fraction of one node's memory; evicts many small hot keys | Enforce a max value size policy (e.g., 1 MB). Large objects should be stored in an object store (S3) with only the reference in cache. Alert on keys exceeding threshold. | L7/L8 |
| Cascading failure across services | Cache tier fails; every upstream service retries aggressively; DB collapses; entire system goes down | Circuit breaker on cache client; request rate limiting at API tier; graceful degradation (serve stale data from a secondary read-only replica or CDN edge); load shedding | L7/L8 |
How to answer by level
The interview differentiator: Every candidate describes a Redis-in-front-of-MySQL architecture. What separates levels is whether you can name the failure of that architecture, and reason from requirements (§2) to the solution without being prompted.
L3/L4, SDE I / SDE IIEntry level›
- Clear API: get/put/delete with TTL
- Correct LRU implementation (hash map + doubly linked list)
- Single-node in-memory design with WAL for durability
- Identifies that cache misses must fall through to DB
- Explains cache-aside pattern correctly
- Cannot explain what happens when one node isn't enough
- No mention of cache stampede or how to prevent it
- Treats the design as complete without discussing failure
- Cannot name a partitioning strategy beyond "split the data"
L5, Senior SDESenior›
- Consistent hashing with virtual nodes, explains rebalancing impact
- Names cache stampede, applies mutex or XFetch proactively
- Discusses replication factor, async vs sync replication tradeoffs
- Compares LRU vs LFU for the given workload type
- Explains write-through vs cache-aside and when each applies
- Proactively names hot key problem; proposes sharding + L1 absorption
- Explains CAS / optimistic locking for concurrent write safety
- Cannot reason about quorum (R/W = 1 vs N/2+1)
- No plan for rebalancing when adding nodes
- Hot key: detects and shards, but cannot quantify write fan-out cost or auto-adapt thresholds
- Single-region only; multi-region not considered
L6, Staff SDEStaff›
- Owns full hot key lifecycle: detection thresholds, auto-promotion/demotion, write fan-out tradeoff analysis
- Cascading failure analysis; circuit breaker, graceful degradation
- Cache avalanche: jittered TTLs, sliding windows
- Dirty cache after DB rollback, transactional invalidation
- Capacity planning tied to capacity estimation numbers
- Monitoring: hit rate, eviction rate, node memory, latency SLOs, error budgets
- Security: key namespace isolation, Redis ACLs, value size enforcement
- Multi-region is "an idea" but no conflict resolution strategy
- Cannot reason about CRDT vs LWW tradeoffs
- No make-vs-buy analysis (Redis vs DynamoDB vs building from scratch)
L7/L8, Principal / DistinguishedPrincipal+›
- Leads with: "Should we build this or use Redis / DynamoDB?"
- Multi-region active-active: names conflict resolution strategy upfront (LWW with fencing, or CRDT-based)
- Defines SLO targets and reverse-engineers the architecture from them
- Discusses operational cost: node count, RAM cost per GB, cost of data transfer in multi-region
- Aware of tail latency (p99.9) and identifies which failure modes cause it
- Proposes a migration plan from single-node to distributed
- Business framing, cost, operational ownership, and team capability are first-class constraints
- Forward planning: what breaks at 10× current load? 100×?
- Naming real systems (DynamoDB DAX, ElastiCache, Netflix EVCache) and knowing their actual limitations
Classic probes table
| Probe | L3/L4 answer | L5/L6 answer | L7/L8 answer |
|---|---|---|---|
| "What happens when a cache node crashes?" | Cache misses spike; DB handles more load | Replica is promoted; coordinator triggers automatic failover; client retries with backoff | Defines RTO target; questions whether replicas are pre-warmed or cold; discusses cascade risk if DB can't absorb the spike; proposes load-shedding |
| "How do you handle hot keys?" | "Put frequently accessed data on faster nodes" | L5: Key sharding across N nodes; detect via request sampling; L1 in-process cache promotes detected hot keys. L6: Owns the full lifecycle, configures detection thresholds, automatic promotion/demotion, and quantifies write fan-out cost of sharding. | Proposes adaptive sharding factor tied to per-node QPS ceiling; distinguishes read-heavy keys (sharding fine) from write-heavy keys (sharding inappropriate, use probabilistic counting instead); frames as an SLO problem: "what's the per-node latency budget at peak?" |
| "How do you ensure consistency between cache and DB?" | "Clear the cache when the DB changes" | Discusses write-through vs cache-aside; identifies dirty-read window; mentions transactional invalidation for rollbacks | Names the fundamental impossibility of perfect consistency in distributed systems under failure; proposes version tokens (optimistic locking); discusses read-your-writes consistency via session affinity |
| "Why not just make the cache bigger to solve all problems?" | "RAM is expensive" | Working set is bounded by Zipf distribution, top 10% of keys receive 90% of reads; cache beyond the working set has diminishing returns; hot key problem persists regardless of cache size | Derives the economic argument: RAM cost per GB vs. DB query cost per million; discusses the inflection point where cache size increase is no longer cost-effective; proposes tiered storage (hot: RAM, warm: NVMe SSD) |
Going deeper, reference sections
Interactive: LRU cache simulator
LRU (Least Recently Used) is the eviction policy asked about most in interviews. The data structure is a hash map for O(1) lookup, wired to a doubly-linked list ordered by recency. On every access, the node moves to the head. On eviction, the tail is removed. The challenge is doing this in O(1) for all three operations, get, put, and evict.
LRU Cache, step-by-step visualization
Capacity: 4 slots · Click an operation or type a custom key below.
Why O(1) requires both structures
- O(1) get/put ✓
- Cannot find LRU key without scanning all entries, O(N) eviction ✗
- O(1) evict from tail ✓
- O(N) lookup by key, must scan the list ✗
The combination: hash map stores key → node pointer (O(1) lookup). Doubly linked list
stores nodes ordered by recency (O(1) move-to-front and O(1) evict-from-tail). Every operation, get, put, evict, is O(1). Java's LinkedHashMap uses exactly this structure. Python's
functools.lru_cache applies the same principle but uses a more memory-efficient compact
circular buffer internally rather than individual node objects.
Code: LRU in Go (interview-ready)Expandable›
type Node struct {
key, val string
prev, next *Node
}
type LRUCache struct {
cap int
m map[string]*Node
head, tail *Node // sentinel nodes (never evicted)
}
func NewLRU(cap int) *LRUCache {
h, t := &Node{}, &Node{}
h.next, t.prev = t, h
return &LRUCache{cap: cap, m: make(map[string]*Node), head: h, tail: t}
}
func (c *LRUCache) Get(key string) (string, bool) {
n, ok := c.m[key]
if !ok { return "", false }
c.moveToFront(n) // O(1), just pointer rewiring
return n.val, true
}
func (c *LRUCache) Put(key, val string) {
if n, ok := c.m[key]; ok {
n.val = val
c.moveToFront(n)
return
}
if len(c.m) == c.cap {
lru := c.tail.prev // O(1), tail sentinel's prev is LRU node
c.remove(lru)
delete(c.m, lru.key)
}
n := &Node{key: key, val: val}
c.m[key] = n
c.insertFront(n)
}
func (c *LRUCache) remove(n *Node) {
n.prev.next, n.next.prev = n.next, n.prev
}
func (c *LRUCache) insertFront(n *Node) {
n.next, n.prev = c.head.next, c.head
c.head.next.prev, c.head.next = n, n
}
func (c *LRUCache) moveToFront(n *Node) { c.remove(n); c.insertFront(n) }
Bloom filters: eliminating wasteful cache misses
Bloom filters have two complementary uses in a key-value system. Their primary role is inside storage engines: RocksDB, Cassandra, and LevelDB attach a Bloom filter to every SSTable level so that a point read can skip a disk seek if the filter says the key definitely isn't there, eliminating most unnecessary I/O in LSM-tree reads. Their secondary role is at the API layer: a miss that falls through to a database for a key that simply doesn't exist is pure waste, and a filter placed in front of the cache can short-circuit it immediately. This matters especially in public APIs where misbehaving or malicious clients probe arbitrary keys.
A Bloom filter is a probabilistic data structure: it can definitively say a key does not exist (no false negatives), but it may occasionally say a key might exist when it doesn't (false positives). The false positive rate is configurable and depends on the bit array size and number of hash functions.
Left: inserting "user:42" sets bits at positions 2, 5, 9 (one per hash function). Right: querying "user:99", bit 1 is 0, so the key definitely doesn't exist. No DB lookup needed.
False positive rate formula
For a Bloom filter with m bits, n inserted elements, and k hash functions, the false positive probability is approximately:
The optimal number of hash functions for a given m/n ratio is k = (m/n) × ln(2). In
practice, using 10 bits per element with 7 hash functions achieves a
false positive rate of roughly 1%. For a set of 100 million keys, that's 100 MB, orders of magnitude
smaller than storing the keys themselves.
| Bits per element (m/n) | Optimal k | False positive rate | Memory (100M keys) |
|---|---|---|---|
| 6 | 4 | ~5.6% | 75 MB |
| 10 | 7 | ~0.8% | 125 MB |
| 14 | 10 | ~0.1% | 175 MB |
| 20 | 14 | ~0.007% | 250 MB |
Bloom filters cannot support deletion. Clearing a bit when removing a key would corrupt entries from other keys that share that bit. If you need deletions, use a counting Bloom filter (store counts instead of bits, decrement on delete) or a Cuckoo filter (supports deletion with similar memory efficiency). Redis includes a Bloom filter module via RedisBloom.
Where Bloom filters sit in the architectureIntegration pattern›
The Bloom filter runs inside the API server, checked before any cache or DB lookup. On each PUT, the new key is added to the filter. On each GET, if the filter returns "definitely not present," the request short-circuits immediately with a 404, no network hop at all.
In LSM-tree storage engines (RocksDB, LevelDB), each SSTable level has its own Bloom filter. Before checking a level for a key, the filter is consulted. Since most levels won't contain the key, the filter eliminates ~99% of disk reads for non-existent keys, which is why point read performance doesn't degrade catastrophically as LSM levels accumulate.
Redis vs Memcached, when to choose which
Interviewers frequently follow up "design a cache" with "would you use Redis or Memcached?" Both are in-memory key-value stores targeting sub-millisecond latency, but they are built for different workload profiles. The answer should come from requirements, not from brand familiarity.
| Dimension | Redis | Memcached |
|---|---|---|
| Data structures | Strings, lists, sets, sorted sets, hashes, streams, HyperLogLog, geospatial | Strings only (arbitrary byte blobs) |
| Persistence | RDB snapshots + AOF (append-only file); configurable durability | None, purely volatile; restart = data loss |
| Replication | Primary–replica replication built in; Redis Sentinel / Redis Cluster for HA | No native replication; must be handled by client or proxy layer |
| Clustering | Redis Cluster (16384 hash slots, automatic sharding) | Client-side sharding (consistent hashing in the client library) |
| Throughput | ~500K–1M+ ops/sec per node (threaded I/O since Redis 6.0; command execution remains single-threaded) | ~500K–1M+ ops/sec per node (multi-threaded) |
| Memory efficiency | Higher per-key overhead (~60–80 bytes per key metadata) | Lower overhead, simpler internal representation |
| Pub/Sub & Streams | Built-in pub/sub, consumer groups (Redis Streams) | Not available |
| Lua scripting | Full Lua scripting for atomic multi-key operations | Not available |
| Operational model | Managed: ElastiCache for Redis, Redis Cloud, Upstash | Managed: ElastiCache for Memcached |
- You need persistence or replication
- You're storing lists, sorted sets, or counters (leaderboards, rate limiters)
- You want pub/sub for event broadcasting
- You need atomic operations across multiple keys (Lua)
- You're using Redis as a primary data store, not just a cache
- Pure caching, maximum throughput for simple get/set
- Multi-threaded performance is the dominant constraint
- You're caching large blobs (rendered HTML, serialised objects)
- Operational simplicity matters, no cluster topology to manage
- You're already using client-side consistent hashing
Interview move: Don't just list features. Say: "For this system, we need TTL-based expiry, atomic increment for rate limiting, and replication for availability, that's three features Memcached doesn't support natively. Redis is the right choice. If we were caching rendered HTML pages at maximum throughput with no durability requirements, Memcached's multi-threaded model would be worth considering." Requirements → decision.
Rate limiting with Redis atomic increment
One of the most common Redis-specific interview follow-ups is implementing a rate limiter. The simplest
approach is a fixed-window counter using INCR + EXPIRE, the right first step
to explain before graduating to sliding window or token bucket approaches.
Fixed window rate limiter in RedisCode + tradeoffs›
-- Lua script (atomic execution in Redis)
-- Key pattern: ratelimit:{user_id}:{window_start}
-- window_start = floor(current_unix_sec / window_size)
local key = KEYS[1] -- e.g. "rl:user:42:1743200"
local limit = tonumber(ARGV[1]) -- e.g. 100 (requests per window)
local window = tonumber(ARGV[2]) -- e.g. 60 (seconds)
local count = redis.call('INCR', key)
if count == 1 then
redis.call('EXPIRE', key, window) -- set TTL only on first request
end
if count > limit then
return 0 -- rate limited
end
return 1 -- allowed
ZADD + ZREMRANGEBYSCORE) fixes this but is more
expensive. Token bucket (pre-computed tokens with atomic decrement) balances burst handling and
accuracy.Observability & production operations
L6+ candidates are expected to own the system end-to-end. That includes knowing which signals to monitor, how to diagnose degradation, and how to change the system safely in production. The six signals to instrument on any cache cluster follow directly from the NFRs in §2.
Key metrics to monitor
| Metric | Target | Alert if… | Implication |
|---|---|---|---|
| Cache hit rate | > 90% | Hit rate drops below 85% | Working set is growing faster than cache capacity; or TTLs are too short; or a new access pattern emerged |
| Eviction rate | Near zero in steady state | Evictions spike unexpectedly | Memory pressure, cache is full; either add nodes or investigate memory leak (growing value sizes) |
| p99 read latency | < 1 ms | p99 > 5 ms | Hot key bottleneck, network congestion, or node under CPU pressure from active expiry scanning |
| Replication lag | < 100 ms | Lag > 1 s continuously | Primary is overloaded or replica can't keep up; reads from replica may return stale data beyond TTL window |
| Memory fragmentation ratio | 1.0–1.5 | Fragmentation > 2.0 | Memory allocator fragmentation, keys were frequently deleted/resized; restart replica or trigger MEMORY PURGE in Redis |
| Connection count | Below max_clients | Connections spike or hold steady at max | Connection pool exhaustion at client side; increase pool size or investigate connection leaks |
Safe operations playbook
Rolling restart (zero-downtime upgrade)Operations›
Restart replicas first, one at a time. Wait for each replica to fully sync with the primary before proceeding. Then perform a planned failover, elect a replica as the new primary, then restart the old primary. At no point is the cluster operating without a primary for a given shard.
After a replica restart, it will be cold, its L2 cache is empty. Monitor hit rate on that node and allow 10–30 minutes for the working set to warm up before triggering another restart.
Cache poisoning incident responseIncident playbook›
Cache poisoning occurs when a bug writes incorrect values into the cache, a stale or corrupt DB read gets cached, and subsequent requests serve the bad value for the duration of its TTL.
Response: (1) Identify the affected key pattern using the key naming
convention. (2) Flush the affected keys with a targeted DEL command or prefix
scan. (3) If the blast radius is large (many keys affected), consider flushing the entire
cache namespace and accepting a temporary hit-rate collapse while the cache repopulates. (4)
Verify the DB source is healthy before allowing the cache to repopulate.
Capacity scaling (adding nodes)Scaling ops›
When memory utilisation on cache nodes exceeds 75–80%, it's time to add capacity. The consistent hashing ring means adding a node remaps only ~1/N of keys, but those remapped keys will initially miss on the new node.
Safe scaling procedure: (1) Add the new node to the ring in shadow mode (it receives keys but forwards to the existing node for the first hour). (2) Once the new node's hit rate stabilises, promote it to full active participation. (3) Monitor overall cluster hit rate, it should dip 2–5% during rebalancing and recover within 15–30 minutes.
The L7/L8 framing: An L7 candidate treats monitoring not as an afterthought but as a design constraint. They ask: "How do we detect that the cache is degrading before users feel it?" This is a capacity planning and SLO question, not an alerting question. The answer involves defining error budgets, at 99.9% hit rate, we have a budget of 0.1% misses per time window before we're in SLO breach territory.
How the pieces connect
Every architectural decision in this article traces back to a small number of observations established early on. The chain is worth assembling in one place:
- 1 NFR: read latency < 1 ms (§2) → all hot-path reads must avoid disk I/O → in-memory primary store with optional WAL persistence (§7) → L1 + L2 cache layers in the architecture (§4)
- 2 Scale: 1M QPS, 10 TB data (§2, §3) → no single node can hold the working set → consistent hashing with virtual nodes for partitioning (§5) → rebalancing strategy when nodes are added (§9)
- 3 Cache miss rate drives DB load (§3 insight: 90% hit rate = 10× DB reduction) → miss handling becomes critical → cache stampede prevention via mutex + jittered TTLs (§6, §8) → hot key detection and L1 promotion (§9)
- 4 NFR: availability 99.99% (§2) → human failover is too slow → automatic leader election via coordinator (etcd/Zookeeper) (§4) → replication factor ≥ 3 with async replica promotion (§9)
- 5 Durability NFR differs by variant (§2) → cache: write-behind acceptable → persistent store: write-through or WAL mandatory → caching strategy choice drives write latency vs durability tradeoff (§8)
- 6 Single-region design (§4) → global user base → multi-region active-active with async replication (§9) → conflict resolution strategy required (LWW vs CRDT) (§11)
- Rate Limiter System Design — atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design — hash-based ID generation, Redis caching strategy, and async analytics pipeline
- Web Crawler System Design — Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter Feed System Design — fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design — multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete
- Chat System (WhatsApp) System Design — WebSocket architecture, message delivery guarantees, and fan-out