Design a Top-K Leaderboard
Simple to query, hard to keep fresh — ranking millions of users in real time is a deceptively precise engineering problem.
What the interviewer is testing
The leaderboard question surface is narrow on first glance: fetch the top K users by score. But the constraint that makes it interesting is freshness: the score window is live, updates arrive in bursts, and K users are queried by thousands of concurrent viewers. The question probes whether you understand why a naïve database query falls apart under concurrent writes — and what it takes to return a ranking in under 50 ms when scores are arriving in bursts.
Interviewers use this question to probe three distinct capabilities that map almost perfectly onto engineering level:
| Level | Core probe | What success looks like |
|---|---|---|
| L3/L4 | Data structure choice | Reaches for Redis sorted sets without prompting; understands ZINCRBY + ZREVRANGE and why a SQL ORDER BY score DESC LIMIT 100 doesn't scale for sub-second reads under high write concurrency |
| L5 | Write-path design | Separates event ingestion from state mutation; introduces a message queue (Kafka) as a buffer, explains micro-batch aggregation and why it dramatically lowers peak write pressure on Redis |
| L6 | Time-windowed rankings | Designs daily/weekly/all-time windows without full rebuilds; explains the sliding window vs. bucket approach and handles the boundary condition correctly |
| L7/L8 | Accuracy vs. scale tradeoff | Introduces Count-Min Sketch for billion-user scale where a full sorted set exceeds RAM budget; quantifies the error bound, demonstrates when approximate is acceptable, and discusses consistency guarantees at the event bus level |
Note: L6 is not a separate architecture from L5: it's the multi-tenant, multi-window extension of the same design. An L6 candidate can articulate when to introduce complexity, not just how.
The question forks early on scope: "Do you need exact ranks, or is approximate top-K acceptable?" An L7/L8 candidate introduces that fork themselves, before the interviewer asks. At L3/L4, answering "exact is fine" is correct: the question is whether you know how to achieve it efficiently.
Requirements clarification
Functional requirements
| Requirement | Detail |
|---|---|
| Score updates | Users earn points from discrete events (game completion, purchase, ad click). Each event increments a user's score by a variable amount. |
| Top-K read | Return the top K users by score for a given leaderboard (e.g., "global", "weekly", "country:US"). K is typically 10–100; the system must support K up to 1,000. |
| User rank lookup | Given a user ID, return their current rank and score. This is a frequent call — every user checks their own position. |
| Time-windowed boards | Separate leaderboards for today, this week, this month, all-time. A score event contributes to all active windows simultaneously. |
| Multiple leaderboards | The platform runs O(100) independent leaderboards (per-game, per-region, tournament). Each is independent. |
| Privacy | Public leaderboards display anonymised or pseudonymous display names only; full user data (email, real name) is never exposed. An authenticated rank lookup may return the requesting user's own full data. L6+ |
Non-functional requirements
| NFR | Target | Why |
|---|---|---|
| Read latency (top-K fetch) | p99 < 50 ms | Leaderboard panels are rendered on page load and in live-updating widgets; human perception of "instant" is roughly 100 ms |
| Write throughput | 50,000 events/s sustained; 200,000 events/s peak (game-end burst) | Score events cluster at round boundaries (end of match, daily reset); the system must absorb the burst without dropping updates |
| Rank freshness | Board reflects events within 5 seconds (eventual consistency) | Real-time exact consistency adds ~10× write overhead; 5-second lag is imperceptible to users but reduces infrastructure cost significantly |
| Availability | 99.9% (43.8 min downtime/month) | Leaderboard outage is degraded UX, not data loss; 99.99% is over-engineered for this feature |
| Durability | Zero score loss | Score events are financial or competitive signals; losing them causes trust breakdowns |
| Scale | 500 M active users; up to 2 B total registered | RAM budget for a full sorted set is bounded; this scale forces the approximate top-K discussion |
NFR reasoning
50 ms p99 read latency §4, §8 ›
50 ms p99 is an aggressive target for a cross-datacenter service (typical inter-AZ RTT is 1–5 ms; cross-region is 60–120 ms). It requires the hot path to be served entirely from memory — no disk reads on the critical path. This rules out a pure database query for top-K, even with optimal indexing, at high concurrency.
5-second rank freshness (eventual consistency) §6 ›
Strong consistency — every event synchronously visible in the sorted set before the write returns — is achievable, but it limits write throughput to roughly 100k–200k synchronous writes/s, the ceiling imposed by per-request network RTT rather than by Redis's internal command throughput. With a 200k events/s peak burst, that margin disappears. Five seconds of lag is set by the product team as the maximum tolerable staleness; lower is better, not required.
Zero score loss (durability) §6, §10 ›
Score durability is stricter than ranking freshness: the score event must survive a service restart even if the rank update is delayed. This means the durable write (to a relational database or Kafka with replication) must happen before the event is acknowledged to the client. The Redis update can be eventually consistent; the Kafka record cannot be lost.
Capacity estimation
Leaderboard estimation has an unusual asymmetry: reads are heavy but cheap (sorted set range queries are O(log N + K)), while writes are moderate but bursty and cluster at predictable intervals. Storage has an unusual constraint: the primary serving store is fully in-memory, so RAM — not disk — is the binding capacity variable. The main question to resolve is whether the user population fits entirely in Redis memory.
Interactive estimator
The Redis memory cliff. A single sorted set entry occupies ~64 bytes when the member is stored as a Redis integer (8B member + 8B score + skip-list overhead). However, string user IDs like "u_7f3a9c" increase this to 100–120 bytes/entry due to the additional robj + sdshdr allocation. At 500 M active users with string IDs that's roughly 50–60 GB — still single-node viable on a 128 GB Redis instance, but the headroom is tighter than the integer case. At 2 B registered users it reaches 200–240 GB, firmly requiring a cluster or approximate counting. This is the architectural fork that separates L5 from L7/L8 answers: knowing exactly when sharding or approximation becomes necessary, and why.
High-level architecture
Before discussing components, the dominant observation from §3 shapes the entire architecture: reads massively outnumber writes, but writes cluster in bursts that can overwhelm a naïve synchronous update path. The architecture separates the write ingestion path from the read serving path, with a stream as the decoupling boundary.
Component breakdown
Load Balancer routes client traffic to two separate service pools (Score API for writes, Read API for reads), enabling each pool to scale independently. Both pools are stateless; all state lives in Redis and Postgres.
Score API is the write-path gateway. It validates the score event, publishes it to a durable message queue (Kafka) with acks=all, and returns 202 Accepted immediately. It never writes directly to Redis — that keeps the write path's latency independent of Redis's single-threaded command execution.
Kafka (score events topic) is the durability boundary. Once a score event is committed to Kafka (replication factor 3, acks=all), it is guaranteed not to be lost even if every downstream service crashes. This decouples durability from availability of the consumer.
Aggregator Consumer is a Kafka consumer that reads score events in micro-batches (every 1–2 seconds), aggregates increments per user per leaderboard within the batch, and issues a single ZINCRBY call per user to Redis rather than one per event. When many players finish the same match simultaneously, a 1–2 second batch collapses 200k events into 20k–100k unique users — cutting ZINCRBY calls by 2–10× without any loss of accuracy.
Redis Sorted Set (ZSET) is the primary serving layer for all reads. ZINCRBY is O(log N) for writes; ZREVRANGE is O(log N + K) for top-K reads; ZREVRANK is O(log N) for user rank lookups. All operations are sub-millisecond even at 100M-user scale.
Postgres (score of record) stores the canonical per-user score, used for recovery (rebuilding the Redis ZSET after a full flush), auditing, and time-windowed queries that don't fit in Redis. It is not on the hot read path.
Architectural rationale
Why Kafka instead of writing directly to Redis? durability ›
Redis is a primary-replica system: writes go to a single primary and replicate asynchronously. Under a burst of 200k events/s, direct Redis writes add latency to the Score API response (Redis processes commands single-threaded). If Redis restarts, in-flight events are lost. Kafka provides a durable, replayable event log: the Score API returns as soon as Kafka acknowledges the write, and the aggregator consumer can replay events from any offset if Redis needs reconstruction.
Why separate Score API and Read API into distinct service pools? scalability ›
Score events and leaderboard reads have completely different traffic profiles: writes spike at game-end boundaries; reads are steady with a multiplier during tournament broadcasts. Separate pools allow each to autoscale independently — a game-end burst scales the Score API, not the Read API; a livestream audience scales the Read API, not the Score API. They also have different SLAs: writes are fire-and-forget (202 Accepted); reads need sub-50ms p99.
Why Postgres as a secondary store if Redis is the hot path? durability + recovery ›
Redis sorted sets are durable with AOF/RDB snapshots, but reconstructing a sorted set from scratch after a large node failure is slow (minutes to hours at 100M-user scale). Postgres stores the latest materialised score per user and acts as the source for a fast COPY-based bulk reload. It also enables historical queries (score at a given timestamp) and audit trails that Redis doesn't support natively.
Real-world comparison
| Decision | Our design | Riot Games | Steam |
|---|---|---|---|
| Primary serving store | Redis ZSET | Redis Cluster (ZSET) | Custom in-memory rank engine |
| Write path buffering | Kafka → aggregator consumer | Internal event bus → batch writer | Direct write with coalescing |
| Rank freshness | ~5s eventual | ~10s eventual (League ladder) | Real-time for top-1000, hourly for rest |
| Scale tier | 100M–500M users | ~150M accounts (LoL) | ~120M MAU |
| Approximate counting | Count-Min Sketch at 2B+ users | Not published | Not published |
Riot's published architecture uses Redis for ranked ladder serving with a custom fan-out system for LP (League Points) changes. The key lesson: even at large scale, a Redis sorted set serves as the canonical ranking layer: the question is how writes are managed upstream of it.
Core algorithm — ranking approaches
Ranking K users out of N is a well-studied problem with a wide spectrum of tradeoffs between exactness, memory, and latency. The right approach is determined by the total user population versus available RAM — a question that only materialises at 500M+ user scale, but one that distinguishes senior from staff-level answers.
Our choice for this system: Redis sorted sets (approach ①) for the active user population (up to ~500M), with Count-Min Sketch (approach ④) as a discussed extension once the population exceeds the RAM budget. Sharded sorted sets (③) appear in the scalability deep-dive as an exact alternative if approximate ranks are unacceptable at 2B+ users.
Count-Min Sketch implementation sketch advanced — L7/L8 ›
A Count-Min Sketch maintains d hash functions and a w × d counter matrix. For each score event, increment d counters (one per hash function row). The estimated count for a user is the minimum across all d rows — this minimum bounds over-count error to at most ε × (total event volume) with probability 1 − δ, where w = ⌈e/ε⌉ and d = ⌈ln(1/δ)⌉. For 1% error at 99% confidence on 2B events: w ≈ 272, d ≈ 5 — a 272×5 = 1360 counter matrix, a few KB regardless of user population. A min-heap of K candidates is maintained separately, evicting the lowest-score entry when a new candidate exceeds the heap minimum.
The error bound is larger than it looks. The ε in Count-Min Sketch is a fraction of total events processed, not of any one user's score. At ε = 1% with 2B total events in the system, a user's estimated score can be overcounted by up to 20M points — enough to push a borderline user to rank #1. For leaderboards, choose ε relative to a meaningful score gap: ε = 1 / (K × max_delta_per_event) ensures the maximum overcount is smaller than the score difference that would shift rank K.
Redis ZSET operations: code reference implementation detail ›
# Increment user score on an event (aggregated batch flush)
# pipe.zincrby is O(log N); pipelined for bulk updates
with redis.pipeline() as pipe:
for user_id, delta in batch_scores.items():
pipe.zincrby("lb:global", delta, user_id)
pipe.zincrby(f"lb:daily:{today}", delta, user_id)
pipe.execute()
# Top-K read — ZREVRANGE returns members sorted high-to-low
# withscores=True returns (member, score) tuples
top_k = redis.zrevrange("lb:global", 0, k - 1, withscores=True)
# User rank lookup — ZREVRANK is 0-indexed, so +1 for display
rank = redis.zrevrank("lb:global", user_id) # None if not in set
score = redis.zscore("lb:global", user_id)
One open question: "How do you handle ties at rank K?" Redis sorted sets break ties by lexicographic order of the member (user ID). If two users have the same score, the one with the lower lexicographic user ID ranks higher. This is deterministic but arbitrary — interviewers often ask this, and the answer matters for tournament fairness. A secondary sort key (e.g., timestamp of achieving the score) can be encoded into the score as a fractional offset.
API design
The leaderboard API has two core write endpoints and two core read endpoints. The write side is fire-and-forget (202 Accepted); the read side is synchronous with strict latency requirements.
POST /score-events — submit a score event
// Request
POST /score-events
Content-Type: application/json
Authorization: Bearer <api_key>
{
"leaderboard_id": "global",
"user_id": "u_7f3a9c",
"delta": 150, // score increment (must be positive)
"event_type": "match_win",
"idempotency_key": "match_42819_u_7f3a9c" // caller-provided
}
// Response — 202 Accepted (event durably queued, not yet ranked)
{
"event_id": "ev_9kLm2nQ",
"status": "accepted",
"expected_visible_in_ms": 5000
}
Input validation: delta must be a positive integer (no score decrements via this endpoint — score resets are a separate admin operation); leaderboard_id must exist; idempotency_key is used to deduplicate retried events at the Kafka consumer layer — writing the same key twice results in one score increment, not two.
GET /leaderboard/:id/top — fetch top-K entries
// Request
GET /leaderboard/global/top?k=100&offset=0
Authorization: Bearer <api_key>
// Response — 200 OK
{
"leaderboard_id": "global",
"generated_at": "2026-04-21T09:55:00Z",
"total_entries": 100000000,
"entries": [
{ "rank": 1, "user_id": "u_4a9f1b", "score": 98420, "display_name": "Nyx" },
{ "rank": 2, "user_id": "u_7f3a9c", "score": 97815, "display_name": "Zephyr" }
]
}
GET /leaderboard/:id/users/:user_id — user rank lookup
// Response
{
"user_id": "u_7f3a9c",
"leaderboard_id": "global",
"rank": 2,
"score": 97815,
"percentile": 99.9998 // computed: (total - rank + 1) / total * 100
}
Optional endpoints by level
| Endpoint | Purpose | Level |
|---|---|---|
| GET /leaderboard/:id/top?window=daily | Time-windowed top-K | L5 |
| GET /leaderboard/:id/users/:id/neighbours | Users ranked N±5 around the querying user (context view). Implementation: two Redis calls — ZREVRANK to get the caller's 0-indexed rank r, then ZREVRANGE(r−5, r+5). Total cost O(log N); returns up to 11 entries. | L5 |
| DELETE /leaderboard/:id/users/:id | Remove a user (ban/account deletion) | L6 |
| POST /leaderboard/:id/reset | Wipe period window for scheduled resets | L6 |
| GET /leaderboard/:id/stats | Score distribution, p50/p95/p99 score, total entries | L7/L8 |
| PATCH /leaderboard/:id/users/:id/score | Admin score correction — sets score to an absolute value when a game server bug misbills points. Requires elevated auth; every call emits an audit event to Kafka. Implementation: pause the aggregator consumer for this user (or use a Lua script to atomically compare the current score), then ZADD with XX flag (update only if member exists) + Postgres upsert + Kafka correction event. Without the pause, an in-flight ZINCRBY from the aggregator can overwrite the corrected value. | L5/L6 |
Core flow: the score update path
The durability NFR from §2 — zero score loss — drives the key design decision: the Score API must acknowledge only after the event is written to a durable store. The freshness NFR — 5-second lag — means Redis does not need to be updated synchronously. This produces the async write path shown below.
Why 202 Accepted, not 200 OK?
Returning 202 explicitly signals to the caller that the event has been durably accepted but the rank effect is not yet visible. This aligns with the eventual consistency model: a client that polls the leaderboard immediately after a 202 may not see the new rank, and that is correct behaviour, not a bug. A 200 OK would imply synchronous processing — semantically incorrect if the rank update is asynchronous.
Idempotency key mechanics
On first receipt, the Score API atomically stores the idempotency key in Redis (SET NX with a 24h TTL). If the key already exists, meaning this is a retry: the API returns immediately with the original 200 response and skips the Kafka publish entirely.
The SET NX atomicity matters: a naïve read-then-write pattern would allow two concurrent retries to both see a missing key, both publish to Kafka, and produce a double score increment. The key is also forwarded in the Kafka message header as a secondary guard, so the aggregator can drop duplicates within a batch window even if an event somehow slips through.
The aggregator's batching strategy doubles as deduplication: if the same user appears 10 times in a 1-second batch, their scores are summed and a single ZINCRBY is issued. This is only safe because score events are additive increments — if events were absolute score assignments, batching would require a max() merge strategy instead.
Data model
The leaderboard involves three distinct entities: score events (ephemeral, streaming), user scores (the materialised aggregate), and leaderboard metadata (configuration). Each entity is used very differently, and those differences shape how we store them.
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Append score event | 50k–200k/s | Insert single row; no read |
| Fetch top-K users | High (30× writes) | Range scan by score desc, LIMIT K |
| Lookup user rank + score | High (per-user on load) | Point lookup by user_id in sorted structure |
| Bulk score rebuild (recovery) | Rare | Full table scan of user_scores, sorted by score |
| Time-windowed top-K | Medium | Range scan restricted to score events in a time window |
Two structural constraints follow from these access patterns. First, the top-K and user rank operations both need a sorted structure — a B-tree index on score supports ORDER BY but degrades under concurrent increments (lock contention). Redis sorted sets handle both operations natively. Second, score events are write-once in Kafka and never updated; their only consumer is the aggregator, which drains them in order. Querying them for time-windowed boards requires a separate materialised view rather than scanning the raw event log.
Schema
Why bigint for score instead of float? schema detail ›
Redis ZSET scores are 64-bit IEEE 754 doubles, which can represent integers exactly up to 2⁵³ (~9 quadrillion). Storing game scores as integer deltas (100, 250, 500) keeps sums exact with no floating-point drift. Only if a secondary sort criterion needs to be packed as a fractional offset (e.g., earlier achievement time encoded as 0.0000001 × unix_ts) does float precision become relevant — and that's a deliberate encoding choice, not accidental.
Time-windowed key strategy in Redis L5+ design ›
A daily leaderboard is a separate ZSET key: lb:global:daily:2026-04-21. Score events contribute to both the all-time ZSET and the relevant daily/weekly ZSET in the same pipeline batch. At midnight, the previous day's key is snapshotted (ZRANGEBYSCORE → archive) and then DELETED. This keeps RAM usage bounded: only the current window's ZSET is live.
Caching strategy
Redis sorted sets are already in-memory, which makes this system unusual: the "cache" and the "database" for ranking are the same store. The caching question is not whether to cache, but what to layer on top of Redis to absorb read QPS beyond what a single primary can handle.
| Layer | What it caches | TTL | Why it exists | Invalidation |
|---|---|---|---|---|
| CDN / edge cache | Top-100 snapshot for the n most popular leaderboards | 5s | Livestream events can send 10k+ clients polling simultaneously. CDN absorbs this burst; Redis never sees it. | Passive TTL expiry. No explicit purge — 5s lag is acceptable per freshness NFR. |
| In-process (Read API) | ZREVRANGE result for top-K per leaderboard | 1s | Each Read API pod can receive thousands of req/s. A 1-second in-process cache collapses concurrent requests to one Redis call per pod per second. | TTL only. Short enough that stale reads are bounded to 1s above the 5s freshness SLA. |
| Redis ZSET (primary) | All user scores in the active window | No TTL (managed by window reset) | This is the hot serving store, not a cache — it's the source of truth for ranking. All top-K reads and rank lookups are served here. | Writes via ZINCRBY (aggregator). Window resets delete the key and re-seed from Postgres. |
| Redis read replicas | Same ZSET data, read-only | N/A | Routes read QPS off the primary; replication lag is typically <100ms, well within the 5s freshness SLA. | Sync'd from primary automatically. |
Read-path flow
The 50 ms p99 latency NFR from §2 is what forces layers 1–2 to exist — Redis alone at high read QPS would still satisfy latency, but the absolute Redis command count becomes the ceiling. The read path works through each cache layer in sequence, short-circuiting on the first hit:
Cache stampede at window reset. When a daily leaderboard resets at midnight, the cached top-100 in the CDN and in-process layers expires simultaneously. If 50k clients poll at reset time, they all miss the cache and fan out to Redis. Mitigations: stagger TTL with a small jitter (+0–500ms), pre-warm the ZSET from Postgres before invalidating the old window's cache, and use probabilistic early expiration (PER) in the CDN to avoid a hard cliff.
Scalability deep-dive
The baseline architecture scales comfortably to ~500M active users on a single Redis primary with read replicas. Beyond that, three bottlenecks emerge: Redis memory, write throughput, and merge complexity for global rankings across geographic shards. Each is addressed separately.
Redis Cluster sharding for 2B+ users L5+ ›
Redis Cluster distributes 16,384 hash slots across N primaries. For a global leaderboard, all users hash into one logical ZSET — but a single ZSET cannot be split across cluster nodes (a cross-slot ZSET doesn't exist). Instead, range-partition the user population: shard 0 owns users 0–499M, shard 1 owns 500M–999M, etc. Each shard maintains its own ZSET. Top-K reads fan out to all shards, each returns its local top-K, and a merge layer heaps the results.
Kafka partition strategy for write throughput L5 ›
Kafka partitions are the parallelism unit. Partitioning by leaderboard_id ensures events for a given leaderboard always go to the same partition: the aggregator consumer for that partition accumulates increments without cross-partition coordination. At 200k events/s across 100 leaderboards, each partition sees ~2k events/s — well within Kafka's per-partition throughput ceiling (~100k events/s).
Geo-distributed leaderboards L7/L8 ›
A global leaderboard across multiple regions has three options: (1) single-region primary with cross-region reads (simple, adds latency for non-primary regions); (2) region-local ZSETs with a periodic global merge scheduled job (approximate; global rank may lag by minutes); (3) CRDTs — Redis CRDT sorted sets available in Redis Enterprise propagate increments asynchronously across regions and converge without central coordination. Option (3) is the principled answer at global scale but requires Redis Enterprise, not open-source Redis.
Approximate top-K with Count-Min Sketch at 2B+ users L7/L8 ›
At 2B users × 64 bytes = 128 GB per leaderboard window — realistic but expensive on a single node. A Count-Min Sketch (described in §5) replaces the sorted set for the score estimation layer, while a min-heap of K candidates tracks the current approximate top-K. The sketch fits in ~1–2 KB of RAM regardless of user count. A secondary exact-count layer (sorted set or Postgres index) can be applied only to the top-K candidate set (~1,000 users), giving exact ordering within the approximate top-K.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Redis primary restart | In-memory ZSET lost if persistence is disabled or snapshot is stale. Rank reads return empty. | Enable AOF persistence (appendfsync everysec). On restart, the aggregator detects an empty ZSET via a sentinel key and triggers a bulk reload from Postgres (COPY → pipelined ZADD). During the rebuild window (~30–60s), reads are served from Redis read replicas (which survive a primary crash) or from the stale in-process and CDN cache — both remain within the 5s freshness SLA. Falling back directly to Postgres for top-K reads is not viable: an ORDER BY score DESC LIMIT K on a 500M-row table takes seconds, violating the 50ms p99 NFR. Postgres is only consulted if all Redis nodes are lost simultaneously, a scenario that warrants a full incident response rather than automatic read fallback. |
L3/L4 |
| Aggregator consumer lag | Kafka consumer falls behind during a burst; ranks lag beyond the 5-second SLA. | Monitor Kafka consumer group lag. Set alerting threshold at 30 seconds behind. Auto-scale the aggregator consumer group horizontally (add pods). Partition count must be ≥ max consumer count. | L5 |
| Duplicate score events | Network retry causes the same event to be delivered twice; user score inflated. | Caller provides idempotency key. Score API stores key in Redis with 24h TTL; rejects duplicates with 200. Aggregator deduplicates on idempotency key within the batch window. | L5 |
| Hot leaderboard write skew | A viral tournament sends 95% of writes to one Kafka partition. Single aggregator consumer becomes a bottleneck. | Detect hot partitions via lag monitoring. Sub-shard the Kafka key for hot leaderboards (lb_id + user_id % N sub-shards). Aggregator consumers merge sub-shard results before flushing to Redis. | L6 |
| Window reset atomicity | At midnight, the daily ZSET is deleted and re-seeded. A reader between delete and re-seed sees an empty leaderboard. | Use a two-phase reset: (1) pre-populate the new window's ZSET key in the background before the reset time; (2) atomically rename it into the canonical key (Redis RENAME is atomic). Old key is deleted after rename. Readers always see either old or new, never empty. | L6 |
| Cache stampede at reset | All CDN caches expire simultaneously at the reset boundary. 50k concurrent clients miss cache and hit Redis. | Use probabilistic early expiration (PER): re-compute cached top-100 while the old cache is still valid, using a random-jitter check. Alternatively, pre-warm CDN by fetching and caching the new leaderboard 10 seconds before reset. | L7/L8 |
| Score manipulation / cheating | A compromised game client sends fraudulent score events inflating a user's score. | Rate-limit score events per user per time window (e.g., max 1,000 events/minute). Flag statistical outliers (score variance beyond 5σ of historical delta distribution). Events above the threshold are quarantined in a separate Kafka topic for manual audit before being applied. | L7/L8 |
| Game server mis-awards score | A bug in the game server awards too many (or too few) points before the error is caught. The incorrect score is now live in Redis and Postgres. | Admin calls PATCH /leaderboard/:id/users/:id/score with the correct absolute score. The aggregator consumer is quiesced for this user during the correction window (or a Lua script is used for an atomic compare-and-set) to prevent an in-flight ZINCRBY from overwriting the corrected value. Implementation: Redis ZADD with XX flag + Postgres upsert + Kafka correction event. A diff record (old score, new score, reason, admin ID) is written to an immutable audit log before any mutation is applied. |
L5/L6 |
How to answer by level
L3/L4 SDE I / SDE II ›
What good looks like
- Reaches for Redis sorted sets immediately; names ZINCRBY and ZREVRANGE without prompting
- Explains why SQL ORDER BY doesn't satisfy the latency NFR under write concurrency
- Designs a correct two-table schema (user_scores + leaderboards) in Postgres
- Mentions read replicas to scale read QPS off the primary
What separates L5 from L3/L4
- L3/L4 writes directly to Redis on the score event; L5 introduces Kafka as a buffer and explains why (burst absorption + durability)
- L3/L4 says "Redis is fast enough for the writes"; L5 quantifies write QPS and shows when a single Redis primary saturates
- L3/L4 doesn't consider time-windowed boards or asks the interviewer to skip them
- L3/L4 rarely considers idempotency unprompted; L5 designs the full end-to-end deduplication flow (API key → Kafka header → aggregator batch)
L5 Senior SDE ›
What good looks like
- Designs the Kafka → aggregator → Redis pipeline and explains the batch-aggregation optimisation (20k ZINCRBY calls instead of 200k)
- Handles time-windowed boards with separate ZSET keys per window and an atomic RENAME reset
- Designs the idempotency key flow end-to-end (API → Redis check → Kafka header → aggregator dedup)
- Handles the Redis restart recovery path: sentinel key + Postgres bulk reload
- Proposes in-process caching + CDN to absorb read bursts without touching Redis
What separates L6 from L5
- L5 designs one leaderboard; L6 designs a multi-tenant leaderboard platform with 100 independent boards, per-board Kafka partitioning, and hot-partition detection
- L5 uses a single Redis primary; L6 knows exactly when and how to shard it
- L5 notes the cache stampede problem; L6 implements PER or a pre-warm pipeline and explains why each is better in which scenario
L6 Staff SDE ›
What good looks like
- Owns the multi-tenant platform design: per-board partitioning, SLA differentiation between tournament (real-time) and casual (5s lag) boards
- Design a score fraud detection pipeline as a Kafka Streams job that runs alongside the aggregator without blocking the write path
- Handles the hot leaderboard problem with sub-sharding and aggregator-side merge
- Proposes a monitoring strategy: Kafka consumer lag, Redis memory utilisation, p99 ZINCRBY latency
- Raises approximate counting unprompted for the billion-user scenario
What separates L7/L8 from L6
- L6 mentions Count-Min Sketch; L7/L8 derives the error bound and explains when the approximation error becomes product-acceptable vs. not
- L6 solves the geo-distribution problem with region-local boards; L7/L8 discusses CRDTs and the consistency model they imply
- L6 designs the system; L7/L8 questions whether the product requirement (global ranking at 2B users) is the right problem to be solving
L7/L8 Principal / Distinguished ›
What good looks like
- Introduces the exactness vs. approximation split without being prompted; quantifies the Count-Min Sketch parameters for the given scale
- Raises the CRDT approach for geo-distributed consistency and its operational implications (Redis Enterprise vs. self-managed conflict resolution)
- Challenges the "global ranking" requirement: proposes region-local or game-mode-local as a product alternative that simplifies the system by an order of magnitude
- Designs the fraud detection pipeline with statistical anomaly detection, quarantine flow, and audit replay
- Explicitly models the build-vs-buy decision for Redis Enterprise vs. Apache Flink vs. a custom streaming aggregator
Common L7/L8 failure modes
- Over-engineers from the start — proposes Count-Min Sketch and CRDTs for a 10M-user system that fits in 640 MB of Redis
- Can't simplify on request — when interviewer says "ignore geo-distribution", pivots smoothly vs. getting stuck on the complex path
- Treats approximate ranking as always acceptable without asking about the product context (financial leaderboard vs. engagement leaderboard)
Classic probes table
| Question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "How do you handle 200k score events/second at game-end?" | Redis is fast enough; batches writes per request | Kafka buffer + aggregator consumer; batch 1-second windows; 10× reduction in ZINCRBY calls | Same, plus hot-partition detection, dynamic sub-sharding, and a streaming aggregation framework discussion (Kafka Streams, Flink) |
| "How do you design a 'last 24 hours' sliding-window leaderboard?" | Separate daily ZSET key | Explains that a true sliding window requires event timestamps; proposes a calendar-aligned daily bucket as a pragmatic alternative and why product teams accept it | Designs the sliding window with Redis ZADD (member = event_id, score = epoch_ms), ZREMRANGEBYSCORE to expire old events, then notes this stores events not aggregated user scores — extracting the top-K requires a ZRANGEBYSCORE scan followed by a group-by-user aggregation pass in memory (heap-select top-K from the grouped sums). This is O(events-in-window) not O(users) and breaks at high event volume; proposes a Count-Min Sketch or calendar-aligned bucket as approximate alternatives |
| "Your Redis node runs out of memory. Now what?" | Add more RAM / upgrade node | Redis Cluster with range-partitioned sorted sets + merge layer; explains merge cost is O(shards × K log K) | Evaluates all three options (larger node, sharded sorted sets, Count-Min Sketch), derives the RAM breakeven for each, and recommends based on exactness requirements and operational budget |
| "Two users have identical scores. How do you break the tie?" | "Redis handles it" (not wrong, but not precise) | Explains lexicographic tie-breaking by member ID; proposes encoding a tiebreaker into the score as a fractional offset (score + 1e-9 × inverse_timestamp) | Raises the product question: is first-to-achieve-the-score more fair, or is tie at the same rank acceptable? Designs both options and names the correctness guarantee of each (exact tie-time capture requires atomic compare-and-set, not ZINCRBY) |
How the pieces connect
Every architectural decision in this article traces back to a stated requirement or capacity number.
- Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
- Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
- Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
- Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
- Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
- Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
- Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
- Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
- Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
- Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
- Online Judge System Design — secure sandboxing, execution queues, and worker scaling