System Design Interview Guide

Design a Top-K Leaderboard

Simple to query, hard to keep fresh — ranking millions of users in real time is a deceptively precise engineering problem.

L3/L4: sorted set + DB L5: stream aggregation L7/L8: approximate counting at scale

Hero Image for Top-K Leaderboard System Design

What the interviewer is testing

The leaderboard question surface is narrow on first glance: fetch the top K users by score. But the constraint that makes it interesting is freshness: the score window is live, updates arrive in bursts, and K users are queried by thousands of concurrent viewers. The question probes whether you understand why a naïve database query falls apart under concurrent writes — and what it takes to return a ranking in under 50 ms when scores are arriving in bursts.

Interviewers use this question to probe three distinct capabilities that map almost perfectly onto engineering level:

Level	Core probe	What success looks like
L3/L4	Data structure choice	Reaches for Redis sorted sets without prompting; understands ZINCRBY + ZREVRANGE and why a SQL `ORDER BY score DESC LIMIT 100` doesn't scale for sub-second reads under high write concurrency
L5	Write-path design	Separates event ingestion from state mutation; introduces a message queue (Kafka) as a buffer, explains micro-batch aggregation and why it dramatically lowers peak write pressure on Redis
L6	Time-windowed rankings	Designs daily/weekly/all-time windows without full rebuilds; explains the sliding window vs. bucket approach and handles the boundary condition correctly
L7/L8	Accuracy vs. scale tradeoff	Introduces Count-Min Sketch for billion-user scale where a full sorted set exceeds RAM budget; quantifies the error bound, demonstrates when approximate is acceptable, and discusses consistency guarantees at the event bus level

Note: L6 is not a separate architecture from L5: it's the multi-tenant, multi-window extension of the same design. An L6 candidate can articulate when to introduce complexity, not just how.

🔍

The question forks early on scope: "Do you need exact ranks, or is approximate top-K acceptable?" An L7/L8 candidate introduces that fork themselves, before the interviewer asks. At L3/L4, answering "exact is fine" is correct: the question is whether you know how to achieve it efficiently.

Requirements clarification

Functional requirements

Requirement	Detail
Score updates	Users earn points from discrete events (game completion, purchase, ad click). Each event increments a user's score by a variable amount.
Top-K read	Return the top K users by score for a given leaderboard (e.g., "global", "weekly", "country:US"). K is typically 10–100; the system must support K up to 1,000.
User rank lookup	Given a user ID, return their current rank and score. This is a frequent call — every user checks their own position.
Time-windowed boards	Separate leaderboards for today, this week, this month, all-time. A score event contributes to all active windows simultaneously.
Multiple leaderboards	The platform runs O(100) independent leaderboards (per-game, per-region, tournament). Each is independent.
Privacy	Public leaderboards display anonymised or pseudonymous display names only; full user data (email, real name) is never exposed. An authenticated rank lookup may return the requesting user's own full data. L6+

Non-functional requirements

NFR	Target	Why
Read latency (top-K fetch)	p99 < 50 ms	Leaderboard panels are rendered on page load and in live-updating widgets; human perception of "instant" is roughly 100 ms
Write throughput	50,000 events/s sustained; 200,000 events/s peak (game-end burst)	Score events cluster at round boundaries (end of match, daily reset); the system must absorb the burst without dropping updates
Rank freshness	Board reflects events within 5 seconds (eventual consistency)	Real-time exact consistency adds ~10× write overhead; 5-second lag is imperceptible to users but reduces infrastructure cost significantly
Availability	99.9% (43.8 min downtime/month)	Leaderboard outage is degraded UX, not data loss; 99.99% is over-engineered for this feature
Durability	Zero score loss	Score events are financial or competitive signals; losing them causes trust breakdowns
Scale	500 M active users; up to 2 B total registered	RAM budget for a full sorted set is bounded; this scale forces the approximate top-K discussion

NFR reasoning

50 ms p99 read latency §4, §8 ›

50 ms p99 is an aggressive target for a cross-datacenter service (typical inter-AZ RTT is 1–5 ms; cross-region is 60–120 ms). It requires the hot path to be served entirely from memory — no disk reads on the critical path. This rules out a pure database query for top-K, even with optimal indexing, at high concurrency.

What this drivesRedis sorted set as the primary serving tier (§4), read replica pool (§8), response caching for top-100 results (§8).

5-second rank freshness (eventual consistency) §6 ›

Strong consistency — every event synchronously visible in the sorted set before the write returns — is achievable, but it limits write throughput to roughly 100k–200k synchronous writes/s, the ceiling imposed by per-request network RTT rather than by Redis's internal command throughput. With a 200k events/s peak burst, that margin disappears. Five seconds of lag is set by the product team as the maximum tolerable staleness; lower is better, not required.

What this drivesKafka-buffered write path with micro-batch aggregation (§6); consumers flush every 1–2 seconds to stay well within the 5-second SLA.

Zero score loss (durability) §6, §10 ›

Score durability is stricter than ranking freshness: the score event must survive a service restart even if the rank update is delayed. This means the durable write (to a relational database or Kafka with replication) must happen before the event is acknowledged to the client. The Redis update can be eventually consistent; the Kafka record cannot be lost.

What this drivesScore events written to Kafka (replication factor ≥ 3, acks=all) before the API returns 202 Accepted. Database is the system of record; Redis is a derived materialised view.

Capacity estimation

Leaderboard estimation has an unusual asymmetry: reads are heavy but cheap (sorted set range queries are O(log N + K)), while writes are moderate but bursty and cluster at predictable intervals. Storage has an unusual constraint: the primary serving store is fully in-memory, so RAM — not disk — is the binding capacity variable. The main question to resolve is whether the user population fits entirely in Redis memory.

Interactive estimator

Score events / day (M) 100 M

Reads per write (ratio) 30×

Active users (M) 100 M

Score event row size (bytes) 100 B

Retention (months) 12 mo

Write QPS

—

events/s (avg)

Peak write QPS

—

3× burst factor assumed

Read QPS

—

top-K + rank lookups

Redis sorted set RAM

—

~64 B per user entry

Event log storage

—

raw events over retention

Score table storage

(1 row/user)

—

DB rows at 64 B/row

⚠️

The Redis memory cliff. A single sorted set entry occupies ~64 bytes when the member is stored as a Redis integer (8B member + 8B score + skip-list overhead). However, string user IDs like "u_7f3a9c" increase this to 100–120 bytes/entry due to the additional robj + sdshdr allocation. At 500 M active users with string IDs that's roughly 50–60 GB — still single-node viable on a 128 GB Redis instance, but the headroom is tighter than the integer case. At 2 B registered users it reaches 200–240 GB, firmly requiring a cluster or approximate counting. This is the architectural fork that separates L5 from L7/L8 answers: knowing exactly when sharding or approximation becomes necessary, and why.

High-level architecture

Before discussing components, the dominant observation from §3 shapes the entire architecture: reads massively outnumber writes, but writes cluster in bursts that can overwhelm a naïve synchronous update path. The architecture separates the write ingestion path from the read serving path, with a stream as the decoupling boundary.

Figure 1 — High-level architecture. Score events flow left-to-right through the write path; reads are served directly from the Redis sorted set.

Component breakdown

Load Balancer routes client traffic to two separate service pools (Score API for writes, Read API for reads), enabling each pool to scale independently. Both pools are stateless; all state lives in Redis and Postgres.

Score API is the write-path gateway. It validates the score event, publishes it to a durable message queue (Kafka) with acks=all, and returns 202 Accepted immediately. It never writes directly to Redis — that keeps the write path's latency independent of Redis's single-threaded command execution.

Kafka (score events topic) is the durability boundary. Once a score event is committed to Kafka (replication factor 3, acks=all), it is guaranteed not to be lost even if every downstream service crashes. This decouples durability from availability of the consumer.

Aggregator Consumer is a Kafka consumer that reads score events in micro-batches (every 1–2 seconds), aggregates increments per user per leaderboard within the batch, and issues a single ZINCRBY call per user to Redis rather than one per event. When many players finish the same match simultaneously, a 1–2 second batch collapses 200k events into 20k–100k unique users — cutting ZINCRBY calls by 2–10× without any loss of accuracy.

Redis Sorted Set (ZSET) is the primary serving layer for all reads. ZINCRBY is O(log N) for writes; ZREVRANGE is O(log N + K) for top-K reads; ZREVRANK is O(log N) for user rank lookups. All operations are sub-millisecond even at 100M-user scale.

Postgres (score of record) stores the canonical per-user score, used for recovery (rebuilding the Redis ZSET after a full flush), auditing, and time-windowed queries that don't fit in Redis. It is not on the hot read path.

Architectural rationale

Why Kafka instead of writing directly to Redis? durability ›

Redis is a primary-replica system: writes go to a single primary and replicate asynchronously. Under a burst of 200k events/s, direct Redis writes add latency to the Score API response (Redis processes commands single-threaded). If Redis restarts, in-flight events are lost. Kafka provides a durable, replayable event log: the Score API returns as soon as Kafka acknowledges the write, and the aggregator consumer can replay events from any offset if Redis needs reconstruction.

TradeoffAdds ~1–2 seconds of end-to-end latency before a rank change is visible. Acceptable per the 5-second freshness NFR. If strong consistency were required, Kafka would be removed and Score API would write to Redis synchronously — at the cost of limiting peak write throughput.

AltRedis StreamsSQS + Lambdadirect Redis write (small scale)

Why separate Score API and Read API into distinct service pools? scalability ›

Score events and leaderboard reads have completely different traffic profiles: writes spike at game-end boundaries; reads are steady with a multiplier during tournament broadcasts. Separate pools allow each to autoscale independently — a game-end burst scales the Score API, not the Read API; a livestream audience scales the Read API, not the Score API. They also have different SLAs: writes are fire-and-forget (202 Accepted); reads need sub-50ms p99.

TradeoffSlightly higher operational complexity. At small scale, a single service with two handler paths is simpler. Separate services are justified once the traffic profiles diverge by more than 3×.

Why Postgres as a secondary store if Redis is the hot path? durability + recovery ›

Redis sorted sets are durable with AOF/RDB snapshots, but reconstructing a sorted set from scratch after a large node failure is slow (minutes to hours at 100M-user scale). Postgres stores the latest materialised score per user and acts as the source for a fast COPY-based bulk reload. It also enables historical queries (score at a given timestamp) and audit trails that Redis doesn't support natively.

TradeoffPostgres is written to by the Aggregator on every batch flush — acceptable for batched writes, not for every individual event. If even batch writes to Postgres are too slow, the event log (Kafka) itself becomes the system of record and Postgres can be rebuilt offline.

Real-world comparison

Decision	Our design	Riot Games	Steam
Primary serving store	Redis ZSET	Redis Cluster (ZSET)	Custom in-memory rank engine
Write path buffering	Kafka → aggregator consumer	Internal event bus → batch writer	Direct write with coalescing
Rank freshness	~5s eventual	~10s eventual (League ladder)	Real-time for top-1000, hourly for rest
Scale tier	100M–500M users	~150M accounts (LoL)	~120M MAU
Approximate counting	Count-Min Sketch at 2B+ users	Not published	Not published

ℹ️

Riot's published architecture uses Redis for ranked ladder serving with a custom fan-out system for LP (League Points) changes. The key lesson: even at large scale, a Redis sorted set serves as the canonical ranking layer: the question is how writes are managed upstream of it.

Core algorithm — ranking approaches

Ranking K users out of N is a well-studied problem with a wide spectrum of tradeoffs between exactness, memory, and latency. The right approach is determined by the total user population versus available RAM — a question that only materialises at 500M+ user scale, but one that distinguishes senior from staff-level answers.

Figure 2 — Four ranking approaches mapped by population size and exactness. ① is the default for most systems; ④ is necessary when user population exceeds available Redis RAM.

Our choice for this system: Redis sorted sets (approach ①) for the active user population (up to ~500M), with Count-Min Sketch (approach ④) as a discussed extension once the population exceeds the RAM budget. Sharded sorted sets (③) appear in the scalability deep-dive as an exact alternative if approximate ranks are unacceptable at 2B+ users.

Count-Min Sketch implementation sketch advanced — L7/L8 ›

A Count-Min Sketch maintains d hash functions and a w × d counter matrix. For each score event, increment d counters (one per hash function row). The estimated count for a user is the minimum across all d rows — this minimum bounds over-count error to at most ε × (total event volume) with probability 1 − δ, where w = ⌈e/ε⌉ and d = ⌈ln(1/δ)⌉. For 1% error at 99% confidence on 2B events: w ≈ 272, d ≈ 5 — a 272×5 = 1360 counter matrix, a few KB regardless of user population. A min-heap of K candidates is maintained separately, evicting the lowest-score entry when a new candidate exceeds the heap minimum.

⚠️

The error bound is larger than it looks. The ε in Count-Min Sketch is a fraction of total events processed, not of any one user's score. At ε = 1% with 2B total events in the system, a user's estimated score can be overcounted by up to 20M points — enough to push a borderline user to rank #1. For leaderboards, choose ε relative to a meaningful score gap: ε = 1 / (K × max_delta_per_event) ensures the maximum overcount is smaller than the score difference that would shift rank K.

Key limitationCount-Min Sketch only over-counts, never under-counts. A user near rank K may be over-ranked if their hash slots collide with heavy-hitter users. This means the true top-K is a subset of the sketch's top-K candidates: the sketch guarantees recall of the top-K, not precision of the exact set boundary.

Redis ZSET operations: code reference implementation detail ›

# Increment user score on an event (aggregated batch flush)
# pipe.zincrby is O(log N); pipelined for bulk updates
with redis.pipeline() as pipe:
    for user_id, delta in batch_scores.items():
        pipe.zincrby("lb:global", delta, user_id)
        pipe.zincrby(f"lb:daily:{today}", delta, user_id)
    pipe.execute()

# Top-K read — ZREVRANGE returns members sorted high-to-low
# withscores=True returns (member, score) tuples
top_k = redis.zrevrange("lb:global", 0, k - 1, withscores=True)

# User rank lookup — ZREVRANK is 0-indexed, so +1 for display
rank = redis.zrevrank("lb:global", user_id)  # None if not in set
score = redis.zscore("lb:global", user_id)

🔍

One open question: "How do you handle ties at rank K?" Redis sorted sets break ties by lexicographic order of the member (user ID). If two users have the same score, the one with the lower lexicographic user ID ranks higher. This is deterministic but arbitrary — interviewers often ask this, and the answer matters for tournament fairness. A secondary sort key (e.g., timestamp of achieving the score) can be encoded into the score as a fractional offset.

API design

The leaderboard API has two core write endpoints and two core read endpoints. The write side is fire-and-forget (202 Accepted); the read side is synchronous with strict latency requirements.

POST /score-events — submit a score event

// Request
POST /score-events
Content-Type: application/json
Authorization: Bearer <api_key>

{
  "leaderboard_id": "global",
  "user_id": "u_7f3a9c",
  "delta": 150,          // score increment (must be positive)
  "event_type": "match_win",
  "idempotency_key": "match_42819_u_7f3a9c"   // caller-provided
}

// Response — 202 Accepted (event durably queued, not yet ranked)
{
  "event_id": "ev_9kLm2nQ",
  "status": "accepted",
  "expected_visible_in_ms": 5000
}

Input validation: delta must be a positive integer (no score decrements via this endpoint — score resets are a separate admin operation); leaderboard_id must exist; idempotency_key is used to deduplicate retried events at the Kafka consumer layer — writing the same key twice results in one score increment, not two.

GET /leaderboard/:id/top — fetch top-K entries

// Request
GET /leaderboard/global/top?k=100&offset=0
Authorization: Bearer <api_key>

// Response — 200 OK
{
  "leaderboard_id": "global",
  "generated_at": "2026-04-21T09:55:00Z",
  "total_entries": 100000000,
  "entries": [
    { "rank": 1, "user_id": "u_4a9f1b", "score": 98420, "display_name": "Nyx" },
    { "rank": 2, "user_id": "u_7f3a9c", "score": 97815, "display_name": "Zephyr" }
  ]
}

GET /leaderboard/:id/users/:user_id — user rank lookup

// Response
{
  "user_id": "u_7f3a9c",
  "leaderboard_id": "global",
  "rank": 2,
  "score": 97815,
  "percentile": 99.9998   // computed: (total - rank + 1) / total * 100
}

Optional endpoints by level

Endpoint	Purpose	Level
GET /leaderboard/:id/top?window=daily	Time-windowed top-K	L5
GET /leaderboard/:id/users/:id/neighbours	Users ranked N±5 around the querying user (context view). Implementation: two Redis calls — `ZREVRANK` to get the caller's 0-indexed rank r, then `ZREVRANGE(r−5, r+5)`. Total cost O(log N); returns up to 11 entries.	L5
DELETE /leaderboard/:id/users/:id	Remove a user (ban/account deletion)	L6
POST /leaderboard/:id/reset	Wipe period window for scheduled resets	L6
GET /leaderboard/:id/stats	Score distribution, p50/p95/p99 score, total entries	L7/L8
PATCH /leaderboard/:id/users/:id/score	Admin score correction — sets score to an absolute value when a game server bug misbills points. Requires elevated auth; every call emits an audit event to Kafka. Implementation: pause the aggregator consumer for this user (or use a Lua script to atomically compare the current score), then ZADD with `XX` flag (update only if member exists) + Postgres upsert + Kafka correction event. Without the pause, an in-flight ZINCRBY from the aggregator can overwrite the corrected value.	L5/L6

Core flow: the score update path

The durability NFR from §2 — zero score loss — drives the key design decision: the Score API must acknowledge only after the event is written to a durable store. The freshness NFR — 5-second lag — means Redis does not need to be updated synchronously. This produces the async write path shown below.

Figure 3 — Score update flow. The 202 response is sent immediately after Kafka commit; the rank update arrives 1–2 seconds later via the aggregator consumer. Idempotency check gates duplicate events.

Why 202 Accepted, not 200 OK?

Returning 202 explicitly signals to the caller that the event has been durably accepted but the rank effect is not yet visible. This aligns with the eventual consistency model: a client that polls the leaderboard immediately after a 202 may not see the new rank, and that is correct behaviour, not a bug. A 200 OK would imply synchronous processing — semantically incorrect if the rank update is asynchronous.

Idempotency key mechanics

On first receipt, the Score API atomically stores the idempotency key in Redis (SET NX with a 24h TTL). If the key already exists, meaning this is a retry: the API returns immediately with the original 200 response and skips the Kafka publish entirely.

The SET NX atomicity matters: a naïve read-then-write pattern would allow two concurrent retries to both see a missing key, both publish to Kafka, and produce a double score increment. The key is also forwarded in the Kafka message header as a secondary guard, so the aggregator can drop duplicates within a batch window even if an event somehow slips through.

💡

The aggregator's batching strategy doubles as deduplication: if the same user appears 10 times in a 1-second batch, their scores are summed and a single ZINCRBY is issued. This is only safe because score events are additive increments — if events were absolute score assignments, batching would require a max() merge strategy instead.

Data model

The leaderboard involves three distinct entities: score events (ephemeral, streaming), user scores (the materialised aggregate), and leaderboard metadata (configuration). Each entity is used very differently, and those differences shape how we store them.

Access patterns

Operation	Frequency	Query shape
Append score event	50k–200k/s	Insert single row; no read
Fetch top-K users	High (30× writes)	Range scan by score desc, LIMIT K
Lookup user rank + score	High (per-user on load)	Point lookup by user_id in sorted structure
Bulk score rebuild (recovery)	Rare	Full table scan of user_scores, sorted by score
Time-windowed top-K	Medium	Range scan restricted to score events in a time window

Two structural constraints follow from these access patterns. First, the top-K and user rank operations both need a sorted structure — a B-tree index on score supports ORDER BY but degrades under concurrent increments (lock contention). Redis sorted sets handle both operations natively. Second, score events are write-once in Kafka and never updated; their only consumer is the aggregator, which drains them in order. Querying them for time-windowed boards requires a separate materialised view rather than scanning the raw event log.

Schema

Figure 4 — Data model. user_scores is the durable aggregate; score_events is the Kafka topic (write-once). Redis ZSET is derived from user_scores.

Why bigint for score instead of float? schema detail ›

Redis ZSET scores are 64-bit IEEE 754 doubles, which can represent integers exactly up to 2⁵³ (~9 quadrillion). Storing game scores as integer deltas (100, 250, 500) keeps sums exact with no floating-point drift. Only if a secondary sort criterion needs to be packed as a fractional offset (e.g., earlier achievement time encoded as 0.0000001 × unix_ts) does float precision become relevant — and that's a deliberate encoding choice, not accidental.

Time-windowed key strategy in Redis L5+ design ›

A daily leaderboard is a separate ZSET key: lb:global:daily:2026-04-21. Score events contribute to both the all-time ZSET and the relevant daily/weekly ZSET in the same pipeline batch. At midnight, the previous day's key is snapshotted (ZRANGEBYSCORE → archive) and then DELETED. This keeps RAM usage bounded: only the current window's ZSET is live.

TradeoffA sliding window (last 24h, not today) requires a different approach: each event must be stored with its timestamp and the ZSET recomputed by replaying events within the window. This is expensive. Most leaderboards use calendar-aligned buckets (today, this week) precisely to avoid sliding-window recomputation.

Caching strategy

💡

Redis sorted sets are already in-memory, which makes this system unusual: the "cache" and the "database" for ranking are the same store. The caching question is not whether to cache, but what to layer on top of Redis to absorb read QPS beyond what a single primary can handle.

Figure 5a — Four-layer cache hierarchy. Layers 1–2 absorb the bulk of read QPS before touching Redis. Postgres (layer 4) is cold and only used for rebuilds.

Layer	What it caches	TTL	Why it exists	Invalidation
CDN / edge cache	Top-100 snapshot for the n most popular leaderboards	5s	Livestream events can send 10k+ clients polling simultaneously. CDN absorbs this burst; Redis never sees it.	Passive TTL expiry. No explicit purge — 5s lag is acceptable per freshness NFR.
In-process (Read API)	ZREVRANGE result for top-K per leaderboard	1s	Each Read API pod can receive thousands of req/s. A 1-second in-process cache collapses concurrent requests to one Redis call per pod per second.	TTL only. Short enough that stale reads are bounded to 1s above the 5s freshness SLA.
Redis ZSET (primary)	All user scores in the active window	No TTL (managed by window reset)	This is the hot serving store, not a cache — it's the source of truth for ranking. All top-K reads and rank lookups are served here.	Writes via ZINCRBY (aggregator). Window resets delete the key and re-seed from Postgres.
Redis read replicas	Same ZSET data, read-only	N/A	Routes read QPS off the primary; replication lag is typically <100ms, well within the 5s freshness SLA.	Sync'd from primary automatically.

Read-path flow

The 50 ms p99 latency NFR from §2 is what forces layers 1–2 to exist — Redis alone at high read QPS would still satisfy latency, but the absolute Redis command count becomes the ceiling. The read path works through each cache layer in sequence, short-circuiting on the first hit:

Figure 5b — Read-path flow. Each layer short-circuits on a hit; all p99 path timings sum to <10 ms well inside the 50 ms NFR even on a full cache miss to Redis.

⚠️

Cache stampede at window reset. When a daily leaderboard resets at midnight, the cached top-100 in the CDN and in-process layers expires simultaneously. If 50k clients poll at reset time, they all miss the cache and fan out to Redis. Mitigations: stagger TTL with a small jitter (+0–500ms), pre-warm the ZSET from Postgres before invalidating the old window's cache, and use probabilistic early expiration (PER) in the CDN to avoid a hard cliff.

Scalability deep-dive

The baseline architecture scales comfortably to ~500M active users on a single Redis primary with read replicas. Beyond that, three bottlenecks emerge: Redis memory, write throughput, and merge complexity for global rankings across geographic shards. Each is addressed separately.

Figure 6 — Production-scale architecture with Redis Cluster, partitioned Kafka, and a merge layer for cross-shard top-K queries.

Redis Cluster sharding for 2B+ users L5+ ›

Redis Cluster distributes 16,384 hash slots across N primaries. For a global leaderboard, all users hash into one logical ZSET — but a single ZSET cannot be split across cluster nodes (a cross-slot ZSET doesn't exist). Instead, range-partition the user population: shard 0 owns users 0–499M, shard 1 owns 500M–999M, etc. Each shard maintains its own ZSET. Top-K reads fan out to all shards, each returns its local top-K, and a merge layer heaps the results.

TradeoffA top-100 query against 10 shards becomes 10 parallel top-100 queries, then an in-memory merge of 1,000 entries to extract the global top-100. Merge cost is O(N_shards × K × log K) — small in practice. The bigger cost is operational complexity of shard reassignment when scaling out.

AltCount-Min SketchConsistent hashingTime-bucketed approximate

Kafka partition strategy for write throughput L5 ›

Kafka partitions are the parallelism unit. Partitioning by leaderboard_id ensures events for a given leaderboard always go to the same partition: the aggregator consumer for that partition accumulates increments without cross-partition coordination. At 200k events/s across 100 leaderboards, each partition sees ~2k events/s — well within Kafka's per-partition throughput ceiling (~100k events/s).

Hot partition riskIf one leaderboard (e.g., a viral tournament) dominates traffic, its single partition becomes a bottleneck. Mitigations: sub-shard the hot leaderboard's partition key (e.g., lb_id + user_id % 4) with a merge step in the aggregator, or detect and dynamically repartition hot leaderboards.

Geo-distributed leaderboards L7/L8 ›

A global leaderboard across multiple regions has three options: (1) single-region primary with cross-region reads (simple, adds latency for non-primary regions); (2) region-local ZSETs with a periodic global merge scheduled job (approximate; global rank may lag by minutes); (3) CRDTs — Redis CRDT sorted sets available in Redis Enterprise propagate increments asynchronously across regions and converge without central coordination. Option (3) is the principled answer at global scale but requires Redis Enterprise, not open-source Redis.

Practical recommendationMost global games tolerate a per-region leaderboard as a first-class product experience (regional rankings) rather than a single global rank. This sidesteps cross-region merge entirely and is the product solution to a distributed systems problem.

Approximate top-K with Count-Min Sketch at 2B+ users L7/L8 ›

At 2B users × 64 bytes = 128 GB per leaderboard window — realistic but expensive on a single node. A Count-Min Sketch (described in §5) replaces the sorted set for the score estimation layer, while a min-heap of K candidates tracks the current approximate top-K. The sketch fits in ~1–2 KB of RAM regardless of user count. A secondary exact-count layer (sorted set or Postgres index) can be applied only to the top-K candidate set (~1,000 users), giving exact ordering within the approximate top-K.

Key decisionThis approach requires product acceptance of approximate ranks. For financial leaderboards (prize money at stake), exactness is non-negotiable and sharded sorted sets are the only option. For engagement leaderboards (badges, social), approximation is acceptable and the RAM savings are significant.

Failure modes & edge cases

Scenario	Problem	Solution	Level
Redis primary restart	In-memory ZSET lost if persistence is disabled or snapshot is stale. Rank reads return empty.	Enable AOF persistence (appendfsync everysec). On restart, the aggregator detects an empty ZSET via a sentinel key and triggers a bulk reload from Postgres (COPY → pipelined ZADD). During the rebuild window (~30–60s), reads are served from Redis read replicas (which survive a primary crash) or from the stale in-process and CDN cache — both remain within the 5s freshness SLA. Falling back directly to Postgres for top-K reads is not viable: an `ORDER BY score DESC LIMIT K` on a 500M-row table takes seconds, violating the 50ms p99 NFR. Postgres is only consulted if all Redis nodes are lost simultaneously, a scenario that warrants a full incident response rather than automatic read fallback.	L3/L4
Aggregator consumer lag	Kafka consumer falls behind during a burst; ranks lag beyond the 5-second SLA.	Monitor Kafka consumer group lag. Set alerting threshold at 30 seconds behind. Auto-scale the aggregator consumer group horizontally (add pods). Partition count must be ≥ max consumer count.	L5
Duplicate score events	Network retry causes the same event to be delivered twice; user score inflated.	Caller provides idempotency key. Score API stores key in Redis with 24h TTL; rejects duplicates with 200. Aggregator deduplicates on idempotency key within the batch window.	L5
Hot leaderboard write skew	A viral tournament sends 95% of writes to one Kafka partition. Single aggregator consumer becomes a bottleneck.	Detect hot partitions via lag monitoring. Sub-shard the Kafka key for hot leaderboards (lb_id + user_id % N sub-shards). Aggregator consumers merge sub-shard results before flushing to Redis.	L6
Window reset atomicity	At midnight, the daily ZSET is deleted and re-seeded. A reader between delete and re-seed sees an empty leaderboard.	Use a two-phase reset: (1) pre-populate the new window's ZSET key in the background before the reset time; (2) atomically rename it into the canonical key (Redis RENAME is atomic). Old key is deleted after rename. Readers always see either old or new, never empty.	L6
Cache stampede at reset	All CDN caches expire simultaneously at the reset boundary. 50k concurrent clients miss cache and hit Redis.	Use probabilistic early expiration (PER): re-compute cached top-100 while the old cache is still valid, using a random-jitter check. Alternatively, pre-warm CDN by fetching and caching the new leaderboard 10 seconds before reset.	L7/L8
Score manipulation / cheating	A compromised game client sends fraudulent score events inflating a user's score.	Rate-limit score events per user per time window (e.g., max 1,000 events/minute). Flag statistical outliers (score variance beyond 5σ of historical delta distribution). Events above the threshold are quarantined in a separate Kafka topic for manual audit before being applied.	L7/L8
Game server mis-awards score	A bug in the game server awards too many (or too few) points before the error is caught. The incorrect score is now live in Redis and Postgres.	Admin calls `PATCH /leaderboard/:id/users/:id/score` with the correct absolute score. The aggregator consumer is quiesced for this user during the correction window (or a Lua script is used for an atomic compare-and-set) to prevent an in-flight ZINCRBY from overwriting the corrected value. Implementation: Redis ZADD with `XX` flag + Postgres upsert + Kafka correction event. A diff record (old score, new score, reason, admin ID) is written to an immutable audit log before any mutation is applied.	L5/L6

How to answer by level

L3/L4 SDE I / SDE II ›

What good looks like

Reaches for Redis sorted sets immediately; names ZINCRBY and ZREVRANGE without prompting
Explains why SQL ORDER BY doesn't satisfy the latency NFR under write concurrency
Designs a correct two-table schema (user_scores + leaderboards) in Postgres
Mentions read replicas to scale read QPS off the primary

What separates L5 from L3/L4

L3/L4 writes directly to Redis on the score event; L5 introduces Kafka as a buffer and explains why (burst absorption + durability)
L3/L4 says "Redis is fast enough for the writes"; L5 quantifies write QPS and shows when a single Redis primary saturates
L3/L4 doesn't consider time-windowed boards or asks the interviewer to skip them
L3/L4 rarely considers idempotency unprompted; L5 designs the full end-to-end deduplication flow (API key → Kafka header → aggregator batch)

L5 Senior SDE ›

What good looks like

Designs the Kafka → aggregator → Redis pipeline and explains the batch-aggregation optimisation (20k ZINCRBY calls instead of 200k)
Handles time-windowed boards with separate ZSET keys per window and an atomic RENAME reset
Designs the idempotency key flow end-to-end (API → Redis check → Kafka header → aggregator dedup)
Handles the Redis restart recovery path: sentinel key + Postgres bulk reload
Proposes in-process caching + CDN to absorb read bursts without touching Redis

What separates L6 from L5

L5 designs one leaderboard; L6 designs a multi-tenant leaderboard platform with 100 independent boards, per-board Kafka partitioning, and hot-partition detection
L5 uses a single Redis primary; L6 knows exactly when and how to shard it
L5 notes the cache stampede problem; L6 implements PER or a pre-warm pipeline and explains why each is better in which scenario

L6 Staff SDE ›

What good looks like

Owns the multi-tenant platform design: per-board partitioning, SLA differentiation between tournament (real-time) and casual (5s lag) boards
Design a score fraud detection pipeline as a Kafka Streams job that runs alongside the aggregator without blocking the write path
Handles the hot leaderboard problem with sub-sharding and aggregator-side merge
Proposes a monitoring strategy: Kafka consumer lag, Redis memory utilisation, p99 ZINCRBY latency
Raises approximate counting unprompted for the billion-user scenario

What separates L7/L8 from L6

L6 mentions Count-Min Sketch; L7/L8 derives the error bound and explains when the approximation error becomes product-acceptable vs. not
L6 solves the geo-distribution problem with region-local boards; L7/L8 discusses CRDTs and the consistency model they imply
L6 designs the system; L7/L8 questions whether the product requirement (global ranking at 2B users) is the right problem to be solving

L7/L8 Principal / Distinguished ›

What good looks like

Introduces the exactness vs. approximation split without being prompted; quantifies the Count-Min Sketch parameters for the given scale
Raises the CRDT approach for geo-distributed consistency and its operational implications (Redis Enterprise vs. self-managed conflict resolution)
Challenges the "global ranking" requirement: proposes region-local or game-mode-local as a product alternative that simplifies the system by an order of magnitude
Designs the fraud detection pipeline with statistical anomaly detection, quarantine flow, and audit replay
Explicitly models the build-vs-buy decision for Redis Enterprise vs. Apache Flink vs. a custom streaming aggregator

Common L7/L8 failure modes

Over-engineers from the start — proposes Count-Min Sketch and CRDTs for a 10M-user system that fits in 640 MB of Redis
Can't simplify on request — when interviewer says "ignore geo-distribution", pivots smoothly vs. getting stuck on the complex path
Treats approximate ranking as always acceptable without asking about the product context (financial leaderboard vs. engagement leaderboard)

Classic probes table

Question	L3/L4	L5/L6	L7/L8
"How do you handle 200k score events/second at game-end?"	Redis is fast enough; batches writes per request	Kafka buffer + aggregator consumer; batch 1-second windows; 10× reduction in ZINCRBY calls	Same, plus hot-partition detection, dynamic sub-sharding, and a streaming aggregation framework discussion (Kafka Streams, Flink)
"How do you design a 'last 24 hours' sliding-window leaderboard?"	Separate daily ZSET key	Explains that a true sliding window requires event timestamps; proposes a calendar-aligned daily bucket as a pragmatic alternative and why product teams accept it	Designs the sliding window with Redis ZADD (member = event_id, score = epoch_ms), ZREMRANGEBYSCORE to expire old events, then notes this stores events not aggregated user scores — extracting the top-K requires a ZRANGEBYSCORE scan followed by a group-by-user aggregation pass in memory (heap-select top-K from the grouped sums). This is O(events-in-window) not O(users) and breaks at high event volume; proposes a Count-Min Sketch or calendar-aligned bucket as approximate alternatives
"Your Redis node runs out of memory. Now what?"	Add more RAM / upgrade node	Redis Cluster with range-partitioned sorted sets + merge layer; explains merge cost is O(shards × K log K)	Evaluates all three options (larger node, sharded sorted sets, Count-Min Sketch), derives the RAM breakeven for each, and recommends based on exactness requirements and operational budget
"Two users have identical scores. How do you break the tie?"	"Redis handles it" (not wrong, but not precise)	Explains lexicographic tie-breaking by member ID; proposes encoding a tiebreaker into the score as a fractional offset (score + 1e-9 × inverse_timestamp)	Raises the product question: is first-to-achieve-the-score more fair, or is tie at the same rank acceptable? Designs both options and names the correctness guarantee of each (exact tie-time capture requires atomic compare-and-set, not ZINCRBY)

How the pieces connect

Every architectural decision in this article traces back to a stated requirement or capacity number.

01 Zero score loss (§2 NFR) → Score events must be durably written before API returns → Kafka with acks=all is the write boundary, not Redis (§6 core flow)

02 200k events/s peak burst (§3 capacity) → Direct Redis writes would saturate the single-threaded primary → Kafka → aggregator → batched ZINCRBY reduces effective Redis write pressure 10× (§4 architecture)

03 50ms p99 read latency (§2 NFR) → No disk reads on the critical path → Redis sorted set as primary serving layer + in-process and CDN caching for the top-100 results (§8 caching)

04 RAM breakeven at 500M active users (§3 Redis memory cliff callout) → Single sorted set approaches memory limit → Sharded sorted sets or Count-Min Sketch become necessary (§9 scalability)

05 5-second rank freshness (§2 NFR) → Async write path acceptable; 202 Accepted on Kafka commit → CDN TTL set to 5s, in-process cache TTL to 1s, still within freshness bound (§8 caching)

06 Multiple time windows required (§2 functional) → Separate ZSET keys per window; score events fan-out to all active windows in the aggregator batch flush → Atomic RENAME on window reset prevents empty-leaderboard reads (§10 failure modes)

Also in this series

Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
Online Judge System Design — secure sandboxing, execution queues, and worker scaling

Design a Top-K Leaderboard

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

NFR reasoning

Capacity estimation

Interactive estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Core algorithm — ranking approaches

API design

POST /score-events — submit a score event

GET /leaderboard/:id/top — fetch top-K entries

GET /leaderboard/:id/users/:user_id — user rank lookup

Optional endpoints by level

Core flow: the score update path

Why 202 Accepted, not 200 OK?

Idempotency key mechanics

Data model

Access patterns

Schema

Caching strategy

Read-path flow

Scalability deep-dive

Failure modes & edge cases

How to answer by level

Classic probes table

How the pieces connect

System Design Mock Interviews

Practice Coding Interviews Now