intervu.dev, System Design Series
Design a Rate Limiter, System Design Interview Guide
Simple to explain, surprisingly hard to get right at scale. The gap between a correct counter and a correct distributed counter is one atomic operation, and everything else follows from that.
What the interviewer is testing
A rate limiter question looks deceptively narrow. Most candidates sketch a counter and call it done. The interviewer is watching for something else: whether you can hold multiple constraints in tension at the same time, low latency, high availability, and fairness, without sacrificing one for another.
The question also acts as a level filter. At L3/L4, the bar is a working single-node design. At L5, you need to identify the distributed race condition and solve it atomically. At L6 and above, the interviewer wants to see you reason about what "correct" even means when nodes disagree, and when it's acceptable to over-count.
| Level | Core expectation | Where candidates fall short |
|---|---|---|
| L3/L4 | Working token bucket or fixed window, correct HTTP 429 response, basic Redis key structure | Forgetting the race condition, no expiry on Redis keys |
| L5 | Atomic Lua script, sliding window algorithm, per-user + per-IP dimensions | Treating Redis as always available, no fallback plan |
| L6 | Multi-tier limits (user, tenant, global), quota inheritance across tiers, async audit log | No thought for quota inheritance or bulk API consumers |
| L7/L8 | Approximate counting tradeoffs, gossip-based quota sync, cell-rate algorithm for bursting SLAs | Optimizing for the happy path only; no model for correctness under partition |
The earliest signal: "What counts as a request?", does a retry count? Does a request that errors before reaching the database count? Candidates who think to ask this tend to do well.
Requirements clarification
Before touching algorithms, establish what the rate limiter needs to protect and what "fair" means. The answers change the data model and the consistency model entirely.
Functional requirements
| Question | Typical answer | Why it matters |
|---|---|---|
| What dimensions to limit on? | Per user, per IP, per API key | Each dimension needs a separate Redis key namespace |
| What endpoints need limiting? | All, via gateway middleware | Centralising at the gateway avoids per-service duplication |
| Hard limit or soft? | Hard, reject at threshold | Soft limits need a separate overflow queue design |
| Should limits be configurable per tenant? | Yes, tiers (Free / Pro / Enterprise) | Drives a config-lookup layer, not just static constants |
| Should retries count? | Yes, same as a new request | Simplifies the counter; retry logic is client responsibility |
| Do you support idempotency keys? | No, out of scope for this design | If yes, idempotent retries must not consume a token; requires a pre-check lookup and changes the Redis key structure (see §10 edge cases) |
Non-functional requirements
| NFR | Target |
|---|---|
| Added latency per check | < 5 ms p99 |
| Availability | 99.99%, fail open on Redis outage |
| Accuracy | At most 0.1% over-count; under-count minimised |
| Scale | 1M unique users, 100k requests/sec peak |
| Auditability | All 429s logged async within 30 s |
Latency < 5 ms p99 Drives Redis-in-process ›
Every API call incurs an extra network hop for the rate limit check. At 5 ms the check is invisible to humans but measurable by services with tight SLAs. This target effectively rules out any approach that requires a synchronous DB read, Redis at sub-millisecond latency is the only practical option.
Fail open on Redis outage Drives availability design ›
99.99% availability for the rate limiter must not lower the availability of the services it protects. If the Redis cluster is unreachable, the right default for most consumer APIs is to let traffic through rather than block everything. This is called "fail open."
The alternative, "fail closed", is appropriate when the rate limiter protects a payment endpoint or prevents fraud. Clarify with the interviewer which mode fits the use case.
Per-tenant configurable limits Drives config layer ›
Static constants hardcoded in the limiter only work for a single-tier product. Once you have Free, Pro, and Enterprise tiers, each with different request quotas, you need a config lookup: given a user ID, what is their limit? This lookup happens on every request, so it must itself be cached, a short TTL (60 s) in a local in-process cache is the standard approach.
Accuracy: over-count ≤ 0.1%, under-count minimised Drives algorithm choice ›
The asymmetry between over-count and under-count is the key constraint. An over-count means a legitimate request gets a 429, a UX failure. An under-count means a request that should be blocked gets through, a security failure. These are not equally bad for most systems, which is why the NFR prioritises minimising under-counting while tolerating a small over-count rate.
In practice, brief under-counts are unavoidable at partition boundaries, during crash windows, and when using approximate counting modes (§09). The NFR means the steady-state design must not structurally under-count, not that zero under-count events are possible.
The 0.1% over-count tolerance is what makes the sliding window counter approximation viable as an alternative to the exact sliding window log. The approximation's error at the window boundary is at most prev_count × (elapsed/window), a fraction of the previous window's count, not the full limit, well within 0.1% under realistic traffic distributions. If this tolerance were tighter (e.g. 0%), you'd be forced into the sliding window log with its O(requests) memory cost.
Scale: 1M users, 100k req/s peak Drives Redis cluster design ›
100k requests/sec means 100k Redis Lua script executions per second at peak. A single Redis node handles ~200–400k Lua executions/sec (see §9), so a single node is sufficient at this scale, but only if key distribution is even. If all traffic concentrates on a small number of hot users, one shard becomes a bottleneck.
The 1M unique users figure matters for memory sizing (see §3, it's small) and for config cache design: 1M users × 60 s TTL means the in-process cache on each API server holds at most a few thousand live entries at any moment, not 1M.
Auditability: all 429s logged within 30 s Drives async pipeline ›
Rate limit decisions must be debuggable ("why was user X blocked?") and auditable for abuse detection. Logging synchronously in the request path is ruled out immediately, even a 1 ms write to a log store adds to the p99 latency budget and would push the limiter past its 5 ms target at volume.
The 30 s bound is generous enough to allow an async pipeline: the middleware emits a 429 event to a local buffer, a background thread flushes to a durable message queue (Kafka), and a stream processor writes to a columnar analytics store (ClickHouse) for querying. This path never touches the hot request thread.
Capacity estimation
The capacity picture for a rate limiter differs from most systems: storage is tiny, latency is everything, and write volume dominates. Every request is a write, a counter increment. There are no reads that aren't also writes.
Interactive estimator
The key insight: Redis memory for the counters is almost always trivially small, a million active 1-minute windows cost under 100 MB. The binding constraint is Redis write throughput at peak QPS, not memory. That's why the design question is really about atomic writes, not storage. Active key count assumes 10% of users are concurrently active at peak, typical for consumer APIs, adjust upward for social/messaging workloads or downward for B2B.
High-level architecture
The rate limiter lives in the request path, after load balancing and before the business logic. This placement is deliberate: it must intercept every request regardless of which API server processes it, and it must do so without adding significant latency.
Figure 1: High-level rate limiter architecture. The middleware intercepts every request, consults Redis atomically, and either forwards the request or returns a 429. Config rules are cached locally for 60 s.
Component breakdown
Load balancer. Distributes traffic across stateless API servers. Does not enforce rate limits itself, shared state would make it a bottleneck. If limiting by IP, it must preserve the original client IP in X-Forwarded-For rather than replacing it with its own address.
Rate limiter middleware. The core enforcement point. Can be deployed as an API gateway plugin (Kong, Nginx Lua), a sidecar proxy (Envoy), or in-process middleware. Does three things on every request: looks up the user's quota (from local cache), atomically checks and decrements the counter in Redis via Lua script, and either forwards the request or returns a 429.
Redis (counter store). The shared state layer. Every API server writes to the same Redis cluster, so the count is accurate regardless of which server handles a given request. Keys are namespaced by user ID and time window with a TTL set to the window size. Atomicity is achieved via a Lua script, a single round trip that reads, decrements, sets expiry, and returns the result.
Config store. Stores quota rules per tenant (Free: 100 req/min, Pro: 1,000 req/min). Read on every request but cached locally per API server with a 60-second TTL, so the lookup never adds a second Redis round-trip to the hot path. Plan changes propagate within 60 s; push invalidation is needed for immediate downgrade enforcement.
Audit log (async). Every 429 is published to a durable message queue (Kafka) and consumed by a stream processor that writes to a columnar analytics store (ClickHouse) for abuse analytics and querying. Serves two purposes: debugging ("why was user X blocked?") and abuse detection ("user X hit the limit 500 times in 5 minutes"). Writing is fire-and-forget from the request path, no latency added.
Architectural rationale
① Load balancer Routing layer ›
The load balancer's job is distribution, not enforcement. Enforcing rate limits at the LB would require shared state between LB nodes, turning a stateless routing layer into a stateful bottleneck. Keeping limits in the middleware layer behind the LB means the LB stays simple and horizontally scalable.
One non-obvious requirement: if you rate limit by IP, the LB must forward the original client IP in X-Forwarded-For. Many default configurations replace the source IP with the LB's own address, meaning all traffic appears to come from one IP, and IP-based limits collapse into a global limit.
X-Forwarded-For from known LB CIDR ranges. Clients can spoof this header to bypass IP-based limits if the middleware trusts it unconditionally.
② Rate limiter middleware Core component ›
Deployment options differ meaningfully. An API gateway plugin (Kong, Nginx Lua) is the simplest: one place to configure, one place to update. But it shares the gateway's blast radius, a middleware bug can take down all routing. A sidecar proxy (Envoy) isolates the rate limiter's failure domain at the cost of a network hop per request. In-process middleware (Express.js, FastAPI) is fastest but per-service, making language-agnostic enforcement impossible.
③ Redis (counter store) State layer ›
Redis is the only practical choice for the counter store given the <5 ms p99 latency target. A traditional DB adds 5–15 ms per write under load; Redis at sub-millisecond latency is the only store that keeps the check invisible in the request path.
The Lua script is the correctness mechanism. Two separate commands (GET then INCR) leave a TOCTOU window: two concurrent requests can both read the same count, both decide "allowed," and both write, admitting two requests when one should be blocked. The Lua script runs atomically on the Redis thread, eliminating this race entirely without distributed locks.
④ Config store Quota rules ›
Quota rules can't be hardcoded constants once you have per-tenant pricing tiers. The config store holds the mapping from user ID (or plan) to their limit, window size, and algorithm. Reads happen on every request, so a raw DB lookup would add another round-trip, the local in-process cache with a 60-second TTL prevents this.
The 60-second TTL is a deliberate product decision: upgrades propagate within 60 s (acceptable), but downgrades also take 60 s to enforce. For fraud-prevention use cases where immediate downgrade is required, push invalidation via a pub/sub event from the config service is the correct extension.
⑤ Audit log Observability ›
Synchronous 429 logging is ruled out by the latency budget: even a 1 ms log write adds to p99 and compounds at 100k QPS. The async pipeline (middleware → message queue → analytics store) decouples observability from the request path entirely, the middleware emits a fire-and-forget event and never waits for it.
The three stages each have a specific job. A durable message queue (Kafka) sits between the middleware and the analytics store, it absorbs traffic spikes without backpressure and retains events for replay if a consumer has a bug. A stream processor (Flink or a simple Kafka consumer) aggregates raw events and writes to a columnar analytics store (ClickHouse) that powers abuse dashboards and per-user 429 queries. Raw events can additionally be archived to object storage (S3) for long-term retention and ad hoc queries, optional at this scale but standard practice for compliance-sensitive deployments.
Real-world comparison
| Decision | This design | Stripe | GitHub API |
|---|---|---|---|
| Algorithm | Token bucket (configurable) | Token bucket | Sliding window (1-hour) |
| Counter store | Redis + Lua | Redis | Redis |
| Limit dimensions | User + API key + IP | API key (per secret) | Authenticated user or IP |
| Headers returned | X-RateLimit-* + Retry-After | X-RateLimit-Limit/Remaining/Reset | X-RateLimit-Limit/Remaining/Reset/Used |
| On Redis outage | Fail open (configurable) | Fail open | Fail open |
Every system above uses Redis. The choice of algorithm (token bucket vs sliding window) is what differentiates them, and that choice flows directly from the burst tolerance NFR you set in §2.
Core algorithm, the bucket decision
Four algorithms compete in interviews. The right choice isn't fixed, it follows from your NFRs, specifically the ones set in §2: whether bursts are tolerable, what over-count accuracy is acceptable, and whether output must be strictly smoothed.
Figure 2: All four rate-limiting algorithms side by side. Token bucket (①) is the recommended default. Sliding window counter (③) is the best alternative when boundary smoothness matters. Fixed window (②) is simplest but has a 2× spike risk. Leaky bucket (④) is reserved for strict output smoothing.
① Token Bucket, deep dive Recommended default ›
A token bucket holds up to max tokens. Tokens refill at a fixed rate (e.g. 10/sec). Each request consumes one token. If the bucket is empty, the request is rejected with a 429. This maps naturally to user intuition: you have a quota that refills over time, and bursting is allowed as long as you have tokens saved up.
Redis representation: a hash with two fields, tokens (float) and last_refill (unix ms). On each request the Lua script computes tokens earned since the last call, adds them up to the max, then attempts to consume one. The entire read-modify-write is atomic.
Why it handles bursts well: if a user makes no requests for 30 seconds, they accumulate up to max tokens. Their next burst of requests drains the bucket instantly, which is intentional. The downstream system must be designed to handle these bursts, but for most APIs this is preferable to queuing.
burst_multiplier = 1 (bucket capacity equals the sustained limit, no burst headroom) or switching to sliding window counter, which smooths the rate without queuing.
② Fixed Window Counter, deep dive Simplest, boundary risk ›
Divide time into fixed buckets (e.g. every 60 seconds). Each bucket holds a counter, keyed by rl:{userId}:{windowTs} where windowTs = floor(now / window) × window. Increment on every request; reject if the count exceeds the limit. The key expires automatically at the end of the window.
Redis representation: a single integer key. INCR rl:u123:1711584000 followed by EXPIRE on first write. Two commands, but the race between them is acceptable here, unlike token bucket, there is no read-before-write, so the TOCTOU window is narrow (only the EXPIRE can be lost, not the count).
The boundary problem: a user can legally make 100 requests at 11:59:59 and another 100 at 12:00:00, both within their respective windows, but 200 requests in 2 seconds. Any downstream service sized for 100 req/min sees a 200-request spike. This is the seam problem, and it's why fixed window is not recommended as a default.
③ Sliding Window Counter, deep dive Best accuracy/memory balance ›
A hybrid that approximates a true sliding window using only two fixed-window counters. The formula: count = prev_count × (1 − elapsed/window) + curr_count. The previous window's count is weighted by how much of it overlaps the current sliding window. As elapsed time grows, the previous window contributes less.
Redis representation: two integer keys, the current window bucket and the previous window bucket. The Lua script reads both, applies the formula, and checks against the limit. Still O(1) memory per user regardless of request rate.
Error bound: the approximation assumes request traffic within the previous window was uniformly distributed. Under this assumption, the maximum error at the boundary instant is prev_count × (elapsed / window), a fraction of the previous window's count, not the full limit. As elapsed time grows from 0 toward the window size, the previous window's contribution shrinks to zero. In practice this means the error is well under 1% of the limit for realistic traffic distributions. Cloudflare uses this algorithm at global scale.
Why this over token bucket? Token bucket allows front-loaded bursts. Sliding window counter smooths the rate across the window boundary without queuing. If your NFR says "no more than N requests per minute, measured continuously", not "N tokens that refill at N/60 per second", the sliding window counter is the more accurate implementation.
④ Leaky Bucket, deep dive Strict output smoothing ›
Requests enter a FIFO queue and are processed at a fixed drain rate (e.g. 10 req/s). Bursts are absorbed into the queue rather than rejected immediately. If the queue is full, new arrivals are dropped. The output rate is perfectly smooth, downstream services see a constant, predictable load regardless of input burstiness.
Redis representation: a list used as a queue, with a background worker draining it at the fixed rate, call LPUSH on arrival, RPOP on each drain tick. A simpler stateless variant stores only the next allowed timestamp per user: on each request, check if now >= next_allowed; if yes, allow and set next_allowed = now + (1 / rate); if no, reject with a Retry-After of next_allowed - now. This avoids the queue entirely while preserving the fixed-rate output guarantee, at the cost of rejecting bursts immediately rather than queuing them.
Why it's rarely the right choice for APIs: queuing adds latency. A burst of 100 requests at 10 req/s drain rate means the 100th request waits 10 seconds before being processed. For an interactive API with a <5 ms p99 target, this is catastrophic. The user experiences a long hang instead of a clean 429.
When it is the right choice: backend processing pipelines where throughput matters more than latency, and where a 429 is worse than a delay, for example, payment processing, batch job submission, or webhook delivery where out-of-order drops are worse than queuing.
Sliding Window Log Exact count, advanced alternative ›
A fifth option not shown in the diagram above. Instead of counters, store a sorted set of request timestamps. To check the limit, prune entries older than the window and count what remains. This is the only algorithm that gives a perfectly exact sliding window count with no approximation error at window boundaries.
The cost is memory: O(requests) per user rather than O(1). A user making 1,000 requests per window requires 1,000 timestamp entries in Redis. At scale this becomes significant. The Redis operation ZREMRANGEBYSCORE to prune old entries also adds latency proportional to the number of expired entries removed.
Our choice for this system: Token Bucket via Redis Lua
Token bucket with a Lua script is the right default for this system: our NFR in §2 explicitly permits short bursts, the smoother output of a leaky bucket would add unnecessary queuing latency for a user-facing API, and the higher memory cost of a sliding window log is unjustified when the 0.1% over-count tolerance means the sliding window counter approximation is already good enough. Token bucket is the choice that the §2 requirements drive you to, not a default you pick before reading them.
Redis Lua script, token bucket implementation ›
-- Redis Lua script: token bucket check-and-refill
-- KEYS[1] = "rl:{userId}" -- persistent per-user key (no window: state carries across windows)
-- ARGV[1] = max tokens, ARGV[2] = refill rate/sec, ARGV[3] = now (unix ms)
local key = KEYS[1]
local max = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3]) -- milliseconds
local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or max
local last = tonumber(data[2]) or now
-- Refill: add tokens proportional to elapsed time
local elapsed = (now - last) / 1000.0
local new_tokens = math.min(max, tokens + elapsed * refill_rate)
if new_tokens < 1 then
return {0, math.ceil((1 - new_tokens) / refill_rate * 1000)} -- {denied, retry_ms}
end
-- Consume one token (HSET replaces deprecated HMSET, supported since Redis 4.0)
redis.call("HSET", key, "tokens", new_tokens - 1, "last_refill", now)
redis.call("PEXPIRE", key, math.ceil(max / refill_rate * 1000)) -- expire after full refill time
return {1, 0} -- {allowed, 0}
One open question, what happens with a clock skew? If API servers have different system clocks, the now value passed to the Lua script will vary. The fix is to use Redis's own server time via redis.call("TIME") inside the script, this ensures the reference clock is always the Redis node, not the calling server.
API design
The rate limiter exposes no external API of its own, it's middleware. But it modifies every response, and it exposes an internal management API for setting quotas. Here's what that looks like concretely.
Response headers added to every request
{
"X-RateLimit-Limit": 100, // the ceiling for this user's plan
"X-RateLimit-Remaining": 43, // tokens left in current window
"X-RateLimit-Reset": 1711584000, // Unix ts when window resets / bucket refills
"X-RateLimit-Policy": "100;w=60" // IETF draft: 100 requests per 60-second window
}
// On 429, additionally:
{
"Retry-After": 17, // seconds until the client should retry
"X-RateLimit-Remaining": 0
}Management API, quota configuration
// Request body
{
"limit": 1000, // sustained rate: max requests per window
"window_seconds": 60, // window size
"algorithm": "token_bucket", // or "sliding_window_counter"
"burst_multiplier": 2 // bucket capacity = limit × burst_multiplier (L5+)
// e.g. limit=1000, burst_multiplier=2 → max_tokens=2000
// a user can burst up to 2000 req instantly if bucket is full
}
// Response 200
{
"userId": "u_8f2a",
"effective_at": "2026-03-28T10:00:00Z",
"propagation_lag_seconds": 60, // cache TTL
"derived": {
"max_tokens": 2000, // limit × burst_multiplier → passed as ARGV[1] to Lua
"refill_rate_per_sec": 16.67 // limit / window_seconds → passed as ARGV[2] to Lua
}
}Optional endpoints
| Endpoint | Purpose | Level |
|---|---|---|
GET /admin/v1/quotas/{userId} | Inspect current quota + remaining tokens | L5 |
DELETE /admin/v1/counters/{userId} | Reset a user's counter (support action) | L5 |
GET /admin/v1/429-log?userId=… | Audit log of 429 events per user | L6 |
POST /admin/v1/quotas/bulk | Batch quota update for tenant migration | L7/L8 |
Core flow
Every request through the system follows the same path. The critical decision, allow or deny, happens inside a single atomic Redis operation, which is why the latency stays under 5 ms even at 100k QPS.
Figure 3: Request flow. The Lua script executes in a single Redis round-trip (~0.5 ms LAN). The decision branches immediately after; audit logging is always async.
The key tradeoff in this flow is why the Lua script is one round-trip, not two (GET then SET). Two operations introduce a TOCTOU window where two concurrent requests can both read count=99, both decide "allowed," and both write count=100, admitting two requests when one should be blocked. The Lua script eliminates this entirely.
Classic L5 probe: "What if the Redis write succeeds but the network drops before returning to the middleware, do you count the request or not?" Answer: you already decremented the token. The request is counted consumed. This is the right behavior, the alternative (retry the Lua script) risks double-counting.
Data model
The data model is split between Redis (hot counter state) and a relational database (quota configuration). They serve opposite purposes: Redis is updated on every single request; the config DB is read rarely and written almost never.
Access patterns first
| Operation | Frequency | Query shape |
|---|---|---|
| Check + increment counter | Every request, 100k/s peak | Read-modify-write by key: rl:{userId} (token bucket) or rl:{userId}:{windowTs} (fixed/sliding window) |
| Lookup quota config | Once per 60 s per user (cache TTL) | Point lookup by userId |
| Admin: update quota | Rare, plan changes, support actions | Update by userId |
| Admin: query 429 log | Occasional, debugging, abuse reports | Range scan by userId + timestamp |
Two things stand out: the counter access is hot, write-heavy, and requires atomic read-modify-write, Redis is the only viable store. The config access is cold, read-heavy, and needs no atomicity, a relational DB with local caching is perfect.
Figure 4: Data model. Redis keys are namespaced and auto-expire. Config lives in a relational DB, cached locally on each API server for 60 seconds.
Why TTL on every Redis key? Memory hygiene ›
Without a TTL, inactive users accumulate dead counter keys forever. With a TTL set to the window size, a user who stops making requests will have their key automatically evicted when the window closes. For a 1-minute window, keys live at most 1 minute after the last request.
This is what keeps Redis memory bounded (see §3, even at 1M users, active keys are tiny) and means you never need a background cleanup job.
Caching strategy
The rate limiter has an inverted caching profile compared to most systems: Redis is the hot state store, not a cache in front of something slower. What you're actually caching is everything that sits in front of Redis, specifically, quota config lookups that would otherwise add a second Redis round-trip to every request. Understanding this inversion matters because it changes what "cache invalidation" means: invalidating a counter is never necessary (TTLs handle it), but invalidating a quota config entry on plan change is a real operational concern.
Placing the layers against the §04 architecture makes the picture concrete. Every request passes through three potential cache interactions before a decision is made:
Figure 6: Where caching fits in the request path. The in-process cache inside the middleware serves quota config without touching Redis. Redis is the authoritative counter store, every request reads and writes it. The Config DB is only reached on a cache miss.
| What | Lives in | Holds | TTL | Why |
|---|---|---|---|---|
| In-process config cache | Rate limiter middleware (per instance) | User quota config (limit, window, algorithm) | 60 s | Avoids a Redis round-trip for config on every request; cold miss falls through to Config DB |
| In-process circuit breaker | Rate limiter middleware (per instance) | Redis outage flag | 10 s | Prevents thundering herd of retries to a failing Redis; resets automatically after 10 s |
| Redis counter store | Shared Redis cluster | Counter state (tokens, last_refill) | = window size | Authoritative rate limit state, not a cache. Written atomically on every request via Lua script. |
Cache invalidation trap: When a user's plan is downgraded, the local config cache will still show the old (higher) limit for up to 60 seconds. This means a just-downgraded user can send 60 more seconds of Pro-tier traffic. For most products this is acceptable. If it isn't, you need push invalidation, the config service sends a pub/sub event to all API servers to evict the specific key immediately.
Deep-dive scalability L5+
At 100k QPS with a single Redis node, you're likely fine, Redis handles ~1M simple ops/sec (GET, SET, INCR). However, Lua scripts are significantly heavier than raw commands: a realistic ceiling for Lua executions is 200–400k/sec on typical hardware, because each script blocks Redis's single thread for longer than a primitive op. At 100k QPS you have headroom; at 500k+ QPS, Redis Cluster becomes necessary. But production systems have to think about what happens when demand grows, when Redis is partitioned, and when consistency guarantees need to be loosened for throughput.
Figure 5: Production-scale architecture. Redis cluster shards counters across 3 primaries. 429 events stream to Kafka and into ClickHouse for abuse analytics.
Redis Cluster sharding L5, throughput ›
Redis Cluster distributes keys across 16,384 hash slots assigned to N primary nodes. For rate limiting, each user's counter key hashes to exactly one shard, so all operations for a given user are atomic on that shard. No cross-shard transactions are needed.
A 3-primary cluster with each primary handling ~200–300k Lua executions/sec (consistent with the ceiling established in §09) gives headroom to ~600k–900k QPS before horizontal scaling is needed. Add shards by resharding, Redis Cluster handles this online.
Local counter + periodic sync (approximate) L6, extreme throughput ›
Above ~5M QPS, even Redis Cluster becomes a bottleneck. The alternative: each API server maintains an in-process counter and periodically (every 100 ms) syncs with Redis via INCRBY. Between syncs, the distributed count can be off by at most servers × local_count. This is an explicit trade: approximate counting for linear write scalability.
This design is used by Cloudflare and is acceptable when "at most 5% over-count" is a tolerable SLA. It's unacceptable for security-critical limits (authentication attempts, payment API).
Multi-tier rate limits L6, tenant design ›
Enterprise products often have nested limits: per-user, per-team, and per-organization, all enforced simultaneously. A request must pass all three checks. The implementation: the Lua script is called once per dimension, in order. If any check fails, the 429 is returned immediately with the header naming which dimension was exhausted: X-RateLimit-Violated: org.
Geo-distributed rate limiting L7/L8, global consistency ›
If a user can send requests to two different regions simultaneously, per-region counters will each allow the full quota, effectively doubling the limit. True global rate limiting requires cross-region counter sync. Options: (1) route all requests for a user to their "home" region (sticky routing, single point of failure); (2) use a globally replicated database (Google Spanner or DynamoDB Global Tables) that offers low-latency strongly-consistent reads across regions; (3) accept approximate counting with gossip-based sync.
Most systems choose option 3: allow a known over-count window at region boundaries, monitor for abuse via the audit log, and address egregious violations reactively.
How gossip-based sync works: each region periodically (e.g. every 100–500 ms) broadcasts its local counter delta to peer regions. Each region maintains a per-region counter vector and sums them to get the global count. This is a form of CRDT (Conflict-free Replicated Data Type), specifically a G-Counter, where each region owns one slot in the vector and only increments it. Merging is simply taking the max of each slot across all received vectors. Convergence time equals the gossip interval plus network RTT between regions (typically 100–300 ms cross-continent), giving a bounded over-count window of at most limit × (gossip_interval / window_size).
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Redis node goes down | All counter checks fail; traffic either stops or goes unchecked | Circuit breaker: fail open for non-auth endpoints. Alert on-call (e.g. PagerDuty). Redis Sentinel (the built-in high-availability manager) detects failure within ~30 s (default down-after-milliseconds); total failover including election and promotion typically takes 30–60 s end-to-end. |
L3/L4 |
| Window boundary spike | Fixed-window allows 2× limit at boundaries, downstream overload | Switch to token bucket or sliding window counter algorithm, which smooth across boundaries | L3/L4 |
| Lua script race condition | Two servers read the same count with GET, both decide "allowed," both write, admitting one request too many (TOCTOU on GET+SET, not INCR itself) | All state mutation in a single Lua script, atomic on the Redis thread, eliminating the read-modify-write race without distributed locks | L5 |
| Plan upgrade cache lag | User upgrades plan; still sees old lower limit for 60 s | Pub/sub invalidation: config service sends evict event to all API servers on plan change | L5 |
| Clock drift between API servers | Window timestamps differ across nodes, same user gets different windows | Use redis.call("TIME") inside Lua script to use Redis server clock, not application clock |
L5 |
| Header spoofing (X-Forwarded-For) | Attacker spoofs IP header to bypass IP-based limits | Trust X-Forwarded-For only from known load balancer CIDR ranges, never from arbitrary clients. Prefer X-Real-IP set by the LB itself, which cannot be injected by clients. For any security-critical limit (auth endpoints, payment APIs), use authenticated user ID as the primary key instead of IP; IP-based limits are a last-resort signal, not a primary defence. Rate limit by authenticated user ID as the primary key whenever possible. |
L6 |
| Thundering herd on Redis restart | All API servers simultaneously miss local cache after Redis restart, 100k req/s hit cold Redis | Jittered local cache TTL (60 s ± 10 s random) + request coalescing: one inflight config fetch per key | L7/L8 |
| Distributed counter drift at geo boundaries | User sends 90 req to US-EAST and 90 req to EU-WEST; both regions allow; effective limit is 2× | Accept approximate over-count. Use audit log anomaly detection to flag and throttle after the fact. For auth endpoints: use sticky geo routing. | L7/L8 |
| Idempotency key on retry | A POST fails mid-flight and is retried with the same idempotency key, should the retry consume a token? | For APIs that support idempotency keys (Stripe-style), do not count idempotent retries against the quota, the client is replaying a request that already consumed a token. Store the idempotency key with the original response in a short-TTL Redis key (24 h); on retry, return the cached response without re-entering the rate limiter. This must be established as a requirement in §2, it changes the key structure and adds a pre-check lookup step to the core flow. | L6 |
How to answer by level
L3/L4 SDE I / SDE II, Can you build a working system? ›
- Pick token bucket or fixed window, justify the choice
- Use Redis with INCR + EXPIRE
- Return 429 with Retry-After header
- Identify (but don't need to fully solve) the race condition
- Mention per-user key scoping
- No thought for distributed atomicity
- Doesn't ask about burst tolerance vs smoothing
- Missing TTL reasoning on Redis keys
- No mention of fail-open vs fail-closed
L5 Senior SDE, Do you understand the tradeoffs? ›
- Lua script for atomicity, explains why two separate commands aren't safe
- Chooses algorithm based on burst tolerance NFR, not by default
- Designs per-user + per-IP dimensions, explains the key structure
- Addresses clock drift via Redis TIME
- Proactively raises fail-open semantics
- Single-tier limits only (no tenant/org hierarchy)
- No plan for global/cross-region consistency
- Doesn't think about approximate counting as a deliberate tradeoff
L6 Staff SDE, Can you own this end-to-end? ›
- Multi-tier limits (user → team → org), Lua script runs all checks
- Push invalidation for plan changes
- Sliding window counter approximation vs exact, chooses based on use case
- Audit log pipeline to ClickHouse for abuse analytics
- Addresses header spoofing and IP trust chain
- Doesn't model the correctness guarantees under network partition
- No framework for deciding when approximate counting is "good enough"
- Doesn't reason about multi-region topology as a first-class constraint
At L7, you're expected to define an explicit correctness model, e.g., "we accept at most 5% over-count during partition windows lasting under 30 seconds", rather than just naming the problem. The difference is quantified tolerance with a rationale, not just awareness that drift exists.
L7/L8 Principal / Distinguished, Should we build this, and how? ›
- Frames approximate vs exact counting as a deliberate correctness model with quantified error bounds
- Designs for geo-distributed consistency: sticky routing vs gossip sync vs Spanner
- Considers cell-rate algorithm (GCRA) for strict burst SLAs
- Thinks about what the rate limiter protects against vs what it can't (state exhaustion, amplification attacks)
- Proactively surfaces build vs buy decision (Kong, Envoy, AWS API GW)
- Shows the reasoning behind every tradeoff, not just the conclusion
- Identifies failure modes the interviewer hasn't raised
- Can quantify the error bound of approximate counting and defend when it's acceptable
- Drives the conversation, doesn't just respond to prompts
Classic probes, level-differentiated answers
| Probe | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "Two servers increment the same counter simultaneously, what happens?" | Identifies the race. Suggests INCR + EXPIRE (not fully atomic). | Solves with Lua script. Explains why WATCH/MULTI is worse (retries, latency). | Discusses whether this race even matters given approximate counting mode. Quantifies the overcounting risk. |
| "Redis goes down. What does your rate limiter do?" | Mentions fail-open vs fail-closed. Picks one. | Designs circuit breaker. Differentiates behavior by endpoint sensitivity. Redis Sentinel failover timeline. | Models the blast radius: how many requests leak during a 30–60 s failover? Is that acceptable given the SLA? Proposes multi-AZ Redis for lower RTO. |
| "How do you rate limit a user who is hitting two different regions?" | Doesn't raise this. Assumes single-region. | Identifies that per-region counters allow 2× limit. Proposes sticky geo routing as simplest fix. | Compares: sticky routing (simple, single PoF) vs globally replicated DB (Spanner, strongly consistent, expensive) vs gossip sync (eventual consistency, tunable error bound). Chooses based on use case sensitivity. |
| "Token bucket vs sliding window, which do you pick and why?" | Picks token bucket. Can describe it. Doesn't connect choice to requirements. | Derives the choice from burst tolerance NFR. If burst is acceptable, token bucket. If exact, sliding window counter. If output smoothing needed, leaky bucket. | Adds: sliding window log is the only exact algorithm; others are approximations. Discusses GCRA (Generic Cell Rate Algorithm) as the rigorous foundation for token bucket. Notes that "burst" itself needs a formal definition in the SLA. |
How the pieces connect
Every architectural decision in this article traces back to a requirement set in §2 or a number established in §3.
- 1 Latency < 5 ms NFR (§2) → rules out DB reads on every request → mandates Redis as counter store (§4) → config cached locally at 60 s TTL (§8)
- 2 99.99% availability NFR (§2) → Redis outage is a real failure scenario → fail-open circuit breaker (§4, §10) → behavior differs by endpoint sensitivity (§11)
- 3 Race condition risk (§5) → two-command INCR+EXPIRE is not atomic → Lua script encapsulates all state mutation in one round-trip (§5, §6) → eliminates TOCTOU without distributed locks
- 4 Configurable per-tenant limits (§2) → quota config lookup on every request → in-process cache with 60 s TTL (§8) → push invalidation needed for plan downgrades (§10)
- 5 Audit / observability NFR (§2) → every 429 must be logged → async Kafka pipeline (§4, §9) → ClickHouse for abuse analytics without adding request-path latency
- 6 Burst tolerance choice (§5 algorithm decision) → token bucket allows burst, leaky bucket smooths output → drives whether downstream services must themselves be burst-tolerant (§4 component rationale)
- URL Shortener System Design — hash-based ID generation, Redis caching strategy, and async analytics pipeline
- Web Crawler System Design — Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter Feed System Design — fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design — multi-channel delivery, idempotency keys, and priority queues at scale
- Key-Value Store System Design — consistent hashing, eviction policies, replication, and failure modes
- Search Autocomplete
- Chat System (WhatsApp) System Design — WebSocket architecture, message delivery guarantees, and fan-out