intervu.dev, System Design Series

Design a Rate Limiter, System Design Interview Guide

Simple to explain, surprisingly hard to get right at scale. The gap between a correct counter and a correct distributed counter is one atomic operation, and everything else follows from that.

L3/L4 L5 Senior L6 Staff L7/L8 Principal

~25 min read  ·  All levels  ·  Updated March 2026

A bouncer turning away API requests at the rate limiter nightclub
01

What the interviewer is testing

A rate limiter question looks deceptively narrow. Most candidates sketch a counter and call it done. The interviewer is watching for something else: whether you can hold multiple constraints in tension at the same time, low latency, high availability, and fairness, without sacrificing one for another.

The question also acts as a level filter. At L3/L4, the bar is a working single-node design. At L5, you need to identify the distributed race condition and solve it atomically. At L6 and above, the interviewer wants to see you reason about what "correct" even means when nodes disagree, and when it's acceptable to over-count.

LevelCore expectationWhere candidates fall short
L3/L4Working token bucket or fixed window, correct HTTP 429 response, basic Redis key structureForgetting the race condition, no expiry on Redis keys
L5Atomic Lua script, sliding window algorithm, per-user + per-IP dimensionsTreating Redis as always available, no fallback plan
L6Multi-tier limits (user, tenant, global), quota inheritance across tiers, async audit logNo thought for quota inheritance or bulk API consumers
L7/L8Approximate counting tradeoffs, gossip-based quota sync, cell-rate algorithm for bursting SLAsOptimizing for the happy path only; no model for correctness under partition
💬

The earliest signal: "What counts as a request?", does a retry count? Does a request that errors before reaching the database count? Candidates who think to ask this tend to do well.

02

Requirements clarification

Before touching algorithms, establish what the rate limiter needs to protect and what "fair" means. The answers change the data model and the consistency model entirely.

Functional requirements

QuestionTypical answerWhy it matters
What dimensions to limit on?Per user, per IP, per API keyEach dimension needs a separate Redis key namespace
What endpoints need limiting?All, via gateway middlewareCentralising at the gateway avoids per-service duplication
Hard limit or soft?Hard, reject at thresholdSoft limits need a separate overflow queue design
Should limits be configurable per tenant?Yes, tiers (Free / Pro / Enterprise)Drives a config-lookup layer, not just static constants
Should retries count?Yes, same as a new requestSimplifies the counter; retry logic is client responsibility
Do you support idempotency keys?No, out of scope for this designIf yes, idempotent retries must not consume a token; requires a pre-check lookup and changes the Redis key structure (see §10 edge cases)

Non-functional requirements

NFRTarget
Added latency per check< 5 ms p99
Availability99.99%, fail open on Redis outage
AccuracyAt most 0.1% over-count; under-count minimised
Scale1M unique users, 100k requests/sec peak
AuditabilityAll 429s logged async within 30 s
Latency < 5 ms p99 Drives Redis-in-process

Every API call incurs an extra network hop for the rate limit check. At 5 ms the check is invisible to humans but measurable by services with tight SLAs. This target effectively rules out any approach that requires a synchronous DB read, Redis at sub-millisecond latency is the only practical option.

⚖ Tradeoff Moving the counter into the API server's memory (in-process) gets latency to < 0.1 ms, but then you can't share counts across servers. You'd need periodic sync, which opens a window for over-counting.
Alternatives In-process counter, fastest, not distributed Memcached, similar latency to Redis, no Lua atomicity
Fail open on Redis outage Drives availability design

99.99% availability for the rate limiter must not lower the availability of the services it protects. If the Redis cluster is unreachable, the right default for most consumer APIs is to let traffic through rather than block everything. This is called "fail open."

The alternative, "fail closed", is appropriate when the rate limiter protects a payment endpoint or prevents fraud. Clarify with the interviewer which mode fits the use case.

⚖ Tradeoff Failing open means an attacker can exploit Redis downtime. A brief circuit breaker window (e.g. 10 s of fail-open before alerting) balances safety and availability.
Per-tenant configurable limits Drives config layer

Static constants hardcoded in the limiter only work for a single-tier product. Once you have Free, Pro, and Enterprise tiers, each with different request quotas, you need a config lookup: given a user ID, what is their limit? This lookup happens on every request, so it must itself be cached, a short TTL (60 s) in a local in-process cache is the standard approach.

Alternatives JWT claims with limit embedded, avoids lookup, stale on plan change Config service with push invalidation, fresher, more complex
Accuracy: over-count ≤ 0.1%, under-count minimised Drives algorithm choice

The asymmetry between over-count and under-count is the key constraint. An over-count means a legitimate request gets a 429, a UX failure. An under-count means a request that should be blocked gets through, a security failure. These are not equally bad for most systems, which is why the NFR prioritises minimising under-counting while tolerating a small over-count rate.

In practice, brief under-counts are unavoidable at partition boundaries, during crash windows, and when using approximate counting modes (§09). The NFR means the steady-state design must not structurally under-count, not that zero under-count events are possible.

The 0.1% over-count tolerance is what makes the sliding window counter approximation viable as an alternative to the exact sliding window log. The approximation's error at the window boundary is at most prev_count × (elapsed/window), a fraction of the previous window's count, not the full limit, well within 0.1% under realistic traffic distributions. If this tolerance were tighter (e.g. 0%), you'd be forced into the sliding window log with its O(requests) memory cost.

⚖ Tradeoff Token bucket allows burst, which means the instantaneous request rate can briefly exceed the configured limit. Whether this counts as "over-count" depends on how you define the window, make sure to align with the interviewer before choosing an algorithm.
Scale: 1M users, 100k req/s peak Drives Redis cluster design

100k requests/sec means 100k Redis Lua script executions per second at peak. A single Redis node handles ~200–400k Lua executions/sec (see §9), so a single node is sufficient at this scale, but only if key distribution is even. If all traffic concentrates on a small number of hot users, one shard becomes a bottleneck.

The 1M unique users figure matters for memory sizing (see §3, it's small) and for config cache design: 1M users × 60 s TTL means the in-process cache on each API server holds at most a few thousand live entries at any moment, not 1M.

⚖ Tradeoff Designing for 100k req/s now means the architecture must be horizontally scalable without a rewrite. Redis Cluster sharding (§9) is the natural extension path, make sure your key structure is compatible with cluster hash slots from day one.
Auditability: all 429s logged within 30 s Drives async pipeline

Rate limit decisions must be debuggable ("why was user X blocked?") and auditable for abuse detection. Logging synchronously in the request path is ruled out immediately, even a 1 ms write to a log store adds to the p99 latency budget and would push the limiter past its 5 ms target at volume.

The 30 s bound is generous enough to allow an async pipeline: the middleware emits a 429 event to a local buffer, a background thread flushes to a durable message queue (Kafka), and a stream processor writes to a columnar analytics store (ClickHouse) for querying. This path never touches the hot request thread.

⚖ Tradeoff Async logging means a crash window exists where 429 events are lost. For most abuse-detection use cases this is acceptable, missing a few events doesn't invalidate a pattern. For compliance use cases (PCI, SOC 2) you may need at-least-once delivery guarantees from Kafka, which adds operational complexity.
03

Capacity estimation

The capacity picture for a rate limiter differs from most systems: storage is tiny, latency is everything, and write volume dominates. Every request is a write, a counter increment. There are no reads that aren't also writes.

Interactive estimator

Avg write QPS
requests/sec
Peak write QPS
requests/sec
Active keys (1 min window)
keys in Redis
Redis memory (active)
MB
429 log events/day
~2% of total
API servers needed
@ 5k req/s each
💡

The key insight: Redis memory for the counters is almost always trivially small, a million active 1-minute windows cost under 100 MB. The binding constraint is Redis write throughput at peak QPS, not memory. That's why the design question is really about atomic writes, not storage. Active key count assumes 10% of users are concurrently active at peak, typical for consumer APIs, adjust upward for social/messaging workloads or downward for B2B.

04

High-level architecture

The rate limiter lives in the request path, after load balancing and before the business logic. This placement is deliberate: it must intercept every request regardless of which API server processes it, and it must do so without adding significant latency.

Client Mobile / Web Load Balancer Rate Limiter Middleware API Gateway / Sidecar allow API Server Business logic 429 Redis Counters INCR / Lua Config Quota rules cached 60 s Audit Log async 429 events LEGEND synchronous async

Figure 1: High-level rate limiter architecture. The middleware intercepts every request, consults Redis atomically, and either forwards the request or returns a 429. Config rules are cached locally for 60 s.

Component breakdown

Load balancer. Distributes traffic across stateless API servers. Does not enforce rate limits itself, shared state would make it a bottleneck. If limiting by IP, it must preserve the original client IP in X-Forwarded-For rather than replacing it with its own address.

Rate limiter middleware. The core enforcement point. Can be deployed as an API gateway plugin (Kong, Nginx Lua), a sidecar proxy (Envoy), or in-process middleware. Does three things on every request: looks up the user's quota (from local cache), atomically checks and decrements the counter in Redis via Lua script, and either forwards the request or returns a 429.

Redis (counter store). The shared state layer. Every API server writes to the same Redis cluster, so the count is accurate regardless of which server handles a given request. Keys are namespaced by user ID and time window with a TTL set to the window size. Atomicity is achieved via a Lua script, a single round trip that reads, decrements, sets expiry, and returns the result.

Config store. Stores quota rules per tenant (Free: 100 req/min, Pro: 1,000 req/min). Read on every request but cached locally per API server with a 60-second TTL, so the lookup never adds a second Redis round-trip to the hot path. Plan changes propagate within 60 s; push invalidation is needed for immediate downgrade enforcement.

Audit log (async). Every 429 is published to a durable message queue (Kafka) and consumed by a stream processor that writes to a columnar analytics store (ClickHouse) for abuse analytics and querying. Serves two purposes: debugging ("why was user X blocked?") and abuse detection ("user X hit the limit 500 times in 5 minutes"). Writing is fire-and-forget from the request path, no latency added.

Architectural rationale

① Load balancer Routing layer

The load balancer's job is distribution, not enforcement. Enforcing rate limits at the LB would require shared state between LB nodes, turning a stateless routing layer into a stateful bottleneck. Keeping limits in the middleware layer behind the LB means the LB stays simple and horizontally scalable.

One non-obvious requirement: if you rate limit by IP, the LB must forward the original client IP in X-Forwarded-For. Many default configurations replace the source IP with the LB's own address, meaning all traffic appears to come from one IP, and IP-based limits collapse into a global limit.

⚖ Tradeoff Only trust X-Forwarded-For from known LB CIDR ranges. Clients can spoof this header to bypass IP-based limits if the middleware trusts it unconditionally.
② Rate limiter middleware Core component

Deployment options differ meaningfully. An API gateway plugin (Kong, Nginx Lua) is the simplest: one place to configure, one place to update. But it shares the gateway's blast radius, a middleware bug can take down all routing. A sidecar proxy (Envoy) isolates the rate limiter's failure domain at the cost of a network hop per request. In-process middleware (Express.js, FastAPI) is fastest but per-service, making language-agnostic enforcement impossible.

⚖ Tradeoff Gateway plugin vs sidecar: plugin is operationally simpler and has lower latency; sidecar adds ~0.5 ms per hop but isolates failures. For most APIs, the gateway plugin is the right default.
Alternatives Kong / Nginx Lua, gateway plugin, simplest operational model Envoy, sidecar, best failure isolation In-process, lowest latency, per-language implementation
③ Redis (counter store) State layer

Redis is the only practical choice for the counter store given the <5 ms p99 latency target. A traditional DB adds 5–15 ms per write under load; Redis at sub-millisecond latency is the only store that keeps the check invisible in the request path.

The Lua script is the correctness mechanism. Two separate commands (GET then INCR) leave a TOCTOU window: two concurrent requests can both read the same count, both decide "allowed," and both write, admitting two requests when one should be blocked. The Lua script runs atomically on the Redis thread, eliminating this race entirely without distributed locks.

⚖ Tradeoff Redis is an in-memory store. Counter state is lost on a cold restart. This is acceptable, counters are ephemeral by design (TTL-bound), and a brief over-allowance after a restart is far better than blocking all traffic while the store recovers.
Alternatives Memcached, similar latency, but no Lua atomicity and no per-key TTL In-process counter, fastest, but no shared state across servers
④ Config store Quota rules

Quota rules can't be hardcoded constants once you have per-tenant pricing tiers. The config store holds the mapping from user ID (or plan) to their limit, window size, and algorithm. Reads happen on every request, so a raw DB lookup would add another round-trip, the local in-process cache with a 60-second TTL prevents this.

The 60-second TTL is a deliberate product decision: upgrades propagate within 60 s (acceptable), but downgrades also take 60 s to enforce. For fraud-prevention use cases where immediate downgrade is required, push invalidation via a pub/sub event from the config service is the correct extension.

Alternatives JWT claims with limit embedded, avoids lookup entirely, stale on plan change Config service with push invalidation, fresher, more operationally complex
⑤ Audit log Observability

Synchronous 429 logging is ruled out by the latency budget: even a 1 ms log write adds to p99 and compounds at 100k QPS. The async pipeline (middleware → message queue → analytics store) decouples observability from the request path entirely, the middleware emits a fire-and-forget event and never waits for it.

The three stages each have a specific job. A durable message queue (Kafka) sits between the middleware and the analytics store, it absorbs traffic spikes without backpressure and retains events for replay if a consumer has a bug. A stream processor (Flink or a simple Kafka consumer) aggregates raw events and writes to a columnar analytics store (ClickHouse) that powers abuse dashboards and per-user 429 queries. Raw events can additionally be archived to object storage (S3) for long-term retention and ad hoc queries, optional at this scale but standard practice for compliance-sensitive deployments.

⚖ Tradeoff A server crash between emitting the Kafka event and flushing the local buffer can drop 429 events. For compliance use cases (PCI, SOC 2) this is unacceptable, at-least-once delivery guarantees from Kafka are required, adding consumer idempotency handling.

Real-world comparison

DecisionThis designStripeGitHub API
AlgorithmToken bucket (configurable)Token bucketSliding window (1-hour)
Counter storeRedis + LuaRedisRedis
Limit dimensionsUser + API key + IPAPI key (per secret)Authenticated user or IP
Headers returnedX-RateLimit-* + Retry-AfterX-RateLimit-Limit/Remaining/ResetX-RateLimit-Limit/Remaining/Reset/Used
On Redis outageFail open (configurable)Fail openFail open

Every system above uses Redis. The choice of algorithm (token bucket vs sliding window) is what differentiates them, and that choice flows directly from the burst tolerance NFR you set in §2.

05

Core algorithm, the bucket decision

Four algorithms compete in interviews. The right choice isn't fixed, it follows from your NFRs, specifically the ones set in §2: whether bursts are tolerable, what over-count accuracy is acceptable, and whether output must be strictly smoothed.

① Token Bucket 7/10 +1/6s ✓ Allows bursts ✓ O(1) memory ✗ Burst can spike downstream State: {tokens, last_refill_ts} ✓ Recommended for most APIs ② Fixed Window Counter 98/100 window N 98/100 window N+1 ✓ Simplest implementation ✓ O(1) memory ✗ 2× spike at boundary State: {count, window_ts} Use only if boundary spike is acceptable ③ Sliding Window Counter prev × (1 − elapsed / window) + curr ✓ Smooth, no boundary spike ✓ O(1) memory (2 counters) ✗ Approximate (≤1 req error) Best accuracy/memory balance ④ Leaky Bucket FIFO queue fixed rate out ✓ Perfectly smooth output ✗ Queuing adds latency ✗ Queue full → drops State: FIFO queue Use for strict output rate smoothing only

Figure 2: All four rate-limiting algorithms side by side. Token bucket (①) is the recommended default. Sliding window counter (③) is the best alternative when boundary smoothness matters. Fixed window (②) is simplest but has a 2× spike risk. Leaky bucket (④) is reserved for strict output smoothing.

① Token Bucket, deep dive Recommended default

A token bucket holds up to max tokens. Tokens refill at a fixed rate (e.g. 10/sec). Each request consumes one token. If the bucket is empty, the request is rejected with a 429. This maps naturally to user intuition: you have a quota that refills over time, and bursting is allowed as long as you have tokens saved up.

Redis representation: a hash with two fields, tokens (float) and last_refill (unix ms). On each request the Lua script computes tokens earned since the last call, adds them up to the max, then attempts to consume one. The entire read-modify-write is atomic.

Why it handles bursts well: if a user makes no requests for 30 seconds, they accumulate up to max tokens. Their next burst of requests drains the bucket instantly, which is intentional. The downstream system must be designed to handle these bursts, but for most APIs this is preferable to queuing.

⚖ Tradeoff The burst allowance can be a problem for downstream services with tight capacity. If a user's first 100 requests all arrive in 1 second after a quiet period, the downstream DB or service sees a spike it may not be sized for. Mitigate by setting burst_multiplier = 1 (bucket capacity equals the sustained limit, no burst headroom) or switching to sliding window counter, which smooths the rate without queuing.
When to choose User-facing APIs where occasional bursts are acceptable Any system where the §2 NFR explicitly permits short bursts Default choice when unsure, most flexible and well-understood
② Fixed Window Counter, deep dive Simplest, boundary risk

Divide time into fixed buckets (e.g. every 60 seconds). Each bucket holds a counter, keyed by rl:{userId}:{windowTs} where windowTs = floor(now / window) × window. Increment on every request; reject if the count exceeds the limit. The key expires automatically at the end of the window.

Redis representation: a single integer key. INCR rl:u123:1711584000 followed by EXPIRE on first write. Two commands, but the race between them is acceptable here, unlike token bucket, there is no read-before-write, so the TOCTOU window is narrow (only the EXPIRE can be lost, not the count).

The boundary problem: a user can legally make 100 requests at 11:59:59 and another 100 at 12:00:00, both within their respective windows, but 200 requests in 2 seconds. Any downstream service sized for 100 req/min sees a 200-request spike. This is the seam problem, and it's why fixed window is not recommended as a default.

⚖ Tradeoff The simplicity is real: one INCR + one EXPIRE, no float arithmetic, no sorted sets. At very high QPS where Lua script overhead matters, fixed window is measurably cheaper. Accept the boundary spike only if your downstream services can absorb it or if you're enforcing at a coarse granularity (hourly limits) where the spike is proportionally small.
③ Sliding Window Counter, deep dive Best accuracy/memory balance

A hybrid that approximates a true sliding window using only two fixed-window counters. The formula: count = prev_count × (1 − elapsed/window) + curr_count. The previous window's count is weighted by how much of it overlaps the current sliding window. As elapsed time grows, the previous window contributes less.

Redis representation: two integer keys, the current window bucket and the previous window bucket. The Lua script reads both, applies the formula, and checks against the limit. Still O(1) memory per user regardless of request rate.

Error bound: the approximation assumes request traffic within the previous window was uniformly distributed. Under this assumption, the maximum error at the boundary instant is prev_count × (elapsed / window), a fraction of the previous window's count, not the full limit. As elapsed time grows from 0 toward the window size, the previous window's contribution shrinks to zero. In practice this means the error is well under 1% of the limit for realistic traffic distributions. Cloudflare uses this algorithm at global scale.

Why this over token bucket? Token bucket allows front-loaded bursts. Sliding window counter smooths the rate across the window boundary without queuing. If your NFR says "no more than N requests per minute, measured continuously", not "N tokens that refill at N/60 per second", the sliding window counter is the more accurate implementation.

⚖ Tradeoff The uniform-distribution assumption breaks down for bursty traffic. A user who sends all 100 requests in the first second of a window, then nothing, will have their count overstated by the formula during the next window. This means they may be blocked earlier than strictly necessary. For most APIs this is an acceptable false positive.
④ Leaky Bucket, deep dive Strict output smoothing

Requests enter a FIFO queue and are processed at a fixed drain rate (e.g. 10 req/s). Bursts are absorbed into the queue rather than rejected immediately. If the queue is full, new arrivals are dropped. The output rate is perfectly smooth, downstream services see a constant, predictable load regardless of input burstiness.

Redis representation: a list used as a queue, with a background worker draining it at the fixed rate, call LPUSH on arrival, RPOP on each drain tick. A simpler stateless variant stores only the next allowed timestamp per user: on each request, check if now >= next_allowed; if yes, allow and set next_allowed = now + (1 / rate); if no, reject with a Retry-After of next_allowed - now. This avoids the queue entirely while preserving the fixed-rate output guarantee, at the cost of rejecting bursts immediately rather than queuing them.

Why it's rarely the right choice for APIs: queuing adds latency. A burst of 100 requests at 10 req/s drain rate means the 100th request waits 10 seconds before being processed. For an interactive API with a <5 ms p99 target, this is catastrophic. The user experiences a long hang instead of a clean 429.

When it is the right choice: backend processing pipelines where throughput matters more than latency, and where a 429 is worse than a delay, for example, payment processing, batch job submission, or webhook delivery where out-of-order drops are worse than queuing.

⚖ Tradeoff Leaky bucket shifts the failure mode from "client gets a 429" to "client waits a long time." Depending on the API contract and client timeout settings, a long wait can be more damaging than an explicit rejection with a Retry-After header. Always prefer token bucket or sliding window for synchronous request-response APIs.
Sliding Window Log Exact count, advanced alternative

A fifth option not shown in the diagram above. Instead of counters, store a sorted set of request timestamps. To check the limit, prune entries older than the window and count what remains. This is the only algorithm that gives a perfectly exact sliding window count with no approximation error at window boundaries.

The cost is memory: O(requests) per user rather than O(1). A user making 1,000 requests per window requires 1,000 timestamp entries in Redis. At scale this becomes significant. The Redis operation ZREMRANGEBYSCORE to prune old entries also adds latency proportional to the number of expired entries removed.

⚖ Tradeoff Use the sliding window log only when the 0.1% over-count tolerance from §2 cannot be relaxed, for example, authentication endpoints where even one extra allowed request is a security failure. For general-purpose APIs, the sliding window counter (③) gives indistinguishable accuracy at O(1) cost.

Our choice for this system: Token Bucket via Redis Lua

Token bucket with a Lua script is the right default for this system: our NFR in §2 explicitly permits short bursts, the smoother output of a leaky bucket would add unnecessary queuing latency for a user-facing API, and the higher memory cost of a sliding window log is unjustified when the 0.1% over-count tolerance means the sliding window counter approximation is already good enough. Token bucket is the choice that the §2 requirements drive you to, not a default you pick before reading them.

Redis Lua script, token bucket implementation
-- Redis Lua script: token bucket check-and-refill
-- KEYS[1] = "rl:{userId}"  -- persistent per-user key (no window: state carries across windows)
-- ARGV[1] = max tokens, ARGV[2] = refill rate/sec, ARGV[3] = now (unix ms)

local key = KEYS[1]
local max = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per second
local now = tonumber(ARGV[3])           -- milliseconds

local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or max
local last = tonumber(data[2]) or now

-- Refill: add tokens proportional to elapsed time
local elapsed = (now - last) / 1000.0
local new_tokens = math.min(max, tokens + elapsed * refill_rate)

if new_tokens < 1 then
  return {0, math.ceil((1 - new_tokens) / refill_rate * 1000)}  -- {denied, retry_ms}
end

-- Consume one token (HSET replaces deprecated HMSET, supported since Redis 4.0)
redis.call("HSET", key, "tokens", new_tokens - 1, "last_refill", now)
redis.call("PEXPIRE", key, math.ceil(max / refill_rate * 1000))  -- expire after full refill time
return {1, 0}  -- {allowed, 0}

One open question, what happens with a clock skew? If API servers have different system clocks, the now value passed to the Lua script will vary. The fix is to use Redis's own server time via redis.call("TIME") inside the script, this ensures the reference clock is always the Redis node, not the calling server.

05b

API design

The rate limiter exposes no external API of its own, it's middleware. But it modifies every response, and it exposes an internal management API for setting quotas. Here's what that looks like concretely.

Response headers added to every request

RESPONSEHeaders on every request (200 or 429)
{
  "X-RateLimit-Limit": 100,           // the ceiling for this user's plan
  "X-RateLimit-Remaining": 43,        // tokens left in current window
  "X-RateLimit-Reset": 1711584000,    // Unix ts when window resets / bucket refills
  "X-RateLimit-Policy": "100;w=60"   // IETF draft: 100 requests per 60-second window
}

// On 429, additionally:
{
  "Retry-After": 17,                  // seconds until the client should retry
  "X-RateLimit-Remaining": 0
}

Management API, quota configuration

PUT/admin/v1/quotas/{userId}
// Request body
{
  "limit": 1000,         // sustained rate: max requests per window
  "window_seconds": 60,  // window size
  "algorithm": "token_bucket",  // or "sliding_window_counter"
  "burst_multiplier": 2  // bucket capacity = limit × burst_multiplier (L5+)
                         // e.g. limit=1000, burst_multiplier=2 → max_tokens=2000
                         // a user can burst up to 2000 req instantly if bucket is full
}

// Response 200
{
  "userId": "u_8f2a",
  "effective_at": "2026-03-28T10:00:00Z",
  "propagation_lag_seconds": 60,  // cache TTL
  "derived": {
    "max_tokens": 2000,           // limit × burst_multiplier → passed as ARGV[1] to Lua
    "refill_rate_per_sec": 16.67  // limit / window_seconds → passed as ARGV[2] to Lua
  }
}

Optional endpoints

EndpointPurposeLevel
GET /admin/v1/quotas/{userId}Inspect current quota + remaining tokensL5
DELETE /admin/v1/counters/{userId}Reset a user's counter (support action)L5
GET /admin/v1/429-log?userId=…Audit log of 429 events per userL6
POST /admin/v1/quotas/bulkBatch quota update for tenant migrationL7/L8
06

Core flow

Every request through the system follows the same path. The critical decision, allow or deny, happens inside a single atomic Redis operation, which is why the latency stays under 5 ms even at 100k QPS.

Request arrives Lookup user limit (local cache, miss → Config DB) Build Redis key rl:{userId}:{window_ts} Execute Redis Lua script (atomic: check tokens → decrement → expiry) tokens ≥ 1? Lua returns {1,0} or {0,retry_ms} YES Forward request 200 / business logic NO Return 429 + Retry-After header Async: push to audit Kafka topic async Add X-RateLimit-* headers to response

Figure 3: Request flow. The Lua script executes in a single Redis round-trip (~0.5 ms LAN). The decision branches immediately after; audit logging is always async.

💡

The key tradeoff in this flow is why the Lua script is one round-trip, not two (GET then SET). Two operations introduce a TOCTOU window where two concurrent requests can both read count=99, both decide "allowed," and both write count=100, admitting two requests when one should be blocked. The Lua script eliminates this entirely.

💬

Classic L5 probe: "What if the Redis write succeeds but the network drops before returning to the middleware, do you count the request or not?" Answer: you already decremented the token. The request is counted consumed. This is the right behavior, the alternative (retry the Lua script) risks double-counting.

07

Data model

The data model is split between Redis (hot counter state) and a relational database (quota configuration). They serve opposite purposes: Redis is updated on every single request; the config DB is read rarely and written almost never.

Access patterns first

OperationFrequencyQuery shape
Check + increment counterEvery request, 100k/s peakRead-modify-write by key: rl:{userId} (token bucket) or rl:{userId}:{windowTs} (fixed/sliding window)
Lookup quota configOnce per 60 s per user (cache TTL)Point lookup by userId
Admin: update quotaRare, plan changes, support actionsUpdate by userId
Admin: query 429 logOccasional, debugging, abuse reportsRange scan by userId + timestamp

Two things stand out: the counter access is hot, write-heavy, and requires atomic read-modify-write, Redis is the only viable store. The config access is cold, read-heavy, and needs no atomicity, a relational DB with local caching is perfect.

Redis, Counter Keys rl:{userId}:{windowTs} Hash fields (token bucket) tokens FLOAT , remaining tokens last_refill INT , unix ms OR Sorted Set (sliding window log) score = request_ts (unix ms) member = unique request ID TTL = window_size (auto-expire) Relational DB, Quota Config user_quotas userId VARCHAR PK limit INT , max requests window_sec INT , window size algorithm ENUM , bucket|sliding plan VARCHAR , free|pro|enterprise audit_429_log (Kafka → ClickHouse) userId, ts, endpoint, remaining_tokens

Figure 4: Data model. Redis keys are namespaced and auto-expire. Config lives in a relational DB, cached locally on each API server for 60 seconds.

Why TTL on every Redis key? Memory hygiene

Without a TTL, inactive users accumulate dead counter keys forever. With a TTL set to the window size, a user who stops making requests will have their key automatically evicted when the window closes. For a 1-minute window, keys live at most 1 minute after the last request.

This is what keeps Redis memory bounded (see §3, even at 1M users, active keys are tiny) and means you never need a background cleanup job.

08

Caching strategy

The rate limiter has an inverted caching profile compared to most systems: Redis is the hot state store, not a cache in front of something slower. What you're actually caching is everything that sits in front of Redis, specifically, quota config lookups that would otherwise add a second Redis round-trip to every request. Understanding this inversion matters because it changes what "cache invalidation" means: invalidating a counter is never necessary (TTLs handle it), but invalidating a quota config entry on plan change is a real operational concern.

Placing the layers against the §04 architecture makes the picture concrete. Every request passes through three potential cache interactions before a decision is made:

req Rate Limiter Middleware L1, In-process cache quota config · TTL 60 s circuit breaker flag · TTL 10 s Lua script on L1 miss Redis, Counter Store tokens · last_refill written on every request TTL = window size (auto-expire) config miss rare Config DB quota rules source of truth ① Config lookup Served from in-process cache DB only on cold start / expiry ② Counter read-write Always hits Redis, authoritative state, not a cache of anything ③ Config source Only on cache miss (~every 60 s per user per instance)

Figure 6: Where caching fits in the request path. The in-process cache inside the middleware serves quota config without touching Redis. Redis is the authoritative counter store, every request reads and writes it. The Config DB is only reached on a cache miss.

WhatLives inHoldsTTLWhy
In-process config cacheRate limiter middleware (per instance)User quota config (limit, window, algorithm)60 sAvoids a Redis round-trip for config on every request; cold miss falls through to Config DB
In-process circuit breakerRate limiter middleware (per instance)Redis outage flag10 sPrevents thundering herd of retries to a failing Redis; resets automatically after 10 s
Redis counter storeShared Redis clusterCounter state (tokens, last_refill)= window sizeAuthoritative rate limit state, not a cache. Written atomically on every request via Lua script.

Cache invalidation trap: When a user's plan is downgraded, the local config cache will still show the old (higher) limit for up to 60 seconds. This means a just-downgraded user can send 60 more seconds of Pro-tier traffic. For most products this is acceptable. If it isn't, you need push invalidation, the config service sends a pub/sub event to all API servers to evict the specific key immediately.

09

Deep-dive scalability L5+

At 100k QPS with a single Redis node, you're likely fine, Redis handles ~1M simple ops/sec (GET, SET, INCR). However, Lua scripts are significantly heavier than raw commands: a realistic ceiling for Lua executions is 200–400k/sec on typical hardware, because each script blocks Redis's single thread for longer than a primitive op. At 100k QPS you have headroom; at 500k+ QPS, Redis Cluster becomes necessary. But production systems have to think about what happens when demand grows, when Redis is partitioned, and when consistency guarantees need to be loosened for throughput.

Clients Global LB / CDN API Server 1 API Server 2 API Server N Redis Cluster 3 primary shards 3 replicas (read) Config DB Quota rules Kafka 429 events → audit ClickHouse Abuse analytics Anomaly Alert PagerDuty / OpsGenie

Figure 5: Production-scale architecture. Redis cluster shards counters across 3 primaries. 429 events stream to Kafka and into ClickHouse for abuse analytics.

Redis Cluster sharding L5, throughput

Redis Cluster distributes keys across 16,384 hash slots assigned to N primary nodes. For rate limiting, each user's counter key hashes to exactly one shard, so all operations for a given user are atomic on that shard. No cross-shard transactions are needed.

A 3-primary cluster with each primary handling ~200–300k Lua executions/sec (consistent with the ceiling established in §09) gives headroom to ~600k–900k QPS before horizontal scaling is needed. Add shards by resharding, Redis Cluster handles this online.

⚖ Tradeoff Resharding causes brief key slot migration, during this window, some Lua scripts may fail with MOVED errors. The middleware must handle retries with backoff.
Local counter + periodic sync (approximate) L6, extreme throughput

Above ~5M QPS, even Redis Cluster becomes a bottleneck. The alternative: each API server maintains an in-process counter and periodically (every 100 ms) syncs with Redis via INCRBY. Between syncs, the distributed count can be off by at most servers × local_count. This is an explicit trade: approximate counting for linear write scalability.

This design is used by Cloudflare and is acceptable when "at most 5% over-count" is a tolerable SLA. It's unacceptable for security-critical limits (authentication attempts, payment API).

⚖ Tradeoff On server crash, the local count is lost, those requests were admitted but never counted in Redis. The system under-counts for the crash window.
Multi-tier rate limits L6, tenant design

Enterprise products often have nested limits: per-user, per-team, and per-organization, all enforced simultaneously. A request must pass all three checks. The implementation: the Lua script is called once per dimension, in order. If any check fails, the 429 is returned immediately with the header naming which dimension was exhausted: X-RateLimit-Violated: org.

Geo-distributed rate limiting L7/L8, global consistency

If a user can send requests to two different regions simultaneously, per-region counters will each allow the full quota, effectively doubling the limit. True global rate limiting requires cross-region counter sync. Options: (1) route all requests for a user to their "home" region (sticky routing, single point of failure); (2) use a globally replicated database (Google Spanner or DynamoDB Global Tables) that offers low-latency strongly-consistent reads across regions; (3) accept approximate counting with gossip-based sync.

Most systems choose option 3: allow a known over-count window at region boundaries, monitor for abuse via the audit log, and address egregious violations reactively.

How gossip-based sync works: each region periodically (e.g. every 100–500 ms) broadcasts its local counter delta to peer regions. Each region maintains a per-region counter vector and sums them to get the global count. This is a form of CRDT (Conflict-free Replicated Data Type), specifically a G-Counter, where each region owns one slot in the vector and only increments it. Merging is simply taking the max of each slot across all received vectors. Convergence time equals the gossip interval plus network RTT between regions (typically 100–300 ms cross-continent), giving a bounded over-count window of at most limit × (gossip_interval / window_size).

10

Failure modes & edge cases

ScenarioProblemSolutionLevel
Redis node goes down All counter checks fail; traffic either stops or goes unchecked Circuit breaker: fail open for non-auth endpoints. Alert on-call (e.g. PagerDuty). Redis Sentinel (the built-in high-availability manager) detects failure within ~30 s (default down-after-milliseconds); total failover including election and promotion typically takes 30–60 s end-to-end. L3/L4
Window boundary spike Fixed-window allows 2× limit at boundaries, downstream overload Switch to token bucket or sliding window counter algorithm, which smooth across boundaries L3/L4
Lua script race condition Two servers read the same count with GET, both decide "allowed," both write, admitting one request too many (TOCTOU on GET+SET, not INCR itself) All state mutation in a single Lua script, atomic on the Redis thread, eliminating the read-modify-write race without distributed locks L5
Plan upgrade cache lag User upgrades plan; still sees old lower limit for 60 s Pub/sub invalidation: config service sends evict event to all API servers on plan change L5
Clock drift between API servers Window timestamps differ across nodes, same user gets different windows Use redis.call("TIME") inside Lua script to use Redis server clock, not application clock L5
Header spoofing (X-Forwarded-For) Attacker spoofs IP header to bypass IP-based limits Trust X-Forwarded-For only from known load balancer CIDR ranges, never from arbitrary clients. Prefer X-Real-IP set by the LB itself, which cannot be injected by clients. For any security-critical limit (auth endpoints, payment APIs), use authenticated user ID as the primary key instead of IP; IP-based limits are a last-resort signal, not a primary defence. Rate limit by authenticated user ID as the primary key whenever possible. L6
Thundering herd on Redis restart All API servers simultaneously miss local cache after Redis restart, 100k req/s hit cold Redis Jittered local cache TTL (60 s ± 10 s random) + request coalescing: one inflight config fetch per key L7/L8
Distributed counter drift at geo boundaries User sends 90 req to US-EAST and 90 req to EU-WEST; both regions allow; effective limit is 2× Accept approximate over-count. Use audit log anomaly detection to flag and throttle after the fact. For auth endpoints: use sticky geo routing. L7/L8
Idempotency key on retry A POST fails mid-flight and is retried with the same idempotency key, should the retry consume a token? For APIs that support idempotency keys (Stripe-style), do not count idempotent retries against the quota, the client is replaying a request that already consumed a token. Store the idempotency key with the original response in a short-TTL Redis key (24 h); on retry, return the cached response without re-entering the rate limiter. This must be established as a requirement in §2, it changes the key structure and adds a pre-check lookup step to the core flow. L6
11

How to answer by level

L3/L4 SDE I / SDE II, Can you build a working system?
What good looks like
  • Pick token bucket or fixed window, justify the choice
  • Use Redis with INCR + EXPIRE
  • Return 429 with Retry-After header
  • Identify (but don't need to fully solve) the race condition
  • Mention per-user key scoping
What separates this from L5
  • No thought for distributed atomicity
  • Doesn't ask about burst tolerance vs smoothing
  • Missing TTL reasoning on Redis keys
  • No mention of fail-open vs fail-closed
L5 Senior SDE, Do you understand the tradeoffs?
What good looks like
  • Lua script for atomicity, explains why two separate commands aren't safe
  • Chooses algorithm based on burst tolerance NFR, not by default
  • Designs per-user + per-IP dimensions, explains the key structure
  • Addresses clock drift via Redis TIME
  • Proactively raises fail-open semantics
What separates this from L6
  • Single-tier limits only (no tenant/org hierarchy)
  • No plan for global/cross-region consistency
  • Doesn't think about approximate counting as a deliberate tradeoff
L6 Staff SDE, Can you own this end-to-end?
What good looks like
  • Multi-tier limits (user → team → org), Lua script runs all checks
  • Push invalidation for plan changes
  • Sliding window counter approximation vs exact, chooses based on use case
  • Audit log pipeline to ClickHouse for abuse analytics
  • Addresses header spoofing and IP trust chain
What separates this from L7
  • Doesn't model the correctness guarantees under network partition
  • No framework for deciding when approximate counting is "good enough"
  • Doesn't reason about multi-region topology as a first-class constraint

At L7, you're expected to define an explicit correctness model, e.g., "we accept at most 5% over-count during partition windows lasting under 30 seconds", rather than just naming the problem. The difference is quantified tolerance with a rationale, not just awareness that drift exists.

L7/L8 Principal / Distinguished, Should we build this, and how?
What good looks like
  • Frames approximate vs exact counting as a deliberate correctness model with quantified error bounds
  • Designs for geo-distributed consistency: sticky routing vs gossip sync vs Spanner
  • Considers cell-rate algorithm (GCRA) for strict burst SLAs
  • Thinks about what the rate limiter protects against vs what it can't (state exhaustion, amplification attacks)
  • Proactively surfaces build vs buy decision (Kong, Envoy, AWS API GW)
The bar
  • Shows the reasoning behind every tradeoff, not just the conclusion
  • Identifies failure modes the interviewer hasn't raised
  • Can quantify the error bound of approximate counting and defend when it's acceptable
  • Drives the conversation, doesn't just respond to prompts

Classic probes, level-differentiated answers

ProbeL3/L4L5/L6L7/L8
"Two servers increment the same counter simultaneously, what happens?" Identifies the race. Suggests INCR + EXPIRE (not fully atomic). Solves with Lua script. Explains why WATCH/MULTI is worse (retries, latency). Discusses whether this race even matters given approximate counting mode. Quantifies the overcounting risk.
"Redis goes down. What does your rate limiter do?" Mentions fail-open vs fail-closed. Picks one. Designs circuit breaker. Differentiates behavior by endpoint sensitivity. Redis Sentinel failover timeline. Models the blast radius: how many requests leak during a 30–60 s failover? Is that acceptable given the SLA? Proposes multi-AZ Redis for lower RTO.
"How do you rate limit a user who is hitting two different regions?" Doesn't raise this. Assumes single-region. Identifies that per-region counters allow 2× limit. Proposes sticky geo routing as simplest fix. Compares: sticky routing (simple, single PoF) vs globally replicated DB (Spanner, strongly consistent, expensive) vs gossip sync (eventual consistency, tunable error bound). Chooses based on use case sensitivity.
"Token bucket vs sliding window, which do you pick and why?" Picks token bucket. Can describe it. Doesn't connect choice to requirements. Derives the choice from burst tolerance NFR. If burst is acceptable, token bucket. If exact, sliding window counter. If output smoothing needed, leaky bucket. Adds: sliding window log is the only exact algorithm; others are approximations. Discusses GCRA (Generic Cell Rate Algorithm) as the rigorous foundation for token bucket. Notes that "burst" itself needs a formal definition in the SLA.

How the pieces connect

Every architectural decision in this article traces back to a requirement set in §2 or a number established in §3.

  • 1 Latency < 5 ms NFR (§2) rules out DB reads on every request mandates Redis as counter store (§4) config cached locally at 60 s TTL (§8)
  • 2 99.99% availability NFR (§2) Redis outage is a real failure scenario fail-open circuit breaker (§4, §10) behavior differs by endpoint sensitivity (§11)
  • 3 Race condition risk (§5) two-command INCR+EXPIRE is not atomic Lua script encapsulates all state mutation in one round-trip (§5, §6) eliminates TOCTOU without distributed locks
  • 4 Configurable per-tenant limits (§2) quota config lookup on every request in-process cache with 60 s TTL (§8) push invalidation needed for plan downgrades (§10)
  • 5 Audit / observability NFR (§2) every 429 must be logged async Kafka pipeline (§4, §9) ClickHouse for abuse analytics without adding request-path latency
  • 6 Burst tolerance choice (§5 algorithm decision) token bucket allows burst, leaky bucket smooths output drives whether downstream services must themselves be burst-tolerant (§4 component rationale)
Also in this series

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →