intervu.dev, System Design Series

Design a Rate Limiter, System Design Interview Guide

Simple to explain, surprisingly hard to get right at scale. The gap between a correct counter and a correct distributed counter is one atomic operation, and everything else follows from that.

L3/L4 L5 Senior L6 Staff L7/L8 Principal

~25 min read · All levels · Updated March 2026

A bouncer turning away API requests at the rate limiter nightclub

What the interviewer is testing

A rate limiter question looks deceptively narrow. Most candidates sketch a counter and call it done. The interviewer is watching for something else: whether you can hold multiple constraints in tension at the same time, low latency, high availability, and fairness, without sacrificing one for another.

The question also acts as a level filter. At L3/L4, the bar is a working single-node design. At L5, you need to identify the distributed race condition and solve it atomically. At L6 and above, the interviewer wants to see you reason about what "correct" even means when nodes disagree, and when it's acceptable to over-count.

Level	Core expectation	Where candidates fall short
L3/L4	Working token bucket or fixed window, correct HTTP 429 response, basic Redis key structure	Forgetting the race condition, no expiry on Redis keys
L5	Atomic Lua script, sliding window algorithm, per-user + per-IP dimensions	Treating Redis as always available, no fallback plan
L6	Multi-tier limits (user, tenant, global), quota inheritance across tiers, async audit log	No thought for quota inheritance or bulk API consumers
L7/L8	Approximate counting tradeoffs, gossip-based quota sync, cell-rate algorithm for bursting SLAs	Optimizing for the happy path only; no model for correctness under partition

💬

The earliest signal: "What counts as a request?", does a retry count? Does a request that errors before reaching the database count? Candidates who think to ask this tend to do well.

Requirements clarification

Before touching algorithms, establish what the rate limiter needs to protect and what "fair" means. The answers change the data model and the consistency model entirely.

Functional requirements

Question	Typical answer	Why it matters
What dimensions to limit on?	Per user, per IP, per API key	Each dimension needs a separate Redis key namespace
What endpoints need limiting?	All, via gateway middleware	Centralising at the gateway avoids per-service duplication
Hard limit or soft?	Hard, reject at threshold	Soft limits need a separate overflow queue design
Should limits be configurable per tenant?	Yes, tiers (Free / Pro / Enterprise)	Drives a config-lookup layer, not just static constants
Should retries count?	Yes, same as a new request	Simplifies the counter; retry logic is client responsibility
Do you support idempotency keys?	No, out of scope for this design	If yes, idempotent retries must not consume a token; requires a pre-check lookup and changes the Redis key structure (see §10 edge cases)

Non-functional requirements

NFR	Target
Added latency per check	< 5 ms p99
Availability	99.99%, fail open on Redis outage
Accuracy	At most 0.1% over-count; under-count minimised
Scale	1M unique users, 100k requests/sec peak
Auditability	All 429s logged async within 30 s

Latency < 5 ms p99 Drives Redis-in-process ›

Every API call incurs an extra network hop for the rate limit check. At 5 ms the check is invisible to humans but measurable by services with tight SLAs. This target effectively rules out any approach that requires a synchronous DB read, Redis at sub-millisecond latency is the only practical option.

⚖ Tradeoff Moving the counter into the API server's memory (in-process) gets latency to < 0.1 ms, but then you can't share counts across servers. You'd need periodic sync, which opens a window for over-counting.

Alternatives In-process counter, fastest, not distributed Memcached, similar latency to Redis, no Lua atomicity

Fail open on Redis outage Drives availability design ›

99.99% availability for the rate limiter must not lower the availability of the services it protects. If the Redis cluster is unreachable, the right default for most consumer APIs is to let traffic through rather than block everything. This is called "fail open."

The alternative, "fail closed", is appropriate when the rate limiter protects a payment endpoint or prevents fraud. Clarify with the interviewer which mode fits the use case.

⚖ Tradeoff Failing open means an attacker can exploit Redis downtime. A brief circuit breaker window (e.g. 10 s of fail-open before alerting) balances safety and availability.

Per-tenant configurable limits Drives config layer ›

Static constants hardcoded in the limiter only work for a single-tier product. Once you have Free, Pro, and Enterprise tiers, each with different request quotas, you need a config lookup: given a user ID, what is their limit? This lookup happens on every request, so it must itself be cached, a short TTL (60 s) in a local in-process cache is the standard approach.

Alternatives JWT claims with limit embedded, avoids lookup, stale on plan change Config service with push invalidation, fresher, more complex

Accuracy: over-count ≤ 0.1%, under-count minimised Drives algorithm choice ›

The asymmetry between over-count and under-count is the key constraint. An over-count means a legitimate request gets a 429, a UX failure. An under-count means a request that should be blocked gets through, a security failure. These are not equally bad for most systems, which is why the NFR prioritises minimising under-counting while tolerating a small over-count rate.

In practice, brief under-counts are unavoidable at partition boundaries, during crash windows, and when using approximate counting modes (§09). The NFR means the steady-state design must not structurally under-count, not that zero under-count events are possible.

The 0.1% over-count tolerance is what makes the sliding window counter approximation viable as an alternative to the exact sliding window log. The approximation's error at the window boundary is at most prev_count × (elapsed/window), a fraction of the previous window's count, not the full limit, well within 0.1% under realistic traffic distributions. If this tolerance were tighter (e.g. 0%), you'd be forced into the sliding window log with its O(requests) memory cost.

⚖ Tradeoff Token bucket allows burst, which means the instantaneous request rate can briefly exceed the configured limit. Whether this counts as "over-count" depends on how you define the window, make sure to align with the interviewer before choosing an algorithm.

Scale: 1M users, 100k req/s peak Drives Redis cluster design ›

100k requests/sec means 100k Redis Lua script executions per second at peak. A single Redis node handles ~200–400k Lua executions/sec (see §9), so a single node is sufficient at this scale, but only if key distribution is even. If all traffic concentrates on a small number of hot users, one shard becomes a bottleneck.

The 1M unique users figure matters for memory sizing (see §3, it's small) and for config cache design: 1M users × 60 s TTL means the in-process cache on each API server holds at most a few thousand live entries at any moment, not 1M.

⚖ Tradeoff Designing for 100k req/s now means the architecture must be horizontally scalable without a rewrite. Redis Cluster sharding (§9) is the natural extension path, make sure your key structure is compatible with cluster hash slots from day one.

Auditability: all 429s logged within 30 s Drives async pipeline ›

Rate limit decisions must be debuggable ("why was user X blocked?") and auditable for abuse detection. Logging synchronously in the request path is ruled out immediately, even a 1 ms write to a log store adds to the p99 latency budget and would push the limiter past its 5 ms target at volume.

The 30 s bound is generous enough to allow an async pipeline: the middleware emits a 429 event to a local buffer, a background thread flushes to a durable message queue (Kafka), and a stream processor writes to a columnar analytics store (ClickHouse) for querying. This path never touches the hot request thread.

⚖ Tradeoff Async logging means a crash window exists where 429 events are lost. For most abuse-detection use cases this is acceptable, missing a few events doesn't invalidate a pattern. For compliance use cases (PCI, SOC 2) you may need at-least-once delivery guarantees from Kafka, which adds operational complexity.

Capacity estimation

The capacity picture for a rate limiter differs from most systems: storage is tiny, latency is everything, and write volume dominates. Every request is a write, a counter increment. There are no reads that aren't also writes.

Interactive estimator

Unique users 1,000,000

Requests per user per day 200

Peak multiplier (burst factor) 5×

Bytes per counter key (key + value) 64 B

Avg write QPS

—

requests/sec

—

Peak write QPS

—

requests/sec

—

Active keys (1 min window)

—

keys in Redis

—

Redis memory (active)

—

429 log events/day

—

~2% of total

—

API servers needed

—

@ 5k req/s each

—

💡

The key insight: Redis memory for the counters is almost always trivially small, a million active 1-minute windows cost under 100 MB. The binding constraint is Redis write throughput at peak QPS, not memory. That's why the design question is really about atomic writes, not storage. Active key count assumes 10% of users are concurrently active at peak, typical for consumer APIs, adjust upward for social/messaging workloads or downward for B2B.

High-level architecture

The rate limiter lives in the request path, after load balancing and before the business logic. This placement is deliberate: it must intercept every request regardless of which API server processes it, and it must do so without adding significant latency.

Figure 1: High-level rate limiter architecture. The middleware intercepts every request, consults Redis atomically, and either forwards the request or returns a 429. Config rules are cached locally for 60 s.

Component breakdown

Load balancer. Distributes traffic across stateless API servers. Does not enforce rate limits itself, shared state would make it a bottleneck. If limiting by IP, it must preserve the original client IP in X-Forwarded-For rather than replacing it with its own address.

Rate limiter middleware. The core enforcement point. Can be deployed as an API gateway plugin (Kong, Nginx Lua), a sidecar proxy (Envoy), or in-process middleware. Does three things on every request: looks up the user's quota (from local cache), atomically checks and decrements the counter in Redis via Lua script, and either forwards the request or returns a 429.

Redis (counter store). The shared state layer. Every API server writes to the same Redis cluster, so the count is accurate regardless of which server handles a given request. Keys are namespaced by user ID and time window with a TTL set to the window size. Atomicity is achieved via a Lua script, a single round trip that reads, decrements, sets expiry, and returns the result.

Config store. Stores quota rules per tenant (Free: 100 req/min, Pro: 1,000 req/min). Read on every request but cached locally per API server with a 60-second TTL, so the lookup never adds a second Redis round-trip to the hot path. Plan changes propagate within 60 s; push invalidation is needed for immediate downgrade enforcement.

Audit log (async). Every 429 is published to a durable message queue (Kafka) and consumed by a stream processor that writes to a columnar analytics store (ClickHouse) for abuse analytics and querying. Serves two purposes: debugging ("why was user X blocked?") and abuse detection ("user X hit the limit 500 times in 5 minutes"). Writing is fire-and-forget from the request path, no latency added.

Architectural rationale

① Load balancer Routing layer ›

The load balancer's job is distribution, not enforcement. Enforcing rate limits at the LB would require shared state between LB nodes, turning a stateless routing layer into a stateful bottleneck. Keeping limits in the middleware layer behind the LB means the LB stays simple and horizontally scalable.

One non-obvious requirement: if you rate limit by IP, the LB must forward the original client IP in X-Forwarded-For. Many default configurations replace the source IP with the LB's own address, meaning all traffic appears to come from one IP, and IP-based limits collapse into a global limit.

⚖ Tradeoff Only trust X-Forwarded-For from known LB CIDR ranges. Clients can spoof this header to bypass IP-based limits if the middleware trusts it unconditionally.

② Rate limiter middleware Core component ›

Deployment options differ meaningfully. An API gateway plugin (Kong, Nginx Lua) is the simplest: one place to configure, one place to update. But it shares the gateway's blast radius, a middleware bug can take down all routing. A sidecar proxy (Envoy) isolates the rate limiter's failure domain at the cost of a network hop per request. In-process middleware (Express.js, FastAPI) is fastest but per-service, making language-agnostic enforcement impossible.

⚖ Tradeoff Gateway plugin vs sidecar: plugin is operationally simpler and has lower latency; sidecar adds ~0.5 ms per hop but isolates failures. For most APIs, the gateway plugin is the right default.

Alternatives Kong / Nginx Lua, gateway plugin, simplest operational model Envoy, sidecar, best failure isolation In-process, lowest latency, per-language implementation

③ Redis (counter store) State layer ›

Redis is the only practical choice for the counter store given the <5 ms p99 latency target. A traditional DB adds 5–15 ms per write under load; Redis at sub-millisecond latency is the only store that keeps the check invisible in the request path.

The Lua script is the correctness mechanism. Two separate commands (GET then INCR) leave a TOCTOU window: two concurrent requests can both read the same count, both decide "allowed," and both write, admitting two requests when one should be blocked. The Lua script runs atomically on the Redis thread, eliminating this race entirely without distributed locks.

⚖ Tradeoff Redis is an in-memory store. Counter state is lost on a cold restart. This is acceptable, counters are ephemeral by design (TTL-bound), and a brief over-allowance after a restart is far better than blocking all traffic while the store recovers.

Alternatives Memcached, similar latency, but no Lua atomicity and no per-key TTL In-process counter, fastest, but no shared state across servers

④ Config store Quota rules ›

Quota rules can't be hardcoded constants once you have per-tenant pricing tiers. The config store holds the mapping from user ID (or plan) to their limit, window size, and algorithm. Reads happen on every request, so a raw DB lookup would add another round-trip, the local in-process cache with a 60-second TTL prevents this.

The 60-second TTL is a deliberate product decision: upgrades propagate within 60 s (acceptable), but downgrades also take 60 s to enforce. For fraud-prevention use cases where immediate downgrade is required, push invalidation via a pub/sub event from the config service is the correct extension.

Alternatives JWT claims with limit embedded, avoids lookup entirely, stale on plan change Config service with push invalidation, fresher, more operationally complex

⑤ Audit log Observability ›

Synchronous 429 logging is ruled out by the latency budget: even a 1 ms log write adds to p99 and compounds at 100k QPS. The async pipeline (middleware → message queue → analytics store) decouples observability from the request path entirely, the middleware emits a fire-and-forget event and never waits for it.

The three stages each have a specific job. A durable message queue (Kafka) sits between the middleware and the analytics store, it absorbs traffic spikes without backpressure and retains events for replay if a consumer has a bug. A stream processor (Flink or a simple Kafka consumer) aggregates raw events and writes to a columnar analytics store (ClickHouse) that powers abuse dashboards and per-user 429 queries. Raw events can additionally be archived to object storage (S3) for long-term retention and ad hoc queries, optional at this scale but standard practice for compliance-sensitive deployments.

⚖ Tradeoff A server crash between emitting the Kafka event and flushing the local buffer can drop 429 events. For compliance use cases (PCI, SOC 2) this is unacceptable, at-least-once delivery guarantees from Kafka are required, adding consumer idempotency handling.

Real-world comparison

Decision	This design	Stripe	GitHub API
Algorithm	Token bucket (configurable)	Token bucket	Sliding window (1-hour)
Counter store	Redis + Lua	Redis	Redis
Limit dimensions	User + API key + IP	API key (per secret)	Authenticated user or IP
Headers returned	X-RateLimit-* + Retry-After	X-RateLimit-Limit/Remaining/Reset	X-RateLimit-Limit/Remaining/Reset/Used
On Redis outage	Fail open (configurable)	Fail open	Fail open

✓

Every system above uses Redis. The choice of algorithm (token bucket vs sliding window) is what differentiates them, and that choice flows directly from the burst tolerance NFR you set in §2.

Core algorithm, the bucket decision

Four algorithms compete in interviews. The right choice isn't fixed, it follows from your NFRs, specifically the ones set in §2: whether bursts are tolerable, what over-count accuracy is acceptable, and whether output must be strictly smoothed.

Figure 2: All four rate-limiting algorithms side by side. Token bucket (①) is the recommended default. Sliding window counter (③) is the best alternative when boundary smoothness matters. Fixed window (②) is simplest but has a 2× spike risk. Leaky bucket (④) is reserved for strict output smoothing.

① Token Bucket, deep dive Recommended default ›

A token bucket holds up to max tokens. Tokens refill at a fixed rate (e.g. 10/sec). Each request consumes one token. If the bucket is empty, the request is rejected with a 429. This maps naturally to user intuition: you have a quota that refills over time, and bursting is allowed as long as you have tokens saved up.

Redis representation: a hash with two fields, tokens (float) and last_refill (unix ms). On each request the Lua script computes tokens earned since the last call, adds them up to the max, then attempts to consume one. The entire read-modify-write is atomic.

Why it handles bursts well: if a user makes no requests for 30 seconds, they accumulate up to max tokens. Their next burst of requests drains the bucket instantly, which is intentional. The downstream system must be designed to handle these bursts, but for most APIs this is preferable to queuing.

⚖ Tradeoff The burst allowance can be a problem for downstream services with tight capacity. If a user's first 100 requests all arrive in 1 second after a quiet period, the downstream DB or service sees a spike it may not be sized for. Mitigate by setting burst_multiplier = 1 (bucket capacity equals the sustained limit, no burst headroom) or switching to sliding window counter, which smooths the rate without queuing.

When to choose User-facing APIs where occasional bursts are acceptable Any system where the §2 NFR explicitly permits short bursts Default choice when unsure, most flexible and well-understood

② Fixed Window Counter, deep dive Simplest, boundary risk ›

Divide time into fixed buckets (e.g. every 60 seconds). Each bucket holds a counter, keyed by rl:{userId}:{windowTs} where windowTs = floor(now / window) × window. Increment on every request; reject if the count exceeds the limit. The key expires automatically at the end of the window.

Redis representation: a single integer key. INCR rl:u123:1711584000 followed by EXPIRE on first write. Two commands, but the race between them is acceptable here, unlike token bucket, there is no read-before-write, so the TOCTOU window is narrow (only the EXPIRE can be lost, not the count).

The boundary problem: a user can legally make 100 requests at 11:59:59 and another 100 at 12:00:00, both within their respective windows, but 200 requests in 2 seconds. Any downstream service sized for 100 req/min sees a 200-request spike. This is the seam problem, and it's why fixed window is not recommended as a default.

⚖ Tradeoff The simplicity is real: one INCR + one EXPIRE, no float arithmetic, no sorted sets. At very high QPS where Lua script overhead matters, fixed window is measurably cheaper. Accept the boundary spike only if your downstream services can absorb it or if you're enforcing at a coarse granularity (hourly limits) where the spike is proportionally small.

③ Sliding Window Counter, deep dive Best accuracy/memory balance ›

A hybrid that approximates a true sliding window using only two fixed-window counters. The formula: count = prev_count × (1 − elapsed/window) + curr_count. The previous window's count is weighted by how much of it overlaps the current sliding window. As elapsed time grows, the previous window contributes less.

Redis representation: two integer keys, the current window bucket and the previous window bucket. The Lua script reads both, applies the formula, and checks against the limit. Still O(1) memory per user regardless of request rate.

Error bound: the approximation assumes request traffic within the previous window was uniformly distributed. Under this assumption, the maximum error at the boundary instant is prev_count × (elapsed / window), a fraction of the previous window's count, not the full limit. As elapsed time grows from 0 toward the window size, the previous window's contribution shrinks to zero. In practice this means the error is well under 1% of the limit for realistic traffic distributions. Cloudflare uses this algorithm at global scale.

Why this over token bucket? Token bucket allows front-loaded bursts. Sliding window counter smooths the rate across the window boundary without queuing. If your NFR says "no more than N requests per minute, measured continuously", not "N tokens that refill at N/60 per second", the sliding window counter is the more accurate implementation.

⚖ Tradeoff The uniform-distribution assumption breaks down for bursty traffic. A user who sends all 100 requests in the first second of a window, then nothing, will have their count overstated by the formula during the next window. This means they may be blocked earlier than strictly necessary. For most APIs this is an acceptable false positive.

④ Leaky Bucket, deep dive Strict output smoothing ›

Requests enter a FIFO queue and are processed at a fixed drain rate (e.g. 10 req/s). Bursts are absorbed into the queue rather than rejected immediately. If the queue is full, new arrivals are dropped. The output rate is perfectly smooth, downstream services see a constant, predictable load regardless of input burstiness.

Redis representation: a list used as a queue, with a background worker draining it at the fixed rate, call LPUSH on arrival, RPOP on each drain tick. A simpler stateless variant stores only the next allowed timestamp per user: on each request, check if now >= next_allowed; if yes, allow and set next_allowed = now + (1 / rate); if no, reject with a Retry-After of next_allowed - now. This avoids the queue entirely while preserving the fixed-rate output guarantee, at the cost of rejecting bursts immediately rather than queuing them.

Why it's rarely the right choice for APIs: queuing adds latency. A burst of 100 requests at 10 req/s drain rate means the 100th request waits 10 seconds before being processed. For an interactive API with a <5 ms p99 target, this is catastrophic. The user experiences a long hang instead of a clean 429.

When it is the right choice: backend processing pipelines where throughput matters more than latency, and where a 429 is worse than a delay, for example, payment processing, batch job submission, or webhook delivery where out-of-order drops are worse than queuing.

⚖ Tradeoff Leaky bucket shifts the failure mode from "client gets a 429" to "client waits a long time." Depending on the API contract and client timeout settings, a long wait can be more damaging than an explicit rejection with a Retry-After header. Always prefer token bucket or sliding window for synchronous request-response APIs.

Sliding Window Log Exact count, advanced alternative ›

A fifth option not shown in the diagram above. Instead of counters, store a sorted set of request timestamps. To check the limit, prune entries older than the window and count what remains. This is the only algorithm that gives a perfectly exact sliding window count with no approximation error at window boundaries.

The cost is memory: O(requests) per user rather than O(1). A user making 1,000 requests per window requires 1,000 timestamp entries in Redis. At scale this becomes significant. The Redis operation ZREMRANGEBYSCORE to prune old entries also adds latency proportional to the number of expired entries removed.

⚖ Tradeoff Use the sliding window log only when the 0.1% over-count tolerance from §2 cannot be relaxed, for example, authentication endpoints where even one extra allowed request is a security failure. For general-purpose APIs, the sliding window counter (③) gives indistinguishable accuracy at O(1) cost.

Our choice for this system: Token Bucket via Redis Lua

✓

Token bucket with a Lua script is the right default for this system: our NFR in §2 explicitly permits short bursts, the smoother output of a leaky bucket would add unnecessary queuing latency for a user-facing API, and the higher memory cost of a sliding window log is unjustified when the 0.1% over-count tolerance means the sliding window counter approximation is already good enough. Token bucket is the choice that the §2 requirements drive you to, not a default you pick before reading them.

Redis Lua script, token bucket implementation ›

-- Redis Lua script: token bucket check-and-refill
-- KEYS[1] = "rl:{userId}"  -- persistent per-user key (no window: state carries across windows)
-- ARGV[1] = max tokens, ARGV[2] = refill rate/sec, ARGV[3] = now (unix ms)

local key = KEYS[1]
local max = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per second
local now = tonumber(ARGV[3])           -- milliseconds

local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or max
local last = tonumber(data[2]) or now

-- Refill: add tokens proportional to elapsed time
local elapsed = (now - last) / 1000.0
local new_tokens = math.min(max, tokens + elapsed * refill_rate)

if new_tokens < 1 then
  return {0, math.ceil((1 - new_tokens) / refill_rate * 1000)}  -- {denied, retry_ms}
end

-- Consume one token (HSET replaces deprecated HMSET, supported since Redis 4.0)
redis.call("HSET", key, "tokens", new_tokens - 1, "last_refill", now)
redis.call("PEXPIRE", key, math.ceil(max / refill_rate * 1000))  -- expire after full refill time
return {1, 0}  -- {allowed, 0}

⚠

One open question, what happens with a clock skew? If API servers have different system clocks, the now value passed to the Lua script will vary. The fix is to use Redis's own server time via redis.call("TIME") inside the script, this ensures the reference clock is always the Redis node, not the calling server.

05b

API design

The rate limiter exposes no external API of its own, it's middleware. But it modifies every response, and it exposes an internal management API for setting quotas. Here's what that looks like concretely.

Response headers added to every request

RESPONSEHeaders on every request (200 or 429)

{
  "X-RateLimit-Limit": 100,           // the ceiling for this user's plan
  "X-RateLimit-Remaining": 43,        // tokens left in current window
  "X-RateLimit-Reset": 1711584000,    // Unix ts when window resets / bucket refills
  "X-RateLimit-Policy": "100;w=60"   // IETF draft: 100 requests per 60-second window
}

// On 429, additionally:
{
  "Retry-After": 17,                  // seconds until the client should retry
  "X-RateLimit-Remaining": 0
}

Management API, quota configuration

PUT/admin/v1/quotas/{userId}

// Request body
{
  "limit": 1000,         // sustained rate: max requests per window
  "window_seconds": 60,  // window size
  "algorithm": "token_bucket",  // or "sliding_window_counter"
  "burst_multiplier": 2  // bucket capacity = limit × burst_multiplier (L5+)
                         // e.g. limit=1000, burst_multiplier=2 → max_tokens=2000
                         // a user can burst up to 2000 req instantly if bucket is full
}

// Response 200
{
  "userId": "u_8f2a",
  "effective_at": "2026-03-28T10:00:00Z",
  "propagation_lag_seconds": 60,  // cache TTL
  "derived": {
    "max_tokens": 2000,           // limit × burst_multiplier → passed as ARGV[1] to Lua
    "refill_rate_per_sec": 16.67  // limit / window_seconds → passed as ARGV[2] to Lua
  }
}

Optional endpoints

Endpoint	Purpose	Level
`GET /admin/v1/quotas/{userId}`	Inspect current quota + remaining tokens	L5
`DELETE /admin/v1/counters/{userId}`	Reset a user's counter (support action)	L5
`GET /admin/v1/429-log?userId=…`	Audit log of 429 events per user	L6
`POST /admin/v1/quotas/bulk`	Batch quota update for tenant migration	L7/L8

Core flow

Every request through the system follows the same path. The critical decision, allow or deny, happens inside a single atomic Redis operation, which is why the latency stays under 5 ms even at 100k QPS.

Figure 3: Request flow. The Lua script executes in a single Redis round-trip (~0.5 ms LAN). The decision branches immediately after; audit logging is always async.

💡

The key tradeoff in this flow is why the Lua script is one round-trip, not two (GET then SET). Two operations introduce a TOCTOU window where two concurrent requests can both read count=99, both decide "allowed," and both write count=100, admitting two requests when one should be blocked. The Lua script eliminates this entirely.

💬

Classic L5 probe: "What if the Redis write succeeds but the network drops before returning to the middleware, do you count the request or not?" Answer: you already decremented the token. The request is counted consumed. This is the right behavior, the alternative (retry the Lua script) risks double-counting.

Data model

The data model is split between Redis (hot counter state) and a relational database (quota configuration). They serve opposite purposes: Redis is updated on every single request; the config DB is read rarely and written almost never.

Access patterns first

Operation	Frequency	Query shape
Check + increment counter	Every request, 100k/s peak	Read-modify-write by key: `rl:{userId}` (token bucket) or `rl:{userId}:{windowTs}` (fixed/sliding window)
Lookup quota config	Once per 60 s per user (cache TTL)	Point lookup by userId
Admin: update quota	Rare, plan changes, support actions	Update by userId
Admin: query 429 log	Occasional, debugging, abuse reports	Range scan by userId + timestamp

Two things stand out: the counter access is hot, write-heavy, and requires atomic read-modify-write, Redis is the only viable store. The config access is cold, read-heavy, and needs no atomicity, a relational DB with local caching is perfect.

Figure 4: Data model. Redis keys are namespaced and auto-expire. Config lives in a relational DB, cached locally on each API server for 60 seconds.

Why TTL on every Redis key? Memory hygiene ›

Without a TTL, inactive users accumulate dead counter keys forever. With a TTL set to the window size, a user who stops making requests will have their key automatically evicted when the window closes. For a 1-minute window, keys live at most 1 minute after the last request.

This is what keeps Redis memory bounded (see §3, even at 1M users, active keys are tiny) and means you never need a background cleanup job.

Caching strategy

The rate limiter has an inverted caching profile compared to most systems: Redis is the hot state store, not a cache in front of something slower. What you're actually caching is everything that sits in front of Redis, specifically, quota config lookups that would otherwise add a second Redis round-trip to every request. Understanding this inversion matters because it changes what "cache invalidation" means: invalidating a counter is never necessary (TTLs handle it), but invalidating a quota config entry on plan change is a real operational concern.

Placing the layers against the §04 architecture makes the picture concrete. Every request passes through three potential cache interactions before a decision is made:

Figure 6: Where caching fits in the request path. The in-process cache inside the middleware serves quota config without touching Redis. Redis is the authoritative counter store, every request reads and writes it. The Config DB is only reached on a cache miss.

What	Lives in	Holds	TTL	Why
In-process config cache	Rate limiter middleware (per instance)	User quota config (limit, window, algorithm)	60 s	Avoids a Redis round-trip for config on every request; cold miss falls through to Config DB
In-process circuit breaker	Rate limiter middleware (per instance)	Redis outage flag	10 s	Prevents thundering herd of retries to a failing Redis; resets automatically after 10 s
Redis counter store	Shared Redis cluster	Counter state (tokens, last_refill)	= window size	Authoritative rate limit state, not a cache. Written atomically on every request via Lua script.

⚠

Cache invalidation trap: When a user's plan is downgraded, the local config cache will still show the old (higher) limit for up to 60 seconds. This means a just-downgraded user can send 60 more seconds of Pro-tier traffic. For most products this is acceptable. If it isn't, you need push invalidation, the config service sends a pub/sub event to all API servers to evict the specific key immediately.

Deep-dive scalability L5+

At 100k QPS with a single Redis node, you're likely fine, Redis handles ~1M simple ops/sec (GET, SET, INCR). However, Lua scripts are significantly heavier than raw commands: a realistic ceiling for Lua executions is 200–400k/sec on typical hardware, because each script blocks Redis's single thread for longer than a primitive op. At 100k QPS you have headroom; at 500k+ QPS, Redis Cluster becomes necessary. But production systems have to think about what happens when demand grows, when Redis is partitioned, and when consistency guarantees need to be loosened for throughput.

Figure 5: Production-scale architecture. Redis cluster shards counters across 3 primaries. 429 events stream to Kafka and into ClickHouse for abuse analytics.

Redis Cluster sharding L5, throughput ›

Redis Cluster distributes keys across 16,384 hash slots assigned to N primary nodes. For rate limiting, each user's counter key hashes to exactly one shard, so all operations for a given user are atomic on that shard. No cross-shard transactions are needed.

A 3-primary cluster with each primary handling ~200–300k Lua executions/sec (consistent with the ceiling established in §09) gives headroom to ~600k–900k QPS before horizontal scaling is needed. Add shards by resharding, Redis Cluster handles this online.

⚖ Tradeoff Resharding causes brief key slot migration, during this window, some Lua scripts may fail with MOVED errors. The middleware must handle retries with backoff.

Local counter + periodic sync (approximate) L6, extreme throughput ›

Above ~5M QPS, even Redis Cluster becomes a bottleneck. The alternative: each API server maintains an in-process counter and periodically (every 100 ms) syncs with Redis via INCRBY. Between syncs, the distributed count can be off by at most servers × local_count. This is an explicit trade: approximate counting for linear write scalability.

This design is used by Cloudflare and is acceptable when "at most 5% over-count" is a tolerable SLA. It's unacceptable for security-critical limits (authentication attempts, payment API).

⚖ Tradeoff On server crash, the local count is lost, those requests were admitted but never counted in Redis. The system under-counts for the crash window.

Multi-tier rate limits L6, tenant design ›

Enterprise products often have nested limits: per-user, per-team, and per-organization, all enforced simultaneously. A request must pass all three checks. The implementation: the Lua script is called once per dimension, in order. If any check fails, the 429 is returned immediately with the header naming which dimension was exhausted: X-RateLimit-Violated: org.

Geo-distributed rate limiting L7/L8, global consistency ›

If a user can send requests to two different regions simultaneously, per-region counters will each allow the full quota, effectively doubling the limit. True global rate limiting requires cross-region counter sync. Options: (1) route all requests for a user to their "home" region (sticky routing, single point of failure); (2) use a globally replicated database (Google Spanner or DynamoDB Global Tables) that offers low-latency strongly-consistent reads across regions; (3) accept approximate counting with gossip-based sync.

Most systems choose option 3: allow a known over-count window at region boundaries, monitor for abuse via the audit log, and address egregious violations reactively.

How gossip-based sync works: each region periodically (e.g. every 100–500 ms) broadcasts its local counter delta to peer regions. Each region maintains a per-region counter vector and sums them to get the global count. This is a form of CRDT (Conflict-free Replicated Data Type), specifically a G-Counter, where each region owns one slot in the vector and only increments it. Merging is simply taking the max of each slot across all received vectors. Convergence time equals the gossip interval plus network RTT between regions (typically 100–300 ms cross-continent), giving a bounded over-count window of at most limit × (gossip_interval / window_size).

Failure modes & edge cases

Scenario	Problem	Solution	Level
Redis node goes down	All counter checks fail; traffic either stops or goes unchecked	Circuit breaker: fail open for non-auth endpoints. Alert on-call (e.g. PagerDuty). Redis Sentinel (the built-in high-availability manager) detects failure within ~30 s (default `down-after-milliseconds`); total failover including election and promotion typically takes 30–60 s end-to-end.	L3/L4
Window boundary spike	Fixed-window allows 2× limit at boundaries, downstream overload	Switch to token bucket or sliding window counter algorithm, which smooth across boundaries	L3/L4
Lua script race condition	Two servers read the same count with GET, both decide "allowed," both write, admitting one request too many (TOCTOU on GET+SET, not INCR itself)	All state mutation in a single Lua script, atomic on the Redis thread, eliminating the read-modify-write race without distributed locks	L5
Plan upgrade cache lag	User upgrades plan; still sees old lower limit for 60 s	Pub/sub invalidation: config service sends evict event to all API servers on plan change	L5
Clock drift between API servers	Window timestamps differ across nodes, same user gets different windows	Use `redis.call("TIME")` inside Lua script to use Redis server clock, not application clock	L5
Header spoofing (X-Forwarded-For)	Attacker spoofs IP header to bypass IP-based limits	Trust `X-Forwarded-For` only from known load balancer CIDR ranges, never from arbitrary clients. Prefer `X-Real-IP` set by the LB itself, which cannot be injected by clients. For any security-critical limit (auth endpoints, payment APIs), use authenticated user ID as the primary key instead of IP; IP-based limits are a last-resort signal, not a primary defence. Rate limit by authenticated user ID as the primary key whenever possible.	L6
Thundering herd on Redis restart	All API servers simultaneously miss local cache after Redis restart, 100k req/s hit cold Redis	Jittered local cache TTL (60 s ± 10 s random) + request coalescing: one inflight config fetch per key	L7/L8
Distributed counter drift at geo boundaries	User sends 90 req to US-EAST and 90 req to EU-WEST; both regions allow; effective limit is 2×	Accept approximate over-count. Use audit log anomaly detection to flag and throttle after the fact. For auth endpoints: use sticky geo routing.	L7/L8
Idempotency key on retry	A POST fails mid-flight and is retried with the same idempotency key, should the retry consume a token?	For APIs that support idempotency keys (Stripe-style), do not count idempotent retries against the quota, the client is replaying a request that already consumed a token. Store the idempotency key with the original response in a short-TTL Redis key (24 h); on retry, return the cached response without re-entering the rate limiter. This must be established as a requirement in §2, it changes the key structure and adds a pre-check lookup step to the core flow.	L6

How to answer by level

L3/L4 SDE I / SDE II, Can you build a working system? ›

What good looks like

Pick token bucket or fixed window, justify the choice
Use Redis with INCR + EXPIRE
Return 429 with Retry-After header
Identify (but don't need to fully solve) the race condition
Mention per-user key scoping

What separates this from L5

No thought for distributed atomicity
Doesn't ask about burst tolerance vs smoothing
Missing TTL reasoning on Redis keys
No mention of fail-open vs fail-closed

L5 Senior SDE, Do you understand the tradeoffs? ›

What good looks like

Lua script for atomicity, explains why two separate commands aren't safe
Chooses algorithm based on burst tolerance NFR, not by default
Designs per-user + per-IP dimensions, explains the key structure
Addresses clock drift via Redis TIME
Proactively raises fail-open semantics

What separates this from L6

Single-tier limits only (no tenant/org hierarchy)
No plan for global/cross-region consistency
Doesn't think about approximate counting as a deliberate tradeoff

L6 Staff SDE, Can you own this end-to-end? ›

What good looks like

Multi-tier limits (user → team → org), Lua script runs all checks
Push invalidation for plan changes
Sliding window counter approximation vs exact, chooses based on use case
Audit log pipeline to ClickHouse for abuse analytics
Addresses header spoofing and IP trust chain

What separates this from L7

Doesn't model the correctness guarantees under network partition
No framework for deciding when approximate counting is "good enough"
Doesn't reason about multi-region topology as a first-class constraint

At L7, you're expected to define an explicit correctness model, e.g., "we accept at most 5% over-count during partition windows lasting under 30 seconds", rather than just naming the problem. The difference is quantified tolerance with a rationale, not just awareness that drift exists.

L7/L8 Principal / Distinguished, Should we build this, and how? ›

What good looks like

Frames approximate vs exact counting as a deliberate correctness model with quantified error bounds
Designs for geo-distributed consistency: sticky routing vs gossip sync vs Spanner
Considers cell-rate algorithm (GCRA) for strict burst SLAs
Thinks about what the rate limiter protects against vs what it can't (state exhaustion, amplification attacks)
Proactively surfaces build vs buy decision (Kong, Envoy, AWS API GW)

The bar

Shows the reasoning behind every tradeoff, not just the conclusion
Identifies failure modes the interviewer hasn't raised
Can quantify the error bound of approximate counting and defend when it's acceptable
Drives the conversation, doesn't just respond to prompts

Classic probes, level-differentiated answers

Probe	L3/L4	L5/L6	L7/L8
"Two servers increment the same counter simultaneously, what happens?"	Identifies the race. Suggests INCR + EXPIRE (not fully atomic).	Solves with Lua script. Explains why WATCH/MULTI is worse (retries, latency).	Discusses whether this race even matters given approximate counting mode. Quantifies the overcounting risk.
"Redis goes down. What does your rate limiter do?"	Mentions fail-open vs fail-closed. Picks one.	Designs circuit breaker. Differentiates behavior by endpoint sensitivity. Redis Sentinel failover timeline.	Models the blast radius: how many requests leak during a 30–60 s failover? Is that acceptable given the SLA? Proposes multi-AZ Redis for lower RTO.
"How do you rate limit a user who is hitting two different regions?"	Doesn't raise this. Assumes single-region.	Identifies that per-region counters allow 2× limit. Proposes sticky geo routing as simplest fix.	Compares: sticky routing (simple, single PoF) vs globally replicated DB (Spanner, strongly consistent, expensive) vs gossip sync (eventual consistency, tunable error bound). Chooses based on use case sensitivity.
"Token bucket vs sliding window, which do you pick and why?"	Picks token bucket. Can describe it. Doesn't connect choice to requirements.	Derives the choice from burst tolerance NFR. If burst is acceptable, token bucket. If exact, sliding window counter. If output smoothing needed, leaky bucket.	Adds: sliding window log is the only exact algorithm; others are approximations. Discusses GCRA (Generic Cell Rate Algorithm) as the rigorous foundation for token bucket. Notes that "burst" itself needs a formal definition in the SLA.

How the pieces connect

Every architectural decision in this article traces back to a requirement set in §2 or a number established in §3.

1 Latency < 5 ms NFR (§2) → rules out DB reads on every request → mandates Redis as counter store (§4) → config cached locally at 60 s TTL (§8)
2 99.99% availability NFR (§2) → Redis outage is a real failure scenario → fail-open circuit breaker (§4, §10) → behavior differs by endpoint sensitivity (§11)
3 Race condition risk (§5) → two-command INCR+EXPIRE is not atomic → Lua script encapsulates all state mutation in one round-trip (§5, §6) → eliminates TOCTOU without distributed locks
4 Configurable per-tenant limits (§2) → quota config lookup on every request → in-process cache with 60 s TTL (§8) → push invalidation needed for plan downgrades (§10)
5 Audit / observability NFR (§2) → every 429 must be logged → async Kafka pipeline (§4, §9) → ClickHouse for abuse analytics without adding request-path latency
6 Burst tolerance choice (§5 algorithm decision) → token bucket allows burst, leaky bucket smooths output → drives whether downstream services must themselves be burst-tolerant (§4 component rationale)

Also in this series

URL Shortener System Design — hash-based ID generation, Redis caching strategy, and async analytics pipeline
Web Crawler System Design — Bloom filter deduplication, politeness throttling, and distributed frontier design
Twitter Feed System Design — fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
Notification Service System Design — multi-channel delivery, idempotency keys, and priority queues at scale
Key-Value Store System Design — consistent hashing, eviction policies, replication, and failure modes
Search Autocomplete
Chat System (WhatsApp) System Design — WebSocket architecture, message delivery guarantees, and fan-out

Design a Rate Limiter, System Design Interview Guide

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

Capacity estimation

Interactive estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Core algorithm, the bucket decision

Our choice for this system: Token Bucket via Redis Lua

API design

Response headers added to every request

Management API, quota configuration

Optional endpoints

Core flow

Data model

Access patterns first

Caching strategy

Deep-dive scalability L5+

Failure modes & edge cases

How to answer by level

Classic probes, level-differentiated answers

How the pieces connect

System Design Mock Interviews

Practice Coding Interviews Now