System Design Interview Guide

Design a Notification Service

Simple to trigger, hard to deliver. Fan-out at scale, priority queues, delivery guarantees, and multi-channel routing — the real complexity of push, email, and SMS.

L3/L4 — Basic fan-out + channels L5/L6 — Priority queues + guarantees L7/L8 — Global delivery + observability
Hero Image for Notification Service System Design
01

What the interviewer is testing

The notification service prompt is deceptively broad. "Send notifications" sounds like a simple pub-sub problem. What makes it hard is the combination of heterogeneous delivery channels (push, email, SMS, in-app), wildly different latency tolerances across notification types, and the need to avoid both silent failures and duplicate deliveries at the same time.

The question is a proxy for four distinct skills: understanding how fan-out complexity grows with follower counts; reasoning about delivery guarantees and idempotency; designing for channel-specific rate limits imposed by third-party providers; and prioritising notification types so a marketing blast doesn't delay a password reset.

Level What good looks like
L3/L4 Identifies push, email, SMS channels. Routes through a queue. Understands retry on failure. Knows what APNs and FCM are.
L5 Designs priority queues per notification type. Justifies fan-out strategy with trade-offs. Reasons about at-least-once vs exactly-once. Handles user preferences and quiet hours.
L6 Owns the end-to-end schema, delivery log, deduplication layer, and third-party rate limiting. Designs for observability, delivery funnels, drop-off tracking, alert thresholds.
L7/L8 Addresses cross-region latency, provider failover, data residency constraints, cost modeling across channels, and the make-vs-buy decision for the notification infrastructure.
02

Requirements clarification

Notification systems span a huge design space. The first thing to do is nail down which of the four channel types are in scope, what notification categories exist, and whether "delivery" means at-most-once, at-least-once, or exactly-once. The answers change every major component.

Functional requirements

Requirement In scope
Push notifications (iOS via APNs, Android via FCM)
Email notifications (transactional + marketing)
SMS notifications (OTP, alerts)
In-app notification inbox (read/unread state)
User preferences (opt-in/out per channel, quiet hours)
Priority tiers: transactional, social, marketing
Scheduled / delayed notifications ✓ (L5+)
Rich push (images, actions, deep links) Partial

Before the NFRs, it's worth understanding how push delivery actually works, it shapes several of the constraints below.

tap to expand

APNs & FCM, the push gatekeepers. Apple Push Notification service (APNs) and Firebase Cloud Messaging (FCM) are Apple's and Google's official push delivery networks. Your server never opens a direct socket to a user's phone, instead it sends the notification payload to APNs or FCM, which maintains a persistent encrypted connection to every registered device. Both services impose per-app rate limits, require a registered device token per installation, and provide feedback channels to detect uninstalled apps. You cannot bypass them for iOS; Android allows alternatives but FCM is the standard.

Non-functional requirements

NFR Target Why
Transactional delivery latency (p99) < 5 seconds end-to-end OTP, password reset, delay kills usability
Social/engagement delivery latency (p99) < 30 seconds Likes, comments, freshness matters but isn't critical
Marketing delivery latency Minutes to hours acceptable Bulk campaigns are best-effort; don't compete with transactional
At-least-once delivery guarantee Required for all channels Silent failure (missing OTP) is worse than duplicate
Duplicate notification rate Minimised in steady state (< 0.1%); brief windows after crashes acceptable Duplicates are annoying but survivable; structured retry + idempotency keys keep this low in practice
Availability 99.9%+ Notification pipeline is non-critical-path but high visibility
Scalability 10M notifications/day at baseline, 100M peak (marketing blast) Events (Black Friday, elections) cause sudden 10× spikes
Why at-least-once, not exactly-once?

Exactly-once delivery across distributed systems and third-party providers (APNs, Twilio) is effectively impossible to guarantee end-to-end, you can get close but you cannot control what happens inside the provider's delivery layer. The practical answer is at-least-once delivery at the infrastructure level, with idempotency keys in the payload so the receiving app or device can deduplicate. This shifts a hard distributed systems problem into an application-level one.

TradeoffIdempotency keys require a deduplication store (a short-lived distributed cache with TTL). This adds a lookup on every delivery attempt. The alternative, accepting duplicates silently, is simpler but creates poor user experience at any non-trivial failure rate.
Why separate latency SLAs for notification types?

Without separated priority queues, a marketing blast of 50M emails queued on Monday morning will delay the OTP that a user needs to log in. The latency targets exist not just as SLOs, they justify the entire queue topology. If all notifications had the same tolerance, a single queue would be fine. The tight 5-second window for transactional also drives the auto-scaling policy: transactional workers must scale in seconds, not minutes.

What this drivesThree separate queue topics (transactional, social, marketing) with independent worker pools, independent scaling policies, and independent alerting thresholds.
03

Capacity estimation

Notification systems have an unusual write profile: write volume (notification sends) can be many orders of magnitude larger than the number of triggering events, because a single social event (a post from a celebrity with 10M followers) fans out to millions of push notifications. Storage is dominated by the delivery log, not the notification metadata itself.

Interactive estimator

10M
10
500 B
3 mo
10M
×10
Notifications/day
events/day × avg recipients
Write QPS (avg)
notifs/day ÷ 86,400 s
Write QPS (peak)
avg QPS × peak multiplier
Log storage
rows × record size × retention days
Delivery log rows
notifs/day × retention days
Cache (preferences)
registered users × 1 KB/user

The key insight: Fan-out multiplier is the dominant lever, the average QPS shown above scales linearly with both events/day and recipients/event. But a marketing blast or celebrity post pushes this to the peak multiple you set. Adjust the peak multiplier to match your workload: social products spike on viral events, B2B products spike on business-hours campaigns. This is why fan-out must be async, the queue absorbs the burst; workers drain it at their own rate, bounded by provider rate limits.

04

High-level architecture

Sync call Async / queue Caller Services Auth · Social · Orders Notification API Validate · Enrich Route · Rate-limit KAFKA, PRIORITY TOPICS ⚡ Transactional OTP · Password reset 💬 Social Likes · Comments · Follows 📣 Marketing Campaigns · Promotions Fan-out Workers Resolve recipients · Check preferences & quiet hours · Route per channel Prefs DB PostgreSQL + Redis cache Push Dispatcher APNs · FCM circuit breaker Email Dispatcher SendGrid · SES circuit breaker SMS Dispatcher Twilio · Vonage circuit breaker Notification Log (Cassandra) notification_log · delivery_attempt · TTL 90 days In-App Inbox API ← reads from log

Notification service, three-tier async fan-out, per-channel dispatchers, and append-only delivery log

Component breakdown

Notification API is the single ingress point for all notification events. It validates the payload (notification type, target users, template ID), enforces per-caller rate limits, enriches the event with metadata (sending timestamp, idempotency key), and publishes to the appropriate priority queue topic. It does not fan out, that is the worker's job.

Priority queues are the architectural keystone. Three separate topics in a durable log (Apache Kafka or AWS SQS): transactional, social, and marketing. Each has independent consumer groups and independent scaling policies. This isolation ensures a 50M-message marketing blast cannot starve an OTP delivery, which directly satisfies the latency NFR from §2.

📬 What is Apache Kafka? tap to expand

Apache Kafka in one paragraph. Kafka is a distributed commit log, producers append messages to named topics, and consumers read them by tracking an offset (a position in the log). Unlike traditional queues that delete messages on consumption, Kafka retains messages for a configurable period, enabling multiple independent consumer groups to read the same topic at their own pace (e.g. the delivery workers and the analytics pipeline both consume from the same topic without interfering). If a consumer crashes, it restarts from its last committed offset, no message is lost. This replay-on-crash property is what makes Kafka the foundation of at-least-once delivery guarantees.

Fan-out workers consume from the priority queues. For each event, they look up the target user list, check preferences and quiet hours, select the right channel(s), and enqueue one send task per recipient per channel into the downstream dispatcher queue. They also write the initial record to the notification log.

Channel dispatchers are thin, stateless workers that handle one channel each: push (APNs + FCM), email (SendGrid or SES), and SMS (Twilio or Vonage). They implement channel-specific retry logic, third-party rate limit handling (respecting 429 responses), and payload formatting. Keeping them separated means a third-party provider outage only affects one channel, not the others.

Preferences DB stores per-user opt-in/out state, quiet hours, preferred channel order, and per-category preferences. It is a relational store (PostgreSQL) with a read-through distributed cache (Redis) in front of it, since preferences are read on every single notification send but change rarely.

What is Redis? tap to expand

Redis in one paragraph. Redis is an in-memory data structure store used as a cache, message broker, and ephemeral database. Because all data lives in RAM, reads and writes complete in under a millisecond, orders of magnitude faster than hitting a disk-backed database. Redis supports expiry (TTL) natively on any key, making it ideal for short-lived data like cache entries, rate limit counters, and deduplication keys. The trade-off is memory cost and volatility: Redis data can be lost on restart unless persistence is configured. In this design, Redis is used for three things: the preference cache (5-min TTL), the deduplication key store (24h TTL), and the per-caller rate limit token buckets, all cases where sub-millisecond lookup and natural expiry outweigh the need for durable storage.

Notification log is backed by a wide-column store (Apache Cassandra) split into two tables: notification_log is partitioned by user_id and clustered by sent_at, optimised for the inbox read ("give me all notifications for user X, newest first"). delivery_attempt is partitioned by notification_id and clustered by attempt_at, optimised for retry and support lookups ("what happened to notification Y?"). Splitting them lets each table's partition key match its dominant query, the core Cassandra data-modelling principle.

⚠️

Email deliverability requires DNS pre-configuration, not just an API key swap. DKIM signing keys and DMARC policy alignment must be consistently configured across all email providers before failover. Switching from SendGrid to AWS SES during an outage without pre-configured DNS records (DKIM CNAME entries, DMARC TXT record) will cause emails to land in spam or be rejected outright. Treat email provider failover as an infrastructure change, not an application-level switch.

🗄️ What is Apache Cassandra? tap to expand

Apache Cassandra in one paragraph. Cassandra is a wide-column NoSQL database built for write-heavy, append-dominant workloads at scale. Data is distributed across nodes by a partition key, all rows sharing the same partition key live on the same node, making single-partition reads fast and predictable. Within a partition, rows are sorted by a clustering key, enabling efficient time-ordered range scans. Cassandra trades strong consistency for availability and partition tolerance: reads are eventually consistent by default, and there is no single point of failure. It also supports native TTL (time-to-live) per row, making retention-based expiry a configuration value rather than a background job. These properties, fast appends, user-partitioned inbox reads, and automatic TTL expiry, make it the standard choice for notification and activity feed storage at scale.

Architectural rationale

Why Kafka for the priority queues? Queue choice

A durable log like Kafka gives replay capability, if a worker crashes mid-fan-out, the unconsumed messages remain on the topic and a replacement worker picks up exactly where the offset left off. This is the foundation of the at-least-once guarantee. It also allows independent consumer groups for analytics without adding load to the main delivery path.

TradeoffKafka adds operational complexity. For smaller scale, a managed queue like AWS SQS with separate queues per priority tier achieves most of the same isolation with far less overhead. The choice between them is a scale and operational maturity decision.
AlternativesAWS SQSRabbitMQGoogle Pub/Sub
Why separate fan-out workers from channel dispatchers? Separation of concerns

Fan-out is CPU-bound and scales with the number of recipients. Channel dispatch is I/O-bound and scales with provider throughput. Separating them means you can scale each independently: a celebrity fan-out event needs many fan-out workers briefly, but APNs imposes a fixed rate limit that no amount of additional dispatchers can exceed. If they were combined, you'd be forced to provision for the worst case on both dimensions simultaneously.

TradeoffTwo separate worker pools means two separate queues between them, adding another async hop. This adds latency (seconds, not milliseconds) which is acceptable for social and marketing but must be considered against the 5-second transactional target.
Why Cassandra for the notification log? Storage choice

The notification log has three dominant access patterns: write a new delivery attempt (extremely frequent), read all notifications for a user sorted by time (the in-app inbox query), and expire old records after the retention period. Cassandra handles all three: writes are fast and append-only, reads by partition key (user_id) with time-ordered clustering are efficient, and native TTL handles expiry without a separate cleanup job.

TradeoffCassandra provides eventual consistency. The in-app inbox may briefly show stale state after a write. This is acceptable for a notification feed. If strong read-your-writes is required (e.g. user clicks "mark all read" and immediately refreshes), route those reads through a read-repair path or accept a brief inconsistency window.
AlternativesDynamoDBScyllaDBPostgreSQL + partitioning (smaller scale)

Real-world comparison

Decision This design Meta Airbnb
Fan-out model Async workers, hybrid write/read Fan-out on write (push model) Fan-out on write + lazy reads for inbox
Queue system Kafka (priority topics) Scribe + internal queues Apache Kafka
Notification log Cassandra, TTL-based expiry TAO (graph store) MySQL + DynamoDB
Preference store PostgreSQL + Redis cache ZippyDB (key-value) MySQL + Memcached
Push dispatch APNs + FCM via dispatcher workers In-house push infra AWS SNS + internal wrapper
💡

Real production systems vary significantly based on scale and legacy. Meta's in-house push infrastructure exists because APNs and FCM rate limits become a bottleneck at their scale. For most companies, managed providers (SendGrid, Twilio, FCM) are the right call, they abstract reliability, compliance (DKIM, DMARC), and global delivery infrastructure.

05

Fan-out strategy, the core algorithm

Fan-out is the defining algorithmic question for notification systems. When a user posts something, the system must decide: do we push that notification into each follower's inbox immediately (fan-out on write), or do we compute the notification list when each follower asks (fan-out on read)? Both have significant consequences across the entire design.

① Fan-out on Write Precompute at event time New Post Event User A inbox User B inbox User C inbox ✓ Fast reads (inbox pre-built) ✓ Simple read path ✗ Slow writes for celebrities ✗ Storage amplification Best for: avg users (<10K followers) ② Fan-out on Read Compute at query time One Post Record Follower graph lookup on read on read on read ✓ No write amplification ✓ Storage efficient ✗ Slow reads (graph traversal) Best for: celebrities (>1M followers)

Fan-out on write vs fan-out on read, both approaches have fatal flaws at the extremes. A hybrid addresses both.

The right answer for production systems is a hybrid approach: fan-out on write for users with follower counts below a threshold (e.g. 10,000), and fan-out on read for high-follower accounts. The threshold is a configuration value, not a constant. Twitter's "celebrity problem" made this pattern famous, and it remains the standard solution.

🔍

Open question: What is the right fan-out threshold? It depends on write throughput, worker capacity, and storage costs. If a worker can produce 10,000 notification records/second and the SLA allows 5 seconds, a user with 50,000 followers can be fanned out synchronously. Beyond that, switch to lazy read. Make the threshold configurable and instrument it.

Implementation sketch, hybrid fan-out worker
async def fan_out(event: NotificationEvent):
    sender = get_user(event.sender_id)

    if sender.follower_count <= FANOUT_WRITE_THRESHOLD:
        # Fan-out on write: push to each follower's inbox
        follower_ids = get_follower_ids(event.sender_id)
        await batch_write_inbox(follower_ids, event)
    else:
        # Fan-out on read: store one record, resolve at query time
        await write_single_event(event)
        # Followers will see this via graph traversal on inbox load

    await log_delivery_attempt(event.notification_id, "fan_out_complete")
05b

API design

POST /v1/notifications

Triggers a notification. Returns immediately after enqueueing, does not wait for delivery.

Request
Response · 202 Accepted
Full schema, all fields & annotations POST /v1/notifications
Request, full field list
Error responses
// 400  invalid template_id or type mismatch
// 409  idempotency_key already used within 24 h (duplicate blocked)
// 429  rate limit exceeded, Retry-After header included

GET /v1/users/{user_id}/notifications

Returns the in-app inbox for a user, newest first. Cursor-paginated.

Query params
// ?limit=20           max items per page (default 20, max 100)
// ?cursor=<token>     opaque cursor from previous response
// ?unread_only=true   filter to unread notifications only
Response · 200 OK
Full schema, all fields & annotations GET /v1/users/{id}/notifications
Query params
// ?limit=20           max items per page (default 20, max 100)
// ?cursor=<token>     opaque cursor from previous response
// ?unread_only=true   filter to unread notifications only
Response · 200 OK, full field list

Additional endpoints

Endpoint Purpose Level
PATCH /v1/notifications/{id}/read Mark a single notification read L3/L4
POST /v1/notifications/read-all Mark all as read for a user L3/L4
GET /v1/notifications/{id}/status Delivery status (queued/sent/failed) L5/L6
PUT /v1/users/{id}/preferences Update channel preferences + quiet hours L5/L6
POST /v1/notifications/bulk Bulk marketing campaigns with targeting L6

Input validation that matters: validate that template_id exists and matches the notification type; check idempotency key against a recent-send cache (Redis with 24h TTL) before accepting; verify recipient_ids are valid and opted in; enforce a max recipients cap on a single API call (e.g. 1,000, bulk sends go through the bulk endpoint).

Per-caller rate limiting: rate limit by caller API key using a token bucket in a distributed cache (Redis), keyed as ratelimit:api:{api_key}. Standard callers get 1,000 requests/minute; pre-approved bulk senders (e.g. a marketing service) get a higher limit negotiated at onboarding. Return 429 Too Many Requests with a Retry-After header when the bucket is exhausted. This prevents a single misconfigured caller from saturating the transactional queue and delaying OTP delivery for everyone else.

06

Core flow, the write path

The dominant path in a notification system is the write path: an event arrives, fans out to recipients, checks their preferences, and dispatches per channel. The key architectural decision is where within the fan-out step to check preferences, in the fan-out worker before enqueuing per-channel tasks (saves provider API calls for opted-out users, but requires preference data in the worker) or in the dispatcher after enqueuing (cleaner separation but wastes queue capacity on sends that will be suppressed).

1. Caller triggers event auth service · social service · orders 2. API validates & enqueues idempotency check · priority routing 3. Priority queue topic transactional · social · marketing 4. Fan-out worker resolve recipients · check preferences Opted in? Quiet hours OK? NO Skip log suppressed YES 5. Channel dispatch APNs · FCM · SendGrid · Twilio 6. Log delivery outcome sent · failed · bounced · suppressed provider fail? retry + backoff

Notification write path, preference check before dispatch prevents unnecessary provider API calls

The preference check happens before channel dispatch, not after. This is the key decision point, driven directly by the NFR to respect quiet hours and opt-outs. Checking preferences in the fan-out worker (step 4) rather than the dispatcher (step 5) means we avoid making any provider API calls for opted-out users, which saves money and avoids consuming rate limit budget. The cost is that the fan-out worker must read preference data, which is why the preference cache (Redis, §8) is critical.

⚠️

Quiet hours edge case: When a notification is suppressed due to quiet hours, it shouldn't be silently dropped, it should be re-queued with a delayed send time aligned to the end of the user's quiet hours window. Otherwise, time-sensitive-but-not-transactional notifications (e.g. a friend request at 11pm) are permanently lost. A scheduled notification queue (a priority queue with delayed visibility) handles this cleanly.

07

Data model

The notification system has three distinct entities with very different access patterns and storage characteristics. Understanding how each entity is queried determines every schema decision.

Access patterns

Operation Frequency Query shape
Write delivery attempt Extremely high (every send) Insert row keyed on notification_id + user_id
Read user inbox High (every app open) Filter by user_id, order by time DESC, paginate
Read user preferences Very high (every send) Lookup by user_id, always full record
Update preference Low (user-initiated) Update by user_id
Delivery status query Low (support / analytics) Lookup by notification_id
Expire old notifications Low (background job) TTL-based, time-ordered

Two things stand out: preference reads happen on every single send and must be fast (driving the Redis cache), and the delivery log is almost pure-append with user-partitioned reads (driving the Cassandra choice).

notification_log (Cassandra) user_id (PK) uuid sent_at (CK) timestamp notification_id uuid channel text status text title / body text is_read bool TTL: 90 days (configurable) user_preferences (PostgreSQL) user_id (PK) uuid push_enabled bool email_enabled bool sms_enabled bool quiet_start / end time category_overrides jsonb timezone text delivery_attempt (Cassandra) notification_id (PK) uuid attempt_at (CK) timestamp channel text outcome text provider_resp_code int error_message text PK = Partition Key · CK = Clustering Key (sort order)

Data model, three entities across two storage engines, chosen by their dominant access patterns

device_tokens, PostgreSQL

Device tokens are the source-of-truth for push delivery. They are relational (one user can have many devices, each with one token per platform), low-volume compared to the notification log, and must survive Redis eviction. They live in PostgreSQL alongside preferences, with the Redis cache (§8) providing the 1-hour TTL read layer.

Field Type Notes
device_id (PK) uuid Stable device identifier, generated on first app launch
user_id (FK) uuid Indexed, supports "give me all devices for user X"
platform text ios | android | web
token text APNs device token or FCM registration token; rotate on provider feedback
registered_at timestamp When the token was first registered
last_seen_at timestamp Updated on every app open; used to identify stale tokens for cleanup
is_active bool Set to false on APNs Feedback / FCM canonical ID response, stops sends immediately without deleting the row
💡

Token rotation: APNs returns APNS_UNREGISTERED when a device token is stale (app uninstalled or re-installed). FCM returns a canonical registration ID when a token has been refreshed. Both signals must be processed via a feedback consumer that marks the old token as is_active = false and invalidates the Redis cache entry — preventing continued sends to dead tokens that consume rate limit budget.

Why category_overrides as JSONB?

Notification categories are product-defined and change over time. A user might disable push for "marketing" but keep it on for "social" and all transactional types. Storing this as a JSONB column avoids schema migrations every time a new category is introduced. The trade-off is that querying across users by category preference is harder, but that query pattern doesn't exist in steady state (preferences are always fetched per-user).

Why store user_id as PK in notification_log (not notification_id)?

In Cassandra, the partition key determines which node stores the data. Partitioning by user_id means all notifications for a user are co-located on the same node, making the inbox query (give me all notifications for user X, sorted by time) a single-partition scan, fast and predictable. If we partitioned by notification_id, an inbox query would scatter across the entire cluster.

Hot partition riskA very active user accumulates many rows in one partition. Use a time-bucketed partition key (user_id + month) to cap partition size. This means inbox queries spanning months need to be merged from two partition scans, a manageable trade-off.
08

Caching strategy

The notification service's caching needs fall into two tiers. The essential tier covers every single send regardless of channel, both caches are hit on every notification that goes through the fan-out worker. The extended tier covers push-specific details that only become relevant once you go deeper on channel dispatch.

Essential, hit on every send

Fan-out Worker checks both caches Dedup Cache Redis · 24 h TTL idempotency_key → sent_at Preference Cache Redis · 5 min TTL user_id → prefs object HIT → Drop (no-op) Already sent — log as suppressed do not dispatch to provider MISS → Proceed to send Write key to dedup cache first then dispatch to channel MISS → PostgreSQL Read source of truth populate cache on read HIT MISS MISS On preference update: write to PostgreSQL → invalidate Redis key (write-invalidate, not write-through)

Cache flows, dedup HIT drops the send; MISS proceeds to dispatch. Preference MISS falls through to PostgreSQL.

Cache layer Store TTL Strategy Why
Deduplication keys Redis 24 hours Write on first send; check before every send Prevents duplicate sends on worker retry; idempotency_key maps to notification_id
User preferences Redis 5 minutes Read-through; invalidate on update Preferences are read on every send and change very rarely, cache hit rate >99%

Extended, push-specific L5+

Cache layer Store TTL Strategy Why
User device tokens Redis 1 hour Read-through; invalidate on APNs/FCM feedback Device tokens don't change often but are read on every push send, caching avoids a DB lookup per dispatch. Must be invalidated immediately when a provider returns a stale-token signal.

Cache invalidation

When a user updates their preferences via the API, the system writes to PostgreSQL and immediately invalidates the Redis key (write-invalidate, not write-through). Write-invalidate is preferred here because the next read will always populate the cache with fresh data, and double-writes (to DB and cache) risk inconsistency if one fails. The 5-minute TTL provides a backstop for any invalidation misses.

⚠️

Quiet hours and timezones: Quiet hours must be evaluated in the user's local timezone, not UTC. Cache the computed quiet-hours window per user per day (not just the raw start/end times) to avoid repeated timezone arithmetic on every notification send. Invalidate at midnight local time.

09

Deep-dive: scalability

At L5 and above, the interviewer expects you to go beyond the happy path and address the stress cases: what happens during a marketing blast, a celebrity event, or a third-party provider outage? The accordions below address the most common follow-up questions at each level, check the tag on each to see where it's expected.

Load Balancer ingress API Cluster ×N · auto-scaled rate-limit · validate Kafka ⚡ Transactional 💬 Social 📣 Marketing 3 independent topics Fan-out Workers consumer groups per-topic · auto-scale Dispatchers Push (APNs/FCM) Email (SendGrid) SMS (Twilio) circuit breakers Redis Cluster dedup keys · prefs cache device tokens Cassandra Cluster notification_log delivery_attempt · TTL 90d Analytics Store ClickHouse / Druid delivery funnel metrics Sync Async write (Cassandra) Async read (Redis) Analytics consumer

Production notification pipeline, independent scaling, circuit breakers per provider, separate analytics consumer group on Kafka

Handling celebrity / high-follower fan-out without write amplification L5

A celebrity with 10M followers triggering a fan-out on write would produce 10M queue messages in a single event. This can saturate workers and delay the entire topic. The hybrid fan-out approach (§5) avoids this by storing a single event record for high-follower users and resolving the recipient list at read time. The threshold (e.g. 10,000 followers) is a system configuration value, tuned based on observed write throughput and worker headroom.

TradeoffFan-out on read for celebrities means in-app inbox loads are slower (graph traversal per page load). This is typically hidden behind a short client-side TTL cache.
Third-party provider rate limiting and circuit breakers L5

APNs, FCM, SendGrid, and Twilio all impose rate limits and have independent reliability SLAs. When a provider returns 429 (rate limited) or 503 (unavailable), the dispatcher should not immediately retry, that amplifies load on an already-struggling provider. The right pattern is a circuit breaker per provider: after N consecutive failures, open the circuit (stop sending), wait a fixed backoff interval, then probe with a small volume before fully reopening.

What is the circuit breaker pattern? tap to expand

Circuit breaker pattern. Named after an electrical circuit breaker, this pattern has three states: Closed (normal, requests flow through), Open (tripped, requests are immediately rejected without hitting the downstream service), and Half-open (probing, a small number of test requests are allowed through to check if the service has recovered). When failures exceed a threshold, the breaker trips to Open, preventing a degraded provider from consuming all retry budget and blocking healthy sends. Libraries like Resilience4j (JVM) and pybreaker (Python) implement this out of the box.

Provider failoverFor push specifically, consider registering dual providers (primary + secondary). If APNs is degraded, fall back to a direct data push mechanism or queue the messages for delivery once APNs recovers. Email is more forgiving, SendGrid and AWS SES can run in parallel with identical DKIM/DMARC config and switch via DNS or weighted routing.
Marketing blast isolation, preventing priority inversion L5/L6

A bulk marketing campaign of 100M emails cannot share queue bandwidth with OTP sends. The three-queue topology (§4) solves this structurally: marketing messages are on a separate topic in the durable message queue (Kafka) with a separate consumer group. Marketing workers are allocated a fixed fraction of total throughput capacity and are not permitted to "borrow" transactional capacity. This must be enforced at the Kafka consumer configuration level (max.poll.records, concurrency settings) — not just as a convention.

Scheduled sendsMarketing campaigns often specify a send_time window. A scheduler service reads the campaign config and emits individual user-level notifications to the marketing topic at the appropriate rate, rather than queuing all 100M records at once. This smooths out the burst.
Geo-distribution and data residency (L7/L8) L7/L8

Notification systems in regulated markets (GDPR in EU, data localisation in India/Russia) must ensure that user notification data never transits through or is stored in non-compliant regions. This requires a regional deployment topology where EU user notifications are processed entirely within EU infrastructure, including the Kafka topics, fan-out workers, and Cassandra cluster. Cross-region forwarding is only allowed for aggregated, non-PII analytics.

Latency benefitAs a side effect, regional processing reduces end-to-end delivery latency by avoiding cross-continental hops in the fan-out path. A user in Tokyo processed by an APAC cluster reaches APNs servers faster than one routed through a US cluster.
Observability, delivery funnels and alert thresholds L6+

A notification service is invisible when it works and very visible when it fails. Good observability means tracking the delivery funnel: events accepted → queued → fanned-out → preference check (suppressed vs passed) → dispatched → provider acknowledged → delivered (device confirmed). Any stage where the count drops unexpectedly signals a bug. Alert thresholds on the transactional queue depth (if it grows beyond X seconds of work) trigger immediate PagerDuty, that's the OTP pipeline.

Key metricsQueue consumer lag per topic tier, provider error rate per channel (429s, 503s), suppression rate (opt-outs and quiet hours), notification open rate (requires device-side callback), and delivery p50/p95/p99 per notification type.
10

Failure modes & edge cases

Scenario Problem Solution Level
Worker crashes mid-fan-out Some recipients get the notification, others don't. Reprocessing duplicates those already sent. Idempotency key per (notification_id, recipient_id, channel). Dedup cache (Redis, 24h TTL) checked before every dispatch. At-least-once delivery is acceptable; duplicates are minimised in steady state via dedup. L3/L4
Third-party provider outage (APNs down) Push notifications fail silently or generate loud retry storms. Circuit breaker per provider. On open circuit, messages re-queue with backoff. Alert on circuit open state. Fall back to in-app notification delivery for time-insensitive types. L5
Marketing blast causes transactional delay 100M marketing emails saturate workers, delaying OTP delivery by minutes. Strict priority queue separation. Marketing topic has a hard cap on worker concurrency and cannot borrow transactional capacity. Autoscale transactional workers independently with aggressive scaling triggers. L5
User device token expired / invalid APNs returns APNS_UNREGISTERED. Continuing to send wastes rate limit budget. Process provider feedback channels (APNs Feedback Service, FCM canonical ID responses) to detect and remove stale device tokens. Update device token table, stop sending to that token immediately. L5
Quiet hours race condition Notification passes quiet-hours check, but user enters quiet hours before dispatch arrives at provider. Accept the small race window, the notification was valid at check time. The alternative (re-checking at dispatch time) adds DB reads on every send. For marketing sends only, implement a send-window enforcement at the scheduler level. L5/L6
Preference cache stale after user opt-out User opts out, but cached preferences show opted-in. Notifications sent for up to 5 minutes after opt-out. Write-invalidate on preference update ensures next read is fresh. 5-minute TTL is the backstop. For GDPR/right-to-be-forgotten, implement immediate invalidation + hard-stop flag checked outside cache. L6
Fan-out creates hot Cassandra partitions A very active user accumulates millions of rows in one partition, causing slow scans and GC pressure. Composite partition key: (user_id, month_bucket). Inbox queries spanning multiple months merge two partition reads. TTL and compaction handle old data automatically. L6
Cross-region data residency violation EU user notification data processed or logged in US region, violating GDPR data localisation requirements. Regional routing at the API gateway: EU user IDs resolve to EU Kafka cluster, EU fan-out workers, EU Cassandra cluster. Cross-region traffic only for anonymised analytics aggregates. L7/L8
11

How to answer by level

L3 / L4, SDE I / SDE II L3/L4
What good looks like
  • Identifies push, email, SMS as separate channels
  • Puts a queue between the API and delivery to decouple
  • Mentions retry on provider failure
  • Knows what APNs and FCM are and that they require device tokens
  • Designs a basic user preferences table
  • Designs a basic in-app inbox (read/unread state, paginated by time, backed by the notification log)
What separates L5 from L3/L4
  • Priority queues, L3 uses one queue for everything
  • Fan-out strategy, L3 doesn't address write amplification
  • Quiet hours, L3 may miss the timezone complexity
  • At-least-once vs exactly-once, L3 doesn't reason about delivery semantics
L5, Senior SDE L5
What good looks like
  • Three-tier priority queue topology with justification
  • Hybrid fan-out: write for regular users, read for celebrities
  • Idempotency keys + dedup cache for at-least-once
  • Preference cache + write-invalidate strategy
  • Circuit breakers per provider
  • Cassandra for notification log with partition key rationale
What separates L6 from L5
  • Delivery funnel observability and alerting design
  • Cross-region / data residency awareness
  • Cost modelling across channels (SMS is expensive)
  • Make vs buy analysis for notification infrastructure
L6, Staff SDE L6
What good looks like
  • Full delivery funnel design with stage-by-stage metrics
  • Alert thresholds tied to SLA tiers per notification type
  • Delivery log schema with hot partition mitigation
  • Provider feedback loop (APNs Feedback, FCM canonical IDs)
  • Scheduled send architecture for marketing campaigns
  • GDPR compliance design: opt-out hard stops, data retention
What separates L7/L8 from L6
  • Global multi-region topology with data residency routing
  • Capacity planning and cost architecture at 10B notifications/day
  • Build vs buy decision with vendor risk analysis
  • Governance: notification fatigue, business impact of over-sending
L7 / L8, Principal / Distinguished L7/L8
What good looks like
  • Global topology: regional clusters, cross-region analytics aggregation only
  • Cost architecture: SMS cost per user, push free via provider, email at scale
  • Make vs buy: when does custom push infra beat APNs/FCM at scale?
  • Notification fatigue governance: ML-based frequency capping
  • Regulatory roadmap: GDPR, CCPA, India PDPB compliance strategy
  • Provider SLA negotiation and multi-vendor redundancy contracts
Distinguishing signals
  • Frames the design as an organisational capability, not just a service
  • Reasons about the business model impact of notification quality
  • Proactively identifies second-order effects (notification spam → churn)
  • Connects infrastructure decisions to product metrics (DAU, retention)

Classic probes

Probe question L3/L4 L5/L6 L7/L8
"A celebrity posts and has 50M followers. What happens?" Fan-out to all 50M in one batch, doesn't identify the problem L5: Hybrid fan-out: threshold check, fan-out on read for high-follower accounts. L6 adds: instruments the threshold in production, owns the runbook for when fan-out workers saturate, and designs the alerting that detects a celebrity burst before it delays transactional sends. Dynamic threshold tuning based on real-time worker headroom; pre-computed fan-out lists for known celebrities; capacity model for worst-case simultaneous celebrity events
"How do you make sure an OTP isn't delayed by a marketing blast?" Maybe mentions separate queue, may not know why L5: Three-tier priority queues with independent workers and autoscaling policies per tier. L6 adds: defines the SLO for transactional queue depth, owns the alerting threshold, and enforces the isolation at the Kafka consumer config level, not just as a convention. SLA-based capacity reservation, burst capacity contracts with cloud provider, latency SLO enforcement via automated load shedding on marketing tier
"A user opts out, but gets a notification 2 minutes later. Why?" Doesn't know, guesses "bug" L5: Preference cache TTL window (5 min); write-invalidate reduces it. L6 adds: distinguishes GDPR "forget me" hard-stop from marketing opt-out, defines the hard-stop flag checked outside the cache, and owns the compliance audit trail for opt-out processing. Quantifies the window, proposes sub-second invalidation via pub/sub eviction (Redis keyspace events to worker process), and designs the data retention deletion pipeline for GDPR right-to-erasure
"APNs is down for 20 minutes. What does your system do?" Retries until it works, no throttling or awareness of downstream impact L5: Circuit breaker opens, messages requeue with exponential backoff, alert fires. L6 adds: designs the observability that distinguishes "APNs degraded" from "our dispatcher misconfigured", owns the incident runbook, and ensures in-app inbox delivery continues unaffected. Pre-negotiated SLA with backup provider, automatic DNS/routing failover, customer-facing status page update, post-incident review of notification debt (messages delivered late vs lost)

How the pieces connect

Every architectural decision traces back to a requirement or observation.

1
OTP delivery latency NFR of <5 seconds (§2) → transactional notifications cannot share queue capacity with marketing → three separate priority queue topics with independent autoscaling worker pools (§4)
2
At-least-once delivery NFR (§2) → workers must retry failed sends → retrying without deduplication creates duplicates → idempotency key stored in Redis dedup cache (§8) checked before every dispatch
3
Celebrity fan-out estimation: 1 event × 10M followers = 10M queue messages (§3) → synchronous fan-out would hold the API response for minutes → hybrid fan-out: write for regular users, read for high-follower accounts (§5)
4
Preference check on every notification send → at 10K notifications/second, PostgreSQL can't absorb the read load → Redis read-through cache for preferences with write-invalidate strategy (§8)
5
Notification log access pattern: write-heavy, read by user sorted by time, auto-expire after retention period (§7) → Cassandra with user_id partition key + sent_at clustering key + native TTL (§4, §7)
6
Third-party providers (APNs, FCM, Twilio) have independent SLAs and impose rate limits → a single provider failure must not halt the entire pipeline → separate dispatchers per channel with circuit breakers and per-provider retry queues (§9)
Also in this series

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →