Notification Service System Design — FAANG Interview Guide (L3

01

What the interviewer is testing

The notification service prompt is deceptively broad. "Send notifications" sounds like a simple pub-sub problem. What makes it hard is the combination of heterogeneous delivery channels (push, email, SMS, in-app), wildly different latency tolerances across notification types, and the need to avoid both silent failures and duplicate deliveries at the same time.

The question is a proxy for four distinct skills: understanding how fan-out complexity grows with follower counts; reasoning about delivery guarantees and idempotency; designing for channel-specific rate limits imposed by third-party providers; and prioritising notification types so a marketing blast doesn't delay a password reset.

Level	What good looks like
L3/L4	Identifies push, email, SMS channels. Routes through a queue. Understands retry on failure. Knows what APNs and FCM are.
L5	Designs priority queues per notification type. Justifies fan-out strategy with trade-offs. Reasons about at-least-once vs exactly-once. Handles user preferences and quiet hours.
L6	Owns the end-to-end schema, delivery log, deduplication layer, and third-party rate limiting. Designs for observability, delivery funnels, drop-off tracking, alert thresholds.
L7/L8	Addresses cross-region latency, provider failover, data residency constraints, cost modeling across channels, and the make-vs-buy decision for the notification infrastructure.

02

Requirements clarification

Notification systems span a huge design space. The first thing to do is nail down which of the four channel types are in scope, what notification categories exist, and whether "delivery" means at-most-once, at-least-once, or exactly-once. The answers change every major component.

Functional requirements

Requirement	In scope
Push notifications (iOS via APNs, Android via FCM)	✓
Email notifications (transactional + marketing)	✓
SMS notifications (OTP, alerts)	✓
In-app notification inbox (read/unread state)	✓
User preferences (opt-in/out per channel, quiet hours)	✓
Priority tiers: transactional, social, marketing	✓
Scheduled / delayed notifications	✓ (L5+)
Rich push (images, actions, deep links)	Partial

Before the NFRs, it's worth understanding how push delivery actually works, it shapes several of the constraints below.

tap to expand ›

APNs & FCM, the push gatekeepers. Apple Push Notification service (APNs) and Firebase Cloud Messaging (FCM) are Apple's and Google's official push delivery networks. Your server never opens a direct socket to a user's phone, instead it sends the notification payload to APNs or FCM, which maintains a persistent encrypted connection to every registered device. Both services impose per-app rate limits, require a registered device token per installation, and provide feedback channels to detect uninstalled apps. You cannot bypass them for iOS; Android allows alternatives but FCM is the standard.

Non-functional requirements

NFR	Target	Why
Transactional delivery latency (p99)	< 5 seconds end-to-end	OTP, password reset, delay kills usability
Social/engagement delivery latency (p99)	< 30 seconds	Likes, comments, freshness matters but isn't critical
Marketing delivery latency	Minutes to hours acceptable	Bulk campaigns are best-effort; don't compete with transactional
At-least-once delivery guarantee	Required for all channels	Silent failure (missing OTP) is worse than duplicate
Duplicate notification rate	Minimised in steady state (< 0.1%); brief windows after crashes acceptable	Duplicates are annoying but survivable; structured retry + idempotency keys keep this low in practice
Availability	99.9%+	Notification pipeline is non-critical-path but high visibility
Scalability	10M notifications/day at baseline, 100M peak (marketing blast)	Events (Black Friday, elections) cause sudden 10× spikes

Why at-least-once, not exactly-once? ›

Exactly-once delivery across distributed systems and third-party providers (APNs, Twilio) is effectively impossible to guarantee end-to-end, you can get close but you cannot control what happens inside the provider's delivery layer. The practical answer is at-least-once delivery at the infrastructure level, with idempotency keys in the payload so the receiving app or device can deduplicate. This shifts a hard distributed systems problem into an application-level one.

TradeoffIdempotency keys require a deduplication store (a short-lived distributed cache with TTL). This adds a lookup on every delivery attempt. The alternative, accepting duplicates silently, is simpler but creates poor user experience at any non-trivial failure rate.

Why separate latency SLAs for notification types? ›

Without separated priority queues, a marketing blast of 50M emails queued on Monday morning will delay the OTP that a user needs to log in. The latency targets exist not just as SLOs, they justify the entire queue topology. If all notifications had the same tolerance, a single queue would be fine. The tight 5-second window for transactional also drives the auto-scaling policy: transactional workers must scale in seconds, not minutes.

What this drivesThree separate queue topics (transactional, social, marketing) with independent worker pools, independent scaling policies, and independent alerting thresholds.

03

Capacity estimation

Notification systems have an unusual write profile: write volume (notification sends) can be many orders of magnitude larger than the number of triggering events, because a single social event (a post from a celebrity with 10M followers) fans out to millions of push notifications. Storage is dominated by the delivery log, not the notification metadata itself.

Interactive estimator

Notification events/day (M) 10M

Avg recipients per event 10

Avg notification record size (B) 500 B

Retention (months) 3 mo

Registered users (M) 10M

Peak multiplier (×avg) ×10

Notifications/day

—

events/day × avg recipients

Write QPS (avg)

—

notifs/day ÷ 86,400 s

Write QPS (peak)

—

avg QPS × peak multiplier

Log storage

—

rows × record size × retention days

Delivery log rows

—

notifs/day × retention days

Cache (preferences)

—

registered users × 1 KB/user

⚡

The key insight: Fan-out multiplier is the dominant lever, the average QPS shown above scales linearly with both events/day and recipients/event. But a marketing blast or celebrity post pushes this to the peak multiple you set. Adjust the peak multiplier to match your workload: social products spike on viral events, B2B products spike on business-hours campaigns. This is why fan-out must be async, the queue absorbs the burst; workers drain it at their own rate, bounded by provider rate limits.

04

High-level architecture

Notification service, three-tier async fan-out, per-channel dispatchers, and append-only delivery log

Component breakdown

Notification API is the single ingress point for all notification events. It validates the payload (notification type, target users, template ID), enforces per-caller rate limits, enriches the event with metadata (sending timestamp, idempotency key), and publishes to the appropriate priority queue topic. It does not fan out, that is the worker's job.

Priority queues are the architectural keystone. Three separate topics in a durable log (Apache Kafka or AWS SQS): transactional, social, and marketing. Each has independent consumer groups and independent scaling policies. This isolation ensures a 50M-message marketing blast cannot starve an OTP delivery, which directly satisfies the latency NFR from §2.

📬 What is Apache Kafka? tap to expand ›

Apache Kafka in one paragraph. Kafka is a distributed commit log, producers append messages to named topics, and consumers read them by tracking an offset (a position in the log). Unlike traditional queues that delete messages on consumption, Kafka retains messages for a configurable period, enabling multiple independent consumer groups to read the same topic at their own pace (e.g. the delivery workers and the analytics pipeline both consume from the same topic without interfering). If a consumer crashes, it restarts from its last committed offset, no message is lost. This replay-on-crash property is what makes Kafka the foundation of at-least-once delivery guarantees.

Fan-out workers consume from the priority queues. For each event, they look up the target user list, check preferences and quiet hours, select the right channel(s), and enqueue one send task per recipient per channel into the downstream dispatcher queue. They also write the initial record to the notification log.

Channel dispatchers are thin, stateless workers that handle one channel each: push (APNs + FCM), email (SendGrid or SES), and SMS (Twilio or Vonage). They implement channel-specific retry logic, third-party rate limit handling (respecting 429 responses), and payload formatting. Keeping them separated means a third-party provider outage only affects one channel, not the others.

Preferences DB stores per-user opt-in/out state, quiet hours, preferred channel order, and per-category preferences. It is a relational store (PostgreSQL) with a read-through distributed cache (Redis) in front of it, since preferences are read on every single notification send but change rarely.

⚡ What is Redis? tap to expand ›

Redis in one paragraph. Redis is an in-memory data structure store used as a cache, message broker, and ephemeral database. Because all data lives in RAM, reads and writes complete in under a millisecond, orders of magnitude faster than hitting a disk-backed database. Redis supports expiry (TTL) natively on any key, making it ideal for short-lived data like cache entries, rate limit counters, and deduplication keys. The trade-off is memory cost and volatility: Redis data can be lost on restart unless persistence is configured. In this design, Redis is used for three things: the preference cache (5-min TTL), the deduplication key store (24h TTL), and the per-caller rate limit token buckets, all cases where sub-millisecond lookup and natural expiry outweigh the need for durable storage.

Notification log is backed by a wide-column store (Apache Cassandra) split into two tables: notification_log is partitioned by user_id and clustered by sent_at, optimised for the inbox read ("give me all notifications for user X, newest first"). delivery_attempt is partitioned by notification_id and clustered by attempt_at, optimised for retry and support lookups ("what happened to notification Y?"). Splitting them lets each table's partition key match its dominant query, the core Cassandra data-modelling principle.

⚠️

Email deliverability requires DNS pre-configuration, not just an API key swap. DKIM signing keys and DMARC policy alignment must be consistently configured across all email providers before failover. Switching from SendGrid to AWS SES during an outage without pre-configured DNS records (DKIM CNAME entries, DMARC TXT record) will cause emails to land in spam or be rejected outright. Treat email provider failover as an infrastructure change, not an application-level switch.

🗄️ What is Apache Cassandra? tap to expand ›

Apache Cassandra in one paragraph. Cassandra is a wide-column NoSQL database built for write-heavy, append-dominant workloads at scale. Data is distributed across nodes by a partition key, all rows sharing the same partition key live on the same node, making single-partition reads fast and predictable. Within a partition, rows are sorted by a clustering key, enabling efficient time-ordered range scans. Cassandra trades strong consistency for availability and partition tolerance: reads are eventually consistent by default, and there is no single point of failure. It also supports native TTL (time-to-live) per row, making retention-based expiry a configuration value rather than a background job. These properties, fast appends, user-partitioned inbox reads, and automatic TTL expiry, make it the standard choice for notification and activity feed storage at scale.

Architectural rationale

Why Kafka for the priority queues? Queue choice ›

A durable log like Kafka gives replay capability, if a worker crashes mid-fan-out, the unconsumed messages remain on the topic and a replacement worker picks up exactly where the offset left off. This is the foundation of the at-least-once guarantee. It also allows independent consumer groups for analytics without adding load to the main delivery path.

TradeoffKafka adds operational complexity. For smaller scale, a managed queue like AWS SQS with separate queues per priority tier achieves most of the same isolation with far less overhead. The choice between them is a scale and operational maturity decision.

AlternativesAWS SQSRabbitMQGoogle Pub/Sub

Why separate fan-out workers from channel dispatchers? Separation of concerns ›

Fan-out is CPU-bound and scales with the number of recipients. Channel dispatch is I/O-bound and scales with provider throughput. Separating them means you can scale each independently: a celebrity fan-out event needs many fan-out workers briefly, but APNs imposes a fixed rate limit that no amount of additional dispatchers can exceed. If they were combined, you'd be forced to provision for the worst case on both dimensions simultaneously.

TradeoffTwo separate worker pools means two separate queues between them, adding another async hop. This adds latency (seconds, not milliseconds) which is acceptable for social and marketing but must be considered against the 5-second transactional target.

Why Cassandra for the notification log? Storage choice ›

The notification log has three dominant access patterns: write a new delivery attempt (extremely frequent), read all notifications for a user sorted by time (the in-app inbox query), and expire old records after the retention period. Cassandra handles all three: writes are fast and append-only, reads by partition key (user_id) with time-ordered clustering are efficient, and native TTL handles expiry without a separate cleanup job.

TradeoffCassandra provides eventual consistency. The in-app inbox may briefly show stale state after a write. This is acceptable for a notification feed. If strong read-your-writes is required (e.g. user clicks "mark all read" and immediately refreshes), route those reads through a read-repair path or accept a brief inconsistency window.

AlternativesDynamoDBScyllaDBPostgreSQL + partitioning (smaller scale)

Real-world comparison

Decision	This design	Meta	Airbnb
Fan-out model	Async workers, hybrid write/read	Fan-out on write (push model)	Fan-out on write + lazy reads for inbox
Queue system	Kafka (priority topics)	Scribe + internal queues	Apache Kafka
Notification log	Cassandra, TTL-based expiry	TAO (graph store)	MySQL + DynamoDB
Preference store	PostgreSQL + Redis cache	ZippyDB (key-value)	MySQL + Memcached
Push dispatch	APNs + FCM via dispatcher workers	In-house push infra	AWS SNS + internal wrapper

💡

Real production systems vary significantly based on scale and legacy. Meta's in-house push infrastructure exists because APNs and FCM rate limits become a bottleneck at their scale. For most companies, managed providers (SendGrid, Twilio, FCM) are the right call, they abstract reliability, compliance (DKIM, DMARC), and global delivery infrastructure.

05

Fan-out strategy, the core algorithm

Fan-out is the defining algorithmic question for notification systems. When a user posts something, the system must decide: do we push that notification into each follower's inbox immediately (fan-out on write), or do we compute the notification list when each follower asks (fan-out on read)? Both have significant consequences across the entire design.

Fan-out on write vs fan-out on read, both approaches have fatal flaws at the extremes. A hybrid addresses both.

The right answer for production systems is a hybrid approach: fan-out on write for users with follower counts below a threshold (e.g. 10,000), and fan-out on read for high-follower accounts. The threshold is a configuration value, not a constant. Twitter's "celebrity problem" made this pattern famous, and it remains the standard solution.

🔍

Open question: What is the right fan-out threshold? It depends on write throughput, worker capacity, and storage costs. If a worker can produce 10,000 notification records/second and the SLA allows 5 seconds, a user with 50,000 followers can be fanned out synchronously. Beyond that, switch to lazy read. Make the threshold configurable and instrument it.

Implementation sketch, hybrid fan-out worker ›

async def fan_out(event: NotificationEvent):
    sender = get_user(event.sender_id)

    if sender.follower_count <= FANOUT_WRITE_THRESHOLD:
        # Fan-out on write: push to each follower's inbox
        follower_ids = get_follower_ids(event.sender_id)
        await batch_write_inbox(follower_ids, event)
    else:
        # Fan-out on read: store one record, resolve at query time
        await write_single_event(event)
        # Followers will see this via graph traversal on inbox load

    await log_delivery_attempt(event.notification_id, "fan_out_complete")

05b

API design

POST /v1/notifications

Triggers a notification. Returns immediately after enqueueing, does not wait for delivery.

Request

Response · 202 Accepted

Full schema, all fields & annotations POST /v1/notifications ›

Request, full field list

Error responses

// 400  invalid template_id or type mismatch
// 409  idempotency_key already used within 24 h (duplicate blocked)
// 429  rate limit exceeded, Retry-After header included

GET /v1/users/{user_id}/notifications

Returns the in-app inbox for a user, newest first. Cursor-paginated.

Query params

// ?limit=20           max items per page (default 20, max 100)
// ?cursor=<token>     opaque cursor from previous response
// ?unread_only=true   filter to unread notifications only

Response · 200 OK

Full schema, all fields & annotations GET /v1/users/{id}/notifications ›

Query params

// ?limit=20           max items per page (default 20, max 100)
// ?cursor=<token>     opaque cursor from previous response
// ?unread_only=true   filter to unread notifications only

Response · 200 OK, full field list

Additional endpoints

Endpoint	Purpose	Level
`PATCH /v1/notifications/{id}/read`	Mark a single notification read	L3/L4
`POST /v1/notifications/read-all`	Mark all as read for a user	L3/L4
`GET /v1/notifications/{id}/status`	Delivery status (queued/sent/failed)	L5/L6
`PUT /v1/users/{id}/preferences`	Update channel preferences + quiet hours	L5/L6
`POST /v1/notifications/bulk`	Bulk marketing campaigns with targeting	L6

✅

Input validation that matters: validate that template_id exists and matches the notification type; check idempotency key against a recent-send cache (Redis with 24h TTL) before accepting; verify recipient_ids are valid and opted in; enforce a max recipients cap on a single API call (e.g. 1,000, bulk sends go through the bulk endpoint).

Per-caller rate limiting: rate limit by caller API key using a token bucket in a distributed cache (Redis), keyed as ratelimit:api:{api_key}. Standard callers get 1,000 requests/minute; pre-approved bulk senders (e.g. a marketing service) get a higher limit negotiated at onboarding. Return 429 Too Many Requests with a Retry-After header when the bucket is exhausted. This prevents a single misconfigured caller from saturating the transactional queue and delaying OTP delivery for everyone else.

06

Core flow, the write path

The dominant path in a notification system is the write path: an event arrives, fans out to recipients, checks their preferences, and dispatches per channel. The key architectural decision is where within the fan-out step to check preferences, in the fan-out worker before enqueuing per-channel tasks (saves provider API calls for opted-out users, but requires preference data in the worker) or in the dispatcher after enqueuing (cleaner separation but wastes queue capacity on sends that will be suppressed).

Notification write path, preference check before dispatch prevents unnecessary provider API calls

The preference check happens before channel dispatch, not after. This is the key decision point, driven directly by the NFR to respect quiet hours and opt-outs. Checking preferences in the fan-out worker (step 4) rather than the dispatcher (step 5) means we avoid making any provider API calls for opted-out users, which saves money and avoids consuming rate limit budget. The cost is that the fan-out worker must read preference data, which is why the preference cache (Redis, §8) is critical.

⚠️

Quiet hours edge case: When a notification is suppressed due to quiet hours, it shouldn't be silently dropped, it should be re-queued with a delayed send time aligned to the end of the user's quiet hours window. Otherwise, time-sensitive-but-not-transactional notifications (e.g. a friend request at 11pm) are permanently lost. A scheduled notification queue (a priority queue with delayed visibility) handles this cleanly.

07

Data model

The notification system has three distinct entities with very different access patterns and storage characteristics. Understanding how each entity is queried determines every schema decision.

Access patterns

Operation	Frequency	Query shape
Write delivery attempt	Extremely high (every send)	Insert row keyed on notification_id + user_id
Read user inbox	High (every app open)	Filter by user_id, order by time DESC, paginate
Read user preferences	Very high (every send)	Lookup by user_id, always full record
Update preference	Low (user-initiated)	Update by user_id
Delivery status query	Low (support / analytics)	Lookup by notification_id
Expire old notifications	Low (background job)	TTL-based, time-ordered

Two things stand out: preference reads happen on every single send and must be fast (driving the Redis cache), and the delivery log is almost pure-append with user-partitioned reads (driving the Cassandra choice).

Data model, three entities across two storage engines, chosen by their dominant access patterns

device_tokens, PostgreSQL

Device tokens are the source-of-truth for push delivery. They are relational (one user can have many devices, each with one token per platform), low-volume compared to the notification log, and must survive Redis eviction. They live in PostgreSQL alongside preferences, with the Redis cache (§8) providing the 1-hour TTL read layer.

Field	Type	Notes
`device_id` (PK)	uuid	Stable device identifier, generated on first app launch
`user_id` (FK)	uuid	Indexed, supports "give me all devices for user X"
`platform`	text	`ios` \| `android` \| `web`
`token`	text	APNs device token or FCM registration token; rotate on provider feedback
`registered_at`	timestamp	When the token was first registered
`last_seen_at`	timestamp	Updated on every app open; used to identify stale tokens for cleanup
`is_active`	bool	Set to false on APNs Feedback / FCM canonical ID response, stops sends immediately without deleting the row

💡

Token rotation: APNs returns APNS_UNREGISTERED when a device token is stale (app uninstalled or re-installed). FCM returns a canonical registration ID when a token has been refreshed. Both signals must be processed via a feedback consumer that marks the old token as is_active = false and invalidates the Redis cache entry — preventing continued sends to dead tokens that consume rate limit budget.

Why category_overrides as JSONB? ›

Notification categories are product-defined and change over time. A user might disable push for "marketing" but keep it on for "social" and all transactional types. Storing this as a JSONB column avoids schema migrations every time a new category is introduced. The trade-off is that querying across users by category preference is harder, but that query pattern doesn't exist in steady state (preferences are always fetched per-user).

Why store user_id as PK in notification_log (not notification_id)? ›

In Cassandra, the partition key determines which node stores the data. Partitioning by user_id means all notifications for a user are co-located on the same node, making the inbox query (give me all notifications for user X, sorted by time) a single-partition scan, fast and predictable. If we partitioned by notification_id, an inbox query would scatter across the entire cluster.

Hot partition riskA very active user accumulates many rows in one partition. Use a time-bucketed partition key (user_id + month) to cap partition size. This means inbox queries spanning months need to be merged from two partition scans, a manageable trade-off.

08

Caching strategy

The notification service's caching needs fall into two tiers. The essential tier covers every single send regardless of channel, both caches are hit on every notification that goes through the fan-out worker. The extended tier covers push-specific details that only become relevant once you go deeper on channel dispatch.

Essential, hit on every send

Cache flows, dedup HIT drops the send; MISS proceeds to dispatch. Preference MISS falls through to PostgreSQL.

Cache layer	Store	TTL	Strategy	Why
Deduplication keys	Redis	24 hours	Write on first send; check before every send	Prevents duplicate sends on worker retry; idempotency_key maps to notification_id
User preferences	Redis	5 minutes	Read-through; invalidate on update	Preferences are read on every send and change very rarely, cache hit rate >99%

Extended, push-specific L5+

Cache layer	Store	TTL	Strategy	Why
User device tokens	Redis	1 hour	Read-through; invalidate on APNs/FCM feedback	Device tokens don't change often but are read on every push send, caching avoids a DB lookup per dispatch. Must be invalidated immediately when a provider returns a stale-token signal.

Cache invalidation

When a user updates their preferences via the API, the system writes to PostgreSQL and immediately invalidates the Redis key (write-invalidate, not write-through). Write-invalidate is preferred here because the next read will always populate the cache with fresh data, and double-writes (to DB and cache) risk inconsistency if one fails. The 5-minute TTL provides a backstop for any invalidation misses.

⚠️

Quiet hours and timezones: Quiet hours must be evaluated in the user's local timezone, not UTC. Cache the computed quiet-hours window per user per day (not just the raw start/end times) to avoid repeated timezone arithmetic on every notification send. Invalidate at midnight local time.

09

Deep-dive: scalability

At L5 and above, the interviewer expects you to go beyond the happy path and address the stress cases: what happens during a marketing blast, a celebrity event, or a third-party provider outage? The accordions below address the most common follow-up questions at each level, check the tag on each to see where it's expected.

Production notification pipeline, independent scaling, circuit breakers per provider, separate analytics consumer group on Kafka

Handling celebrity / high-follower fan-out without write amplification L5 ›

A celebrity with 10M followers triggering a fan-out on write would produce 10M queue messages in a single event. This can saturate workers and delay the entire topic. The hybrid fan-out approach (§5) avoids this by storing a single event record for high-follower users and resolving the recipient list at read time. The threshold (e.g. 10,000 followers) is a system configuration value, tuned based on observed write throughput and worker headroom.

TradeoffFan-out on read for celebrities means in-app inbox loads are slower (graph traversal per page load). This is typically hidden behind a short client-side TTL cache.

Third-party provider rate limiting and circuit breakers L5 ›

APNs, FCM, SendGrid, and Twilio all impose rate limits and have independent reliability SLAs. When a provider returns 429 (rate limited) or 503 (unavailable), the dispatcher should not immediately retry, that amplifies load on an already-struggling provider. The right pattern is a circuit breaker per provider: after N consecutive failures, open the circuit (stop sending), wait a fixed backoff interval, then probe with a small volume before fully reopening.

⚡ What is the circuit breaker pattern? tap to expand ›

Circuit breaker pattern. Named after an electrical circuit breaker, this pattern has three states: Closed (normal, requests flow through), Open (tripped, requests are immediately rejected without hitting the downstream service), and Half-open (probing, a small number of test requests are allowed through to check if the service has recovered). When failures exceed a threshold, the breaker trips to Open, preventing a degraded provider from consuming all retry budget and blocking healthy sends. Libraries like Resilience4j (JVM) and pybreaker (Python) implement this out of the box.

Provider failoverFor push specifically, consider registering dual providers (primary + secondary). If APNs is degraded, fall back to a direct data push mechanism or queue the messages for delivery once APNs recovers. Email is more forgiving, SendGrid and AWS SES can run in parallel with identical DKIM/DMARC config and switch via DNS or weighted routing.

Marketing blast isolation, preventing priority inversion L5/L6 ›

A bulk marketing campaign of 100M emails cannot share queue bandwidth with OTP sends. The three-queue topology (§4) solves this structurally: marketing messages are on a separate topic in the durable message queue (Kafka) with a separate consumer group. Marketing workers are allocated a fixed fraction of total throughput capacity and are not permitted to "borrow" transactional capacity. This must be enforced at the Kafka consumer configuration level (max.poll.records, concurrency settings) — not just as a convention.

Scheduled sendsMarketing campaigns often specify a send_time window. A scheduler service reads the campaign config and emits individual user-level notifications to the marketing topic at the appropriate rate, rather than queuing all 100M records at once. This smooths out the burst.

Geo-distribution and data residency (L7/L8) L7/L8 ›

Notification systems in regulated markets (GDPR in EU, data localisation in India/Russia) must ensure that user notification data never transits through or is stored in non-compliant regions. This requires a regional deployment topology where EU user notifications are processed entirely within EU infrastructure, including the Kafka topics, fan-out workers, and Cassandra cluster. Cross-region forwarding is only allowed for aggregated, non-PII analytics.

Latency benefitAs a side effect, regional processing reduces end-to-end delivery latency by avoiding cross-continental hops in the fan-out path. A user in Tokyo processed by an APAC cluster reaches APNs servers faster than one routed through a US cluster.

Observability, delivery funnels and alert thresholds L6+ ›

A notification service is invisible when it works and very visible when it fails. Good observability means tracking the delivery funnel: events accepted → queued → fanned-out → preference check (suppressed vs passed) → dispatched → provider acknowledged → delivered (device confirmed). Any stage where the count drops unexpectedly signals a bug. Alert thresholds on the transactional queue depth (if it grows beyond X seconds of work) trigger immediate PagerDuty, that's the OTP pipeline.

Key metricsQueue consumer lag per topic tier, provider error rate per channel (429s, 503s), suppression rate (opt-outs and quiet hours), notification open rate (requires device-side callback), and delivery p50/p95/p99 per notification type.

10

Failure modes & edge cases

Scenario	Problem	Solution	Level
Worker crashes mid-fan-out	Some recipients get the notification, others don't. Reprocessing duplicates those already sent.	Idempotency key per (notification_id, recipient_id, channel). Dedup cache (Redis, 24h TTL) checked before every dispatch. At-least-once delivery is acceptable; duplicates are minimised in steady state via dedup.	L3/L4
Third-party provider outage (APNs down)	Push notifications fail silently or generate loud retry storms.	Circuit breaker per provider. On open circuit, messages re-queue with backoff. Alert on circuit open state. Fall back to in-app notification delivery for time-insensitive types.	L5
Marketing blast causes transactional delay	100M marketing emails saturate workers, delaying OTP delivery by minutes.	Strict priority queue separation. Marketing topic has a hard cap on worker concurrency and cannot borrow transactional capacity. Autoscale transactional workers independently with aggressive scaling triggers.	L5
User device token expired / invalid	APNs returns APNS_UNREGISTERED. Continuing to send wastes rate limit budget.	Process provider feedback channels (APNs Feedback Service, FCM canonical ID responses) to detect and remove stale device tokens. Update device token table, stop sending to that token immediately.	L5
Quiet hours race condition	Notification passes quiet-hours check, but user enters quiet hours before dispatch arrives at provider.	Accept the small race window, the notification was valid at check time. The alternative (re-checking at dispatch time) adds DB reads on every send. For marketing sends only, implement a send-window enforcement at the scheduler level.	L5/L6
Preference cache stale after user opt-out	User opts out, but cached preferences show opted-in. Notifications sent for up to 5 minutes after opt-out.	Write-invalidate on preference update ensures next read is fresh. 5-minute TTL is the backstop. For GDPR/right-to-be-forgotten, implement immediate invalidation + hard-stop flag checked outside cache.	L6
Fan-out creates hot Cassandra partitions	A very active user accumulates millions of rows in one partition, causing slow scans and GC pressure.	Composite partition key: (user_id, month_bucket). Inbox queries spanning multiple months merge two partition reads. TTL and compaction handle old data automatically.	L6
Cross-region data residency violation	EU user notification data processed or logged in US region, violating GDPR data localisation requirements.	Regional routing at the API gateway: EU user IDs resolve to EU Kafka cluster, EU fan-out workers, EU Cassandra cluster. Cross-region traffic only for anonymised analytics aggregates.	L7/L8

11

How to answer by level

L3 / L4, SDE I / SDE II L3/L4 ›

What good looks like

Identifies push, email, SMS as separate channels
Puts a queue between the API and delivery to decouple
Mentions retry on provider failure
Knows what APNs and FCM are and that they require device tokens
Designs a basic user preferences table
Designs a basic in-app inbox (read/unread state, paginated by time, backed by the notification log)

What separates L5 from L3/L4

Priority queues, L3 uses one queue for everything
Fan-out strategy, L3 doesn't address write amplification
Quiet hours, L3 may miss the timezone complexity
At-least-once vs exactly-once, L3 doesn't reason about delivery semantics

L5, Senior SDE L5 ›

What good looks like

Three-tier priority queue topology with justification
Hybrid fan-out: write for regular users, read for celebrities
Idempotency keys + dedup cache for at-least-once
Preference cache + write-invalidate strategy
Circuit breakers per provider
Cassandra for notification log with partition key rationale

What separates L6 from L5

Delivery funnel observability and alerting design
Cross-region / data residency awareness
Cost modelling across channels (SMS is expensive)
Make vs buy analysis for notification infrastructure

L6, Staff SDE L6 ›

What good looks like

Full delivery funnel design with stage-by-stage metrics
Alert thresholds tied to SLA tiers per notification type
Delivery log schema with hot partition mitigation
Provider feedback loop (APNs Feedback, FCM canonical IDs)
Scheduled send architecture for marketing campaigns
GDPR compliance design: opt-out hard stops, data retention

What separates L7/L8 from L6

Global multi-region topology with data residency routing
Capacity planning and cost architecture at 10B notifications/day
Build vs buy decision with vendor risk analysis
Governance: notification fatigue, business impact of over-sending

L7 / L8, Principal / Distinguished L7/L8 ›

What good looks like

Global topology: regional clusters, cross-region analytics aggregation only
Cost architecture: SMS cost per user, push free via provider, email at scale
Make vs buy: when does custom push infra beat APNs/FCM at scale?
Notification fatigue governance: ML-based frequency capping
Regulatory roadmap: GDPR, CCPA, India PDPB compliance strategy
Provider SLA negotiation and multi-vendor redundancy contracts

Distinguishing signals

Frames the design as an organisational capability, not just a service
Reasons about the business model impact of notification quality
Proactively identifies second-order effects (notification spam → churn)
Connects infrastructure decisions to product metrics (DAU, retention)

Classic probes

Probe question	L3/L4	L5/L6	L7/L8
"A celebrity posts and has 50M followers. What happens?"	Fan-out to all 50M in one batch, doesn't identify the problem	L5: Hybrid fan-out: threshold check, fan-out on read for high-follower accounts. L6 adds: instruments the threshold in production, owns the runbook for when fan-out workers saturate, and designs the alerting that detects a celebrity burst before it delays transactional sends.	Dynamic threshold tuning based on real-time worker headroom; pre-computed fan-out lists for known celebrities; capacity model for worst-case simultaneous celebrity events
"How do you make sure an OTP isn't delayed by a marketing blast?"	Maybe mentions separate queue, may not know why	L5: Three-tier priority queues with independent workers and autoscaling policies per tier. L6 adds: defines the SLO for transactional queue depth, owns the alerting threshold, and enforces the isolation at the Kafka consumer config level, not just as a convention.	SLA-based capacity reservation, burst capacity contracts with cloud provider, latency SLO enforcement via automated load shedding on marketing tier
"A user opts out, but gets a notification 2 minutes later. Why?"	Doesn't know, guesses "bug"	L5: Preference cache TTL window (5 min); write-invalidate reduces it. L6 adds: distinguishes GDPR "forget me" hard-stop from marketing opt-out, defines the hard-stop flag checked outside the cache, and owns the compliance audit trail for opt-out processing.	Quantifies the window, proposes sub-second invalidation via pub/sub eviction (Redis keyspace events to worker process), and designs the data retention deletion pipeline for GDPR right-to-erasure
"APNs is down for 20 minutes. What does your system do?"	Retries until it works, no throttling or awareness of downstream impact	L5: Circuit breaker opens, messages requeue with exponential backoff, alert fires. L6 adds: designs the observability that distinguishes "APNs degraded" from "our dispatcher misconfigured", owns the incident runbook, and ensures in-app inbox delivery continues unaffected.	Pre-negotiated SLA with backup provider, automatic DNS/routing failover, customer-facing status page update, post-incident review of notification debt (messages delivered late vs lost)

Design a Notification Service

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

Capacity estimation

Interactive estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Fan-out strategy, the core algorithm

API design

POST /v1/notifications

GET /v1/users/{user_id}/notifications

Additional endpoints

Core flow, the write path

Data model

Access patterns

device_tokens, PostgreSQL

Caching strategy

Essential, hit on every send

Extended, push-specific L5+

Cache invalidation

Deep-dive: scalability

Failure modes & edge cases

How to answer by level

Classic probes

How the pieces connect

System Design Mock Interviews

Practice Coding Interviews Now