What the interviewer is testing
The notification service prompt is deceptively broad. "Send notifications" sounds like a simple pub-sub problem. What makes it hard is the combination of heterogeneous delivery channels (push, email, SMS, in-app), wildly different latency tolerances across notification types, and the need to avoid both silent failures and duplicate deliveries at the same time.
The question is a proxy for four distinct skills: understanding how fan-out complexity grows with follower counts; reasoning about delivery guarantees and idempotency; designing for channel-specific rate limits imposed by third-party providers; and prioritising notification types so a marketing blast doesn't delay a password reset.
| Level | What good looks like |
|---|---|
| L3/L4 | Identifies push, email, SMS channels. Routes through a queue. Understands retry on failure. Knows what APNs and FCM are. |
| L5 | Designs priority queues per notification type. Justifies fan-out strategy with trade-offs. Reasons about at-least-once vs exactly-once. Handles user preferences and quiet hours. |
| L6 | Owns the end-to-end schema, delivery log, deduplication layer, and third-party rate limiting. Designs for observability, delivery funnels, drop-off tracking, alert thresholds. |
| L7/L8 | Addresses cross-region latency, provider failover, data residency constraints, cost modeling across channels, and the make-vs-buy decision for the notification infrastructure. |
Requirements clarification
Notification systems span a huge design space. The first thing to do is nail down which of the four channel types are in scope, what notification categories exist, and whether "delivery" means at-most-once, at-least-once, or exactly-once. The answers change every major component.
Functional requirements
| Requirement | In scope |
|---|---|
| Push notifications (iOS via APNs, Android via FCM) | ✓ |
| Email notifications (transactional + marketing) | ✓ |
| SMS notifications (OTP, alerts) | ✓ |
| In-app notification inbox (read/unread state) | ✓ |
| User preferences (opt-in/out per channel, quiet hours) | ✓ |
| Priority tiers: transactional, social, marketing | ✓ |
| Scheduled / delayed notifications | ✓ (L5+) |
| Rich push (images, actions, deep links) | Partial |
Before the NFRs, it's worth understanding how push delivery actually works, it shapes several of the constraints below.
APNs & FCM, the push gatekeepers. Apple Push Notification service (APNs) and Firebase Cloud Messaging (FCM) are Apple's and Google's official push delivery networks. Your server never opens a direct socket to a user's phone, instead it sends the notification payload to APNs or FCM, which maintains a persistent encrypted connection to every registered device. Both services impose per-app rate limits, require a registered device token per installation, and provide feedback channels to detect uninstalled apps. You cannot bypass them for iOS; Android allows alternatives but FCM is the standard.
Non-functional requirements
| NFR | Target | Why |
|---|---|---|
| Transactional delivery latency (p99) | < 5 seconds end-to-end | OTP, password reset, delay kills usability |
| Social/engagement delivery latency (p99) | < 30 seconds | Likes, comments, freshness matters but isn't critical |
| Marketing delivery latency | Minutes to hours acceptable | Bulk campaigns are best-effort; don't compete with transactional |
| At-least-once delivery guarantee | Required for all channels | Silent failure (missing OTP) is worse than duplicate |
| Duplicate notification rate | Minimised in steady state (< 0.1%); brief windows after crashes acceptable | Duplicates are annoying but survivable; structured retry + idempotency keys keep this low in practice |
| Availability | 99.9%+ | Notification pipeline is non-critical-path but high visibility |
| Scalability | 10M notifications/day at baseline, 100M peak (marketing blast) | Events (Black Friday, elections) cause sudden 10× spikes |
Why at-least-once, not exactly-once? ›
Exactly-once delivery across distributed systems and third-party providers (APNs, Twilio) is effectively impossible to guarantee end-to-end, you can get close but you cannot control what happens inside the provider's delivery layer. The practical answer is at-least-once delivery at the infrastructure level, with idempotency keys in the payload so the receiving app or device can deduplicate. This shifts a hard distributed systems problem into an application-level one.
Why separate latency SLAs for notification types? ›
Without separated priority queues, a marketing blast of 50M emails queued on Monday morning will delay the OTP that a user needs to log in. The latency targets exist not just as SLOs, they justify the entire queue topology. If all notifications had the same tolerance, a single queue would be fine. The tight 5-second window for transactional also drives the auto-scaling policy: transactional workers must scale in seconds, not minutes.
Capacity estimation
Notification systems have an unusual write profile: write volume (notification sends) can be many orders of magnitude larger than the number of triggering events, because a single social event (a post from a celebrity with 10M followers) fans out to millions of push notifications. Storage is dominated by the delivery log, not the notification metadata itself.
Interactive estimator
The key insight: Fan-out multiplier is the dominant lever, the average QPS shown above scales linearly with both events/day and recipients/event. But a marketing blast or celebrity post pushes this to the peak multiple you set. Adjust the peak multiplier to match your workload: social products spike on viral events, B2B products spike on business-hours campaigns. This is why fan-out must be async, the queue absorbs the burst; workers drain it at their own rate, bounded by provider rate limits.
High-level architecture
Notification service, three-tier async fan-out, per-channel dispatchers, and append-only delivery log
Component breakdown
Notification API is the single ingress point for all notification events. It validates the payload (notification type, target users, template ID), enforces per-caller rate limits, enriches the event with metadata (sending timestamp, idempotency key), and publishes to the appropriate priority queue topic. It does not fan out, that is the worker's job.
Priority queues are the architectural keystone. Three separate topics in a durable log (Apache Kafka or AWS SQS): transactional, social, and marketing. Each has independent consumer groups and independent scaling policies. This isolation ensures a 50M-message marketing blast cannot starve an OTP delivery, which directly satisfies the latency NFR from §2.
What is Apache Kafka? tap to expand ›
Apache Kafka in one paragraph. Kafka is a distributed commit log, producers append messages to named topics, and consumers read them by tracking an offset (a position in the log). Unlike traditional queues that delete messages on consumption, Kafka retains messages for a configurable period, enabling multiple independent consumer groups to read the same topic at their own pace (e.g. the delivery workers and the analytics pipeline both consume from the same topic without interfering). If a consumer crashes, it restarts from its last committed offset, no message is lost. This replay-on-crash property is what makes Kafka the foundation of at-least-once delivery guarantees.
Fan-out workers consume from the priority queues. For each event, they look up the target user list, check preferences and quiet hours, select the right channel(s), and enqueue one send task per recipient per channel into the downstream dispatcher queue. They also write the initial record to the notification log.
Channel dispatchers are thin, stateless workers that handle one channel each: push (APNs + FCM), email (SendGrid or SES), and SMS (Twilio or Vonage). They implement channel-specific retry logic, third-party rate limit handling (respecting 429 responses), and payload formatting. Keeping them separated means a third-party provider outage only affects one channel, not the others.
Preferences DB stores per-user opt-in/out state, quiet hours, preferred channel order, and per-category preferences. It is a relational store (PostgreSQL) with a read-through distributed cache (Redis) in front of it, since preferences are read on every single notification send but change rarely.
What is Redis? tap to expand ›
Redis in one paragraph. Redis is an in-memory data structure store used as a cache, message broker, and ephemeral database. Because all data lives in RAM, reads and writes complete in under a millisecond, orders of magnitude faster than hitting a disk-backed database. Redis supports expiry (TTL) natively on any key, making it ideal for short-lived data like cache entries, rate limit counters, and deduplication keys. The trade-off is memory cost and volatility: Redis data can be lost on restart unless persistence is configured. In this design, Redis is used for three things: the preference cache (5-min TTL), the deduplication key store (24h TTL), and the per-caller rate limit token buckets, all cases where sub-millisecond lookup and natural expiry outweigh the need for durable storage.
Notification log is backed by a wide-column store (Apache Cassandra) split into two
tables: notification_log is partitioned by user_id and clustered by
sent_at, optimised for the inbox read ("give me all notifications for user X, newest
first"). delivery_attempt is partitioned by notification_id and clustered
by attempt_at, optimised for retry and support lookups ("what happened to notification
Y?"). Splitting them lets each table's partition key match its dominant query, the core Cassandra
data-modelling principle.
Email deliverability requires DNS pre-configuration, not just an API key swap. DKIM signing keys and DMARC policy alignment must be consistently configured across all email providers before failover. Switching from SendGrid to AWS SES during an outage without pre-configured DNS records (DKIM CNAME entries, DMARC TXT record) will cause emails to land in spam or be rejected outright. Treat email provider failover as an infrastructure change, not an application-level switch.
What is Apache Cassandra? tap to expand ›
Apache Cassandra in one paragraph. Cassandra is a wide-column NoSQL database built for write-heavy, append-dominant workloads at scale. Data is distributed across nodes by a partition key, all rows sharing the same partition key live on the same node, making single-partition reads fast and predictable. Within a partition, rows are sorted by a clustering key, enabling efficient time-ordered range scans. Cassandra trades strong consistency for availability and partition tolerance: reads are eventually consistent by default, and there is no single point of failure. It also supports native TTL (time-to-live) per row, making retention-based expiry a configuration value rather than a background job. These properties, fast appends, user-partitioned inbox reads, and automatic TTL expiry, make it the standard choice for notification and activity feed storage at scale.
Architectural rationale
Why Kafka for the priority queues? Queue choice ›
A durable log like Kafka gives replay capability, if a worker crashes mid-fan-out, the unconsumed messages remain on the topic and a replacement worker picks up exactly where the offset left off. This is the foundation of the at-least-once guarantee. It also allows independent consumer groups for analytics without adding load to the main delivery path.
Why separate fan-out workers from channel dispatchers? Separation of concerns ›
Fan-out is CPU-bound and scales with the number of recipients. Channel dispatch is I/O-bound and scales with provider throughput. Separating them means you can scale each independently: a celebrity fan-out event needs many fan-out workers briefly, but APNs imposes a fixed rate limit that no amount of additional dispatchers can exceed. If they were combined, you'd be forced to provision for the worst case on both dimensions simultaneously.
Why Cassandra for the notification log? Storage choice ›
The notification log has three dominant access patterns: write a new delivery attempt (extremely frequent), read all notifications for a user sorted by time (the in-app inbox query), and expire old records after the retention period. Cassandra handles all three: writes are fast and append-only, reads by partition key (user_id) with time-ordered clustering are efficient, and native TTL handles expiry without a separate cleanup job.
Real-world comparison
| Decision | This design | Meta | Airbnb |
|---|---|---|---|
| Fan-out model | Async workers, hybrid write/read | Fan-out on write (push model) | Fan-out on write + lazy reads for inbox |
| Queue system | Kafka (priority topics) | Scribe + internal queues | Apache Kafka |
| Notification log | Cassandra, TTL-based expiry | TAO (graph store) | MySQL + DynamoDB |
| Preference store | PostgreSQL + Redis cache | ZippyDB (key-value) | MySQL + Memcached |
| Push dispatch | APNs + FCM via dispatcher workers | In-house push infra | AWS SNS + internal wrapper |
Real production systems vary significantly based on scale and legacy. Meta's in-house push infrastructure exists because APNs and FCM rate limits become a bottleneck at their scale. For most companies, managed providers (SendGrid, Twilio, FCM) are the right call, they abstract reliability, compliance (DKIM, DMARC), and global delivery infrastructure.
Fan-out strategy, the core algorithm
Fan-out is the defining algorithmic question for notification systems. When a user posts something, the system must decide: do we push that notification into each follower's inbox immediately (fan-out on write), or do we compute the notification list when each follower asks (fan-out on read)? Both have significant consequences across the entire design.
Fan-out on write vs fan-out on read, both approaches have fatal flaws at the extremes. A hybrid addresses both.
The right answer for production systems is a hybrid approach: fan-out on write for users with follower counts below a threshold (e.g. 10,000), and fan-out on read for high-follower accounts. The threshold is a configuration value, not a constant. Twitter's "celebrity problem" made this pattern famous, and it remains the standard solution.
Open question: What is the right fan-out threshold? It depends on write throughput, worker capacity, and storage costs. If a worker can produce 10,000 notification records/second and the SLA allows 5 seconds, a user with 50,000 followers can be fanned out synchronously. Beyond that, switch to lazy read. Make the threshold configurable and instrument it.
Implementation sketch, hybrid fan-out worker ›
async def fan_out(event: NotificationEvent):
sender = get_user(event.sender_id)
if sender.follower_count <= FANOUT_WRITE_THRESHOLD:
# Fan-out on write: push to each follower's inbox
follower_ids = get_follower_ids(event.sender_id)
await batch_write_inbox(follower_ids, event)
else:
# Fan-out on read: store one record, resolve at query time
await write_single_event(event)
# Followers will see this via graph traversal on inbox load
await log_delivery_attempt(event.notification_id, "fan_out_complete")
API design
POST /v1/notifications
Triggers a notification. Returns immediately after enqueueing, does not wait for delivery.
Full schema, all fields & annotations POST /v1/notifications ›
// 400 invalid template_id or type mismatch // 409 idempotency_key already used within 24 h (duplicate blocked) // 429 rate limit exceeded, Retry-After header included
GET /v1/users/{user_id}/notifications
Returns the in-app inbox for a user, newest first. Cursor-paginated.
// ?limit=20 max items per page (default 20, max 100) // ?cursor=<token> opaque cursor from previous response // ?unread_only=true filter to unread notifications only
Full schema, all fields & annotations GET /v1/users/{id}/notifications ›
// ?limit=20 max items per page (default 20, max 100) // ?cursor=<token> opaque cursor from previous response // ?unread_only=true filter to unread notifications only
Additional endpoints
| Endpoint | Purpose | Level |
|---|---|---|
PATCH /v1/notifications/{id}/read |
Mark a single notification read | L3/L4 |
POST /v1/notifications/read-all |
Mark all as read for a user | L3/L4 |
GET /v1/notifications/{id}/status |
Delivery status (queued/sent/failed) | L5/L6 |
PUT /v1/users/{id}/preferences |
Update channel preferences + quiet hours | L5/L6 |
POST /v1/notifications/bulk |
Bulk marketing campaigns with targeting | L6 |
Input validation that matters: validate that template_id exists and matches the notification type; check idempotency key against a recent-send cache (Redis with 24h TTL) before accepting; verify recipient_ids are valid and opted in; enforce a max recipients cap on a single API call (e.g. 1,000, bulk sends go through the bulk endpoint).
Per-caller rate limiting: rate limit by caller API key
using a token bucket in a distributed cache (Redis), keyed as
ratelimit:api:{api_key}. Standard callers get 1,000 requests/minute; pre-approved
bulk senders (e.g. a marketing service) get a higher limit negotiated at onboarding. Return
429 Too Many Requests with a Retry-After header when the bucket is
exhausted. This prevents a single misconfigured caller from saturating the transactional queue
and delaying OTP delivery for everyone else.
Core flow, the write path
The dominant path in a notification system is the write path: an event arrives, fans out to recipients, checks their preferences, and dispatches per channel. The key architectural decision is where within the fan-out step to check preferences, in the fan-out worker before enqueuing per-channel tasks (saves provider API calls for opted-out users, but requires preference data in the worker) or in the dispatcher after enqueuing (cleaner separation but wastes queue capacity on sends that will be suppressed).
Notification write path, preference check before dispatch prevents unnecessary provider API calls
The preference check happens before channel dispatch, not after. This is the key decision point, driven directly by the NFR to respect quiet hours and opt-outs. Checking preferences in the fan-out worker (step 4) rather than the dispatcher (step 5) means we avoid making any provider API calls for opted-out users, which saves money and avoids consuming rate limit budget. The cost is that the fan-out worker must read preference data, which is why the preference cache (Redis, §8) is critical.
Quiet hours edge case: When a notification is suppressed due to quiet hours, it shouldn't be silently dropped, it should be re-queued with a delayed send time aligned to the end of the user's quiet hours window. Otherwise, time-sensitive-but-not-transactional notifications (e.g. a friend request at 11pm) are permanently lost. A scheduled notification queue (a priority queue with delayed visibility) handles this cleanly.
Data model
The notification system has three distinct entities with very different access patterns and storage characteristics. Understanding how each entity is queried determines every schema decision.
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Write delivery attempt | Extremely high (every send) | Insert row keyed on notification_id + user_id |
| Read user inbox | High (every app open) | Filter by user_id, order by time DESC, paginate |
| Read user preferences | Very high (every send) | Lookup by user_id, always full record |
| Update preference | Low (user-initiated) | Update by user_id |
| Delivery status query | Low (support / analytics) | Lookup by notification_id |
| Expire old notifications | Low (background job) | TTL-based, time-ordered |
Two things stand out: preference reads happen on every single send and must be fast (driving the Redis cache), and the delivery log is almost pure-append with user-partitioned reads (driving the Cassandra choice).
Data model, three entities across two storage engines, chosen by their dominant access patterns
device_tokens, PostgreSQL
Device tokens are the source-of-truth for push delivery. They are relational (one user can have many devices, each with one token per platform), low-volume compared to the notification log, and must survive Redis eviction. They live in PostgreSQL alongside preferences, with the Redis cache (§8) providing the 1-hour TTL read layer.
| Field | Type | Notes |
|---|---|---|
device_id (PK) |
uuid | Stable device identifier, generated on first app launch |
user_id (FK) |
uuid | Indexed, supports "give me all devices for user X" |
platform |
text | ios | android | web |
token |
text | APNs device token or FCM registration token; rotate on provider feedback |
registered_at |
timestamp | When the token was first registered |
last_seen_at |
timestamp | Updated on every app open; used to identify stale tokens for cleanup |
is_active |
bool | Set to false on APNs Feedback / FCM canonical ID response, stops sends immediately without deleting the row |
Token rotation: APNs returns APNS_UNREGISTERED when a device token
is stale (app uninstalled or re-installed). FCM returns a canonical registration ID when a token
has been refreshed. Both signals must be processed via a feedback consumer that marks the old
token as is_active = false and invalidates the Redis cache entry — preventing
continued sends to dead tokens that consume rate limit budget.
Why category_overrides as JSONB? ›
Notification categories are product-defined and change over time. A user might disable push for "marketing" but keep it on for "social" and all transactional types. Storing this as a JSONB column avoids schema migrations every time a new category is introduced. The trade-off is that querying across users by category preference is harder, but that query pattern doesn't exist in steady state (preferences are always fetched per-user).
Why store user_id as PK in notification_log (not notification_id)? ›
In Cassandra, the partition key determines which node stores the data. Partitioning by user_id means all notifications for a user are co-located on the same node, making the inbox query (give me all notifications for user X, sorted by time) a single-partition scan, fast and predictable. If we partitioned by notification_id, an inbox query would scatter across the entire cluster.
Caching strategy
The notification service's caching needs fall into two tiers. The essential tier covers every single send regardless of channel, both caches are hit on every notification that goes through the fan-out worker. The extended tier covers push-specific details that only become relevant once you go deeper on channel dispatch.
Essential, hit on every send
Cache flows, dedup HIT drops the send; MISS proceeds to dispatch. Preference MISS falls through to PostgreSQL.
| Cache layer | Store | TTL | Strategy | Why |
|---|---|---|---|---|
| Deduplication keys | Redis | 24 hours | Write on first send; check before every send | Prevents duplicate sends on worker retry; idempotency_key maps to notification_id |
| User preferences | Redis | 5 minutes | Read-through; invalidate on update | Preferences are read on every send and change very rarely, cache hit rate >99% |
Extended, push-specific L5+
| Cache layer | Store | TTL | Strategy | Why |
|---|---|---|---|---|
| User device tokens | Redis | 1 hour | Read-through; invalidate on APNs/FCM feedback | Device tokens don't change often but are read on every push send, caching avoids a DB lookup per dispatch. Must be invalidated immediately when a provider returns a stale-token signal. |
Cache invalidation
When a user updates their preferences via the API, the system writes to PostgreSQL and immediately invalidates the Redis key (write-invalidate, not write-through). Write-invalidate is preferred here because the next read will always populate the cache with fresh data, and double-writes (to DB and cache) risk inconsistency if one fails. The 5-minute TTL provides a backstop for any invalidation misses.
Quiet hours and timezones: Quiet hours must be evaluated in the user's local timezone, not UTC. Cache the computed quiet-hours window per user per day (not just the raw start/end times) to avoid repeated timezone arithmetic on every notification send. Invalidate at midnight local time.
Deep-dive: scalability
At L5 and above, the interviewer expects you to go beyond the happy path and address the stress cases: what happens during a marketing blast, a celebrity event, or a third-party provider outage? The accordions below address the most common follow-up questions at each level, check the tag on each to see where it's expected.
Production notification pipeline, independent scaling, circuit breakers per provider, separate analytics consumer group on Kafka
Handling celebrity / high-follower fan-out without write amplification L5 ›
A celebrity with 10M followers triggering a fan-out on write would produce 10M queue messages in a single event. This can saturate workers and delay the entire topic. The hybrid fan-out approach (§5) avoids this by storing a single event record for high-follower users and resolving the recipient list at read time. The threshold (e.g. 10,000 followers) is a system configuration value, tuned based on observed write throughput and worker headroom.
Third-party provider rate limiting and circuit breakers L5 ›
APNs, FCM, SendGrid, and Twilio all impose rate limits and have independent reliability SLAs. When a provider returns 429 (rate limited) or 503 (unavailable), the dispatcher should not immediately retry, that amplifies load on an already-struggling provider. The right pattern is a circuit breaker per provider: after N consecutive failures, open the circuit (stop sending), wait a fixed backoff interval, then probe with a small volume before fully reopening.
What is the circuit breaker pattern? tap to expand ›
Circuit breaker pattern. Named after an electrical circuit breaker, this pattern has three states: Closed (normal, requests flow through), Open (tripped, requests are immediately rejected without hitting the downstream service), and Half-open (probing, a small number of test requests are allowed through to check if the service has recovered). When failures exceed a threshold, the breaker trips to Open, preventing a degraded provider from consuming all retry budget and blocking healthy sends. Libraries like Resilience4j (JVM) and pybreaker (Python) implement this out of the box.
Marketing blast isolation, preventing priority inversion L5/L6 ›
A bulk marketing campaign of 100M emails cannot share queue bandwidth with OTP sends. The
three-queue topology (§4) solves this structurally: marketing messages are on a separate
topic in the durable message queue (Kafka) with a separate consumer group. Marketing
workers are allocated a fixed fraction of total throughput capacity and are not
permitted to "borrow" transactional capacity. This must be enforced at the Kafka
consumer configuration level (max.poll.records, concurrency settings) — not
just as a convention.
Geo-distribution and data residency (L7/L8) L7/L8 ›
Notification systems in regulated markets (GDPR in EU, data localisation in India/Russia) must ensure that user notification data never transits through or is stored in non-compliant regions. This requires a regional deployment topology where EU user notifications are processed entirely within EU infrastructure, including the Kafka topics, fan-out workers, and Cassandra cluster. Cross-region forwarding is only allowed for aggregated, non-PII analytics.
Observability, delivery funnels and alert thresholds L6+ ›
A notification service is invisible when it works and very visible when it fails. Good observability means tracking the delivery funnel: events accepted → queued → fanned-out → preference check (suppressed vs passed) → dispatched → provider acknowledged → delivered (device confirmed). Any stage where the count drops unexpectedly signals a bug. Alert thresholds on the transactional queue depth (if it grows beyond X seconds of work) trigger immediate PagerDuty, that's the OTP pipeline.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Worker crashes mid-fan-out | Some recipients get the notification, others don't. Reprocessing duplicates those already sent. | Idempotency key per (notification_id, recipient_id, channel). Dedup cache (Redis, 24h TTL) checked before every dispatch. At-least-once delivery is acceptable; duplicates are minimised in steady state via dedup. | L3/L4 |
| Third-party provider outage (APNs down) | Push notifications fail silently or generate loud retry storms. | Circuit breaker per provider. On open circuit, messages re-queue with backoff. Alert on circuit open state. Fall back to in-app notification delivery for time-insensitive types. | L5 |
| Marketing blast causes transactional delay | 100M marketing emails saturate workers, delaying OTP delivery by minutes. | Strict priority queue separation. Marketing topic has a hard cap on worker concurrency and cannot borrow transactional capacity. Autoscale transactional workers independently with aggressive scaling triggers. | L5 |
| User device token expired / invalid | APNs returns APNS_UNREGISTERED. Continuing to send wastes rate limit budget. | Process provider feedback channels (APNs Feedback Service, FCM canonical ID responses) to detect and remove stale device tokens. Update device token table, stop sending to that token immediately. | L5 |
| Quiet hours race condition | Notification passes quiet-hours check, but user enters quiet hours before dispatch arrives at provider. | Accept the small race window, the notification was valid at check time. The alternative (re-checking at dispatch time) adds DB reads on every send. For marketing sends only, implement a send-window enforcement at the scheduler level. | L5/L6 |
| Preference cache stale after user opt-out | User opts out, but cached preferences show opted-in. Notifications sent for up to 5 minutes after opt-out. | Write-invalidate on preference update ensures next read is fresh. 5-minute TTL is the backstop. For GDPR/right-to-be-forgotten, implement immediate invalidation + hard-stop flag checked outside cache. | L6 |
| Fan-out creates hot Cassandra partitions | A very active user accumulates millions of rows in one partition, causing slow scans and GC pressure. | Composite partition key: (user_id, month_bucket). Inbox queries spanning multiple months merge two partition reads. TTL and compaction handle old data automatically. | L6 |
| Cross-region data residency violation | EU user notification data processed or logged in US region, violating GDPR data localisation requirements. | Regional routing at the API gateway: EU user IDs resolve to EU Kafka cluster, EU fan-out workers, EU Cassandra cluster. Cross-region traffic only for anonymised analytics aggregates. | L7/L8 |
How to answer by level
L3 / L4, SDE I / SDE II L3/L4 ›
- Identifies push, email, SMS as separate channels
- Puts a queue between the API and delivery to decouple
- Mentions retry on provider failure
- Knows what APNs and FCM are and that they require device tokens
- Designs a basic user preferences table
- Designs a basic in-app inbox (read/unread state, paginated by time, backed by the notification log)
- Priority queues, L3 uses one queue for everything
- Fan-out strategy, L3 doesn't address write amplification
- Quiet hours, L3 may miss the timezone complexity
- At-least-once vs exactly-once, L3 doesn't reason about delivery semantics
L5, Senior SDE L5 ›
- Three-tier priority queue topology with justification
- Hybrid fan-out: write for regular users, read for celebrities
- Idempotency keys + dedup cache for at-least-once
- Preference cache + write-invalidate strategy
- Circuit breakers per provider
- Cassandra for notification log with partition key rationale
- Delivery funnel observability and alerting design
- Cross-region / data residency awareness
- Cost modelling across channels (SMS is expensive)
- Make vs buy analysis for notification infrastructure
L6, Staff SDE L6 ›
- Full delivery funnel design with stage-by-stage metrics
- Alert thresholds tied to SLA tiers per notification type
- Delivery log schema with hot partition mitigation
- Provider feedback loop (APNs Feedback, FCM canonical IDs)
- Scheduled send architecture for marketing campaigns
- GDPR compliance design: opt-out hard stops, data retention
- Global multi-region topology with data residency routing
- Capacity planning and cost architecture at 10B notifications/day
- Build vs buy decision with vendor risk analysis
- Governance: notification fatigue, business impact of over-sending
L7 / L8, Principal / Distinguished L7/L8 ›
- Global topology: regional clusters, cross-region analytics aggregation only
- Cost architecture: SMS cost per user, push free via provider, email at scale
- Make vs buy: when does custom push infra beat APNs/FCM at scale?
- Notification fatigue governance: ML-based frequency capping
- Regulatory roadmap: GDPR, CCPA, India PDPB compliance strategy
- Provider SLA negotiation and multi-vendor redundancy contracts
- Frames the design as an organisational capability, not just a service
- Reasons about the business model impact of notification quality
- Proactively identifies second-order effects (notification spam → churn)
- Connects infrastructure decisions to product metrics (DAU, retention)
Classic probes
| Probe question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "A celebrity posts and has 50M followers. What happens?" | Fan-out to all 50M in one batch, doesn't identify the problem | L5: Hybrid fan-out: threshold check, fan-out on read for high-follower accounts. L6 adds: instruments the threshold in production, owns the runbook for when fan-out workers saturate, and designs the alerting that detects a celebrity burst before it delays transactional sends. | Dynamic threshold tuning based on real-time worker headroom; pre-computed fan-out lists for known celebrities; capacity model for worst-case simultaneous celebrity events |
| "How do you make sure an OTP isn't delayed by a marketing blast?" | Maybe mentions separate queue, may not know why | L5: Three-tier priority queues with independent workers and autoscaling policies per tier. L6 adds: defines the SLO for transactional queue depth, owns the alerting threshold, and enforces the isolation at the Kafka consumer config level, not just as a convention. | SLA-based capacity reservation, burst capacity contracts with cloud provider, latency SLO enforcement via automated load shedding on marketing tier |
| "A user opts out, but gets a notification 2 minutes later. Why?" | Doesn't know, guesses "bug" | L5: Preference cache TTL window (5 min); write-invalidate reduces it. L6 adds: distinguishes GDPR "forget me" hard-stop from marketing opt-out, defines the hard-stop flag checked outside the cache, and owns the compliance audit trail for opt-out processing. | Quantifies the window, proposes sub-second invalidation via pub/sub eviction (Redis keyspace events to worker process), and designs the data retention deletion pipeline for GDPR right-to-erasure |
| "APNs is down for 20 minutes. What does your system do?" | Retries until it works, no throttling or awareness of downstream impact | L5: Circuit breaker opens, messages requeue with exponential backoff, alert fires. L6 adds: designs the observability that distinguishes "APNs degraded" from "our dispatcher misconfigured", owns the incident runbook, and ensures in-app inbox delivery continues unaffected. | Pre-negotiated SLA with backup provider, automatic DNS/routing failover, customer-facing status page update, post-incident review of notification debt (messages delivered late vs lost) |