Design a Photo-Sharing Feed
like Instagram
Simple to describe, hard to scale: serving personalized feeds of images to two billion users without melting the database or the CDN budget.
~35 min read · 11 sections · interactive estimator
What the interviewer is testing
All levels"Design Instagram" is a canonical FAANG question because it compresses three fundamentally different hard problems into one: a high-throughput binary data pipeline (images), a social graph traversal problem (who follows whom), and a personalized feed assembly problem (what to show, in what order). Solving any one of them is straightforward. Solving all three while keeping latency under 200ms is the interview.
| Problem | Why it's hard | What changes at scale |
|---|---|---|
| Image pipeline | Binary objects are orders of magnitude larger than text; uploading, transcoding, and delivering at scale requires a dedicated infrastructure path | Direct-to-object-store uploads bypass app servers; CDN absorbs nearly all read traffic |
| Social graph traversal | Fetching follower lists at write time (fan-out) or post lists at read time (fan-in) both become expensive at celebrity scales | Hybrid strategy: push for regular users, pull for celebrities |
| Feed ranking | Chronological feeds don't scale (too noisy). Ranked feeds require ML inference on the assembled candidate set | Feed candidates pre-assembled; ranking applied as a lightweight pass |
Level signal: L3/L4 candidates describe the upload and read paths correctly but treat all users identically (always fan-out-on-write). L5 candidates identify the celebrity problem and propose the hybrid strategy. L6 candidates reason about the consistency window — how stale is a feed acceptable — and tie it back to a specific NFR. L7/L8 candidates frame the ranking vs. freshness tradeoff as a business decision with product impact, not just a technical constraint.
Requirements clarification
All levelsThe question "design Instagram" is deliberately vague. Scoping it correctly — and explaining why you scoped it this way — is itself part of the evaluation. Start here before touching architecture.
Functional requirements
| Capability | In scope | Out of scope (for this interview) |
|---|---|---|
| Photo upload | Upload a photo with caption and tags; multi-format support (JPEG, PNG, HEIC) | Video (Reels), Stories (ephemeral content), carousel posts |
| Follow graph | Follow / unfollow users; public accounts only | Private accounts, approval workflows, close friends lists |
| Home feed | Paginated feed of recent posts from followed users | Explore tab, Reels feed, algorithmic ranking ML model internals |
| Interactions | Like a photo; comment on a photo | Saves, shares, DMs, polls, stickers |
| User profile | User's own post grid, follower/following counts | Highlights, bio link shortening, verified badge |
| Notifications | Like and follow notifications (async) | Real-time push infrastructure design |
Non-functional requirements
| NFR | Target | Why this level? |
|---|---|---|
| Feed load latency (p99) | < 200 ms (API response, excl. image download) | Users scroll continuously; any pause above ~300ms registers as lag |
| Photo upload latency | < 5 s end-to-end (processed + CDN URL returned) | Upload completion is a hard perception threshold; beyond 5s users abandon |
| Feed freshness | New posts appear in followers' feeds within 30 s | Near-real-time is the product expectation; batch updates are not acceptable |
| Like/comment consistency | Eventual consistency; counts accurate within ~5 s | Exact like counts are not safety-critical; slight lag is imperceptible |
| Availability | 99.99% (feed read path); 99.9% (upload path) | Feed reads are the core product; upload is less failure-sensitive (user can retry) |
| Scalability | 2 B MAU; 500 M DAU; 100 M photos/day uploaded | Instagram's reported 2024 scale |
| Storage durability | 11-nines (99.999999999%) for photo objects | Losing a user's photo is an irreversible trust failure |
NFR reasoning
Feed load latency < 200 ms Drives §4, §8, §9 ›
200 ms p99 for a paginated feed response means the API must return metadata (post IDs, image URLs, like counts) without waiting for image downloads. Image bytes flow separately through the CDN. The 200 ms budget is split roughly: 10 ms feed cache hit → 80 ms feed assembly from Redis → 110 ms network. Hitting this with a database query that joins posts, users, and likes across millions of rows per user is not feasible: the feed must be pre-assembled.
Feed freshness within 30 seconds Drives §5, §6 ›
30-second freshness means fan-out to follower feed caches must complete within 30 seconds of a post being uploaded. For a user with 1,000 followers, this is trivially achievable in a single async batch. For a celebrity with 100 million followers, fanning to 100M Redis keys within 30s is technically possible (50–100 workers consuming parallel Kafka partitions can sustain ~3–10M writes/sec), but it is economically infeasible — provisioning that burst capacity for a single post event wastes massive infrastructure that sits idle 99.99% of the time. The celebrity exception is therefore a cost-efficiency decision, not a physical impossibility, and it forces the hybrid fan-out strategy in §5.
11-nines durability for photos Drives §4, §7 ›
Photos are unique user-generated content — they cannot be regenerated. Unlike text posts that could potentially be re-entered, a lost photo is permanently lost. 11-nines durability (≈ 0.0000001% annual loss probability) means storing photos in object storage (Amazon S3 or equivalent) that replicates across multiple availability zones by default. The application layer never stores binary image data; it only stores CDN URLs pointing to the object store.
Capacity estimation
L3/L4+Instagram's load profile is dramatically read-heavy. For every photo uploaded, it's viewed hundreds to thousands of times. The write path (uploads) is a relatively thin stream; the read path (feed loads, profile views, photo fetches) is the scale challenge to design for. Image data adds a critical second dimension: storage and bandwidth numbers dwarf what you'd see in a text-only system.
Interactive capacity estimator
⚠️ Calibration note: The feed read QPS multiplier is calibrated so that 200× at 100 M uploads/day produces 500 M DAU × 40 feed loads/day ÷ 86,400 s ≈ 231 K req/sec. In practice, feed read QPS is a function of DAU and session behaviour — it does not scale proportionally with upload volume. If you adjust the upload count, re-calibrate this multiplier inversely to preserve the DAU-anchored read rate.
Key insight: At 100 M uploads/day, photo storage grows by ~20 TB/day (100M × 200 KB). Over 10 years that's ~73 PB of raw storage — before size variants. With three size variants (thumbnail at ~10%, feed at ~40%, full at 100% of original), the primary storage footprint is ~1.5× raw = ~109 PB. The estimator's 200–220 PB figure accounts for S3 cross-region replication (two geographic copies for multi-region availability), bringing the total logical storage to ~109 PB × 2 ≈ 220 PB. Note: S3 handles within-AZ redundancy transparently; the 2× factor here is cross-region replication, which you'd configure explicitly for a global platform. This number immediately tells you that object storage with CDN offloading is not optional: it's the only cost-feasible architecture. At $0.023/GB on S3, serving 220 PB directly from origin without a CDN would cost thousands of dollars per minute in egress alone.
| Dimension | Estimate | Key insight |
|---|---|---|
| Upload write rate | ~1,160 uploads/sec (avg); ~3,500/sec peak | Manageable; the upload path is not the bottleneck |
| Feed read rate | ~230,000 req/sec | This is the dominant load — requires pre-computed feed caches and CDN |
| Photo storage (10 yr) | ~220 PB (with 3 size variants) | Forces object store; no block storage or filesystem can sustainably operate at this scale |
| CDN bandwidth | ~40–80 Tbps theoretical peak | Only achievable via global CDN PoPs with cache hit rates ≥ 90% |
| Feed metadata DB | ~500 GB/day new rows | Post metadata (caption, user_id, timestamp, CDN URLs) is tiny compared to images |
| Peak multiplier | 3–5× steady state | Major events (New Year's Eve, World Cup) spike uploads and feed reads simultaneously |
High-level architecture
L3/L4+The architecture splits cleanly into three planes. The upload plane handles inbound binary data: client → object store → processing pipeline → metadata write. The read plane handles feed and photo delivery: client → CDN → feed cache → API servers → database for cache misses. The social graph plane manages follow relationships and drives the feed fan-out process that connects uploads to reads. A durable message queue (Kafka) decouples all three.
Component breakdown
CDN (content delivery network) sits in front of all photo delivery. It caches the three image variants (thumbnail, feed-resolution, full-resolution) at edge nodes globally. Once a photo is cached at a PoP, subsequent requests for that URL cost zero origin bandwidth. For a platform with billions of daily photo fetches, CDN cache hit rates ≥90% are the difference between a viable and an unviable bandwidth bill.
API Gateway / Load Balancer is the single entry point for all non-image traffic: authentication token validation, rate limiting (per-user and per-IP), and routing to the three service clusters (Upload API, Feed API, Social API). It does not handle image bytes — those go directly from the client to object storage via pre-signed URL.
Upload API handles the upload initiation flow. On a POST /photos/upload request, it validates the request, generates a pre-signed URL pointing at object storage (S3), and returns it to the client. The client then uploads the binary data directly to S3 without touching the Upload API server again. After the S3 upload completes, the client notifies the Upload API, which writes a pending record to the Post DB and enqueues a processing job. This architecture keeps multi-hundred-KB binary data off the application tier entirely.
Image Processor is a stateless worker pool that consumes from an object-store event trigger (S3 event notification or SQS queue). For each uploaded photo, it generates three resolution variants — thumbnail (150×150px), feed resolution (640×640px), and full resolution (1080×1080px) — and writes them back to object storage with stable, CDN-friendly URLs. Once processing completes, it writes the final CDN URLs to the Post DB and publishes a post.published event to Kafka.
Feed API assembles the home feed for a requesting user. It first checks the user's pre-computed feed list in the Redis Feed Cache. On a cache hit, it hydrates post metadata (cached separately) and returns the response. On a miss (new user, cache evicted, or first request), it falls back to a database fan-in query: fetch the user's followed list from Graph DB, query Post DB for recent posts from each followee, merge and sort. Cache misses are expensive but rare in steady state. Hydration avoids N+1: metadata for all ~20 posts in a page is fetched in a single Redis pipeline (MGET post:{id1} post:{id2} … or pipelined HGETALL calls) rather than 20 serial round-trips. Cache misses within the batch trigger a parallel async read from Cassandra, with results written back to Redis before the response returns.
Social API manages the follow graph: creating and deleting follow relationships, returning follower/following counts, and handling the Graph DB writes that trigger fan-out. When a user follows someone, the Graph DB record is written synchronously; feed cache updates happen asynchronously.
Post DB (wide-column store, Cassandra) stores post metadata: photo ID, user ID, CDN URLs for all three variants, caption, timestamp, like count (approximate), comment count. Cassandra's append-friendly write path handles the high upload write rate without locking. Reads are by (user_id, time_range), which maps cleanly to Cassandra's partition + clustering key model.
Graph DB (relational, PostgreSQL or sharded MySQL) stores the follow graph as a follows table with (follower_id, followee_id, created_at). Both directions are indexed because the system needs to answer both "who do I follow?" (write fan-out) and "who follows me?" (notification). At very large scale (billions of edges), this migrates to a dedicated graph service (Meta's TAO), but for interview purposes a sharded relational store is the correct starting point.
Feed Cache (Redis sorted sets) holds one per-user sorted set of post IDs, ordered by timestamp. Each post_id maps to a post metadata hash. The Fan-out Worker appends new post IDs to followers' sorted sets; Feed API pops the top N. TTL is set to avoid stale feed eviction — active users will have their caches refreshed frequently; inactive users' caches expire and are rebuilt on next login.
Kafka event pipeline receives post.published events from the Image Processor and follow.created events from the Social API. The Fan-out Worker subscribes and executes the feed propagation logic. The Notification Service subscribes to the same stream for independent like and follow notifications. This decoupling means a slow fan-out for a celebrity post doesn't delay photo processing or the API response.
Notification Service is a separate stateless consumer group on Kafka that processes post.published, like.created, and follow.created events. For each event, it resolves the target user's push token (APNs for iOS, FCM for Android) from a Device Registry store and enqueues a push notification via the respective platform gateway. Two reliability constraints matter here: (1) at-least-once delivery: Kafka consumer group commit-after-ack ensures no notification is silently dropped on worker crash; (2) deduplication — if a post triggers fan-out to 10,000 followers and the worker retries, the same notification must not be sent twice. A Redis set (notif_sent:{event_id}:{user_id}, TTL 24 h) acts as an idempotency gate before the push is dispatched. In-app notifications are written to a notifications table and fetched on app open — no real-time socket needed for the baseline design. (Scale note: at 500 M DAU a Cassandra table — notifications_by_user, partitioned by recipient_user_id, clustered by created_at DESC — handles concurrent per-user appends more efficiently than a PostgreSQL table, which requires explicit partitioning by user_id to avoid hot-row contention at this write rate.)
Fan-out Worker is a horizontally scalable consumer pool that receives post.published events and decides whether to push (regular user) or skip (celebrity — readers will pull). For push-eligible posts, it reads the poster's follower list from Graph DB in paginated batches and writes the post ID to each follower's Redis sorted set via a pipeline command. The worker itself is stateless; Kafka consumer group semantics handle parallel processing across workers.
Hashtag and search index: The post.published Kafka event carries the caption and tags array. A dedicated Search Consumer (separate Kafka consumer group) reads this event and writes an index document to Elasticsearch: { post_id, user_id, caption, tags[], location, published_at }. Hashtag queries (GET /hashtag/{tag}) hit the Elasticsearch index directly — Cassandra is never queried for this. The Elasticsearch cluster is sized independently of the post write path because index writes are asynchronous and latency-tolerant. At interview depth, naming this consumer pattern and the Kafka-to-ES fanout is an L5 signal; discussing index mapping strategy (keyword vs. text analyzer for caption search) is L6.
Architectural rationale
Why upload directly to object storage and not through the app servers? Upload path ›
At 100 M uploads/day (~1,160/sec average, 3,500/sec peak), routing binary payloads (avg 5–10 MB pre-compression) through application servers would require massive server fleets for network I/O alone — completely separate from any compute cost. A pre-signed URL allows the client to push binary data directly to S3 over a parallel TCP connection while the app server does nothing. S3's direct-upload throughput is effectively unbounded per account. Application servers stay small, stateless, and CPU-bound rather than network-I/O-bound.
Why Cassandra for Post DB rather than PostgreSQL? Storage choice ›
Post writes — 100 M/day, or ~1,200 writes/sec sustained — are append-only inserts with no updates (except like/comment count increments, handled separately). Reads are almost always time-range queries by user_id: "give me the 20 most recent posts from this user." This maps directly to Cassandra's data model: partition key = user_id, clustering key = created_at (descending). Range scans are a single partition read — extremely fast with no cross-node joins. PostgreSQL can serve this workload, but requires careful sharding and index tuning to avoid fan-out queries across shards. Cassandra's write path is also cheaper: LSM-tree with commit log, no locking.
Why Kafka rather than directly writing to Redis from the Image Processor? Decoupling ›
Fan-out for a celebrity's post may touch 100 million Redis keys. This cannot complete synchronously within the upload response time budget. Kafka buffers the event; the Fan-out Worker scales horizontally to process the fan-out asynchronously. The Image Processor commits once (fast) and is done. Kafka also enables independent consumers: the Notification Service and Analytics pipeline consume the same post.published event without coupling to the fan-out timeline.
Real-world comparison
| Decision | This design | Instagram (reported) | Twitter/X |
|---|---|---|---|
| Image storage | S3 + CDN (CloudFront) | S3 + Facebook CDN | S3 + Fastly CDN |
| Feed generation | Hybrid fan-out (push regular, pull celebrity) | Hybrid; celebrity threshold ~10k followers | Hybrid; "power users" pulled on read |
| Post metadata store | Cassandra (append-friendly) | PostgreSQL historically; migrated to distributed stores | MySQL (Manhattan), moved to distributed |
| Social graph | Sharded relational (follows table) | TAO (Facebook graph service) | Custom sharded MySQL (FlockDB historically) |
| Feed cache | Redis sorted sets per user | Redis at massive scale | Timeline cache in Redis |
There is no universally correct fan-out threshold — Instagram, Twitter, and Pinterest have all settled on different celebrity follower thresholds (ranging from 10k to 1M) based on their specific follower distribution curves and Redis write cost constraints. The right threshold follows from profiling your actual dataset, not from first principles.
Core algorithm — feed generation strategy
L5+Before designing the feed read path, you have to answer a fundamental question: when a user posts a photo, when does that photo enter their followers' feeds — at write time or at read time? This single decision drives the fan-out architecture, the feed latency design, the Redis data model, and the special handling for celebrity accounts.
The tension is simple: making feeds fast to read (pre-computed) makes posts expensive to write (must update millions of feed caches). Making posts cheap to write (store once) makes feed reads expensive (must dynamically assemble from scratch). Neither extreme works at Instagram scale. Two pure approaches and one hybrid dominate the space.
Our choice for this system: Use the hybrid fan-out (③). Posts from users with fewer than ~10,000 followers are pushed to all follower feed caches asynchronously within seconds. Posts from celebrity accounts (above threshold) are stored once in the Post DB. When reading the feed, the Feed API merges the user's pre-computed push cache with a fresh pull of the latest N posts from each celebrity they follow, then sorts by timestamp before applying ranking. The merge is a simple N-way merge of sorted lists — efficient even for users who follow dozens of celebrities.
Fan-out worker implementation: determining celebrity status ›
# Fan-out worker (pseudocode)
def handle_post_published(event):
post_id = event['post_id']
poster_id = event['user_id']
# Check if poster is celebrity (cached in Redis)
follower_count = get_cached_follower_count(poster_id)
if follower_count <= CELEBRITY_THRESHOLD: # e.g. 10,000
# Fan-out-on-write: push to all followers in batches
fan_out_to_followers(poster_id, post_id)
# else: celebrity — no fan-out; pulled at read time
def fan_out_to_followers(poster_id, post_id):
page_token = None
while True:
followers, next_token = get_followers_page(poster_id, page_token)
# Redis pipeline: ZADD feed:{follower_id} score=timestamp post_id
with redis.pipeline() as pipe:
for fid in followers:
pipe.zadd(f'feed:{fid}', {post_id: event['ts']})
# Trim batched with ZADD in same pipeline (single round-trip); no-op if set < MAX_FEED_SIZE
pipe.zremrangebyrank(f'feed:{fid}', 0, -(MAX_FEED_SIZE+1))
pipe.execute()
if not next_token: break
page_token = next_token
# Feed read: merge push cache + celebrity pulls
def get_home_feed(user_id, page_size=20):
# Step 1: get pre-computed feed from Redis (cursor-aware)
# Use ZREVRANGEBYSCORE with the cursor timestamp as upper-bound score — not ZREVRANGE by rank.
# ZREVRANGE offsets shift as new posts are prepended, causing duplicate/skipped items on page 2+.
max_score = cursor_ts if cursor_ts else '+inf'
pushed = redis.zrevrangebyscore(f'feed:{user_id}', max_score, '-inf', start=0, num=page_size * 3)
# Step 2: pull latest posts from followed celebrities
celebrities = get_followed_celebrities(user_id)
pulled = [get_recent_posts(c_id, n=10) for c_id in celebrities]
# Step 3: merge + sort + rank + truncate
candidates = merge_sorted(pushed, *pulled)
return rank_and_truncate(candidates, page_size)
The ZREMRANGEBYRANK call trims each feed cache to a maximum size (e.g., 3,000 entries) to bound memory consumption. Users who haven't opened the app in weeks will simply get a longer cold-start query that rebuilds from the database — acceptable because this is not a hot path.
One open question: Where exactly is the celebrity threshold? It's not a fixed architectural constant — it's a tuned operational parameter based on the actual follower distribution and Redis write throughput limits. At Instagram scale (~1B+ accounts, with ~0.001% having over 1M followers), nearly all write load is handled by the push path. The threshold should be set low enough that the pull path covers only accounts where fan-out clearly exceeds write budget, and high enough that the merge overhead at read time stays negligible.
API design
L3/L4+Two endpoints dominate the hot path: the upload initiation endpoint and the feed read endpoint. Both are called from mobile clients where latency is compounded by cellular network variability.
POST /photos — initiate upload
Returns a pre-signed URL for direct-to-S3 upload. The actual binary data never flows through this endpoint. Using PUT (not POST) to S3 because S3 pre-signed URLs for object creation use PUT semantics. The client is responsible for following up with POST /photos/{id}/complete after the S3 upload finishes.
// Request
{
"filename": "IMG_2047.HEIC",
"content_type": "image/heic",
"size_bytes": 4721038,
"caption": "Sunset at Big Sur 🌅",
"location": { "lat": 36.27, "lon": -121.81 },
"tags": ["#sunset", "#california"]
}
// Response 200 OK (pre-signed URL for direct S3 upload)
{
"photo_id": "ph_8Xk2mNqLp9vW",
"upload_url": "https://s3.amazonaws.com/ig-uploads/ph_8Xk2mNqLp9vW?X-Amz-Signature=...",
"upload_method": "PUT",
"upload_headers": {
"Content-Type": "image/heic",
"Content-Length": "4721038"
},
"expires_at": "2026-04-24T21:15:00Z" // 15-min window
}
POST /photos/{id}/complete — finalize after upload
// Request (no body required — server validates S3 object existence)
// Response 202 Accepted (processing async)
{
"photo_id": "ph_8Xk2mNqLp9vW",
"status": "processing",
"estimated_ready_ms": 4000,
"poll_url": "/photos/ph_8Xk2mNqLp9vW/status"
}
// GET /photos/{id}/status — once processing completes:
{
"photo_id": "ph_8Xk2mNqLp9vW",
"status": "published",
"urls": {
"thumbnail": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/t.jpg",
"feed": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/f.jpg",
"full": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/o.jpg"
},
"published_at": "2026-04-24T20:58:34Z"
}
GET /feed — home feed
Cursor-based pagination (no offset). The cursor encodes the last-seen timestamp so subsequent pages are stable even as new posts arrive. Returns metadata only — image bytes are fetched separately by the client via CDN URLs. Request validates authentication via bearer token in the Authorization header.
// Request: GET /feed?limit=20&cursor=eyJ0cyI6MTc0NTUzMTg5Mn0
// Response 200 OK
{
"posts": [
{
"post_id": "ph_8Xk2mNqLp9vW",
"user": {
"user_id": "u_4Rm9pXzL",
"username": "juanita.m",
"avatar_url": "https://cdn.intervu.dev/avatars/u_4Rm9pXzL.jpg"
},
"caption": "Sunset at Big Sur 🌅",
"image_urls": {
"thumbnail": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/t.jpg",
"feed": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/f.jpg"
},
"like_count": 1842,
"comment_count": 37,
"liked_by_viewer": false,
"created_at": "2026-04-24T20:58:34Z"
}
],
"next_cursor": "eyJ0cyI6MTc0NTQ4NjIxMH0",
"has_more": true
}
| Endpoint | Method | Level | Notes |
|---|---|---|---|
POST /photos | POST | L3/L4 | Upload initiation; returns pre-signed S3 URL |
POST /photos/{id}/complete | POST | L3/L4 | Trigger async processing pipeline |
GET /feed | GET | L3/L4 | Cursor-paginated home feed |
POST /likes/{post_id} | POST | L3/L4 | Like a post; idempotent |
POST /follow/{user_id} | POST | L5 | Follow a user; triggers fan-out recalculation if threshold crossed |
GET /photos/{id}/status | GET | L5 | Poll-based processing status; L5 discusses webhook alternative. Clients must implement exponential backoff (1 s → 2 s → 4 s → 8 s cap) — aggressive polling at 100 M uploads/day constitutes a self-inflicted DDoS on the status tier. |
GET /users/{id}/feed | GET | L5 | Profile grid; different query pattern from home feed |
GET /explore | GET | L7/L8 | Ranked non-following content; requires separate ML recommendation pipeline |
Large-file resumability — S3 Multipart Upload: For files larger than 5 MB (which includes all original HEIC uploads from modern iPhones, typically 4–8 MB), the pre-signed URL flow is extended to use S3 Multipart Upload. The Upload API creates a Multipart Upload session and returns pre-signed part URLs for each 5 MB chunk. The client uploads parts in parallel or sequentially; if connectivity drops, it resumes from the last successfully uploaded part using its ETag. After all parts are uploaded, the client calls CompleteMultipartUpload (or the Upload API does it on /photos/{id}/complete). Incomplete multipart uploads older than 7 days are cleaned up by an S3 lifecycle policy to avoid orphaned storage costs. This is an L5 probe: "What happens if the user's phone loses connection mid-upload of a 20 MB raw photo?"
Pre-signed URL security — enforcing content constraints: A pre-signed URL grants any bearer the right to PUT to that S3 key. Without additional controls, a client could upload a 10 GB binary or a PDF instead of a JPEG. Two enforcement layers prevent this: (1) Upload policy conditions: the Upload API embeds a content-length-range condition (e.g., 1 byte to 50 MB) and locks the signed Content-Type header to the declared MIME type (e.g., image/heic). S3 rejects PUTs that violate either condition at the bucket level. (2) Image Processor validation: before generating any variant, the processor validates the file magic bytes against the declared MIME type and rejects non-image uploads. The S3 object is tombstoned and the job fails cleanly. This is a common L5 probe: "What stops a client from uploading an executable to your S3 bucket using the pre-signed URL?"
Core flow — photo upload & feed read
L3/L4+Two flows dominate the system. The upload flow is write-critical and must preserve every photo; the feed read flow is latency-critical and must return in under 200ms. They interact through Kafka: the upload flow publishes events; the feed read flow consumes pre-built caches that the fan-out worker produced from those events.
Upload flow
Key decision — why write metadata after processing (step ⑥), not before? This is driven by the latency NFR from §2. If metadata is written before processing, the feed would contain post entries with no CDN URLs — resulting in broken images for the 3–5 seconds the processor takes. The tradeoff: if the processor crashes between S3 write and DB metadata write, the photo object is orphaned in S3 but the user still sees an error. A reconciliation job (< 1 min scan) detects orphaned objects and retries or notifies. This is a far better failure mode than showing broken images.
Feed read flow
Data model
L5+The data model has three distinct domains — posts, users/graph, and engagement (likes, comments) — each with very different access patterns. Those differences end up shaping which store each entity lives in and how it's keyed.
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Insert new post | ~1,200/sec avg | Write by (user_id, post_id); no joins |
| Fetch user's recent posts (profile grid) | High (every profile visit) | Range scan: WHERE user_id = X ORDER BY ts DESC LIMIT 12 |
| Fetch post by ID (feed hydration) | Very high (every feed load) | Point read: WHERE post_id = X |
| Get followers of user X | High (fan-out trigger) | Index scan: WHERE followee_id = X |
| Get users followed by X | High (feed miss fallback) | Index scan: WHERE follower_id = X |
| Like count for post X | Very high (every feed render) | Point read — must be precomputed |
| Has viewer liked post X? | High (per feed item) | Point read: (post_id, user_id) |
Two things jump out from this table. First, the post read path is predominantly by (user_id, time) for profile grids and by (post_id) for feed hydration — two different primary access patterns that suggest two indexes. Second, like counts and "has viewer liked" are read on every feed render; they must be precomputed, not computed on-the-fly from a likes table at read time.
Schema
Cassandra COUNTER constraint: COUNTER columns cannot coexist with regular columns in Cassandra — they require a dedicated table. Like and comment counts are persisted in a separate post_counters(post_id, like_count COUNTER, comment_count COUNTER) Cassandra table, used only as the async flush target from Redis (the hot-path read layer). Follower/following counts and the celebrity flag are Redis hash fields maintained on every follow.created event — not denormalized columns in PostgreSQL. The users table stores immutable profile data only (username, bio, avatar_url). This separation avoids row-level locking on concurrent follow/unfollow writes in PostgreSQL.
Why store like counts as Redis sets rather than a SQL likes table? ›
Like counts and "has the current viewer liked this post?" are read on every single feed render — potentially 20 posts × 230,000 req/sec = 4.6 million point reads per second. A SQL COUNT(*) WHERE post_id = X aggregation query at this rate would require a massive database fleet. A Redis set per post (likes:{post_id}) stores user_ids of likers; SCARD returns the count in O(1) and SISMEMBER checks viewer membership in O(1). The total memory per popular post (10k likers × 8 bytes/user_id) is ~80 KB — trivially small. The set is asynchronously flushed to Cassandra's COUNTER column for persistence; Redis is the hot path read layer.
Scaling limit for viral posts: A post with 10 M likers stores 10M × 8 bytes = 80 MB in a single Redis key — a concern for replication and eviction. At celebrity scale, switch to a HyperLogLog (PFADD / PFCOUNT) for approximate like counts (~12 KB per post, <1% error) and a Bloom filter for the "has viewer liked?" membership check. Exact historical counts are reconciled from Cassandra asynchronously. This is an L6 probe: "What happens to your Redis SET design when a post gets 50 million likes?"
Caching strategy
L5+The caching architecture for Instagram has four distinct layers, each positioned at a different tier of the §4 architecture. Without each layer, a different latency or cost ceiling is hit. The layers are not redundant — they address different bottlenecks.
| Layer | What it caches | Technology | TTL | Why it exists |
|---|---|---|---|---|
| 1. CDN (edge) | Photo objects (all 3 variants) | CloudFront / Fastly PoPs | 30 days (immutable URLs) | Without CDN, S3 egress at ~40 Tbps would be economically and physically impossible |
| 2. Feed Cache | Per-user sorted post ID list | Redis cluster (sorted set) | 7 days; LRU at 3000 entries | Enables <200 ms feed reads; without this, every feed load is a N-way fanin database scan |
| 3. Post Metadata | Recent post JSON (caption, URLs, counts) | Redis cluster (hash) | 1 hr for hot posts; TTL-on-access | Feed hydration needs full metadata for ~20 posts per request at 230k req/sec — Cassandra cannot absorb all point reads |
| 4. Engagement Counters | Like sets, follower counts, celebrity flags | Redis (sets, counters) | None (async flush to Cassandra) | Like count + viewer-liked check on every post render; SQL COUNT at feed QPS is infeasible |
Cache invalidation
CDN invalidation: how do you update a photo once published? Immutability pattern ›
Photo objects are immutable: once a variant is written to S3 with a stable URL (e.g., /ph_8Xk2mNqLp9vW/f.jpg), it never changes. If the user edits their photo or post (which Instagram supports for captions, not image content), only the metadata changes. The CDN URL is stable for the original binary. This "immutable URL" pattern means CDN TTLs can be long (30 days) without invalidation risk. Deleting a photo requires an explicit CDN purge API call, which propagates to PoPs within minutes — acceptable, since a deleted photo isn't a latency-sensitive operation.
Feed cache invalidation on unfollow Consistency edge case ›
When user A unfollows user B, posts from B should stop appearing in A's feed. Purging B's post IDs from A's sorted set retroactively is expensive — you'd have to scan the entire set. The practical approach: on unfollow, simply mark the relationship as deleted in the Graph DB. The Feed API filters out posts from non-followed users before returning results. The feed cache may contain stale post IDs from B, but the hydration step checks current follow state and silently drops them. The feed cache self-heals as entries age out (7-day TTL). For immediate strictness, a background job can scrub the cache on unfollow — but this creates a thundering herd if mass-unfollows are triggered.
Avoiding Graph DB reads on every render: "Checking current follow state" during hydration cannot mean a Graph DB query per post — that defeats the purpose of the cache. Instead, the Feed API maintains a short-lived cached follow set per user in Redis (following:{user_id}, a SET of followee IDs, TTL 5 min, invalidated on follow/unfollow events via the Social API). Post IDs in the feed sorted set are filtered against this cached set in O(1) per post — no Graph DB round-trip needed during normal feed renders. The 5-min TTL bounds the maximum staleness window for any unfollow to 5 minutes, which is well within acceptable UX.
Deep-dive scalability
L5+At 500 M DAU, the production Instagram architecture must handle several failure modes and scale pressures that don't appear at smaller scale. This section traces through the five key scalability dimensions that L5+ candidates are expected to address.
Sharding strategy for Cassandra Post DB L5 · Data sharding ›
Cassandra partitions by user_id. This is the natural sharding key because the dominant access pattern — "get user X's recent posts" — is a single-partition scan. Cross-user queries (e.g., "get posts from all N followees") are parallelised across N partitions by the Feed API. Hot partitions arise for hyper-active users who post extremely frequently. Mitigation: add a shard_id secondary component to the partition key (e.g., user_id_mod_4) to spread a single user's data across 4 partitions, then union at read time. This is rarely needed below 1 B users.
Distributed ID generation for post_id L5 · ID generation ›
post_ids must be globally unique, roughly time-ordered (for sorted set scoring), and generated at ~1,200/sec without a central coordinator becoming a bottleneck. The standard solution is a Snowflake-like 64-bit ID: 41 bits timestamp (ms since epoch), 10 bits machine ID, 12 bits sequence. Each Upload API server generates IDs locally using its assigned machine_id — no network round-trip. The timestamp-prefix ensures IDs are sortable by time, which the Cassandra clustering key exploits.
Capacity note: 10 bits = 1,024 unique machine IDs. The Upload API fleet can scale beyond this limit at Instagram's size. The production fix is a lightweight machine ID lease service (a Redis INCR counter, ZooKeeper node, or etcd lease) that issues numeric IDs from the 10-bit namespace with short-TTL leases — recycled when instances shut down. Modern stacks favour etcd over ZooKeeper for this pattern due to simpler operation and Raft-based strong consistency. Alternatively, use 12 bits for machine ID (4,096 nodes) and reduce the sequence to 10 bits (1,024 IDs/ms per node); throughput stays well above 1,200/sec at any realistic fleet size.
Fan-out worker scaling: the celebrity post problem L5/L6 · Thundering herd ›
Kylie Jenner has ~400 million followers. If she posts and we try to fan-out to all 400 M feed caches within 30 seconds, that requires writing ~13 million Redis operations per second sustained for 30 seconds — from the fan-out workers alone. This is achievable with a large horizontal fan-out fleet (hundreds of workers consuming from parallel Kafka partitions), but at enormous infrastructure cost for a single event. The design caps fan-out via the celebrity threshold. For true celebrities (400 M followers), the post is stored once; followers pull it lazily at feed read time, merged into the pre-built push cache. The thundering herd from fan-out is fully eliminated.
Geo-distribution and multi-region L6 · Global availability ›
Instagram serves users across North America, Europe, Latin America, Southeast Asia, and India. A single-region architecture would add 100–300 ms transcontinental latency to every API call. The production design deploys independent regional clusters (US-East, EU-West, AP-Southeast) with global S3 replication for photo objects and inter-region replication for the social graph. Feed caches are regional — users primarily interact with accounts in their geographic cluster. Post data replicates asynchronously across regions via Kafka cross-cluster replication. The social graph (follows table) is the hardest to multi-home because a follow write must eventually reach all regions — typically done via async replication with eventual consistency acceptable at cross-region propagation timescales (<5s).
Feed ranking: when and where is it applied? L6/L7 · ML integration ›
Instagram's feed is not purely chronological — it's ranked by an ML model incorporating engagement signals (likes, comments, profile visits), content signals (caption, hashtags), and behavioral signals (user viewing history, time of day). The ranking model cannot run at 230,000 feed request/sec inline. The practical architecture: the Feed API assembles a candidate set of ~100–200 post IDs from the Redis cache and celebrity pulls, sends this candidate set to a lightweight ranking service (ONNX model inference, <20 ms for 200 candidates), and the ranked N posts are returned to the client. The ranking service is stateless and horizontally scalable — adding latency budget of ~20ms, which fits inside the 200ms p99 NFR.
Failure modes & edge cases
L5+| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Image processor crashes mid-processing | Photo uploaded to S3, metadata never written to Post DB. User sees a spinner indefinitely. | Processor uses idempotent job IDs from SQS with visibility timeout. If the job isn't ACKed within 5 min (DLQ threshold), SQS re-delivers to another worker. S3 object already present — processor regenerates variants and completes the write. Client polls /photos/{id}/status. |
L3/L4 |
| Redis cluster failure | Feed caches unavailable. All feed reads fall back to database fan-in. Cassandra and Graph DB get hit at full feed read QPS (~230,000 req/sec). DB fleet overwhelmed. | Load shedding: Feed API activates circuit breaker after N consecutive Redis timeouts, returns degraded response (recent 5 posts from cache replica or empty feed with "Loading..."). Redis is deployed as a cluster with replica failover (<15s switchover). Read replicas absorb fan-in during failover window. | L5 |
| Kafka consumer lag spike (fan-out delayed) | Fan-out workers fall behind. Posts appear in followers' feeds 5+ minutes late, violating the 30 s SLA. | Fan-out workers auto-scale on Kafka consumer lag metric (Kubernetes HPA on consumer group lag). If lag exceeds threshold, additional workers spin up. Kafka partition count is set 2× the expected worker count to allow headroom. Monitoring alerts at >60 s lag. | L5 |
| User with 0 followers posts; feed cache never populated | No followers to fan out to. Post is orphaned — nobody sees it in feeds. Profile page still works (directly reads from Cassandra by user_id). | This is correct behavior. The fan-out worker correctly does zero writes for 0 followers. Post is always visible on the poster's own profile page. If the user gains followers later, subsequent fan-outs populate follower feeds from then — historical posts are visible via profile grid but not retroactively injected into feed caches. | L3/L4 |
| Celebrity crosses the threshold mid-fan-out | User has 9,999 followers; posts; fan-out starts. 1 second later they gain 10,001 followers (viral moment). Fan-out continues but now exceeds budget. | Fan-out worker checks celebrity status once at job start using the Redis-cached follower count (60s TTL). If the follower count was below threshold at job start, the full fan-out completes. There is a short race window where a newly-over-threshold account gets one over-budget fan-out — acceptable, as it happens at most once per account crossing the threshold. A background observer on Kafka's follow.created stream detects the threshold crossing and atomically updates a Redis hash field (celebrity:{user_id} = 1): the flag the fan-out worker reads on all future post events. All subsequent posts from that user use the pull path. |
L5 L6 |
| Celebrity flag crosses threshold downward (mass unfollow / account cleanup) | A celebrity account loses followers — dropping below the threshold. Future posts should theoretically resume fan-out-on-write, but the celebrity:{user_id} flag was set when the threshold was crossed upward and is never cleared. |
The flag is treated as a one-way gate for production simplicity: once an account is flagged as a celebrity, it stays on the pull path even if follower count later drops below the threshold. Re-enabling fan-out requires a manual operational step (clear the Redis flag). The tradeoff: a de-celeb'd account's posts involve a slightly heavier read path for its followers, but this affects a negligible fraction of accounts. Auto-detecting and migrating pull→push introduces race conditions between the flag update and in-flight fan-out workers that aren't worth the complexity. | L5 L6 |
| CDN cache poisoning / wrong variant served | Image processor has a bug and generates a corrupt thumbnail. CDN caches the corrupt file. All subsequent requests get the corrupt image for 30 days. | Pre-signed upload URLs include a content hash suffix. If the processor regenerates a correct variant, it writes to a different URL (with a new hash suffix) rather than overwriting. The Post DB metadata is updated to point to the new URL, and an explicit CDN purge is issued for the old URL. Immutable-by-URL prevents accidental cache overwrites. | L5 |
| Hot partition in Cassandra (viral user posts 1000 photos/day) | A single user generates extreme write load on one Cassandra partition, causing that node's disk and CPU to spike disproportionately. | Apply partition-key salting: append a shard suffix (0–3) to user_id to spread the user's data across 4 partitions. Reads union across all 4 partitions for that user. Only activate for monitored hot users — normal users use unsalted partition key. | L7/L8 |
| Fan-out worker: poison-pill Kafka event | A malformed post.published event (unknown poster_id, corrupt payload, or missing Graph DB record) causes the fan-out worker to fail on every retry attempt. After max retries the event lands in the DLQ; followers never see the post in their feeds. |
Two-branch error handling: (1) transient errors (Redis timeout, Graph DB unavailable) — rely on Kafka at-least-once retry with exponential backoff; (2) unrecoverable errors (unknown user_id, schema mismatch) — log, emit a metric, and commit the Kafka offset without retrying to unblock the partition. A DLQ consumer alerts on-call and replays structurally valid events after root-cause resolution. Max retry count is configured per consumer group (typically 3 before DLQ). | L5 |
Security, rate limiting & compliance
Three security concerns map directly to components already in the architecture: abuse prevention on the upload path, illegal content detection in the image pipeline, and regulatory erasure requirements that span every storage tier. L6+ interviewers will probe all three explicitly.
Upload rate limiting
Without rate limiting, a single malicious client can exhaust S3 pre-signed URL generation capacity and drive disproportionate storage write cost. The API Gateway enforces a per-user token-bucket rate limit on POST /photos: maximum 10 uploads per minute per authenticated user, enforced via a Redis counter (INCR ratelimit:{user_id}:{minute_bucket} with a 60-second TTL). Requests exceeding the limit receive a 429 Too Many Requests response with a Retry-After header. A separate, higher limit (100/min) applies at the IP level to catch unauthenticated probing before authentication succeeds.
L5 probe: "How would you prevent a bot from uploading 1 M photos per hour?" — Token bucket at the API Gateway keyed on user_id, backed by a Redis counter with a sliding window. Secondary: S3 pre-signed URL expiry (15 min) limits replay attacks. Tertiary: content-hash deduplication at the Image Processor rejects bit-exact duplicates before any variant is stored.
Perceptual hash deduplication and content moderation
The Image Processor runs two checks before writing variants to S3:
- Exact-hash deduplication: SHA-256 of the raw upload is checked against a probabilistic hash store (e.g., a disk-backed bloom filter or a Redis SET of known hashes) of previously stored object hashes. Bit-exact duplicates skip variant generation and reuse existing CDN URLs. Important: a bloom filter has no false negatives but can produce false positives — a false positive would silently deduplicate a genuinely unique image against someone else's photo. The implementation must include a fallback verification step: before reusing a CDN URL, confirm the candidate S3 object exists and its SHA-256 matches the new upload's hash. At 100 M uploads/day over 10 years (~365 B cumulative), a bloom filter targeting ≤0.001% false-positive rate requires ~5 GB of RAM — size accordingly.
- Perceptual hash (pHash / PhotoDNA): The processor computes a perceptual hash of the image and checks it against a blocklist of known-illegal content hashes (CSAM, jurisdiction-required blocklists). A match aborts the job, tombstones the S3 object, and triggers a moderation queue entry. This check happens before any variant is written publicly: the image never reaches CDN. The blocklist is a small in-memory set (≤100 MB) loaded at worker startup and refreshed hourly.
GDPR right-to-erasure pipeline
A user deletion request triggers a multi-stage erasure job implemented as a durable workflow (e.g., AWS Step Functions) to survive partial failures. The pipeline must reach every storage tier:
| Tier | Erasure action | SLA |
|---|---|---|
| PostgreSQL (users, follows) | Hard-delete user row; cascade to follows table | ≤24 hours |
| Cassandra (posts, post_counters) | Delete all rows WHERE user_id = X (posts partitioned by user_id). For post_counters (partitioned by post_id): first fetch the user's post IDs, then delete counter rows individually. Tombstones physically compact after gc_grace_seconds (default 10 days) — zero out COUNTER columns instead of deleting if a strict ≤7-day physical erasure SLA is required, and set gc_grace_seconds = 86400 on the post_counters table. | ≤7 days (logical); ≤10 days physical without config change |
| Redis (feed, likes, counts) | DEL feed:{user_id}; scan and DEL all likes:{post_id} keys for the user's posts | ≤1 hour |
| S3 (photo objects) | Batch DELETE all objects under the user's prefix; mark for permanent destruction | ≤7 days |
| CDN (edge caches) | Issue explicit purge API call (e.g., CloudFront Invalidation API) for all CDN URLs associated with the user's posts | ≤30 days (CDN propagation ~15 min after purge) |
| Kafka (event log) | Kafka retention is time-windowed (7 days default); post.published events age out automatically | ≤7 days |
CDN erasure gap: the most common L6 miss: CDN edge caches may continue serving a deleted photo for up to 30 days even after the S3 object is deleted, if no explicit CDN purge is issued. The erasure workflow must call the CDN purge API — not just delete the S3 object. Note: S3 returns 404 Not Found for deleted objects, not 410. For jurisdictions requiring erasure within 72 hours, a Lambda@Edge function intercepts requests for tombstoned photo IDs and returns 410 Gone explicitly — signaling to CDN nodes not to cache the response and surfacing the intentional removal to clients.
L7/L8 probe: "A user deletes their account. How do you guarantee their photos stop being served within 30 days across all 300+ CDN PoPs globally?" — Explicit CDN Invalidation API call in the erasure workflow, not TTL expiry. The purge propagates to all PoPs within ~15 minutes. For 72-hour jurisdictions: redesign the CDN URL to include a user-controlled token; on deletion, revoke the token at an origin-shield layer that returns 410 Gone — CDN nodes revalidate and stop serving the stale cached response.
How to answer by level
L3 / L4 SDE I / SDE II — Build a working system ›
- Identify the three domains: upload, social graph, feed
- Propose pre-signed URL upload pattern (binary off app servers)
- Describe fan-out-on-write as the baseline feed strategy
- Choose object store + CDN for image delivery
- Sketch a basic Post DB schema with user_id, timestamp, CDN URLs
- Mention Redis for feed caching
- Recognizing that uniform fan-out fails for celebrities
- Explaining why Cassandra fits the post write pattern better than PostgreSQL
- Tracing the latency budget: where does 200 ms go?
- Discussing image processing as async (not blocking the upload response)
L5 Senior SDE — Understand the tradeoffs ›
- Propose hybrid fan-out and explain the celebrity threshold
- Describe how the feed merge works (N-way sort of push + pull)
- Tie Redis feed cache to the 200 ms latency NFR explicitly
- Discuss Kafka for decoupling image processing, fan-out, and notifications
- Explain the access pattern table before the schema (§7 pattern)
- Address the processor failure mode and retry strategy
- Designing the full four-layer cache hierarchy with TTL reasoning
- Addressing CDN invalidation for photo deletion
- Explaining why random UUIDs (v4) are suboptimal for time-sorted Cassandra clustering keys (random distribution defeats range-scan locality), and proposing Snowflake-style 64-bit time-prefixed IDs instead
- Discussing multi-region deployment and which stores replicate synchronously vs async
L6 Staff SDE — Own the system end-to-end ›
- Full cache invalidation strategy per layer (CDN, feed, metadata, counts)
- Geo-distribution plan: which stores replicate where and at what consistency level
- Feed ranking integration: candidate set size, ranking model placement, latency budget
- Enumerate failure modes and propose circuit breakers and load shedding
- Cassandra hot partition mitigation via partition-key salting
- Framing the celebrity threshold as a tunable business parameter, not a fixed constant
- Proposing a feedback loop to adjust fan-out strategy dynamically based on observed write cost
- Discussing CDN cost model and how cache hit rate directly drives infrastructure spend
L7 / L8 Principal / Distinguished — Drive the architecture ›
- Frame every decision as an explicit tradeoff with business impact (celebrity fan-out cost vs latency improvement)
- Propose adaptive celebrity threshold: dynamically adjusted based on real-time write cost monitoring
- Discuss the Explore tab as a separate architecture problem (collaborative filtering, not follow graph)
- Address regulatory requirements: GDPR right-to-erasure pipeline across S3, Cassandra, Redis, CDN
- Propose self-healing cache rebuild after Redis failure without overwhelming downstream DBs
- Spending too much time on the upload path (solved problem) vs feed ranking
- Proposing synchronous cross-region consistency for the social graph (impossible at this scale)
- Ignoring CDN cost as an architectural constraint
- Not discussing the difference between the home feed (follow graph) and the Explore feed (recommendation)
Classic probes
| Question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| "How would you handle Beyoncé posting a photo?" | Fan-out to all followers (missing the cost problem) | Celebrity threshold → store once, pull on read → merge at feed read time | Adaptive threshold, cost model, monitoring, and when NOT to fan out even below threshold (trending moments) |
| "How do you deliver images to a user in Tokyo?" | Store in S3, return URL | CDN PoP in Tokyo; immutable URLs; 90%+ cache hit rate; origin is S3 multi-region | CDN cost model, cache hit rate sensitivity analysis, purge latency SLA on deletion, GDPR erasure timelines across CDN PoPs |
| "What happens if Redis goes down?" | "Fall back to the database" | Circuit breaker + load shedding; replica failover (<15s); DB fan-in for degraded feed during failover | Graduated degradation: replica → degraded feed → cached empty state; replay-from-Kafka feed rebuild without overwhelming Cassandra; impact analysis (X% of users see degraded feed for Y minutes) |
| "How would you add a ranked feed instead of chronological?" | "Sort by likes" (doesn't describe the pipeline) | Candidate set (200 post IDs from cache) → ranking service (ONNX/TF, <20ms for 200 candidates) → top-N returned; offline model training on engagement signals | Full ML pipeline: feature store, online serving latency budget, A/B testing infrastructure, ranking model cold start for new users, privacy constraints on signal collection |
- Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
- Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
- Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
- Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
- Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
- Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
- Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
- Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
- Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
- Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
- Online Judge System Design — secure sandboxing, execution queues, and worker scaling