System Design Interview Guide

Design a Photo-Sharing Feed
like Instagram

Simple to describe, hard to scale: serving personalized feeds of images to two billion users without melting the database or the CDN budget.

L3 / L4 — Build it L5 / L6: Own the tradeoffs L7 / L8 — Drive the architecture

~35 min read · 11 sections · interactive estimator

Hero Image for Photo-Sharing Feed System Design
01

What the interviewer is testing

All levels

"Design Instagram" is a canonical FAANG question because it compresses three fundamentally different hard problems into one: a high-throughput binary data pipeline (images), a social graph traversal problem (who follows whom), and a personalized feed assembly problem (what to show, in what order). Solving any one of them is straightforward. Solving all three while keeping latency under 200ms is the interview.

ProblemWhy it's hardWhat changes at scale
Image pipelineBinary objects are orders of magnitude larger than text; uploading, transcoding, and delivering at scale requires a dedicated infrastructure pathDirect-to-object-store uploads bypass app servers; CDN absorbs nearly all read traffic
Social graph traversalFetching follower lists at write time (fan-out) or post lists at read time (fan-in) both become expensive at celebrity scalesHybrid strategy: push for regular users, pull for celebrities
Feed rankingChronological feeds don't scale (too noisy). Ranked feeds require ML inference on the assembled candidate setFeed candidates pre-assembled; ranking applied as a lightweight pass
🎯

Level signal: L3/L4 candidates describe the upload and read paths correctly but treat all users identically (always fan-out-on-write). L5 candidates identify the celebrity problem and propose the hybrid strategy. L6 candidates reason about the consistency window — how stale is a feed acceptable — and tie it back to a specific NFR. L7/L8 candidates frame the ranking vs. freshness tradeoff as a business decision with product impact, not just a technical constraint.

02

Requirements clarification

All levels

The question "design Instagram" is deliberately vague. Scoping it correctly — and explaining why you scoped it this way — is itself part of the evaluation. Start here before touching architecture.

Functional requirements

CapabilityIn scopeOut of scope (for this interview)
Photo uploadUpload a photo with caption and tags; multi-format support (JPEG, PNG, HEIC)Video (Reels), Stories (ephemeral content), carousel posts
Follow graphFollow / unfollow users; public accounts onlyPrivate accounts, approval workflows, close friends lists
Home feedPaginated feed of recent posts from followed usersExplore tab, Reels feed, algorithmic ranking ML model internals
InteractionsLike a photo; comment on a photoSaves, shares, DMs, polls, stickers
User profileUser's own post grid, follower/following countsHighlights, bio link shortening, verified badge
NotificationsLike and follow notifications (async)Real-time push infrastructure design

Non-functional requirements

NFRTargetWhy this level?
Feed load latency (p99)< 200 ms (API response, excl. image download)Users scroll continuously; any pause above ~300ms registers as lag
Photo upload latency< 5 s end-to-end (processed + CDN URL returned)Upload completion is a hard perception threshold; beyond 5s users abandon
Feed freshnessNew posts appear in followers' feeds within 30 sNear-real-time is the product expectation; batch updates are not acceptable
Like/comment consistencyEventual consistency; counts accurate within ~5 sExact like counts are not safety-critical; slight lag is imperceptible
Availability99.99% (feed read path); 99.9% (upload path)Feed reads are the core product; upload is less failure-sensitive (user can retry)
Scalability2 B MAU; 500 M DAU; 100 M photos/day uploadedInstagram's reported 2024 scale
Storage durability11-nines (99.999999999%) for photo objectsLosing a user's photo is an irreversible trust failure

NFR reasoning

Feed load latency < 200 ms Drives §4, §8, §9

200 ms p99 for a paginated feed response means the API must return metadata (post IDs, image URLs, like counts) without waiting for image downloads. Image bytes flow separately through the CDN. The 200 ms budget is split roughly: 10 ms feed cache hit → 80 ms feed assembly from Redis → 110 ms network. Hitting this with a database query that joins posts, users, and likes across millions of rows per user is not feasible: the feed must be pre-assembled.

What this drivesPre-computed feed caches in an in-memory store (Redis) are required. This is not an optimisation — it's a structural prerequisite for the latency target.
Feed freshness within 30 seconds Drives §5, §6

30-second freshness means fan-out to follower feed caches must complete within 30 seconds of a post being uploaded. For a user with 1,000 followers, this is trivially achievable in a single async batch. For a celebrity with 100 million followers, fanning to 100M Redis keys within 30s is technically possible (50–100 workers consuming parallel Kafka partitions can sustain ~3–10M writes/sec), but it is economically infeasible — provisioning that burst capacity for a single post event wastes massive infrastructure that sits idle 99.99% of the time. The celebrity exception is therefore a cost-efficiency decision, not a physical impossibility, and it forces the hybrid fan-out strategy in §5.

TradeoffCelebrity posts may appear in follower feeds with a slight additional delay (seconds, not minutes) due to the pull-on-read path. This is an accepted product tradeoff — Instagram users don't notice a 15-second difference in Beyoncé's post appearing. The celebrity threshold should be set where provisioned fan-out cost exceeds pull-on-read overhead — not at a fixed follower count.
11-nines durability for photos Drives §4, §7

Photos are unique user-generated content — they cannot be regenerated. Unlike text posts that could potentially be re-entered, a lost photo is permanently lost. 11-nines durability (≈ 0.0000001% annual loss probability) means storing photos in object storage (Amazon S3 or equivalent) that replicates across multiple availability zones by default. The application layer never stores binary image data; it only stores CDN URLs pointing to the object store.

What this drivesApplication servers must never be in the upload data path for binary data. Clients upload directly to object storage using pre-signed URLs to bypass app tier bottlenecks and achieve the necessary durability guarantees.
03

Capacity estimation

L3/L4+

Instagram's load profile is dramatically read-heavy. For every photo uploaded, it's viewed hundreds to thousands of times. The write path (uploads) is a relatively thin stream; the read path (feed loads, profile views, photo fetches) is the scale challenge to design for. Image data adds a critical second dimension: storage and bandwidth numbers dwarf what you'd see in a text-only system.

Interactive capacity estimator

100 M
200×
200 KB
10 yr
Upload write QPS
photo uploads / sec (avg)
Peak upload QPS
uploads / sec (3× avg)
Feed read QPS
feed requests / sec
Storage added / day
raw image storage
Total photo storage (replicated)
incl. 2× cross-region replication
CDN bandwidth
outbound / day (feed reads)

⚠️ Calibration note: The feed read QPS multiplier is calibrated so that 200× at 100 M uploads/day produces 500 M DAU × 40 feed loads/day ÷ 86,400 s ≈ 231 K req/sec. In practice, feed read QPS is a function of DAU and session behaviour — it does not scale proportionally with upload volume. If you adjust the upload count, re-calibrate this multiplier inversely to preserve the DAU-anchored read rate.

💡

Key insight: At 100 M uploads/day, photo storage grows by ~20 TB/day (100M × 200 KB). Over 10 years that's ~73 PB of raw storage — before size variants. With three size variants (thumbnail at ~10%, feed at ~40%, full at 100% of original), the primary storage footprint is ~1.5× raw = ~109 PB. The estimator's 200–220 PB figure accounts for S3 cross-region replication (two geographic copies for multi-region availability), bringing the total logical storage to ~109 PB × 2 ≈ 220 PB. Note: S3 handles within-AZ redundancy transparently; the 2× factor here is cross-region replication, which you'd configure explicitly for a global platform. This number immediately tells you that object storage with CDN offloading is not optional: it's the only cost-feasible architecture. At $0.023/GB on S3, serving 220 PB directly from origin without a CDN would cost thousands of dollars per minute in egress alone.

DimensionEstimateKey insight
Upload write rate~1,160 uploads/sec (avg); ~3,500/sec peakManageable; the upload path is not the bottleneck
Feed read rate~230,000 req/secThis is the dominant load — requires pre-computed feed caches and CDN
Photo storage (10 yr)~220 PB (with 3 size variants)Forces object store; no block storage or filesystem can sustainably operate at this scale
CDN bandwidth~40–80 Tbps theoretical peakOnly achievable via global CDN PoPs with cache hit rates ≥ 90%
Feed metadata DB~500 GB/day new rowsPost metadata (caption, user_id, timestamp, CDN URLs) is tiny compared to images
Peak multiplier3–5× steady stateMajor events (New Year's Eve, World Cup) spike uploads and feed reads simultaneously
04

High-level architecture

L3/L4+

The architecture splits cleanly into three planes. The upload plane handles inbound binary data: client → object store → processing pipeline → metadata write. The read plane handles feed and photo delivery: client → CDN → feed cache → API servers → database for cache misses. The social graph plane manages follow relationships and drives the feed fan-out process that connects uploads to reads. A durable message queue (Kafka) decouples all three.

Client Web / App CDN Photo delivery API Gateway LB + Auth Upload API pre-signed URL Feed API assemble + rank Social API follow / graph Object Store S3 · 11-nines dur. Image Processor resize → 3 variants Post DB Cassandra (writes) Feed Cache Redis feed lists Graph DB follows table Kafka post.published Fan-out Worker push to caches LEGEND Synchronous Async
Figure 1 — High-level architecture. Upload plane (top), feed read plane (middle), social graph plane (bottom). Kafka decouples image processing from fan-out; CDN absorbs photo delivery traffic.

Component breakdown

CDN (content delivery network) sits in front of all photo delivery. It caches the three image variants (thumbnail, feed-resolution, full-resolution) at edge nodes globally. Once a photo is cached at a PoP, subsequent requests for that URL cost zero origin bandwidth. For a platform with billions of daily photo fetches, CDN cache hit rates ≥90% are the difference between a viable and an unviable bandwidth bill.

API Gateway / Load Balancer is the single entry point for all non-image traffic: authentication token validation, rate limiting (per-user and per-IP), and routing to the three service clusters (Upload API, Feed API, Social API). It does not handle image bytes — those go directly from the client to object storage via pre-signed URL.

Upload API handles the upload initiation flow. On a POST /photos/upload request, it validates the request, generates a pre-signed URL pointing at object storage (S3), and returns it to the client. The client then uploads the binary data directly to S3 without touching the Upload API server again. After the S3 upload completes, the client notifies the Upload API, which writes a pending record to the Post DB and enqueues a processing job. This architecture keeps multi-hundred-KB binary data off the application tier entirely.

Image Processor is a stateless worker pool that consumes from an object-store event trigger (S3 event notification or SQS queue). For each uploaded photo, it generates three resolution variants — thumbnail (150×150px), feed resolution (640×640px), and full resolution (1080×1080px) — and writes them back to object storage with stable, CDN-friendly URLs. Once processing completes, it writes the final CDN URLs to the Post DB and publishes a post.published event to Kafka.

Feed API assembles the home feed for a requesting user. It first checks the user's pre-computed feed list in the Redis Feed Cache. On a cache hit, it hydrates post metadata (cached separately) and returns the response. On a miss (new user, cache evicted, or first request), it falls back to a database fan-in query: fetch the user's followed list from Graph DB, query Post DB for recent posts from each followee, merge and sort. Cache misses are expensive but rare in steady state. Hydration avoids N+1: metadata for all ~20 posts in a page is fetched in a single Redis pipeline (MGET post:{id1} post:{id2} … or pipelined HGETALL calls) rather than 20 serial round-trips. Cache misses within the batch trigger a parallel async read from Cassandra, with results written back to Redis before the response returns.

Social API manages the follow graph: creating and deleting follow relationships, returning follower/following counts, and handling the Graph DB writes that trigger fan-out. When a user follows someone, the Graph DB record is written synchronously; feed cache updates happen asynchronously.

Post DB (wide-column store, Cassandra) stores post metadata: photo ID, user ID, CDN URLs for all three variants, caption, timestamp, like count (approximate), comment count. Cassandra's append-friendly write path handles the high upload write rate without locking. Reads are by (user_id, time_range), which maps cleanly to Cassandra's partition + clustering key model.

Graph DB (relational, PostgreSQL or sharded MySQL) stores the follow graph as a follows table with (follower_id, followee_id, created_at). Both directions are indexed because the system needs to answer both "who do I follow?" (write fan-out) and "who follows me?" (notification). At very large scale (billions of edges), this migrates to a dedicated graph service (Meta's TAO), but for interview purposes a sharded relational store is the correct starting point.

Feed Cache (Redis sorted sets) holds one per-user sorted set of post IDs, ordered by timestamp. Each post_id maps to a post metadata hash. The Fan-out Worker appends new post IDs to followers' sorted sets; Feed API pops the top N. TTL is set to avoid stale feed eviction — active users will have their caches refreshed frequently; inactive users' caches expire and are rebuilt on next login.

Kafka event pipeline receives post.published events from the Image Processor and follow.created events from the Social API. The Fan-out Worker subscribes and executes the feed propagation logic. The Notification Service subscribes to the same stream for independent like and follow notifications. This decoupling means a slow fan-out for a celebrity post doesn't delay photo processing or the API response.

Notification Service is a separate stateless consumer group on Kafka that processes post.published, like.created, and follow.created events. For each event, it resolves the target user's push token (APNs for iOS, FCM for Android) from a Device Registry store and enqueues a push notification via the respective platform gateway. Two reliability constraints matter here: (1) at-least-once delivery: Kafka consumer group commit-after-ack ensures no notification is silently dropped on worker crash; (2) deduplication — if a post triggers fan-out to 10,000 followers and the worker retries, the same notification must not be sent twice. A Redis set (notif_sent:{event_id}:{user_id}, TTL 24 h) acts as an idempotency gate before the push is dispatched. In-app notifications are written to a notifications table and fetched on app open — no real-time socket needed for the baseline design. (Scale note: at 500 M DAU a Cassandra table — notifications_by_user, partitioned by recipient_user_id, clustered by created_at DESC — handles concurrent per-user appends more efficiently than a PostgreSQL table, which requires explicit partitioning by user_id to avoid hot-row contention at this write rate.)

Fan-out Worker is a horizontally scalable consumer pool that receives post.published events and decides whether to push (regular user) or skip (celebrity — readers will pull). For push-eligible posts, it reads the poster's follower list from Graph DB in paginated batches and writes the post ID to each follower's Redis sorted set via a pipeline command. The worker itself is stateless; Kafka consumer group semantics handle parallel processing across workers.

🔍

Hashtag and search index: The post.published Kafka event carries the caption and tags array. A dedicated Search Consumer (separate Kafka consumer group) reads this event and writes an index document to Elasticsearch: { post_id, user_id, caption, tags[], location, published_at }. Hashtag queries (GET /hashtag/{tag}) hit the Elasticsearch index directly — Cassandra is never queried for this. The Elasticsearch cluster is sized independently of the post write path because index writes are asynchronous and latency-tolerant. At interview depth, naming this consumer pattern and the Kafka-to-ES fanout is an L5 signal; discussing index mapping strategy (keyword vs. text analyzer for caption search) is L6.

Architectural rationale

Why upload directly to object storage and not through the app servers? Upload path

At 100 M uploads/day (~1,160/sec average, 3,500/sec peak), routing binary payloads (avg 5–10 MB pre-compression) through application servers would require massive server fleets for network I/O alone — completely separate from any compute cost. A pre-signed URL allows the client to push binary data directly to S3 over a parallel TCP connection while the app server does nothing. S3's direct-upload throughput is effectively unbounded per account. Application servers stay small, stateless, and CPU-bound rather than network-I/O-bound.

TradeoffPre-signed URL flow adds a round-trip (client → API server → client, then client → S3). On slow mobile connections this can increase perceived upload start latency. Mitigated by generating URLs speculatively during photo selection, before the user taps "Share".
AlternativesMultipart upload via app serversChunked upload with resumability
Why Cassandra for Post DB rather than PostgreSQL? Storage choice

Post writes — 100 M/day, or ~1,200 writes/sec sustained — are append-only inserts with no updates (except like/comment count increments, handled separately). Reads are almost always time-range queries by user_id: "give me the 20 most recent posts from this user." This maps directly to Cassandra's data model: partition key = user_id, clustering key = created_at (descending). Range scans are a single partition read — extremely fast with no cross-node joins. PostgreSQL can serve this workload, but requires careful sharding and index tuning to avoid fan-out queries across shards. Cassandra's write path is also cheaper: LSM-tree with commit log, no locking.

TradeoffCassandra has no native support for secondary indexes or ad-hoc queries across partitions. Searching posts by hashtag or location requires a separate search index (Elasticsearch). Cassandra is chosen for the hot write+read path; other query patterns are served from other stores.
AlternativesPostgreSQL (sharded)DynamoDBBigtable
Why Kafka rather than directly writing to Redis from the Image Processor? Decoupling

Fan-out for a celebrity's post may touch 100 million Redis keys. This cannot complete synchronously within the upload response time budget. Kafka buffers the event; the Fan-out Worker scales horizontally to process the fan-out asynchronously. The Image Processor commits once (fast) and is done. Kafka also enables independent consumers: the Notification Service and Analytics pipeline consume the same post.published event without coupling to the fan-out timeline.

AlternativesSQS + Lambda fan-outDirect Redis write in processor

Real-world comparison

DecisionThis designInstagram (reported)Twitter/X
Image storageS3 + CDN (CloudFront)S3 + Facebook CDNS3 + Fastly CDN
Feed generationHybrid fan-out (push regular, pull celebrity)Hybrid; celebrity threshold ~10k followersHybrid; "power users" pulled on read
Post metadata storeCassandra (append-friendly)PostgreSQL historically; migrated to distributed storesMySQL (Manhattan), moved to distributed
Social graphSharded relational (follows table)TAO (Facebook graph service)Custom sharded MySQL (FlockDB historically)
Feed cacheRedis sorted sets per userRedis at massive scaleTimeline cache in Redis
🌐

There is no universally correct fan-out threshold — Instagram, Twitter, and Pinterest have all settled on different celebrity follower thresholds (ranging from 10k to 1M) based on their specific follower distribution curves and Redis write cost constraints. The right threshold follows from profiling your actual dataset, not from first principles.

05

Core algorithm — feed generation strategy

L5+

Before designing the feed read path, you have to answer a fundamental question: when a user posts a photo, when does that photo enter their followers' feeds — at write time or at read time? This single decision drives the fan-out architecture, the feed latency design, the Redis data model, and the special handling for celebrity accounts.

The tension is simple: making feeds fast to read (pre-computed) makes posts expensive to write (must update millions of feed caches). Making posts cheap to write (store once) makes feed reads expensive (must dynamically assemble from scratch). Neither extreme works at Instagram scale. Two pure approaches and one hybrid dominate the space.

① FAN-OUT-ON-WRITE Push post to each follower's feed A posts → push to 1K followers Feed read: O(1) Redis LRANGE Write: O(followers) → expensive Read: O(1) → very fast Celebrity: 100M writes/post ✗ Regular users: ✓ Best for: < ~10k followers
Figure 2 — (See full three-panel comparison below)
② FAN-OUT-ON-READ Pull from each followee at read time User A posts → stored once Feed read: query N followees Write: O(1) → very cheap Read: O(followees) → expensive Power users who follow many: ✗ Celebrity posts: ✓ Best for: high-follower posters ③ HYBRID ★ OUR CHOICE Push for regular users · Pull for celebrities if poster.followers < threshold → fan-out-on-write else → store once; pull on read + merge Write: O(followers) only below threshold Read: O(1) push + O(celebrity_count) pull → merge Handles 99.9% of users (push) efficiently ✓ Celebrity thundering herd eliminated ✓ ← Our choice for this system
Figure 2 — Feed generation strategies. Fan-out-on-write (①) and fan-out-on-read (②) both fail at the extremes. The hybrid (③) is our choice: push for regular users, pull and merge for celebrities.

Our choice for this system: Use the hybrid fan-out (③). Posts from users with fewer than ~10,000 followers are pushed to all follower feed caches asynchronously within seconds. Posts from celebrity accounts (above threshold) are stored once in the Post DB. When reading the feed, the Feed API merges the user's pre-computed push cache with a fresh pull of the latest N posts from each celebrity they follow, then sorts by timestamp before applying ranking. The merge is a simple N-way merge of sorted lists — efficient even for users who follow dozens of celebrities.

Fan-out worker implementation: determining celebrity status
# Fan-out worker (pseudocode)
def handle_post_published(event):
    post_id   = event['post_id']
    poster_id = event['user_id']

    # Check if poster is celebrity (cached in Redis)
    follower_count = get_cached_follower_count(poster_id)

    if follower_count <= CELEBRITY_THRESHOLD:  # e.g. 10,000
        # Fan-out-on-write: push to all followers in batches
        fan_out_to_followers(poster_id, post_id)
    # else: celebrity — no fan-out; pulled at read time

def fan_out_to_followers(poster_id, post_id):
    page_token = None
    while True:
        followers, next_token = get_followers_page(poster_id, page_token)
        # Redis pipeline: ZADD feed:{follower_id} score=timestamp post_id
        with redis.pipeline() as pipe:
            for fid in followers:
                pipe.zadd(f'feed:{fid}', {post_id: event['ts']})
                # Trim batched with ZADD in same pipeline (single round-trip); no-op if set < MAX_FEED_SIZE
                pipe.zremrangebyrank(f'feed:{fid}', 0, -(MAX_FEED_SIZE+1))
            pipe.execute()
        if not next_token: break
        page_token = next_token

# Feed read: merge push cache + celebrity pulls
def get_home_feed(user_id, page_size=20):
    # Step 1: get pre-computed feed from Redis (cursor-aware)
    # Use ZREVRANGEBYSCORE with the cursor timestamp as upper-bound score — not ZREVRANGE by rank.
    # ZREVRANGE offsets shift as new posts are prepended, causing duplicate/skipped items on page 2+.
    max_score = cursor_ts if cursor_ts else '+inf'
    pushed = redis.zrevrangebyscore(f'feed:{user_id}', max_score, '-inf', start=0, num=page_size * 3)

    # Step 2: pull latest posts from followed celebrities
    celebrities = get_followed_celebrities(user_id)
    pulled = [get_recent_posts(c_id, n=10) for c_id in celebrities]

    # Step 3: merge + sort + rank + truncate
    candidates = merge_sorted(pushed, *pulled)
    return rank_and_truncate(candidates, page_size)

The ZREMRANGEBYRANK call trims each feed cache to a maximum size (e.g., 3,000 entries) to bound memory consumption. Users who haven't opened the app in weeks will simply get a longer cold-start query that rebuilds from the database — acceptable because this is not a hot path.

💬

One open question: Where exactly is the celebrity threshold? It's not a fixed architectural constant — it's a tuned operational parameter based on the actual follower distribution and Redis write throughput limits. At Instagram scale (~1B+ accounts, with ~0.001% having over 1M followers), nearly all write load is handled by the push path. The threshold should be set low enough that the pull path covers only accounts where fan-out clearly exceeds write budget, and high enough that the merge overhead at read time stays negligible.

05b

API design

L3/L4+

Two endpoints dominate the hot path: the upload initiation endpoint and the feed read endpoint. Both are called from mobile clients where latency is compounded by cellular network variability.

POST /photos — initiate upload

Returns a pre-signed URL for direct-to-S3 upload. The actual binary data never flows through this endpoint. Using PUT (not POST) to S3 because S3 pre-signed URLs for object creation use PUT semantics. The client is responsible for following up with POST /photos/{id}/complete after the S3 upload finishes.

// Request
{
  "filename": "IMG_2047.HEIC",
  "content_type": "image/heic",
  "size_bytes": 4721038,
  "caption": "Sunset at Big Sur 🌅",
  "location": { "lat": 36.27, "lon": -121.81 },
  "tags": ["#sunset", "#california"]
}

// Response 200 OK (pre-signed URL for direct S3 upload)
{
  "photo_id": "ph_8Xk2mNqLp9vW",
  "upload_url": "https://s3.amazonaws.com/ig-uploads/ph_8Xk2mNqLp9vW?X-Amz-Signature=...",
  "upload_method": "PUT",
  "upload_headers": {
    "Content-Type": "image/heic",
    "Content-Length": "4721038"
  },
  "expires_at": "2026-04-24T21:15:00Z"  // 15-min window
}

POST /photos/{id}/complete — finalize after upload

// Request (no body required — server validates S3 object existence)

// Response 202 Accepted (processing async)
{
  "photo_id": "ph_8Xk2mNqLp9vW",
  "status": "processing",
  "estimated_ready_ms": 4000,
  "poll_url": "/photos/ph_8Xk2mNqLp9vW/status"
}

// GET /photos/{id}/status — once processing completes:
{
  "photo_id": "ph_8Xk2mNqLp9vW",
  "status": "published",
  "urls": {
    "thumbnail": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/t.jpg",
    "feed": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/f.jpg",
    "full": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/o.jpg"
  },
  "published_at": "2026-04-24T20:58:34Z"
}

GET /feed — home feed

Cursor-based pagination (no offset). The cursor encodes the last-seen timestamp so subsequent pages are stable even as new posts arrive. Returns metadata only — image bytes are fetched separately by the client via CDN URLs. Request validates authentication via bearer token in the Authorization header.

// Request: GET /feed?limit=20&cursor=eyJ0cyI6MTc0NTUzMTg5Mn0

// Response 200 OK
{
  "posts": [
    {
      "post_id": "ph_8Xk2mNqLp9vW",
      "user": {
        "user_id": "u_4Rm9pXzL",
        "username": "juanita.m",
        "avatar_url": "https://cdn.intervu.dev/avatars/u_4Rm9pXzL.jpg"
      },
      "caption": "Sunset at Big Sur 🌅",
      "image_urls": {
        "thumbnail": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/t.jpg",
        "feed": "https://cdn.intervu.dev/ph_8Xk2mNqLp9vW/f.jpg"
      },
      "like_count": 1842,
      "comment_count": 37,
      "liked_by_viewer": false,
      "created_at": "2026-04-24T20:58:34Z"
    }
  ],
  "next_cursor": "eyJ0cyI6MTc0NTQ4NjIxMH0",
  "has_more": true
}
EndpointMethodLevelNotes
POST /photosPOSTL3/L4Upload initiation; returns pre-signed S3 URL
POST /photos/{id}/completePOSTL3/L4Trigger async processing pipeline
GET /feedGETL3/L4Cursor-paginated home feed
POST /likes/{post_id}POSTL3/L4Like a post; idempotent
POST /follow/{user_id}POSTL5Follow a user; triggers fan-out recalculation if threshold crossed
GET /photos/{id}/statusGETL5Poll-based processing status; L5 discusses webhook alternative. Clients must implement exponential backoff (1 s → 2 s → 4 s → 8 s cap) — aggressive polling at 100 M uploads/day constitutes a self-inflicted DDoS on the status tier.
GET /users/{id}/feedGETL5Profile grid; different query pattern from home feed
GET /exploreGETL7/L8Ranked non-following content; requires separate ML recommendation pipeline
📱

Large-file resumability — S3 Multipart Upload: For files larger than 5 MB (which includes all original HEIC uploads from modern iPhones, typically 4–8 MB), the pre-signed URL flow is extended to use S3 Multipart Upload. The Upload API creates a Multipart Upload session and returns pre-signed part URLs for each 5 MB chunk. The client uploads parts in parallel or sequentially; if connectivity drops, it resumes from the last successfully uploaded part using its ETag. After all parts are uploaded, the client calls CompleteMultipartUpload (or the Upload API does it on /photos/{id}/complete). Incomplete multipart uploads older than 7 days are cleaned up by an S3 lifecycle policy to avoid orphaned storage costs. This is an L5 probe: "What happens if the user's phone loses connection mid-upload of a 20 MB raw photo?"

🔒

Pre-signed URL security — enforcing content constraints: A pre-signed URL grants any bearer the right to PUT to that S3 key. Without additional controls, a client could upload a 10 GB binary or a PDF instead of a JPEG. Two enforcement layers prevent this: (1) Upload policy conditions: the Upload API embeds a content-length-range condition (e.g., 1 byte to 50 MB) and locks the signed Content-Type header to the declared MIME type (e.g., image/heic). S3 rejects PUTs that violate either condition at the bucket level. (2) Image Processor validation: before generating any variant, the processor validates the file magic bytes against the declared MIME type and rejects non-image uploads. The S3 object is tombstoned and the job fails cleanly. This is a common L5 probe: "What stops a client from uploading an executable to your S3 bucket using the pre-signed URL?"

06

Core flow — photo upload & feed read

L3/L4+

Two flows dominate the system. The upload flow is write-critical and must preserve every photo; the feed read flow is latency-critical and must return in under 200ms. They interact through Kafka: the upload flow publishes events; the feed read flow consumes pre-built caches that the fan-out worker produced from those events.

Upload flow

① Client: POST /photos caption, content_type, size ② Upload API: pre-signed URL 15-min expiry window ③ Client: PUT binary → S3 directly bypasses app servers entirely ④ Client: POST /photos/{id}/complete signal upload finished ⑤ Image Processor: 3 variants 150px / 640px / 1080px → back to S3 ⑥ Write metadata → Post DB CDN URLs, caption, user_id, ts ⑦ Publish post.published → Kafka → fan-out worker picks up LEGEND Synchronous Async
Figure 3 — Upload flow. Steps ①–④ are synchronous; ⑤–⑦ are async. The client app can show the photo immediately using the local file while processing completes.
🔑

Key decision — why write metadata after processing (step ⑥), not before? This is driven by the latency NFR from §2. If metadata is written before processing, the feed would contain post entries with no CDN URLs — resulting in broken images for the 3–5 seconds the processor takes. The tradeoff: if the processor crashes between S3 write and DB metadata write, the photo object is orphaned in S3 but the user still sees an error. A reconciliation job (< 1 min scan) detects orphaned objects and retries or notifies. This is a far better failure mode than showing broken images.

Feed read flow

① Client: GET /feed?cursor=... ② Feed API: check Redis feed cache ZREVRANGE feed:{user_id} HIT Skip to step ⑥ ~10 ms total MISS (new/inactive user) ③ Query Post DB for followee posts fan-in: N queries (parallelized) ④ Backfill Redis cache from results so next request is a HIT ⑤ Pull celebrity posts (if any) merge with push cache; N-way sort ⑥ Return feed JSON → Client
Figure 4 — Feed read flow. Cache hit path (top-left branch) returns in ~10 ms. Cache miss path (vertical) falls back to database fan-in and backfills cache. Celebrity posts are always pulled fresh and merged.
07

Data model

L5+

The data model has three distinct domains — posts, users/graph, and engagement (likes, comments) — each with very different access patterns. Those differences end up shaping which store each entity lives in and how it's keyed.

Access patterns

OperationFrequencyQuery shape
Insert new post~1,200/sec avgWrite by (user_id, post_id); no joins
Fetch user's recent posts (profile grid)High (every profile visit)Range scan: WHERE user_id = X ORDER BY ts DESC LIMIT 12
Fetch post by ID (feed hydration)Very high (every feed load)Point read: WHERE post_id = X
Get followers of user XHigh (fan-out trigger)Index scan: WHERE followee_id = X
Get users followed by XHigh (feed miss fallback)Index scan: WHERE follower_id = X
Like count for post XVery high (every feed render)Point read — must be precomputed
Has viewer liked post X?High (per feed item)Point read: (post_id, user_id)

Two things jump out from this table. First, the post read path is predominantly by (user_id, time) for profile grids and by (post_id) for feed hydration — two different primary access patterns that suggest two indexes. Second, like counts and "has viewer liked" are read on every feed render; they must be precomputed, not computed on-the-fly from a likes table at read time.

Schema

posts (Cassandra) pk: user_id (partition) ck: created_at DESC ck: post_id url_thumb TEXT url_feed TEXT url_full TEXT caption TEXT location GEO tags LIST → counts in post_counters table users (PostgreSQL) pk: user_id BIGINT username TEXT UNIQUE display_name TEXT avatar_url TEXT bio TEXT → counts & flag in Redis follows (PostgreSQL) idx: (follower_id) idx: (followee_id) follower_id BIGINT FK followee_id BIGINT FK created_at TIMESTAMPTZ PK (follower_id, followee_id) likes (Redis SET) key: likes:{post_id} SADD for like, SREM for unlike SISMEMBER viewer_id → O(1) Async flushed to Cassandra counter SCARD → like count O(1) feed (Redis sorted set) key: feed:{user_id} score: published_at (unix ms) member: post_id max 3000 entries; TTL 7 days Key Index Notes • Cassandra PK = user_id (partition) • created_at DESC = newest-first O(1) • follows: BOTH directions indexed • counts → Redis (hot) + Cassandra table
Figure 5 — Data model. Posts in Cassandra (append-optimized), users and graph in PostgreSQL, engagement (likes, feed) in Redis with async flush to Cassandra counters.
⚠️

Cassandra COUNTER constraint: COUNTER columns cannot coexist with regular columns in Cassandra — they require a dedicated table. Like and comment counts are persisted in a separate post_counters(post_id, like_count COUNTER, comment_count COUNTER) Cassandra table, used only as the async flush target from Redis (the hot-path read layer). Follower/following counts and the celebrity flag are Redis hash fields maintained on every follow.created event — not denormalized columns in PostgreSQL. The users table stores immutable profile data only (username, bio, avatar_url). This separation avoids row-level locking on concurrent follow/unfollow writes in PostgreSQL.

Why store like counts as Redis sets rather than a SQL likes table?

Like counts and "has the current viewer liked this post?" are read on every single feed render — potentially 20 posts × 230,000 req/sec = 4.6 million point reads per second. A SQL COUNT(*) WHERE post_id = X aggregation query at this rate would require a massive database fleet. A Redis set per post (likes:{post_id}) stores user_ids of likers; SCARD returns the count in O(1) and SISMEMBER checks viewer membership in O(1). The total memory per popular post (10k likers × 8 bytes/user_id) is ~80 KB — trivially small. The set is asynchronously flushed to Cassandra's COUNTER column for persistence; Redis is the hot path read layer.

⚠️

Scaling limit for viral posts: A post with 10 M likers stores 10M × 8 bytes = 80 MB in a single Redis key — a concern for replication and eviction. At celebrity scale, switch to a HyperLogLog (PFADD / PFCOUNT) for approximate like counts (~12 KB per post, <1% error) and a Bloom filter for the "has viewer liked?" membership check. Exact historical counts are reconciled from Cassandra asynchronously. This is an L6 probe: "What happens to your Redis SET design when a post gets 50 million likes?"

TradeoffIf Redis restarts, like counts must be rebuilt from Cassandra. The rebuild can fall behind for a few seconds, meaning like counts may temporarily read lower than reality. This is acceptable — it matches the "eventual consistency within 5s" NFR from §2. HyperLogLog cardinality estimates survive restarts only if Redis persistence (AOF/RDB) is enabled.
08

Caching strategy

L5+

The caching architecture for Instagram has four distinct layers, each positioned at a different tier of the §4 architecture. Without each layer, a different latency or cost ceiling is hit. The layers are not redundant — they address different bottlenecks.

Client Layer 1: CDN Photo object cache ≥90% hit rate TTL: 30 days (immutable URL) Scope: 3 size variants Eliminates origin egress Layer 2: Feed Cache Redis sorted sets per-user post_id list TTL: 7 days Max: 3000 entries Enables O(1) feed reads Layer 3: Post Meta Redis hashes hot post metadata TTL: 1 hour (recent) Key: post:{id} Avoids DB hit per post Layer 4: Counts Redis counters likes:{post_id} SET followers:{user_id} No TTL (async flush) O(1) like check + count miss hydrate Origin: Cassandra Post DB + S3 Object Store
Figure 6 — Four-layer cache hierarchy anchored to the §4 architecture. Each layer addresses a distinct bottleneck.
LayerWhat it cachesTechnologyTTLWhy it exists
1. CDN (edge)Photo objects (all 3 variants)CloudFront / Fastly PoPs30 days (immutable URLs)Without CDN, S3 egress at ~40 Tbps would be economically and physically impossible
2. Feed CachePer-user sorted post ID listRedis cluster (sorted set)7 days; LRU at 3000 entriesEnables <200 ms feed reads; without this, every feed load is a N-way fanin database scan
3. Post MetadataRecent post JSON (caption, URLs, counts)Redis cluster (hash)1 hr for hot posts; TTL-on-accessFeed hydration needs full metadata for ~20 posts per request at 230k req/sec — Cassandra cannot absorb all point reads
4. Engagement CountersLike sets, follower counts, celebrity flagsRedis (sets, counters)None (async flush to Cassandra)Like count + viewer-liked check on every post render; SQL COUNT at feed QPS is infeasible

Cache invalidation

CDN invalidation: how do you update a photo once published? Immutability pattern

Photo objects are immutable: once a variant is written to S3 with a stable URL (e.g., /ph_8Xk2mNqLp9vW/f.jpg), it never changes. If the user edits their photo or post (which Instagram supports for captions, not image content), only the metadata changes. The CDN URL is stable for the original binary. This "immutable URL" pattern means CDN TTLs can be long (30 days) without invalidation risk. Deleting a photo requires an explicit CDN purge API call, which propagates to PoPs within minutes — acceptable, since a deleted photo isn't a latency-sensitive operation.

TradeoffLong TTLs mean deleted photos may persist in edge caches for up to 30 days unless a purge is issued. Instagram issues explicit CDN purges for deletions. Note: S3 itself returns 404 Not Found for deleted objects — not 410 Gone. Serving a semantic 410 ("intentionally removed") requires a Lambda@Edge or origin-shield function that intercepts requests for tombstoned photo IDs and returns the correct status code explicitly.
Feed cache invalidation on unfollow Consistency edge case

When user A unfollows user B, posts from B should stop appearing in A's feed. Purging B's post IDs from A's sorted set retroactively is expensive — you'd have to scan the entire set. The practical approach: on unfollow, simply mark the relationship as deleted in the Graph DB. The Feed API filters out posts from non-followed users before returning results. The feed cache may contain stale post IDs from B, but the hydration step checks current follow state and silently drops them. The feed cache self-heals as entries age out (7-day TTL). For immediate strictness, a background job can scrub the cache on unfollow — but this creates a thundering herd if mass-unfollows are triggered.

Avoiding Graph DB reads on every render: "Checking current follow state" during hydration cannot mean a Graph DB query per post — that defeats the purpose of the cache. Instead, the Feed API maintains a short-lived cached follow set per user in Redis (following:{user_id}, a SET of followee IDs, TTL 5 min, invalidated on follow/unfollow events via the Social API). Post IDs in the feed sorted set are filtered against this cached set in O(1) per post — no Graph DB round-trip needed during normal feed renders. The 5-min TTL bounds the maximum staleness window for any unfollow to 5 minutes, which is well within acceptable UX.

09

Deep-dive scalability

L5+

At 500 M DAU, the production Instagram architecture must handle several failure modes and scale pressures that don't appear at smaller scale. This section traces through the five key scalability dimensions that L5+ candidates are expected to address.

Clients CDN PoPs 300+ locations ≥90% hit rate ≥40 Tbps cap. LB × N geo-aware rate-limited API Fleet Feed / Upload Social APIs stateless; ×100s auto-scale Redis Cluster feed caches post meta like sets Cassandra post metadata sharded by user_id S3 Object Store multi-region 11-nines dur. Kafka + Workers fan-out workers img processors Graph DB follows sharded by user_id Full production architecture at 500 M DAU scale Sync Async
Figure 7 — Full production scalability architecture. Each tier is independently scalable and stateless except the data layer.
Sharding strategy for Cassandra Post DB L5 · Data sharding

Cassandra partitions by user_id. This is the natural sharding key because the dominant access pattern — "get user X's recent posts" — is a single-partition scan. Cross-user queries (e.g., "get posts from all N followees") are parallelised across N partitions by the Feed API. Hot partitions arise for hyper-active users who post extremely frequently. Mitigation: add a shard_id secondary component to the partition key (e.g., user_id_mod_4) to spread a single user's data across 4 partitions, then union at read time. This is rarely needed below 1 B users.

TradeoffSharding by user_id makes cross-user aggregation queries (trending posts, global analytics) unavailable from Cassandra. A separate analytics pipeline (Spark on HDFS or ClickHouse) handles aggregate queries.
Distributed ID generation for post_id L5 · ID generation

post_ids must be globally unique, roughly time-ordered (for sorted set scoring), and generated at ~1,200/sec without a central coordinator becoming a bottleneck. The standard solution is a Snowflake-like 64-bit ID: 41 bits timestamp (ms since epoch), 10 bits machine ID, 12 bits sequence. Each Upload API server generates IDs locally using its assigned machine_id — no network round-trip. The timestamp-prefix ensures IDs are sortable by time, which the Cassandra clustering key exploits.

Capacity note: 10 bits = 1,024 unique machine IDs. The Upload API fleet can scale beyond this limit at Instagram's size. The production fix is a lightweight machine ID lease service (a Redis INCR counter, ZooKeeper node, or etcd lease) that issues numeric IDs from the 10-bit namespace with short-TTL leases — recycled when instances shut down. Modern stacks favour etcd over ZooKeeper for this pattern due to simpler operation and Raft-based strong consistency. Alternatively, use 12 bits for machine ID (4,096 nodes) and reduce the sequence to 10 bits (1,024 IDs/ms per node); throughput stays well above 1,200/sec at any realistic fleet size.

AlternativesUUID v4 (random, not sortable)ULIDDatabase sequence (bottleneck)
Fan-out worker scaling: the celebrity post problem L5/L6 · Thundering herd

Kylie Jenner has ~400 million followers. If she posts and we try to fan-out to all 400 M feed caches within 30 seconds, that requires writing ~13 million Redis operations per second sustained for 30 seconds — from the fan-out workers alone. This is achievable with a large horizontal fan-out fleet (hundreds of workers consuming from parallel Kafka partitions), but at enormous infrastructure cost for a single event. The design caps fan-out via the celebrity threshold. For true celebrities (400 M followers), the post is stored once; followers pull it lazily at feed read time, merged into the pre-built push cache. The thundering herd from fan-out is fully eliminated.

TradeoffUsers who follow many celebrities experience marginally higher feed read latency (extra N pull queries + merge). In practice, most users follow fewer than 5 celebrities, and their posts are fetched in parallel — adding <20 ms to the merge step.
Geo-distribution and multi-region L6 · Global availability

Instagram serves users across North America, Europe, Latin America, Southeast Asia, and India. A single-region architecture would add 100–300 ms transcontinental latency to every API call. The production design deploys independent regional clusters (US-East, EU-West, AP-Southeast) with global S3 replication for photo objects and inter-region replication for the social graph. Feed caches are regional — users primarily interact with accounts in their geographic cluster. Post data replicates asynchronously across regions via Kafka cross-cluster replication. The social graph (follows table) is the hardest to multi-home because a follow write must eventually reach all regions — typically done via async replication with eventual consistency acceptable at cross-region propagation timescales (<5s).

Feed ranking: when and where is it applied? L6/L7 · ML integration

Instagram's feed is not purely chronological — it's ranked by an ML model incorporating engagement signals (likes, comments, profile visits), content signals (caption, hashtags), and behavioral signals (user viewing history, time of day). The ranking model cannot run at 230,000 feed request/sec inline. The practical architecture: the Feed API assembles a candidate set of ~100–200 post IDs from the Redis cache and celebrity pulls, sends this candidate set to a lightweight ranking service (ONNX model inference, <20 ms for 200 candidates), and the ranked N posts are returned to the client. The ranking service is stateless and horizontally scalable — adding latency budget of ~20ms, which fits inside the 200ms p99 NFR.

What this drivesThe feed cache needs to hold more candidates (200 IDs, not just 20) so the ranking model has a large enough pool to reorder. This increases Redis memory footprint by ~10× for the feed sorted set, but post IDs are only 8 bytes each — 200 IDs per user × 500 M active users = 800 GB RAM theoretical maximum. In practice, only DAU-active users maintain warm caches; the 7-day TTL evicts inactive users, reducing the hot working set to ~150–250 GB. A production Redis cluster (with read replicas and hash-slot overhead) budgets ~400–600 GB total — well within a clustered fleet's capacity.
10

Failure modes & edge cases

L5+
ScenarioProblemSolutionLevel
Image processor crashes mid-processing Photo uploaded to S3, metadata never written to Post DB. User sees a spinner indefinitely. Processor uses idempotent job IDs from SQS with visibility timeout. If the job isn't ACKed within 5 min (DLQ threshold), SQS re-delivers to another worker. S3 object already present — processor regenerates variants and completes the write. Client polls /photos/{id}/status. L3/L4
Redis cluster failure Feed caches unavailable. All feed reads fall back to database fan-in. Cassandra and Graph DB get hit at full feed read QPS (~230,000 req/sec). DB fleet overwhelmed. Load shedding: Feed API activates circuit breaker after N consecutive Redis timeouts, returns degraded response (recent 5 posts from cache replica or empty feed with "Loading..."). Redis is deployed as a cluster with replica failover (<15s switchover). Read replicas absorb fan-in during failover window. L5
Kafka consumer lag spike (fan-out delayed) Fan-out workers fall behind. Posts appear in followers' feeds 5+ minutes late, violating the 30 s SLA. Fan-out workers auto-scale on Kafka consumer lag metric (Kubernetes HPA on consumer group lag). If lag exceeds threshold, additional workers spin up. Kafka partition count is set 2× the expected worker count to allow headroom. Monitoring alerts at >60 s lag. L5
User with 0 followers posts; feed cache never populated No followers to fan out to. Post is orphaned — nobody sees it in feeds. Profile page still works (directly reads from Cassandra by user_id). This is correct behavior. The fan-out worker correctly does zero writes for 0 followers. Post is always visible on the poster's own profile page. If the user gains followers later, subsequent fan-outs populate follower feeds from then — historical posts are visible via profile grid but not retroactively injected into feed caches. L3/L4
Celebrity crosses the threshold mid-fan-out User has 9,999 followers; posts; fan-out starts. 1 second later they gain 10,001 followers (viral moment). Fan-out continues but now exceeds budget. Fan-out worker checks celebrity status once at job start using the Redis-cached follower count (60s TTL). If the follower count was below threshold at job start, the full fan-out completes. There is a short race window where a newly-over-threshold account gets one over-budget fan-out — acceptable, as it happens at most once per account crossing the threshold. A background observer on Kafka's follow.created stream detects the threshold crossing and atomically updates a Redis hash field (celebrity:{user_id} = 1): the flag the fan-out worker reads on all future post events. All subsequent posts from that user use the pull path. L5 L6
Celebrity flag crosses threshold downward (mass unfollow / account cleanup) A celebrity account loses followers — dropping below the threshold. Future posts should theoretically resume fan-out-on-write, but the celebrity:{user_id} flag was set when the threshold was crossed upward and is never cleared. The flag is treated as a one-way gate for production simplicity: once an account is flagged as a celebrity, it stays on the pull path even if follower count later drops below the threshold. Re-enabling fan-out requires a manual operational step (clear the Redis flag). The tradeoff: a de-celeb'd account's posts involve a slightly heavier read path for its followers, but this affects a negligible fraction of accounts. Auto-detecting and migrating pull→push introduces race conditions between the flag update and in-flight fan-out workers that aren't worth the complexity. L5 L6
CDN cache poisoning / wrong variant served Image processor has a bug and generates a corrupt thumbnail. CDN caches the corrupt file. All subsequent requests get the corrupt image for 30 days. Pre-signed upload URLs include a content hash suffix. If the processor regenerates a correct variant, it writes to a different URL (with a new hash suffix) rather than overwriting. The Post DB metadata is updated to point to the new URL, and an explicit CDN purge is issued for the old URL. Immutable-by-URL prevents accidental cache overwrites. L5
Hot partition in Cassandra (viral user posts 1000 photos/day) A single user generates extreme write load on one Cassandra partition, causing that node's disk and CPU to spike disproportionately. Apply partition-key salting: append a shard suffix (0–3) to user_id to spread the user's data across 4 partitions. Reads union across all 4 partitions for that user. Only activate for monitored hot users — normal users use unsalted partition key. L7/L8
Fan-out worker: poison-pill Kafka event A malformed post.published event (unknown poster_id, corrupt payload, or missing Graph DB record) causes the fan-out worker to fail on every retry attempt. After max retries the event lands in the DLQ; followers never see the post in their feeds. Two-branch error handling: (1) transient errors (Redis timeout, Graph DB unavailable) — rely on Kafka at-least-once retry with exponential backoff; (2) unrecoverable errors (unknown user_id, schema mismatch) — log, emit a metric, and commit the Kafka offset without retrying to unblock the partition. A DLQ consumer alerts on-call and replays structurally valid events after root-cause resolution. Max retry count is configured per consumer group (typically 3 before DLQ). L5
10b

Security, rate limiting & compliance

Three security concerns map directly to components already in the architecture: abuse prevention on the upload path, illegal content detection in the image pipeline, and regulatory erasure requirements that span every storage tier. L6+ interviewers will probe all three explicitly.

Upload rate limiting

Without rate limiting, a single malicious client can exhaust S3 pre-signed URL generation capacity and drive disproportionate storage write cost. The API Gateway enforces a per-user token-bucket rate limit on POST /photos: maximum 10 uploads per minute per authenticated user, enforced via a Redis counter (INCR ratelimit:{user_id}:{minute_bucket} with a 60-second TTL). Requests exceeding the limit receive a 429 Too Many Requests response with a Retry-After header. A separate, higher limit (100/min) applies at the IP level to catch unauthenticated probing before authentication succeeds.

💬

L5 probe: "How would you prevent a bot from uploading 1 M photos per hour?" — Token bucket at the API Gateway keyed on user_id, backed by a Redis counter with a sliding window. Secondary: S3 pre-signed URL expiry (15 min) limits replay attacks. Tertiary: content-hash deduplication at the Image Processor rejects bit-exact duplicates before any variant is stored.

Perceptual hash deduplication and content moderation

The Image Processor runs two checks before writing variants to S3:

  1. Exact-hash deduplication: SHA-256 of the raw upload is checked against a probabilistic hash store (e.g., a disk-backed bloom filter or a Redis SET of known hashes) of previously stored object hashes. Bit-exact duplicates skip variant generation and reuse existing CDN URLs. Important: a bloom filter has no false negatives but can produce false positives — a false positive would silently deduplicate a genuinely unique image against someone else's photo. The implementation must include a fallback verification step: before reusing a CDN URL, confirm the candidate S3 object exists and its SHA-256 matches the new upload's hash. At 100 M uploads/day over 10 years (~365 B cumulative), a bloom filter targeting ≤0.001% false-positive rate requires ~5 GB of RAM — size accordingly.
  2. Perceptual hash (pHash / PhotoDNA): The processor computes a perceptual hash of the image and checks it against a blocklist of known-illegal content hashes (CSAM, jurisdiction-required blocklists). A match aborts the job, tombstones the S3 object, and triggers a moderation queue entry. This check happens before any variant is written publicly: the image never reaches CDN. The blocklist is a small in-memory set (≤100 MB) loaded at worker startup and refreshed hourly.

GDPR right-to-erasure pipeline

A user deletion request triggers a multi-stage erasure job implemented as a durable workflow (e.g., AWS Step Functions) to survive partial failures. The pipeline must reach every storage tier:

TierErasure actionSLA
PostgreSQL (users, follows)Hard-delete user row; cascade to follows table≤24 hours
Cassandra (posts, post_counters)Delete all rows WHERE user_id = X (posts partitioned by user_id). For post_counters (partitioned by post_id): first fetch the user's post IDs, then delete counter rows individually. Tombstones physically compact after gc_grace_seconds (default 10 days) — zero out COUNTER columns instead of deleting if a strict ≤7-day physical erasure SLA is required, and set gc_grace_seconds = 86400 on the post_counters table.≤7 days (logical); ≤10 days physical without config change
Redis (feed, likes, counts)DEL feed:{user_id}; scan and DEL all likes:{post_id} keys for the user's posts≤1 hour
S3 (photo objects)Batch DELETE all objects under the user's prefix; mark for permanent destruction≤7 days
CDN (edge caches)Issue explicit purge API call (e.g., CloudFront Invalidation API) for all CDN URLs associated with the user's posts≤30 days (CDN propagation ~15 min after purge)
Kafka (event log)Kafka retention is time-windowed (7 days default); post.published events age out automatically≤7 days
⚠️

CDN erasure gap: the most common L6 miss: CDN edge caches may continue serving a deleted photo for up to 30 days even after the S3 object is deleted, if no explicit CDN purge is issued. The erasure workflow must call the CDN purge API — not just delete the S3 object. Note: S3 returns 404 Not Found for deleted objects, not 410. For jurisdictions requiring erasure within 72 hours, a Lambda@Edge function intercepts requests for tombstoned photo IDs and returns 410 Gone explicitly — signaling to CDN nodes not to cache the response and surfacing the intentional removal to clients.

💬

L7/L8 probe: "A user deletes their account. How do you guarantee their photos stop being served within 30 days across all 300+ CDN PoPs globally?" — Explicit CDN Invalidation API call in the erasure workflow, not TTL expiry. The purge propagates to all PoPs within ~15 minutes. For 72-hour jurisdictions: redesign the CDN URL to include a user-controlled token; on deletion, revoke the token at an origin-shield layer that returns 410 Gone — CDN nodes revalidate and stop serving the stale cached response.

11

How to answer by level

L3 / L4 SDE I / SDE II — Build a working system
What good looks like
  • Identify the three domains: upload, social graph, feed
  • Propose pre-signed URL upload pattern (binary off app servers)
  • Describe fan-out-on-write as the baseline feed strategy
  • Choose object store + CDN for image delivery
  • Sketch a basic Post DB schema with user_id, timestamp, CDN URLs
  • Mention Redis for feed caching
What separates L5 from L4
  • Recognizing that uniform fan-out fails for celebrities
  • Explaining why Cassandra fits the post write pattern better than PostgreSQL
  • Tracing the latency budget: where does 200 ms go?
  • Discussing image processing as async (not blocking the upload response)
L5 Senior SDE — Understand the tradeoffs
What good looks like
  • Propose hybrid fan-out and explain the celebrity threshold
  • Describe how the feed merge works (N-way sort of push + pull)
  • Tie Redis feed cache to the 200 ms latency NFR explicitly
  • Discuss Kafka for decoupling image processing, fan-out, and notifications
  • Explain the access pattern table before the schema (§7 pattern)
  • Address the processor failure mode and retry strategy
What separates L6 from L5
  • Designing the full four-layer cache hierarchy with TTL reasoning
  • Addressing CDN invalidation for photo deletion
  • Explaining why random UUIDs (v4) are suboptimal for time-sorted Cassandra clustering keys (random distribution defeats range-scan locality), and proposing Snowflake-style 64-bit time-prefixed IDs instead
  • Discussing multi-region deployment and which stores replicate synchronously vs async
L6 Staff SDE — Own the system end-to-end
What good looks like
  • Full cache invalidation strategy per layer (CDN, feed, metadata, counts)
  • Geo-distribution plan: which stores replicate where and at what consistency level
  • Feed ranking integration: candidate set size, ranking model placement, latency budget
  • Enumerate failure modes and propose circuit breakers and load shedding
  • Cassandra hot partition mitigation via partition-key salting
What separates L7 from L6
  • Framing the celebrity threshold as a tunable business parameter, not a fixed constant
  • Proposing a feedback loop to adjust fan-out strategy dynamically based on observed write cost
  • Discussing CDN cost model and how cache hit rate directly drives infrastructure spend
L7 / L8 Principal / Distinguished — Drive the architecture
What good looks like
  • Frame every decision as an explicit tradeoff with business impact (celebrity fan-out cost vs latency improvement)
  • Propose adaptive celebrity threshold: dynamically adjusted based on real-time write cost monitoring
  • Discuss the Explore tab as a separate architecture problem (collaborative filtering, not follow graph)
  • Address regulatory requirements: GDPR right-to-erasure pipeline across S3, Cassandra, Redis, CDN
  • Propose self-healing cache rebuild after Redis failure without overwhelming downstream DBs
Common L7 misses
  • Spending too much time on the upload path (solved problem) vs feed ranking
  • Proposing synchronous cross-region consistency for the social graph (impossible at this scale)
  • Ignoring CDN cost as an architectural constraint
  • Not discussing the difference between the home feed (follow graph) and the Explore feed (recommendation)

Classic probes

QuestionL3/L4L5/L6L7/L8
"How would you handle Beyoncé posting a photo?" Fan-out to all followers (missing the cost problem) Celebrity threshold → store once, pull on read → merge at feed read time Adaptive threshold, cost model, monitoring, and when NOT to fan out even below threshold (trending moments)
"How do you deliver images to a user in Tokyo?" Store in S3, return URL CDN PoP in Tokyo; immutable URLs; 90%+ cache hit rate; origin is S3 multi-region CDN cost model, cache hit rate sensitivity analysis, purge latency SLA on deletion, GDPR erasure timelines across CDN PoPs
"What happens if Redis goes down?" "Fall back to the database" Circuit breaker + load shedding; replica failover (<15s); DB fan-in for degraded feed during failover Graduated degradation: replica → degraded feed → cached empty state; replay-from-Kafka feed rebuild without overwhelming Cassandra; impact analysis (X% of users see degraded feed for Y minutes)
"How would you add a ranked feed instead of chronological?" "Sort by likes" (doesn't describe the pipeline) Candidate set (200 post IDs from cache) → ranking service (ONNX/TF, <20ms for 200 candidates) → top-N returned; offline model training on engagement signals Full ML pipeline: feature store, online serving latency budget, A/B testing infrastructure, ranking model cold start for new users, privacy constraints on signal collection
How the pieces connect
Every major decision traces back to a stated NFR or capacity number. These are the threads.
1
11-nines durability NFR (§2) → photo bytes must never touch app servers → pre-signed direct-to-S3 upload pattern (§4, §6), which also eliminates app-tier network I/O bottleneck as a side benefit.
2
200 ms feed latency NFR (§2) → real-time database fan-in at 230k req/sec is impossible → pre-computed Redis feed cache with 7-day TTL (§8) as a structural prerequisite → which requires the Kafka fan-out architecture (§4) to populate it.
3
30 s feed freshness NFR (§2) → fan-out-on-write to all followers is required → but 100 M follower fan-out takes >30 s → hybrid strategy: push regular users, pull celebrities (§5) with merge at read time (§6).
4
Storage at 220 PB (§3) and 40+ Tbps read bandwidth → CDN as Layer 1 cache (§8) with immutable URLs and 30-day TTL, making photo delivery economically viable by achieving ≥90% hit rates at edge.
5
Post access pattern: range scan by (user_id, time) (§7) → Cassandra with user_id as partition key and created_at as clustering key → single-partition range scan ≤10 ms → enables feed miss fallback without overwhelming the database (§9).
6
Like count on every feed render (§7 access patterns) → SQL COUNT at 230k req/sec × 20 posts per request → infeasible → Redis SADD/SCARD for O(1) like counts and membership (§7, §8 Layer 4), async flushed to Cassandra COUNTER for persistence.

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series