How does Dropbox sync files across devices?

Dropbox syncs through a combination of push notifications and a cursor-based delta sync protocol. When a file changes on one device, the client uploads only the modified chunks (using content-defined chunking) and commits the new version to the metadata service. The server publishes a change event to a notification bus, which pushes a 'file changed' signal to all other connected devices. On receiving the notification, each device calls a /sync/delta endpoint with its last seen cursor and downloads only the chunks that differ from its local copy — not the entire file.

What is content-defined chunking and why does Dropbox use it?

Content-defined chunking (CDC) is an algorithm that splits a file at positions determined by the file's content rather than at fixed byte offsets. It uses a rolling hash (such as a Rabin fingerprint) sliding over the bytes; a chunk boundary is cut whenever the hash matches a target pattern. The advantage over fixed-size chunking is boundary stability: if you insert a byte at the start of a 1 GB file, only the first chunk boundary shifts, leaving all subsequent chunks unchanged. This means only the locally modified chunk needs to be re-uploaded on the next save, which is why CDC is critical for efficient delta sync and upload bandwidth reduction.

How does a file storage system avoid storing duplicate files?

Through content-addressed deduplication. Each file chunk is identified by its SHA-256 fingerprint. Before uploading, the client sends the server a list of all chunk hashes. The server checks a global chunk index; any chunk whose hash already exists in the index is skipped. Only new, never-before-seen chunks are uploaded. Since the chunk store keyed by hash, two users uploading the same file (or the same template, or the same photo) result in one physical copy in storage. Dropbox uses global cross-user dedup; Google Drive uses per-user dedup to avoid certain privacy concerns.

What happens when two devices edit the same file at the same time in Dropbox?

Dropbox uses a conflict copy strategy. Each file version tracks its parent_version_id. When Device B tries to commit a new version whose parent does not match the current head on the server (because Device A already committed a new version), the server rejects the update as a conflict. Instead of overwriting, it creates a conflict copy: both versions are preserved as separate files — for example, 'report.pdf' (Device A's version) and 'report.pdf (Device B's conflicted copy 2026-04-10)'. No data is lost. The user sees both files and resolves the conflict manually by choosing which version to keep or merging them.

What database does a file storage system like Dropbox use for file metadata?

File metadata — file names, folder structure, version history, sharing permissions — is typically stored in a sharded relational database partitioned by user ID. Sharding by user ID ensures all of a given user's file tree records co-locate on the same shard, making folder listing and sync queries efficient. Dropbox historically used MySQL with custom sharding (Edgestore). Google Drive uses Google Spanner for global strong consistency across regions. The chunk content itself is stored separately in a distributed object store (like Amazon S3) keyed by SHA-256 hash — this separation is what enables deduplication and allows metadata and content to be scaled independently.

System Design Interview Guide

Design a File Storage System

The interface is a folder. Behind it: chunked uploads, content-addressed deduplication, delta sync across devices, and conflict resolution that can't lose a byte.

L3/L4: Working upload/download L5/L6: Chunking, dedup & sync protocol L7/L8: Global scale & conflict resolution

~42 min read · 12 sections

Hero Image for File Storage System Design

What the interviewer is testing

A file storage and sync system — Dropbox, Google Drive — looks like a managed folder. Underneath it spans chunked uploads, content-addressed deduplication, delta sync, offline-first conflict resolution, and a metadata layer that must track billions of file versions with sub-second query times.

Interviewers use this question to probe whether you think about data at rest (block storage, dedup) separately from data in motion (the sync protocol), and whether you understand how those two planes interact under failure conditions. Three tensions drive the core tradeoffs: bandwidth efficiency vs. simplicity, strong consistency vs. availability during partitions, and storage cost vs. query flexibility.

Level	Primary focus	Hard question at this level
L3/L4	Working upload/download with metadata tracking	How does a client know what changed while it was offline?
L5	Chunked uploads, content addressing, delta sync	How does chunking interact with deduplication across users?
L6	Conflict resolution, versioning, resumable uploads	Two devices edit the same file simultaneously — what happens?
L7/L8	Cross-region consistency, CDN strategy, cost at petabyte scale	How do you cut storage cost by 60% without degrading read latency?

🎯

The single most differentiating question across all levels is the sync protocol. How does a client discover what changed, fetch only what it needs, and handle conflicts, without polling constantly or losing data? A candidate who handles the sync protocol clearly tends to derive the remaining design decisions correctly: the sync model constrains the metadata schema, the notification architecture, and the conflict resolution approach.

Requirements clarification

Two designs hide under this question: a consumer sync tool (Dropbox-like, mobile + desktop, strong sync) and an enterprise document platform (Google Drive-like, collaborative editing, sharing, search). The scope changes which problems are hard. Confirm scale (DAU, average file size, total storage) before assuming either.

Functional requirements

Requirement	Description
Upload files	Clients upload files of any type and size (up to 5 GB per file)
Download files	Clients download files on any device associated with their account
Sync	Changes on one device propagate automatically to all other devices for the same account
Version history	Users can view and restore previous versions of any file (minimum 30 days)
File sharing	Files and folders can be shared with other users or via a public link
Offline access	Files marked for offline access are available without a network connection
Conflict resolution	When two devices edit the same file concurrently, both versions are preserved (no silent data loss)

Non-functional requirements

NFR	Target	Notes
Availability	99.99% (52 min/year)	Reads must succeed even during write outages
Durability	99.999999999% (11 nines)	No data loss after upload acknowledgement
Sync latency	< 5 s for files < 1 MB	From upload-complete on device A to available on device B
Upload bandwidth	Delta uploads (only changed chunks)	A 1-byte edit in a 100 MB file must not re-upload 100 MB
Read latency (p99)	< 200 ms for metadata; < 2 s TTFB for large files	Metadata served from cache; files from nearest CDN edge
Storage efficiency	Global deduplication across all users	A file uploaded by 10,000 users is stored only once
Consistency	Eventual (sync events); strong (metadata writes per file)	Upload completion is strongly consistent; cross-device propagation is eventual

NFR reasoning

Durability: 11 nines (99.999999999%) Storage architecture driver ›

11 nines means roughly one file lost per billion stored per year. This is the standard SLA for consumer file storage and matches what AWS S3 and Google Cloud Storage guarantee. Achieving it requires erasure coding (or triple replication) across multiple availability zones, plus periodic integrity verification (bit-rot scrubbing). It is not achievable with single-region storage alone.

TradeoffErasure coding (e.g. Reed-Solomon 6+3) stores each object at ~1.5× overhead vs triple replication at 3×. EC is cheaper at scale but adds CPU overhead on read reconstruction and cannot serve reads during partial failures without recomputing missing shards.

DrivesMulti-AZ chunk storage in §4Erasure coding in §10

Upload bandwidth: delta-only (changed chunks) Sync protocol driver ›

Mobile and home network upload bandwidth is 2–20 Mbps for most users. Re-uploading an entire large file on every save would make the product unusable for video editors or anyone working with large documents. Chunking solves this: only the changed chunks are uploaded. If a 1 GB video has only its last 10 MB modified, the upload is ~10 MB, not 1 GB — a 100× bandwidth reduction.

TradeoffFixed-size chunking is simple but misaligned: inserting a character at the start of a binary file shifts all subsequent chunk boundaries, causing all chunks to appear "changed" even though only one byte was inserted. Rolling-hash (content-defined) chunking preserves boundaries across insertions but adds CPU cost per upload.

DrivesChunking algorithm in §5Delta sync protocol in §7

Storage efficiency: global deduplication Cost driver ›

Studies of real cloud storage systems show 20–50% of stored data is identical across different user accounts. A popular PowerPoint template uploaded by 10,000 users would consume 1 TB at naïve storage, 100 MB with deduplication. At petabyte scale, global dedup reduces raw storage costs by 30–50%, which can make the difference between a profitable and unprofitable storage business at consumer pricing.

TradeoffGlobal dedup requires storing chunk fingerprints (SHA-256 hashes) in a central lookup table. Querying this table on every upload adds latency. The table itself is a hot, high-QPS database that must be highly available — it becomes a critical dependency in the upload path.

DrivesContent-addressable chunk store in §4Hash lookup in §7

Capacity estimation

File storage is a storage-bound problem, not a compute-bound one. Three numbers drive the cost model: total raw storage accumulated over time, the deduplication ratio that compresses it, and peak inbound bandwidth during business hours.

Interactive capacity estimator

Daily active users (millions) 50 M

Uploads per user per day 3

Avg file size (MB) 5 MB

Storage retention factor (0.70 = 30% of uploads are duplicates; system stores 70% of raw volume) 0.70×

Retention years 5 yr

Upload QPS

—

uploads/sec

DAU × uploads / 86400

Peak Upload QPS

—

uploads/sec (peak)

avg × 3× assumed peak factor

Write Bandwidth

—

GB/s sustained

upload_qps × file_size

Raw Storage / yr

—

PB/year

daily_uploads × file_size × 365

Deduped Storage

—

PB after dedup × retention

raw × dedup_ratio × years

Add 1.5× for RS 6+3 parity. Actual bytes on disk ≈ deduped × 1.5.

Metadata Rows

—

billions of file records

DAU × uploads/day × 365 × years

💡

Key insight: The metadata database (file records, version history, sharing permissions) grows independently of the chunk store — and is far more expensive per byte. At 50M DAU uploading 3 files/day for 5 years, you accumulate ~270 billion file records. A relational row at ~500 bytes means ~135 TB of metadata alone, which is why file metadata is typically sharded by user ID and stored in a distributed SQL or document database, separate from the actual file content.

High-level architecture

A file storage system has two fundamentally different data planes: the metadata plane, which tracks what files exist, who owns them, and what version each device has seen; and the block plane, which stores the actual file content as immutable, content-addressed chunks. Keeping these planes separate is what makes global deduplication and delta sync possible: metadata can be strongly consistent without imposing that cost on the block store, and blocks can be immutable and content-addressed without requiring the metadata layer to share those constraints.

The system splits into four functional clusters: a sync API tier that clients talk to, a metadata service backed by a sharded relational database, a block service backed by a content-addressable blob store, and a notification bus that pushes change events to subscribed client devices.

Figure 1 — High-level architecture. Clients upload through the Sync API tier; the Block Service deduplicates against the Chunk Store; the Metadata Service records the file manifest; the Notification Bus pushes change events back to all connected devices.

Component breakdown

CDN (content delivery network): caches file chunks at edge nodes globally. File reads are by far the highest-volume operations. Serving them from the nearest edge node reduces latency from hundreds of milliseconds (origin round-trip) to single-digit milliseconds for most users. The CDN stores chunks by their content hash: the same hash always resolves to the same bytes, so cached chunks never require invalidation.

API Gateway + Sync API servers: the stateless tier that authenticates requests, enforces rate limits, and routes to the correct downstream service. The Sync API is split from the download path: clients GET files via CDN-signed URLs, they PUT chunks directly to the Block Service's upload endpoints. The Sync API orchestrates the metadata transaction that records each completed upload.

Metadata Service: tracks every file record, folder tree, sharing permission, and version history entry. This is the most query-intensive service; every sync operation begins here to discover what changed. It runs on a sharded relational database — MySQL via Vitess (the sharding layer Dropbox itself used), or Google Cloud Spanner for globally replicated consistency — partitioned by user ID so that all queries for a given user's file tree hit the same shard. The relational model is chosen over document or wide-column stores because uploading a file must atomically create a file_versions row and update files.current_ver — a multi-row ACID transaction that eventually-consistent document stores cannot provide without application-level compensating logic.

Block Service: receives raw chunk uploads, computes SHA-256 fingerprints, checks the global dedup table, and either confirms the chunk already exists or writes it to the Chunk Store. This service is stateless and owns no data: all state lives in the hash lookup table and the Chunk Store.

Chunk Store: a durable, content-addressed object store (a managed object storage service like S3 or GCS) where each chunk is keyed by its SHA-256 hash. Objects are immutable once written. Because chunks are identified by content hash, two identical files uploaded by different users share the same physical chunk objects — deduplication is free at the storage layer.

Notification Bus: when a file upload completes, the Sync API publishes a change event to a durable message queue (such as Apache Kafka). A notification service fans out that event to all other devices for the same account connected via WebSocket or HTTP long-poll. Devices receive the event and know to pull the updated file manifest from the Metadata Service.

Cache: an in-memory store (Redis) sitting in front of the Metadata Service for hot read paths — active sessions, recently accessed file metadata, sharing permission checks. Because the Metadata Service owns all writes, it can push invalidation events after each commit without coordinating with other writers — there is no distributed invalidation problem.

Architectural rationale

Why separate the metadata plane from the block plane? Core architecture ›

Metadata (file names, folder trees, version history) has very different access characteristics from file content: it is small (hundreds of bytes per record), queried constantly (every sync event), and needs strong consistency so clients agree on the current version. File content is large (MB to GB per file), mostly read-heavy after initial upload, and benefits from content-addressed immutability. Mixing them in one service forces both to use the same consistency and latency model, which is wrong for both.

TradeoffSeparate planes mean a two-phase operation: the block upload and the metadata commit must be coordinated. If the block upload succeeds but the metadata commit fails, you have orphaned chunks. The Block Service must be idempotent and the Metadata Service must be transactional, with background garbage collection to clean orphans.

AlternativeUnified file blob API (simpler but no dedup)

Why clients GET files via CDN-signed URLs, not through API servers? Bandwidth cost ›

Routing file downloads through API servers would require each server to proxy gigabytes of data — pulling from object storage and streaming to the client. This consumes server CPU, memory, and egress bandwidth at API server cost. Generating a time-limited pre-signed URL (configurable TTL — commonly 15 minutes for download URLs; the AWS SDK default is 3600 s, tune downward for sensitive files) allows the client to download directly from the CDN or object store, bypassing the API tier entirely. The API server's job in the download path is only to authorise the request and generate the URL.

TradeoffPre-signed URLs can be forwarded by the client to any third party before they expire. For sensitive files, the 15-minute download window is acceptable for most workloads; a hardened posture reduces this to 5 minutes or adds IP binding and content-type restrictions in the signature. Upload presigned URLs require a longer TTL: a 5 GB file on a 20 Mbps connection takes ~34 minutes to upload; on a 2 Mbps connection it takes ~5.7 hours — far exceeding any 15-minute presigned URL TTL. Upload chunk URLs should be valid for the duration of the upload session (up to 24 hours), or the client must re-fetch a fresh presigned URL for any chunk whose URL has expired before it could be uploaded.

AlternativesPresigned POST for uploadsServer-side streaming proxy

Why WebSocket / long-poll for sync notifications instead of polling? Sync latency ›

With 50M active devices polling every 30 s, that's ~1.7M req/s of overhead regardless of what changed: the system pays that cost at all times. Push-based notifications (WebSocket subscriptions or HTTP long-poll) flip this: the server only sends a message when something actually changed. For most devices in a typical hour, that is zero messages: the connection is idle.

TradeoffLong-lived connections consume server file descriptors. At 50M DAU with 30% concurrently connected, you need connection management infrastructure that can hold 15M open sockets — typically a dedicated connection server tier using event-loop I/O (like Node.js or epoll-based Go/C++ servers), separate from the stateless Sync API.

AlternativesServer-Sent Events (SSE)APNs (Apple Push Notification service) / FCM (Firebase Cloud Messaging)Periodic polling (backup)

Real-world comparison

Decision	This design	Dropbox	Google Drive
Chunk size strategy	Content-defined (rolling hash)	Fixed 4 MB (historically); rolling hash in modern client	Fixed block size for binary files; byte-level diff for Docs
Metadata storage	Sharded RDBMS (sharded by user ID)	MySQL with custom sharding (Edgestore)	Google Spanner (globally replicated)
Block storage	S3-compatible object store	Amazon S3 + custom Magic Pocket	Google Colossus (distributed file system)
Deduplication scope	Global (all users)	Global block-level dedup	Per-user (privacy constraints limit cross-user dedup)
Sync protocol	Event-driven (push change events)	LongPoll + delta sync	Push via APNs/FCM + polling fallback

ℹ️

The right architectural choices follow from requirements, not from brand preference. Dropbox's decision to build Magic Pocket (its own object storage) came from S3 egress costs at petabyte scale. Google Drive's Spanner choice reflects the need for strong consistency across global regions — a tradeoff that makes sense when the product also powers Google Workspace collaboration.

Chunking & deduplication

The central algorithmic question in this design is: how do you split a file into chunks such that the same logical content produces the same chunks, across different file sizes and edit positions? The answer determines how effective delta sync and cross-user deduplication are in practice.

Figure 2 — Fixed-size chunking (left) causes boundary shift on any insertion, forcing all chunks to re-upload. Content-defined chunking (right) anchors boundaries to the content itself, so only the locally modified chunk changes. Our choice: CDC with 4–8 MB target chunks.

Our choice for this system is content-defined chunking (CDC) using a Rabin fingerprint sliding window. The algorithm slides a 48-byte window over the file's bytes and cuts a chunk boundary whenever the 32-bit rolling hash matches a target pattern. As a concrete example: requiring the lowest 13 bits to be zero yields an expected chunk size of 2^13 = 8 KB. For a 4 MB target, require the lowest 22 bits to match (2^22 ≈ 4 MB). Minimum and maximum chunk size bounds prevent degenerate tiny or huge chunks.

Deduplication via content addressing

After chunking, each chunk is identified by its SHA-256 fingerprint. Before uploading, the client sends the Block Service a list of chunk hashes. The Block Service checks each hash against the global chunk index. The Block Service skips any chunk whose hash is already in the index and notifies the client of which chunks to omit. Only the remaining chunks are uploaded. This achieves both intra-user dedup (re-uploading a file you already have) and inter-user dedup (uploading a file that someone else already uploaded).

Hash collision risk: is SHA-256 safe? Security · L5 probe ›

SHA-256 produces a 256-bit fingerprint. The birthday paradox collision probability for N chunks is approximately N²/2^256. At 10^15 chunks (petabyte-scale system with 1 KB average chunk), this is ~10^30/10^77 — astronomically small. In practice, accidental collision is not a meaningful risk.

The real threat is intentional collision: a malicious user crafting a chunk that has the same hash as a chunk belonging to another user, enabling them to "own" that chunk without uploading it. SHA-256 is currently collision-resistant against known attacks, but the system should also perform byte-level verification on download if tampering is a concern.

TradeoffSome storage systems (e.g., Dropbox historically) used weaker hashes for performance. The attack surface for SHA-256 is vastly smaller. The CPU overhead of SHA-256 on modern hardware (using Intel SHA Extensions (SHA-NI) on modern CPUs) is negligible vs. the I/O cost of a chunk upload.

Cross-user dedup and privacy: the "hash proof" pattern Privacy · L6 probe ›

Cross-user dedup implies that if you know the hash of a file you don't have access to, you could "own" a copy of it for free (by just proving you know its hash). This is the "Dropbox hash theft" attack discovered in 2011. The mitigation is requiring the uploader to prove they have the content, not just the hash — either by uploading a fragment (a proof-of-possession scheme) or by using HMAC with a per-user secret key so hashes are not comparable across users. Google Drive disables cross-user dedup for sensitive content categories for this reason.

ImplicationSystems that prioritize dedup efficiency (Dropbox-style) and systems that prioritize content privacy (per-user encryption at rest) are in tension. Client-side encryption (encrypting before chunking) prevents any server-side dedup: the same plaintext produces unique ciphertext per user.

Encryption at rest and key management Security · L6+ probe ›

For consumer tiers, server-side encryption (SSE) with managed keys is standard: chunks are encrypted at rest using AES-256 with a per-user data key, itself wrapped by a master key in a KMS (AWS KMS, Google Cloud KMS). This protects against physical storage compromise without affecting the dedup model: the server decrypts on read before serving a presigned URL. For enterprise and HIPAA-regulated tiers, client-side encryption (CSE) is required: the client encrypts each chunk before upload using a key that never leaves the user's device. The tradeoff is explicit: CSE eliminates all cross-user dedup (the same plaintext file produces unique ciphertext per user, so chunk hashes never match across users), and must be stated as such when an interviewer asks about the privacy-vs-efficiency axis.

Key rotationSSE allows transparent key rotation at the KMS layer without re-encrypting chunk data (envelope encryption: only the wrapping key changes). CSE requires the client to re-encrypt and re-upload affected chunks on rotation — operationally expensive for large storage accounts.

💡

One open question: What's the target chunk size? Smaller chunks improve dedup granularity and delta sync precision, but increase the number of chunk hash lookups and metadata rows. Larger chunks reduce lookup overhead but mean more re-upload if a large chunk contains a small edit. For workloads that include large binary files, 4–8 MB chunks reduce hash lookup overhead at the cost of coarser delta granularity. For document and text-heavy workloads, 64–512 KB chunks improve dedup precision and delta sync resolution, at the cost of more hash lookups per upload.

API design

The public API has two core operations: initiating a file upload (which returns upload URLs for missing chunks) and triggering a sync to discover what changed. Everything else — version restore, sharing, folder management — is important but does not change the core architecture.

POST /files/upload/init — initiate upload

The API gateway enforces a per-user token bucket on this endpoint: max 100 upload-init requests per minute, 10 concurrent upload sessions. Clients exceeding the limit receive 429 Too Many Requests with a Retry-After header.

// Request: client sends file metadata + list of chunk hashes
POST /files/upload/init
Authorization: Bearer <token>
Content-Type: application/json

{
  "path":        "/My Documents/report.pdf",
  "size_bytes":   52428800,           // 50 MB
  "modified_at":  "2026-04-10T14:32:00Z",
  "content_hash": "sha256:aef392...",  // whole-file hash
  "chunks": [
    { "index": 0, "hash": "sha256:c1a...", "size": 4194304 },
    { "index": 1, "hash": "sha256:d2b...", "size": 4194304 },
    { "index": 2, "hash": "sha256:e3c...", "size": 1048576 }
  ]
}

// Response: 200 OK — server identifies which chunks are missing
{
  "upload_session_id": "sess_abc123",
  "missing_chunks": [1, 2],           // chunk 0 already exists
  "upload_urls": {
    "1": "https://storage.example.com/chunks/d2b...?sig=...",
    "2": "https://storage.example.com/chunks/e3c...?sig=..."
  }
}

// Validation errors
400 Bad Request: chunk count > 10,000 or file > 5 GB
409 Conflict: path exists with same content_hash (client dedups early)\

POST /files/upload/complete — commit upload

// Request: after uploading all missing chunks, commit the transaction
POST /files/upload/complete
Content-Type: application/json

{
  "upload_session_id": "sess_abc123"
}

// Response: 201 Created
{
  "file_id":    "file_xyz789",
  "version_id": "ver_001",
  "url":        "https://cdn.example.com/files/file_xyz789?v=ver_001&sig=..."
}

// Error responses
404 Not Found: upload_session_id expired or unknown
422 Unprocessable: one or more missing chunks not yet uploaded

GET /sync/delta — pull changes since a cursor

// Request: client provides its last sync cursor
GET /sync/delta?cursor=eyJhY2N0IjoiMTIzIiwiY3Vyc29yIjoiMTcxMjcwMDAwMCJ9

// Response: list of changes since that cursor
{
  "changes": [
    {
      "type":       "modified",
      "file_id":    "file_xyz789",
      "path":       "/My Documents/report.pdf",
      "version_id": "ver_002",
      "size_bytes": 52457600,
      "chunks": [
        { "index": 0, "hash": "sha256:c1a..." },  // unchanged
        { "index": 1, "hash": "sha256:f4d..." },  // new chunk
        { "index": 2, "hash": "sha256:e3c..." }   // unchanged
      ]
    }
  ],
  "new_cursor": "eyJhY2N0IjoiMTIzIiwiY3Vyc29yIjoiMTcxMjcxMDAwMCJ9",
  "has_more":   false
}

Optional endpoints by level

Endpoint	Description	Level
`GET /files/{id}/versions`	List version history for a file	L3/L4
`POST /files/{id}/restore`	Restore a file to a previous version	L3/L4
`POST /files/{id}/share`	Generate a shareable link with optional expiry	L5
`DELETE /files/{id}`	Soft-delete a file (moves to trash, retains versions)	L5
`GET /files/search`	Full-text search across file names and content	L7/L8
`POST /files/upload/resumable/init`	TUS-compatible resumable upload for large files	L7/L8

POST /files/{id}/share — sharing endpoint

// Request: specify permission level, optional grantee, optional expiry
POST /files/{id}/share
Authorization: Bearer <token>
Content-Type: application/json

{
  "permission":    "view" | "edit",        // access level granted
  "grantee_id":   "usr_abc123",           // omit for a public link
  "expiry_seconds": 604800                // optional; 0 = no expiry
}

// Response
HTTP 201 Created
{
  "link_token":  "tok_xK9mPq2r",
  "share_url":   "https://app.example.com/s/tok_xK9mPq2r",
  "expires_at":  "2026-04-18T12:00:00Z", // null if no expiry
  "permission":  "view"
}

The link_token is stored in the file_shares table (§8). On access, the API verifies the token is not expired, increments an access counter for audit, and returns a presigned download URL. For grantee_id shares (direct user grants), the notification fan-out (§10) adds the grantee to the affected file's change events so they receive real-time sync updates for the shared file.

Search design note

File-name search can be served without a dedicated search index: a WHERE user_id = ? AND name LIKE 'prefix%' query on a (user_id, name) composite index in the metadata DB handles autocomplete efficiently at L3/L4. Full-text content search — searching inside document bodies — requires an async pipeline: upload completion triggers an event → text-extraction worker pulls the chunk bytes, extracts plaintext, and pushes to Elasticsearch → the index maps terms to file IDs. Clients query Elasticsearch directly via the API gateway. This pipeline is an L7/L8 concern; state the distinction clearly when the interviewer asks about GET /files/search.

Core flow: upload & sync

The durability NFR (11 nines) and the upload bandwidth NFR (delta-only) from §2 drive every step in the upload flow, along with the quota check added in §8. The key question is: at what point does the server commit a file version, and what happens on device B when device A completes an upload?

Client-side early exit: before calling POST /upload/init, the client compares the whole-file content hash against its local version database. If the hash matches the last synced version, the upload is skipped entirely — zero network calls. This is the first and cheapest deduplication check, operating entirely on-device.

Figure 3 — Upload flow (top): the client chunks locally, sends hashes to discover missing chunks, uploads only those chunks directly to the Block Store, then commits the metadata transaction. Step ⑧: Sync API writes to Metadata DB (not shown as a separate lane). Sync flow (bottom): the Notification Bus pushes a change event to Device B, which pulls only the changed chunks.

⚠️

Conflict handling: what if Device A and Device B both edit the same file while offline? The server detects a conflict when Device B tries to commit a version whose parent version does not match the current head. Rather than silently overwriting, the system creates a conflict copy — both versions are preserved as separate files (e.g., "report.pdf (Device B's conflicted copy 2026-04-10)"). This matches Dropbox's actual conflict policy. At L5+, interviewers expect this approach: both versions are preserved, no data is lost, and the user resolves the conflict manually, which means the system's job is detection, not merge.

ℹ️

Quota check in the upload path: before the server performs the missing-chunk lookup (step ③), the Sync API does an optimistic read of quota_used_bytes as an early exit — if the user is clearly over quota, return 507 Insufficient Storage immediately. However, this read-then-check alone has a race condition: two concurrent upload-init requests from the same user can both pass the quota check simultaneously, each within quota individually but together exceeding it. The authoritative enforcement happens at commit time using an atomic compare-and-swap: UPDATE user_quotas SET quota_used_bytes = quota_used_bytes + ? WHERE user_id = ? AND quota_used_bytes + ? <= quota_limit_bytes. If 0 rows are updated, the commit is rejected with 507 and any uploaded chunks are abandoned. This guarantees quota is never exceeded regardless of concurrency. The quota is decremented on soft-delete and reclaimed by the GC pipeline when a version is hard-deleted.

Resumable uploads for large files

For files larger than ~100 MB, network interruptions mid-upload are likely. The system implements the TUS resumable upload protocol (or an equivalent): the client first issues an upload initialization request, then uploads chunks in sequence. Each chunk upload is acknowledged. If the connection drops, the client queries the server for which offset was last committed and resumes from there. This directly addresses the upload bandwidth NFR from §2 — a 2 GB file with a dropped connection at 1.9 GB only re-uploads the last chunk, not the whole file.

Async thumbnail & preview generation

Thumbnail generation is a background pipeline, not a synchronous upload step. File browsers display 64×64 icons and hover previews independently of the full file — downloading the complete file just to render the list view would be impractical.

When the Sync API publishes the upload-complete event to the Notification Bus (step ⑨ above), a Thumbnail Worker subscribes as a second consumer on that same topic. For each image or document file, the worker:

Downloads the first chunk (or the full file for images) from the Block Store using the version's chunk list.
Generates two thumbnail variants: 64×64 px (file-list icon) and 256×256 px (hover preview).
Writes both variants to object storage under deterministic keys: thumbs/{file_id}/{version_id}/64.webp and thumbs/{file_id}/{version_id}/256.webp.
Updates the file_versions row with thumb_url_64 and thumb_url_256 columns, which the metadata API then returns to clients.

Because the thumbnail worker is an async consumer, it does not block the upload acknowledgement: the client gets its file_id + version_id immediately, and thumbnails appear in the UI once the worker finishes (typically within 1–3 seconds for images). For file types without a thumbnail (e.g., .zip), the worker is a no-op and the client falls back to a generic file-type icon.

🎯

Common L4+ follow-up: "How does the file browser show a preview without downloading the whole file?" — Answer: the upload-complete event triggers the async thumbnail pipeline above. The client receives thumb_url_64 in the metadata response and fetches the thumbnail from the CDN directly; the full file content is never downloaded just to render the list view.

§7.5 — Offline access: the local sync daemon L5+

The "offline access" functional requirement (§2) is satisfied by a persistent background daemon on the client device, not by the server. The daemon has three responsibilities: watching for local file changes, maintaining a local state database, and flushing queued work on reconnect.

File system watcher: the daemon registers with the OS-native change API — FSEvents on macOS, inotify on Linux, ReadDirectoryChangesW on Windows. When a watched file changes, the watcher fires an event within milliseconds. The daemon debounces rapid successive changes (e.g., an editor saving every keystroke) with a short idle delay (~500 ms) before treating the file as stable.
Local state database: a lightweight SQLite database tracks the last-synced version_id, content hash, and sync status (synced | pending_upload | pending_download) for every file in the sync folder. This is the ground truth for what the device believes is current. On app startup or reconnect, the daemon compares the local state DB against the server's GET /sync/delta response to identify any divergence.
Reconnect flush: when the device comes back online, the daemon walks its local state DB for any rows with pending_upload status — these are edits made while offline. Each is processed through the normal upload flow (§7). Offline edits discovered on reconnect are treated identically to concurrent device edits: the parent_version_id check at commit time detects conflicts, and the conflict-copy mechanism in §7 handles them — no special offline-conflict code path is needed.

Selective sync (marking only certain folders for offline access) works by limiting which paths the daemon watches and which entries it downloads to the local state DB. Files not marked for offline access are stubs: their metadata is local but their chunks are fetched on-demand when opened.

Data model

There are four categories of entities in this system: users + accounts, file metadata, chunk index, and sharing + permissions. Each has different access patterns, and those differences determine where each entity lives and how it is indexed.

Access patterns

Operation	Frequency	Query shape
Fetch file tree for a user	Very high (on sync / app open)	Range scan by user_id + parent_folder_id
Look up chunk by hash	Very high (per chunk on every upload)	Point lookup by SHA-256 hash
List version history for a file	Medium (on user request)	Range scan by file_id ordered by created_at DESC
Fetch sharing permissions	High (on every file access)	Point lookup by file_id + requesting user_id
Write new file version	High (every upload)	INSERT + UPDATE file record (transactional)
Compute sync delta (cursor)	High (per sync event)	Ordered scan by account_id + updated_at > cursor timestamp

All user-facing queries begin with a user_id filter, making user_id the natural sharding key: all records for a given user co-locate on one shard. The chunk hash lookup has no user context at all, which makes it a candidate for a separate, independently-scalable hash index.

Schema

Figure 4 — Schema: user-associated tables are sharded by user_id. The chunk_index is the global dedup table stored in a separate hash-keyed cluster. version_chunks is the join table mapping each file version to its ordered chunk sequence.

Quota tracking

Storage quotas require two additions to the schema. A user_quotas table (co-located on the same shard as the user's files via user_id) tracks the current and limit values:

-- co-located with files shard via user_id
user_quotas (
  user_id        UUID PK  -- shard key,
  quota_limit_bytes  BIGINT   NOT NULL  -- e.g. 15 GB for free tier,
  quota_used_bytes   BIGINT   NOT NULL  DEFAULT 0,
  updated_at     TIMESTAMPTZ
)

quota_used_bytes is incremented atomically when a file version is committed and decremented on soft-delete. The Sync API reads this row at upload-init time and rejects the request with 507 Insufficient Storage if the upload would exceed the limit (see §7). The files table additionally carries deleted_at TIMESTAMPTZ and purge_after TIMESTAMPTZ columns to support the trash/GC pipeline: a daily background job hard-deletes rows where purge_after < NOW(), decrements quota_used_bytes, and decrements chunk_index.ref_count for orphaned chunks.

Version retention and GC lifecycle

The §2 NFR specifies a minimum 30-day version history. On each version creation, purge_after is set to created_at + retention_days (30 days for free tier, configurable per plan). The daily GC job runs DELETE FROM file_versions WHERE purge_after < NOW() and, for each removed version, atomically decrements chunk_index.ref_count for each chunk in the version's chunk list: UPDATE chunk_index SET ref_count = ref_count - 1 WHERE chunk_hash = ?. If the updated ref_count reaches 0, the GC then deletes the blob from the object store by its hash key (an S3/GCS DeleteObject call), and finally removes the chunk_index row. The object store has no ref_count column — it is a content-addressed blob store; all reference tracking lives in chunk_index. This two-step sequence (decrement in index → delete from object store if zero → delete index row) prevents races where a concurrent upload re-references a chunk between the decrement and the delete. This is distinct from the orphaned-chunk GC path (§11) which handles failed uploads — version-expiry GC is the planned lifecycle path for fully committed versions whose retention window has closed.

Why denormalize user_id into file_versions? Sharding strategy ›

In a sharded database, cross-shard joins are expensive or impossible. If a query needs file_versions for a file owned by a user on shard 3, the version data must also be on shard 3. Including user_id in file_versions (even though it duplicates the files table relationship) ensures the shard router can keep all a user's related records co-located. This denormalization is intentional, not a mistake.

TradeoffStorage overhead is small (UUID = 16 bytes per row). Data consistency must be maintained by the application layer: if a file is "transferred" to another user, all related records must be carefully migrated.

Caching strategy

The caching architecture follows from two NFRs: metadata reads must complete in under 200 ms (§2), and file downloads must start within 2 seconds from the nearest edge (§2). These are satisfied by different cache layers operating at different points in the request path established in §4.

Figure 5 — Four-layer cache hierarchy. A file read hits the client's local disk cache first; on miss, the CDN edge serves the chunk by hash; on CDN miss, Redis serves hot metadata to help construct the download URL; only on full miss does the request reach the Metadata DB or Chunk Store.

Cache layer	What's cached	TTL	Why it exists
Client local disk	Full file chunks for recently accessed files	LRU until storage limit	Offline access; zero latency for recently seen files
CDN edge nodes	Chunk objects by SHA-256 hash	Indefinite (content-addressed)	File content never changes for a given hash; cache hits are always valid
Redis cluster	File metadata, session tokens, permission checks	60–300 s per entry	Metadata reads happen on every sync event; DB round-trip would bottleneck the API
API server in-process	User account info, routing config	30–60 s	Avoid Redis hop for extremely hot data (logged-in user's account)

Cache invalidation

Chunk objects are immutable by design (content-addressed), so the CDN never needs invalidation for chunks: the same hash always returns the same bytes. Every metadata write triggers a cache invalidation event: after each successful commit, the Sync API publishes to a durable Kafka topic (the same cluster already in the design), and consumers clear the relevant Redis keys. Using a durable topic rather than Redis pub/sub ensures that a temporarily disconnected consumer does not permanently miss invalidation events. Redis pub/sub is fire-and-forget: it delivers nothing to subscribers that restart or lag. This keeps cache coherence bounded to the event propagation latency, typically under 100 ms within a region.

ℹ️

Content-addressing eliminates the hardest part of CDN caching: because every chunk is keyed by its SHA-256 hash, CDN caching is trivially correct. There is no cache poisoning risk, no stale-data risk, and no need for cache-busting URLs. A chunk served from a Berlin CDN edge node 2 years after it was first uploaded is guaranteed to be byte-identical to what was stored at upload time.

Deep-dive: scalability

At 50M+ DAU the system faces four distinct scaling challenges: the chunk hash lookup table becomes a hot global bottleneck; metadata shards become unevenly loaded for power users; the sync notification layer must hold millions of persistent connections; and cross-region replication introduces consistency trade-offs.

Figure 6 — Production multi-region architecture. Metadata DB primary in Region A replicates asynchronously to Region B. Chunk Store is replicated for local read performance. The global chunk hash index is shared but access can be routed regionally. Each region has an independent notification service layer.

Scaling the chunk hash index (global dedup table) L5 deep-dive ›

The chunk hash index is the single hottest table in the system. Every upload — by every user — must query it before uploading a chunk. At 50M DAU uploading 3 files/day averaging 10 chunks each, that's ~17,000 hash lookups per second sustained, with 5× peak during business hours (~85K/s).

A sharded key-value store (such as Apache Cassandra or Amazon DynamoDB) handles this naturally: the partition key is the SHA-256 hash (first 8 bytes), distributing evenly across nodes. Reads are single-key lookups with no fan-out. The table schema is simple: chunk_hash → {storage_path, size_bytes, ref_count}. At 10^11 unique chunks with 512 bytes per row, total table size is ~50 TB — comfortably within DynamoDB or Cassandra's capacity model.

Race conditionTwo users can concurrently upload the same chunk and both pass the "does this hash exist?" check (returning "missing") before either writes. Both then upload the chunk to object storage. The second write is a no-op (idempotent) since the object key is identical. The chunk_index update should use an atomic conditional write (put-if-absent) to avoid double-inserting the ref_count.

Metadata shard hot-spots and large-account handling L6 deep-dive ›

Sharding by user_id distributes load evenly for average users. But enterprise customers with shared drives accessed by thousands of employees create a different problem: all write traffic for that shared namespace hits one shard. The shared drive is a logical account, and all editors write to the same shard concurrently.

The solution is hierarchical sharding: a shared drive gets its own synthetic "owner ID" distinct from any individual user ID. Traffic targeting that drive is further sub-sharded by folder ID within the drive, distributing concurrent write load across multiple shards. The metadata service maintains a routing table mapping drive IDs to their shard assignments.

TradeoffSub-sharding a shared drive means folder-tree queries (list all files in a drive) may require fan-out across multiple shards. The query layer must merge and sort results in memory. This adds latency for full-tree queries but is acceptable because deep folder listings are infrequent compared to file-level reads.

Notification at 15M concurrent connections L6 deep-dive ›

A WebSocket server handling 300K connections per node (using epoll-based event loop I/O — Go, Node.js, C++) requires ~50 nodes per region for 15M concurrent connections. When a file changes, the Sync API publishes a change event to a durable message queue (Kafka topic = account_id). A notification service fan-out layer reads partitions by account_id and writes to the correct WebSocket server, which knows which client connections belong to that account. For files in shared folders, this fan-out must also look up the file_shares table for the affected file_id and publish change events to each grantee_id's notification topic — otherwise collaborators never receive push notifications for changes to files they have been granted access to. This means the fan-out requires a metadata lookup step and cannot be a simple per-account Kafka partition route.

WebSocket servers must be stateful (they hold open connections) but must not be in the write path. If the WebSocket server goes down, clients fall back to polling the /sync/delta endpoint with their cursor. The cursor-based sync model means no sync events are permanently lost — a client that was offline simply polls on reconnect and catches up.

TradeoffMobile apps can't hold permanent TCP connections — iOS and Android kill background sockets aggressively. Mobile sync falls back to push notifications (APNs / FCM) that wake the app, which then connects and pulls its delta. The sync protocol (cursor-based GET /sync/delta) is the same regardless of how the "wake-up" arrives. This fallback is required to meet the sync latency NFR (§2: <5 s for files <1 MB) on mobile, without a wakeup signal, the app could go hours without syncing.

Cross-region consistency and the sync cursor L7 deep-dive ›

Metadata is written to the primary region and replicated asynchronously to replica regions with typical lag of 50–200 ms. A client in EU may read from the EU replica and see a slightly stale file tree. This is acceptable for the sync use case — eventual consistency bounded by replication lag is the explicit consistency model from §2. The sync cursor encodes a timestamp; a client reading stale data will receive the correct state on the next poll once replication catches up.

For writes, all clients route metadata mutations to the primary region. The cross-region latency adds 80–120 ms to write operations for EU clients writing to US-East primary. At L7, the discussion is whether to implement active-active multi-primary (each region accepts writes, with conflict resolution via vector clocks (logical timestamps that track causal ordering across concurrent writes)): the same problem Google Docs and CRDTs (Conflict-free Replicated Data Types — data structures where any two concurrent edits merge deterministically) solve, but for file metadata rather than document content.

TradeoffActive-active metadata writes require distributed consensus on every write (Paxos/Raft across regions) adding ~80–150ms of quorum latency. The complexity is significant. For most file storage use cases, active-passive with regional read replicas is the right tradeoff — it delivers the latency improvements EU users care about (fast reads) without the consistency complexity of active-active writes.

Data residency (GDPR)GDPR can constrain the active-passive topology: EU users' metadata primary must reside in an EU region, so a single US-East primary is not permissible for an EU-facing product. This forces either a true multi-primary model or per-region account isolation — EU accounts are mastered in EU-West, US accounts in US-East, which avoids cross-region consensus but means shared folders across regions require an indirection layer.

ReferenceGoogle Spanner (true multi-primary)CockroachDB

Erasure coding for storage durability L7 deep-dive ›

Triple replication (3 full copies across 3 AZs) provides 11-nines durability but at 3× storage cost. Reed-Solomon erasure coding (RS 6+3 is common) splits each chunk into 6 data shards and 3 parity shards. Any 6 of the 9 shards can reconstruct the original chunk — tolerating 3 simultaneous shard failures. Storage overhead is 9/6 = 1.5×, vs 3× for triple replication. At petabyte scale, this is a 50% reduction in raw storage cost: the economic case is compelling.

The tradeoff: reading an erasure-coded chunk requires reading at least 6 shards in parallel and decoding via Galois field arithmetic (polynomial interpolation over GF(2⁸)) — not simple XOR, which only applies to single-parity schemes like RAID-5. This adds CPU overhead and small latency. Hot chunks in the CDN cache bypass this — they are served as whole objects. Only CDN-cold chunks require reconstruction.

In practiceDropbox Magic Pocket uses RS 6+3. Facebook f4 (OSDI 2014) uses RS(10,4) for its WARM tier — overhead of 14/10 = 1.4×. For colder tiers, higher erasure ratios push overhead lower at the cost of higher reconstruction latency. The pattern: use higher erasure ratios for cold data where that latency is acceptable.

Failure modes & edge cases

Scenario	Problem	Solution	Level
Upload exceeds user quota	User attempts to upload a file that would push their `quota_used_bytes` over `quota_limit_bytes`	Sync API does an optimistic quota read at upload-init for a fast early exit, but enforces quota atomically at commit time via `UPDATE user_quotas SET quota_used_bytes = quota_used_bytes + ? WHERE user_id = ? AND quota_used_bytes + ? <= quota_limit_bytes`. If 0 rows are updated, the commit is rejected with `507 Insufficient Storage`. This prevents the race condition where two concurrent uploads both pass the init-time check but together exceed quota. Client surfaces "storage full" UI and optionally links to upgrade flow.	L3/L4
Upload interrupted mid-stream	Client uploads 80% of chunks, then loses connection; server has partial data	TUS-protocol resumable uploads: client queries upload cursor on reconnect, re-uploads only from last committed chunk. Upload sessions expire after 24 hours; orphaned chunks are cleaned by background GC.	L3/L4
Chunk store unavailable	Object storage returns 503; uploads fail	Return 503 to client immediately (don't buffer chunks in API servers). Client retries with exponential backoff. The system uses an object store with 99.99% availability SLA — outages are rare but must be handled gracefully by the client library.	L3/L4
Two devices save the same file concurrently	Device A and Device B both modify a file while offline; first commit wins, second overwrites data	The server rejects the second commit if its parent_version_id doesn't match the current head. The Sync API creates a "conflict copy" (Device B's conflicted copy) instead of overwriting. No data is lost. User resolves manually.	L5
Metadata DB shard fails	A shard failure makes all files for affected users inaccessible	Each shard has a primary + 2 synchronous replicas. On primary failure, the replica promotes automatically (typically in 10–30 s). During promotion, writes are rejected (503); reads may serve from a slightly stale replica. Clients retry with backoff.	L5
Chunk hash collision (deliberate)	Attacker crafts a chunk whose SHA-256 matches an existing chunk they don't own, "stealing" access	Require proof-of-possession: client must upload the first 4 KB of any chunk it claims to already own. Server validates hash of the fragment matches. This makes hash-stealing attacks require actually having the content.	L5
Notification delivery failure	WebSocket connection drops; change event not delivered to Device B	Events are also persisted in Kafka with a 7-day retention window. On reconnect, Device B polls /sync/delta with its last cursor. The cursor-based protocol ensures all events are catchable on poll regardless of push reliability.	L5
Chunk hash index unavailable	The global chunk hash index (DynamoDB/Cassandra) is a synchronous dependency in every upload: the Block Service must perform the dedup lookup (step ③) before any chunk can be accepted or skipped. If the index is unavailable, all upload-init requests fail at the dedup check.	Add a circuit breaker on the hash index client. On open circuit, degrade to always-upload mode: skip the dedup lookup, accept all chunks unconditionally, and reconcile orphaned duplicates in the next GC run. Return a response header indicating dedup was bypassed so clients do not retry unnecessarily. Restore dedup checks once the circuit closes. Alternatively, return 503 with a `Retry-After` header if degraded-mode storage cost is unacceptable.	L5
Orphaned chunks (upload abort after block write)	Block is written to chunk store but metadata commit never happens; storage leak accumulates	Upload sessions have a 24-hour expiry. A background garbage collector runs daily, scanning upload_sessions for expired/uncommitted sessions and deleting associated chunk_store objects whose ref_count has never been incremented. GC uses a two-phase check (mark then delete with a 1-hour delay) to guard against slow commits. Critically, the final delete must re-verify `ref_count = 0` at deletion time: re-read `chunk_index.ref_count` atomically before issuing the object-store `DeleteObject` call, because a new upload could have incremented `ref_count` to 1 between the mark and delete phases. Only delete the object store blob and `chunk_index` row if `ref_count` is still 0 at that moment.	L7/L8
Region-wide outage	US-East primary region loses connectivity; EU users can't write files	EU read replicas serve reads from the last consistent snapshot. Writes are queued in client (up to 24 hours of local changes). On region recovery, clients flush their local write queue. Active-active would eliminate this window but adds significant complexity (see §10 deep-dive).	L7/L8

How to answer by level

The file storage question is calibrated by how deep you go on the sync protocol, chunking algorithm, and consistency model. An L3 answer covers the mechanics; an L7 answer owns the cost structure and consistency trade-offs.

L3/L4 What good looks like at this level ›

What good looks like

Identifies two core services: file metadata store and file content store
Proposes chunked uploads to handle large files and network failures
Designs a simple sync mechanism: client polls for changes, downloads new versions
Handles the basic upload flow end-to-end: chunk → upload → commit metadata
Mentions version history as a list of file records with timestamps

Minimum schema (L3/L4)

files (
  file_id    UUID PK,
  user_id    UUID,
  path       TEXT,
  s3_url     TEXT,
  version    INT,
  updated_at TIMESTAMPTZ
)

No sharding concerns at this level. One table is enough to answer "what changed?" via a WHERE user_id = ? AND updated_at > ? scan.

What separates L5 from L3

L3/L4 uses polling for sync; L5 introduces event-driven push
L3/L4 may re-upload entire files; L5 proposes delta / chunk-diff sync
L3/L4 doesn't bring up deduplication as a storage strategy
L3/L4 uses a single flat file store; L5 separates metadata and blocks explicitly

L5 What good looks like at this level ›

What good looks like

Explains content-defined chunking (CDC) and why it's superior to fixed-size for delta sync
Proposes SHA-256 content addressing for deduplication; identifies the global hash index as a separate component
Designs the "/upload/init → upload chunks → /upload/complete" two-phase commit
Proposes WebSocket or long-poll for sync notifications instead of polling
Identifies conflict resolution as a distinct problem and proposes conflict copies

What separates L6 from L5

L5 proposes dedup; L6 reasons about privacy implications and proof-of-possession
L5 designs for a single region; L6 reasons about cross-region read/write separation
L5 describes conflict copies; L6 probes further (CRDTs, version vectors, shared drive conflicts with multiple owners)
L5 caches metadata; L6 designs invalidation and discusses cache consistency under concurrent writes

L6 What good looks like at this level ›

What good looks like

Owns the chunk index scaling problem end-to-end: partitioning strategy, concurrent upload race condition, GC for orphaned chunks
Addresses the enterprise shared-drive shard hot-spot and proposes hierarchical sub-sharding
Designs the notification connection server tier separately from the API tier, with cursor-based fallback
Discusses cross-region metadata replication lag and its effect on sync cursors
Proposes erasure coding (RS 6+3) vs. triple replication and reasons about cost vs. latency trade-off

What separates L7 from L6

L6 designs for the current scale; L7 asks "what's the cheapest this system can get while meeting SLAs?"
L6 accepts active-passive multi-region; L7 analyzes active-active trade-offs, vector clocks, and CRDT approaches
L7 frames build vs. buy decisions at petabyte scale (e.g., Dropbox's decision to build Magic Pocket vs. staying on S3)

L7/L8 What good looks like at this level ›

What good looks like

Frames the cost structure: storage dominates; dedup ratio and erasure coding are the two levers. Example at 50M DAU, 5 GB avg stored per user, $0.023/GB/month:

Scenario	Retention factor	Stored (PB)	Annual cost
Baseline dedup	0.70×	175 PB	~$48 M/yr
Improved dedup	0.50×	125 PB	~$34 M/yr

Delta: ~$14 M/yr saved from a 0.20 improvement in the retention factor. Consistent with the interactive estimator defaults.

Reasons about whether to build custom object storage vs. S3 at the point of scale that makes it economically rational
Designs multi-primary active-active metadata with causal consistency (version vectors) and probes the consistency boundaries
Identifies which parts of the system are genuinely novel engineering problems vs. solved infrastructure

Common gaps at L7

Over-engineering: proposing global strong consistency when eventual suffices
Ignoring cost modeling — L7 must reason about money, not just correctness
Not probing failure modes of their own proposed solutions (e.g., what breaks in active-active under network partition)

Classic probes table

Question	L3/L4 answer	L5/L6 answer	L7/L8 answer
How does Device B know a file changed on Device A?	Polling every N seconds	Push notification via WebSocket / long-poll from Notification Bus; cursor-based delta pull on reconnect	Same, plus reasons about mobile constraints (APNs/FCM) and the economics of 15M persistent connections per region
How do you limit the upload bandwidth for a 5 GB video?	Split the file into chunks and upload in sequence	CDC chunking → hash → only upload missing chunks; TUS resumable protocol for interruption safety	Same, plus client-side adaptive bitrate for upload (detect congestion, reduce chunk rate), and server-side per-user upload quota enforcement at the load balancer layer
What happens when two people edit the same file simultaneously?	Last write wins (data loss)	Conflict copy: both versions preserved; parent_version_id mismatch triggers conflict creation; user resolves manually	Conflict copy for binary files; CRDT or OT for text files; CRDTs eliminate conflicts structurally but require intent-preserving merge semantics
How do you achieve 11-nines durability?	Backup the data frequently	Multi-AZ replication with erasure coding (RS 6+3), synchronous write to 2+ shards before ACK, bit-rot scrubbing	Same, plus cross-region replication for disaster recovery, integrity audit pipelines (checksumming at rest), and the math: at RS 6+3 with independent disk failure rates, you can compute the annual probability of data loss explicitly

How the pieces connect

Trace a single 1-byte edit end to end

Device A makes a 1-byte change inside a 100 MB file. Here is every system touched, in order:

CDC re-chunks the file (§5): only the chunk whose boundary shifted changes; the other ~6,000 chunks produce identical hashes. One new chunk is identified.
Hash lookup in the chunk index (§4, §8): the client sends the new chunk's SHA-256 to POST /upload/init. The chunk index confirms the hash is not already stored; the server returns one presigned upload URL.
PUT to Block Store (§4): the client uploads one chunk directly to the Block Store, bypassing the Sync API. The only bytes that cross the network are the one changed chunk: the upload bandwidth NFR (§2) is satisfied.
Metadata commit (§7, §8): POST /upload/complete writes a new file_versions row pointing to the updated chunk list. The parent_version_id check ensures no concurrent edit from another device has been silently overwritten.
Kafka event published (§4): the Sync API emits a change event async to the Notification Bus (step ⑨ in Fig. 3). This decouples the upload acknowledgement from the sync fan-out.
Device B wakes via WebSocket (§10): the WebSocket gateway, subscribed to the Kafka topic for this user, pushes "file changed" to all of Device B's open connections within the sync latency NFR window (< 5 s, §2).
Device B pulls delta (§7): Device B calls GET /sync/delta?cursor=…. The server returns the new chunk list diff: one new chunk hash and its presigned download URL. Device B downloads one chunk. The file is reconstructed from the unchanged cached chunks plus the one new chunk — no other data moves.

NFR chain traces

Upload bandwidth NFR (§2) → fixed-size chunking causes boundary shift on insertion → CDC chunking chosen (§5) → only modified chunks re-uploaded → sync latency NFR also satisfied (§7)
Storage efficiency NFR (§2) → global deduplication required → chunks keyed by SHA-256 → global chunk index table (§4, §8) → the index becomes the single hottest component and must be independently scaled (§10)
Sync latency < 5 s (§2) → polling is too slow at scale → event-driven notification bus (§4) → Kafka fan-out to WebSocket gateway → cursor-based /sync/delta as durable fallback (§7)
Durability 11 nines (§2) → triple replication alone costs 3× storage → erasure coding RS 6+3 (§10) reduces overhead to 1.5× at comparable durability → storage cost halved at petabyte scale
Concurrent device edits (§11) → last-write-wins causes data loss → parent_version_id check + conflict copies (§7) → no data lost, user resolves → connects forward to CRDT trade-off discussion at L7 (§12)
Content-addressable chunks (§5) → same hash = same content, always valid → CDN caching never needs invalidation for chunk content (§9) → cache TTL can be indefinite → file read latency hits CDN p99 (< 20 ms) rather than origin p99 (< 200 ms)

Also in this series

Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
Online Judge System Design — secure sandboxing, execution queues, and worker scaling

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

NFR reasoning

Capacity estimation

Interactive capacity estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Chunking & deduplication

Deduplication via content addressing

API design

POST /files/upload/init — initiate upload

POST /files/upload/complete — commit upload

GET /sync/delta — pull changes since a cursor

Optional endpoints by level

POST /files/{id}/share — sharing endpoint

Search design note

Core flow: upload & sync

Resumable uploads for large files

Async thumbnail & preview generation

§7.5 — Offline access: the local sync daemon L5+

Data model

Access patterns

Schema

Quota tracking

Version retention and GC lifecycle

Caching strategy

Cache invalidation

Deep-dive: scalability

Failure modes & edge cases

How to answer by level

Classic probes table

How the pieces connect

Trace a single 1-byte edit end to end

NFR chain traces

System Design Mock Interviews

Practice Coding Interviews Now