Design a File Storage System
The interface is a folder. Behind it: chunked uploads, content-addressed deduplication, delta sync across devices, and conflict resolution that can't lose a byte.
~42 min read · 12 sections
What the interviewer is testing
A file storage and sync system — Dropbox, Google Drive — looks like a managed folder. Underneath it spans chunked uploads, content-addressed deduplication, delta sync, offline-first conflict resolution, and a metadata layer that must track billions of file versions with sub-second query times.
Interviewers use this question to probe whether you think about data at rest (block storage, dedup) separately from data in motion (the sync protocol), and whether you understand how those two planes interact under failure conditions. Three tensions drive the core tradeoffs: bandwidth efficiency vs. simplicity, strong consistency vs. availability during partitions, and storage cost vs. query flexibility.
| Level | Primary focus | Hard question at this level |
|---|---|---|
| L3/L4 | Working upload/download with metadata tracking | How does a client know what changed while it was offline? |
| L5 | Chunked uploads, content addressing, delta sync | How does chunking interact with deduplication across users? |
| L6 | Conflict resolution, versioning, resumable uploads | Two devices edit the same file simultaneously — what happens? |
| L7/L8 | Cross-region consistency, CDN strategy, cost at petabyte scale | How do you cut storage cost by 60% without degrading read latency? |
The single most differentiating question across all levels is the sync protocol. How does a client discover what changed, fetch only what it needs, and handle conflicts, without polling constantly or losing data? A candidate who handles the sync protocol clearly tends to derive the remaining design decisions correctly: the sync model constrains the metadata schema, the notification architecture, and the conflict resolution approach.
Requirements clarification
Two designs hide under this question: a consumer sync tool (Dropbox-like, mobile + desktop, strong sync) and an enterprise document platform (Google Drive-like, collaborative editing, sharing, search). The scope changes which problems are hard. Confirm scale (DAU, average file size, total storage) before assuming either.
Functional requirements
| Requirement | Description |
|---|---|
| Upload files | Clients upload files of any type and size (up to 5 GB per file) |
| Download files | Clients download files on any device associated with their account |
| Sync | Changes on one device propagate automatically to all other devices for the same account |
| Version history | Users can view and restore previous versions of any file (minimum 30 days) |
| File sharing | Files and folders can be shared with other users or via a public link |
| Offline access | Files marked for offline access are available without a network connection |
| Conflict resolution | When two devices edit the same file concurrently, both versions are preserved (no silent data loss) |
Non-functional requirements
| NFR | Target | Notes |
|---|---|---|
| Availability | 99.99% (52 min/year) | Reads must succeed even during write outages |
| Durability | 99.999999999% (11 nines) | No data loss after upload acknowledgement |
| Sync latency | < 5 s for files < 1 MB | From upload-complete on device A to available on device B |
| Upload bandwidth | Delta uploads (only changed chunks) | A 1-byte edit in a 100 MB file must not re-upload 100 MB |
| Read latency (p99) | < 200 ms for metadata; < 2 s TTFB for large files | Metadata served from cache; files from nearest CDN edge |
| Storage efficiency | Global deduplication across all users | A file uploaded by 10,000 users is stored only once |
| Consistency | Eventual (sync events); strong (metadata writes per file) | Upload completion is strongly consistent; cross-device propagation is eventual |
NFR reasoning
Durability: 11 nines (99.999999999%) Storage architecture driver ›
11 nines means roughly one file lost per billion stored per year. This is the standard SLA for consumer file storage and matches what AWS S3 and Google Cloud Storage guarantee. Achieving it requires erasure coding (or triple replication) across multiple availability zones, plus periodic integrity verification (bit-rot scrubbing). It is not achievable with single-region storage alone.
Upload bandwidth: delta-only (changed chunks) Sync protocol driver ›
Mobile and home network upload bandwidth is 2–20 Mbps for most users. Re-uploading an entire large file on every save would make the product unusable for video editors or anyone working with large documents. Chunking solves this: only the changed chunks are uploaded. If a 1 GB video has only its last 10 MB modified, the upload is ~10 MB, not 1 GB — a 100× bandwidth reduction.
Storage efficiency: global deduplication Cost driver ›
Studies of real cloud storage systems show 20–50% of stored data is identical across different user accounts. A popular PowerPoint template uploaded by 10,000 users would consume 1 TB at naïve storage, 100 MB with deduplication. At petabyte scale, global dedup reduces raw storage costs by 30–50%, which can make the difference between a profitable and unprofitable storage business at consumer pricing.
Capacity estimation
File storage is a storage-bound problem, not a compute-bound one. Three numbers drive the cost model: total raw storage accumulated over time, the deduplication ratio that compresses it, and peak inbound bandwidth during business hours.
Interactive capacity estimator
Key insight: The metadata database (file records, version history, sharing permissions) grows independently of the chunk store — and is far more expensive per byte. At 50M DAU uploading 3 files/day for 5 years, you accumulate ~270 billion file records. A relational row at ~500 bytes means ~135 TB of metadata alone, which is why file metadata is typically sharded by user ID and stored in a distributed SQL or document database, separate from the actual file content.
High-level architecture
A file storage system has two fundamentally different data planes: the metadata plane, which tracks what files exist, who owns them, and what version each device has seen; and the block plane, which stores the actual file content as immutable, content-addressed chunks. Keeping these planes separate is what makes global deduplication and delta sync possible: metadata can be strongly consistent without imposing that cost on the block store, and blocks can be immutable and content-addressed without requiring the metadata layer to share those constraints.
The system splits into four functional clusters: a sync API tier that clients talk to, a metadata service backed by a sharded relational database, a block service backed by a content-addressable blob store, and a notification bus that pushes change events to subscribed client devices.
Component breakdown
CDN (content delivery network): caches file chunks at edge nodes globally. File reads are by far the highest-volume operations. Serving them from the nearest edge node reduces latency from hundreds of milliseconds (origin round-trip) to single-digit milliseconds for most users. The CDN stores chunks by their content hash: the same hash always resolves to the same bytes, so cached chunks never require invalidation.
API Gateway + Sync API servers: the stateless tier that authenticates requests, enforces rate limits, and routes to the correct downstream service. The Sync API is split from the download path: clients GET files via CDN-signed URLs, they PUT chunks directly to the Block Service's upload endpoints. The Sync API orchestrates the metadata transaction that records each completed upload.
Metadata Service: tracks every file record, folder tree, sharing permission, and version history entry. This is the most query-intensive service; every sync operation begins here to discover what changed. It runs on a sharded relational database — MySQL via Vitess (the sharding layer Dropbox itself used), or Google Cloud Spanner for globally replicated consistency — partitioned by user ID so that all queries for a given user's file tree hit the same shard. The relational model is chosen over document or wide-column stores because uploading a file must atomically create a file_versions row and update files.current_ver — a multi-row ACID transaction that eventually-consistent document stores cannot provide without application-level compensating logic.
Block Service: receives raw chunk uploads, computes SHA-256 fingerprints, checks the global dedup table, and either confirms the chunk already exists or writes it to the Chunk Store. This service is stateless and owns no data: all state lives in the hash lookup table and the Chunk Store.
Chunk Store: a durable, content-addressed object store (a managed object storage service like S3 or GCS) where each chunk is keyed by its SHA-256 hash. Objects are immutable once written. Because chunks are identified by content hash, two identical files uploaded by different users share the same physical chunk objects — deduplication is free at the storage layer.
Notification Bus: when a file upload completes, the Sync API publishes a change event to a durable message queue (such as Apache Kafka). A notification service fans out that event to all other devices for the same account connected via WebSocket or HTTP long-poll. Devices receive the event and know to pull the updated file manifest from the Metadata Service.
Cache: an in-memory store (Redis) sitting in front of the Metadata Service for hot read paths — active sessions, recently accessed file metadata, sharing permission checks. Because the Metadata Service owns all writes, it can push invalidation events after each commit without coordinating with other writers — there is no distributed invalidation problem.
Architectural rationale
Why separate the metadata plane from the block plane? Core architecture ›
Metadata (file names, folder trees, version history) has very different access characteristics from file content: it is small (hundreds of bytes per record), queried constantly (every sync event), and needs strong consistency so clients agree on the current version. File content is large (MB to GB per file), mostly read-heavy after initial upload, and benefits from content-addressed immutability. Mixing them in one service forces both to use the same consistency and latency model, which is wrong for both.
Why clients GET files via CDN-signed URLs, not through API servers? Bandwidth cost ›
Routing file downloads through API servers would require each server to proxy gigabytes of data — pulling from object storage and streaming to the client. This consumes server CPU, memory, and egress bandwidth at API server cost. Generating a time-limited pre-signed URL (configurable TTL — commonly 15 minutes for download URLs; the AWS SDK default is 3600 s, tune downward for sensitive files) allows the client to download directly from the CDN or object store, bypassing the API tier entirely. The API server's job in the download path is only to authorise the request and generate the URL.
Why WebSocket / long-poll for sync notifications instead of polling? Sync latency ›
With 50M active devices polling every 30 s, that's ~1.7M req/s of overhead regardless of what changed: the system pays that cost at all times. Push-based notifications (WebSocket subscriptions or HTTP long-poll) flip this: the server only sends a message when something actually changed. For most devices in a typical hour, that is zero messages: the connection is idle.
Real-world comparison
| Decision | This design | Dropbox | Google Drive |
|---|---|---|---|
| Chunk size strategy | Content-defined (rolling hash) | Fixed 4 MB (historically); rolling hash in modern client | Fixed block size for binary files; byte-level diff for Docs |
| Metadata storage | Sharded RDBMS (sharded by user ID) | MySQL with custom sharding (Edgestore) | Google Spanner (globally replicated) |
| Block storage | S3-compatible object store | Amazon S3 + custom Magic Pocket | Google Colossus (distributed file system) |
| Deduplication scope | Global (all users) | Global block-level dedup | Per-user (privacy constraints limit cross-user dedup) |
| Sync protocol | Event-driven (push change events) | LongPoll + delta sync | Push via APNs/FCM + polling fallback |
The right architectural choices follow from requirements, not from brand preference. Dropbox's decision to build Magic Pocket (its own object storage) came from S3 egress costs at petabyte scale. Google Drive's Spanner choice reflects the need for strong consistency across global regions — a tradeoff that makes sense when the product also powers Google Workspace collaboration.
Chunking & deduplication
The central algorithmic question in this design is: how do you split a file into chunks such that the same logical content produces the same chunks, across different file sizes and edit positions? The answer determines how effective delta sync and cross-user deduplication are in practice.
Our choice for this system is content-defined chunking (CDC) using a Rabin fingerprint sliding window. The algorithm slides a 48-byte window over the file's bytes and cuts a chunk boundary whenever the 32-bit rolling hash matches a target pattern. As a concrete example: requiring the lowest 13 bits to be zero yields an expected chunk size of 2^13 = 8 KB. For a 4 MB target, require the lowest 22 bits to match (2^22 ≈ 4 MB). Minimum and maximum chunk size bounds prevent degenerate tiny or huge chunks.
Deduplication via content addressing
After chunking, each chunk is identified by its SHA-256 fingerprint. Before uploading, the client sends the Block Service a list of chunk hashes. The Block Service checks each hash against the global chunk index. The Block Service skips any chunk whose hash is already in the index and notifies the client of which chunks to omit. Only the remaining chunks are uploaded. This achieves both intra-user dedup (re-uploading a file you already have) and inter-user dedup (uploading a file that someone else already uploaded).
Hash collision risk: is SHA-256 safe? Security · L5 probe ›
SHA-256 produces a 256-bit fingerprint. The birthday paradox collision probability for N chunks is approximately N²/2^256. At 10^15 chunks (petabyte-scale system with 1 KB average chunk), this is ~10^30/10^77 — astronomically small. In practice, accidental collision is not a meaningful risk.
The real threat is intentional collision: a malicious user crafting a chunk that has the same hash as a chunk belonging to another user, enabling them to "own" that chunk without uploading it. SHA-256 is currently collision-resistant against known attacks, but the system should also perform byte-level verification on download if tampering is a concern.
Cross-user dedup and privacy: the "hash proof" pattern Privacy · L6 probe ›
Cross-user dedup implies that if you know the hash of a file you don't have access to, you could "own" a copy of it for free (by just proving you know its hash). This is the "Dropbox hash theft" attack discovered in 2011. The mitigation is requiring the uploader to prove they have the content, not just the hash — either by uploading a fragment (a proof-of-possession scheme) or by using HMAC with a per-user secret key so hashes are not comparable across users. Google Drive disables cross-user dedup for sensitive content categories for this reason.
Encryption at rest and key management Security · L6+ probe ›
For consumer tiers, server-side encryption (SSE) with managed keys is standard: chunks are encrypted at rest using AES-256 with a per-user data key, itself wrapped by a master key in a KMS (AWS KMS, Google Cloud KMS). This protects against physical storage compromise without affecting the dedup model: the server decrypts on read before serving a presigned URL. For enterprise and HIPAA-regulated tiers, client-side encryption (CSE) is required: the client encrypts each chunk before upload using a key that never leaves the user's device. The tradeoff is explicit: CSE eliminates all cross-user dedup (the same plaintext file produces unique ciphertext per user, so chunk hashes never match across users), and must be stated as such when an interviewer asks about the privacy-vs-efficiency axis.
One open question: What's the target chunk size? Smaller chunks improve dedup granularity and delta sync precision, but increase the number of chunk hash lookups and metadata rows. Larger chunks reduce lookup overhead but mean more re-upload if a large chunk contains a small edit. For workloads that include large binary files, 4–8 MB chunks reduce hash lookup overhead at the cost of coarser delta granularity. For document and text-heavy workloads, 64–512 KB chunks improve dedup precision and delta sync resolution, at the cost of more hash lookups per upload.
API design
The public API has two core operations: initiating a file upload (which returns upload URLs for missing chunks) and triggering a sync to discover what changed. Everything else — version restore, sharing, folder management — is important but does not change the core architecture.
POST /files/upload/init — initiate upload
The API gateway enforces a per-user token bucket on this endpoint: max 100 upload-init requests per minute, 10 concurrent upload sessions. Clients exceeding the limit receive 429 Too Many Requests with a Retry-After header.
// Request: client sends file metadata + list of chunk hashes
POST /files/upload/init
Authorization: Bearer <token>
Content-Type: application/json
{
"path": "/My Documents/report.pdf",
"size_bytes": 52428800, // 50 MB
"modified_at": "2026-04-10T14:32:00Z",
"content_hash": "sha256:aef392...", // whole-file hash
"chunks": [
{ "index": 0, "hash": "sha256:c1a...", "size": 4194304 },
{ "index": 1, "hash": "sha256:d2b...", "size": 4194304 },
{ "index": 2, "hash": "sha256:e3c...", "size": 1048576 }
]
}
// Response: 200 OK — server identifies which chunks are missing
{
"upload_session_id": "sess_abc123",
"missing_chunks": [1, 2], // chunk 0 already exists
"upload_urls": {
"1": "https://storage.example.com/chunks/d2b...?sig=...",
"2": "https://storage.example.com/chunks/e3c...?sig=..."
}
}
// Validation errors
400 Bad Request: chunk count > 10,000 or file > 5 GB
409 Conflict: path exists with same content_hash (client dedups early)\
POST /files/upload/complete — commit upload
// Request: after uploading all missing chunks, commit the transaction
POST /files/upload/complete
Content-Type: application/json
{
"upload_session_id": "sess_abc123"
}
// Response: 201 Created
{
"file_id": "file_xyz789",
"version_id": "ver_001",
"url": "https://cdn.example.com/files/file_xyz789?v=ver_001&sig=..."
}
// Error responses
404 Not Found: upload_session_id expired or unknown
422 Unprocessable: one or more missing chunks not yet uploaded
GET /sync/delta — pull changes since a cursor
// Request: client provides its last sync cursor
GET /sync/delta?cursor=eyJhY2N0IjoiMTIzIiwiY3Vyc29yIjoiMTcxMjcwMDAwMCJ9
// Response: list of changes since that cursor
{
"changes": [
{
"type": "modified",
"file_id": "file_xyz789",
"path": "/My Documents/report.pdf",
"version_id": "ver_002",
"size_bytes": 52457600,
"chunks": [
{ "index": 0, "hash": "sha256:c1a..." }, // unchanged
{ "index": 1, "hash": "sha256:f4d..." }, // new chunk
{ "index": 2, "hash": "sha256:e3c..." } // unchanged
]
}
],
"new_cursor": "eyJhY2N0IjoiMTIzIiwiY3Vyc29yIjoiMTcxMjcxMDAwMCJ9",
"has_more": false
}
Optional endpoints by level
| Endpoint | Description | Level |
|---|---|---|
GET /files/{id}/versions | List version history for a file | L3/L4 |
POST /files/{id}/restore | Restore a file to a previous version | L3/L4 |
POST /files/{id}/share | Generate a shareable link with optional expiry | L5 |
DELETE /files/{id} | Soft-delete a file (moves to trash, retains versions) | L5 |
GET /files/search | Full-text search across file names and content | L7/L8 |
POST /files/upload/resumable/init | TUS-compatible resumable upload for large files | L7/L8 |
POST /files/{id}/share — sharing endpoint
// Request: specify permission level, optional grantee, optional expiry
POST /files/{id}/share
Authorization: Bearer <token>
Content-Type: application/json
{
"permission": "view" | "edit", // access level granted
"grantee_id": "usr_abc123", // omit for a public link
"expiry_seconds": 604800 // optional; 0 = no expiry
}
// Response
HTTP 201 Created
{
"link_token": "tok_xK9mPq2r",
"share_url": "https://app.example.com/s/tok_xK9mPq2r",
"expires_at": "2026-04-18T12:00:00Z", // null if no expiry
"permission": "view"
}
The link_token is stored in the file_shares table (§8). On access, the API verifies the token is not expired, increments an access counter for audit, and returns a presigned download URL. For grantee_id shares (direct user grants), the notification fan-out (§10) adds the grantee to the affected file's change events so they receive real-time sync updates for the shared file.
Search design note
File-name search can be served without a dedicated search index: a WHERE user_id = ? AND name LIKE 'prefix%' query on a (user_id, name) composite index in the metadata DB handles autocomplete efficiently at L3/L4. Full-text content search — searching inside document bodies — requires an async pipeline: upload completion triggers an event → text-extraction worker pulls the chunk bytes, extracts plaintext, and pushes to Elasticsearch → the index maps terms to file IDs. Clients query Elasticsearch directly via the API gateway. This pipeline is an L7/L8 concern; state the distinction clearly when the interviewer asks about GET /files/search.
Core flow: upload & sync
The durability NFR (11 nines) and the upload bandwidth NFR (delta-only) from §2 drive every step in the upload flow, along with the quota check added in §8. The key question is: at what point does the server commit a file version, and what happens on device B when device A completes an upload?
Client-side early exit: before calling POST /upload/init, the client compares the whole-file content hash against its local version database. If the hash matches the last synced version, the upload is skipped entirely — zero network calls. This is the first and cheapest deduplication check, operating entirely on-device.
Conflict handling: what if Device A and Device B both edit the same file while offline? The server detects a conflict when Device B tries to commit a version whose parent version does not match the current head. Rather than silently overwriting, the system creates a conflict copy — both versions are preserved as separate files (e.g., "report.pdf (Device B's conflicted copy 2026-04-10)"). This matches Dropbox's actual conflict policy. At L5+, interviewers expect this approach: both versions are preserved, no data is lost, and the user resolves the conflict manually, which means the system's job is detection, not merge.
Quota check in the upload path: before the server performs the missing-chunk lookup (step ③), the Sync API does an optimistic read of quota_used_bytes as an early exit — if the user is clearly over quota, return 507 Insufficient Storage immediately. However, this read-then-check alone has a race condition: two concurrent upload-init requests from the same user can both pass the quota check simultaneously, each within quota individually but together exceeding it. The authoritative enforcement happens at commit time using an atomic compare-and-swap: UPDATE user_quotas SET quota_used_bytes = quota_used_bytes + ? WHERE user_id = ? AND quota_used_bytes + ? <= quota_limit_bytes. If 0 rows are updated, the commit is rejected with 507 and any uploaded chunks are abandoned. This guarantees quota is never exceeded regardless of concurrency. The quota is decremented on soft-delete and reclaimed by the GC pipeline when a version is hard-deleted.
Resumable uploads for large files
For files larger than ~100 MB, network interruptions mid-upload are likely. The system implements the TUS resumable upload protocol (or an equivalent): the client first issues an upload initialization request, then uploads chunks in sequence. Each chunk upload is acknowledged. If the connection drops, the client queries the server for which offset was last committed and resumes from there. This directly addresses the upload bandwidth NFR from §2 — a 2 GB file with a dropped connection at 1.9 GB only re-uploads the last chunk, not the whole file.
Async thumbnail & preview generation
Thumbnail generation is a background pipeline, not a synchronous upload step. File browsers display 64×64 icons and hover previews independently of the full file — downloading the complete file just to render the list view would be impractical.
When the Sync API publishes the upload-complete event to the Notification Bus (step ⑨ above), a Thumbnail Worker subscribes as a second consumer on that same topic. For each image or document file, the worker:
- Downloads the first chunk (or the full file for images) from the Block Store using the version's chunk list.
- Generates two thumbnail variants: 64×64 px (file-list icon) and 256×256 px (hover preview).
- Writes both variants to object storage under deterministic keys:
thumbs/{file_id}/{version_id}/64.webpandthumbs/{file_id}/{version_id}/256.webp. - Updates the
file_versionsrow withthumb_url_64andthumb_url_256columns, which the metadata API then returns to clients.
Because the thumbnail worker is an async consumer, it does not block the upload acknowledgement: the client gets its file_id + version_id immediately, and thumbnails appear in the UI once the worker finishes (typically within 1–3 seconds for images). For file types without a thumbnail (e.g., .zip), the worker is a no-op and the client falls back to a generic file-type icon.
Common L4+ follow-up: "How does the file browser show a preview without downloading the whole file?" — Answer: the upload-complete event triggers the async thumbnail pipeline above. The client receives thumb_url_64 in the metadata response and fetches the thumbnail from the CDN directly; the full file content is never downloaded just to render the list view.
§7.5 — Offline access: the local sync daemon L5+
The "offline access" functional requirement (§2) is satisfied by a persistent background daemon on the client device, not by the server. The daemon has three responsibilities: watching for local file changes, maintaining a local state database, and flushing queued work on reconnect.
- File system watcher: the daemon registers with the OS-native change API —
FSEventson macOS,inotifyon Linux,ReadDirectoryChangesWon Windows. When a watched file changes, the watcher fires an event within milliseconds. The daemon debounces rapid successive changes (e.g., an editor saving every keystroke) with a short idle delay (~500 ms) before treating the file as stable. - Local state database: a lightweight SQLite database tracks the last-synced
version_id, content hash, and sync status (synced | pending_upload | pending_download) for every file in the sync folder. This is the ground truth for what the device believes is current. On app startup or reconnect, the daemon compares the local state DB against the server'sGET /sync/deltaresponse to identify any divergence. - Reconnect flush: when the device comes back online, the daemon walks its local state DB for any rows with
pending_uploadstatus — these are edits made while offline. Each is processed through the normal upload flow (§7). Offline edits discovered on reconnect are treated identically to concurrent device edits: theparent_version_idcheck at commit time detects conflicts, and the conflict-copy mechanism in §7 handles them — no special offline-conflict code path is needed.
Selective sync (marking only certain folders for offline access) works by limiting which paths the daemon watches and which entries it downloads to the local state DB. Files not marked for offline access are stubs: their metadata is local but their chunks are fetched on-demand when opened.
Data model
There are four categories of entities in this system: users + accounts, file metadata, chunk index, and sharing + permissions. Each has different access patterns, and those differences determine where each entity lives and how it is indexed.
Access patterns
| Operation | Frequency | Query shape |
|---|---|---|
| Fetch file tree for a user | Very high (on sync / app open) | Range scan by user_id + parent_folder_id |
| Look up chunk by hash | Very high (per chunk on every upload) | Point lookup by SHA-256 hash |
| List version history for a file | Medium (on user request) | Range scan by file_id ordered by created_at DESC |
| Fetch sharing permissions | High (on every file access) | Point lookup by file_id + requesting user_id |
| Write new file version | High (every upload) | INSERT + UPDATE file record (transactional) |
| Compute sync delta (cursor) | High (per sync event) | Ordered scan by account_id + updated_at > cursor timestamp |
All user-facing queries begin with a user_id filter, making user_id the natural sharding key: all records for a given user co-locate on one shard. The chunk hash lookup has no user context at all, which makes it a candidate for a separate, independently-scalable hash index.
Schema
Quota tracking
Storage quotas require two additions to the schema. A user_quotas table (co-located on the same shard as the user's files via user_id) tracks the current and limit values:
-- co-located with files shard via user_id
user_quotas (
user_id UUID PK -- shard key,
quota_limit_bytes BIGINT NOT NULL -- e.g. 15 GB for free tier,
quota_used_bytes BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ
)
quota_used_bytes is incremented atomically when a file version is committed and decremented on soft-delete. The Sync API reads this row at upload-init time and rejects the request with 507 Insufficient Storage if the upload would exceed the limit (see §7). The files table additionally carries deleted_at TIMESTAMPTZ and purge_after TIMESTAMPTZ columns to support the trash/GC pipeline: a daily background job hard-deletes rows where purge_after < NOW(), decrements quota_used_bytes, and decrements chunk_index.ref_count for orphaned chunks.
Version retention and GC lifecycle
The §2 NFR specifies a minimum 30-day version history. On each version creation, purge_after is set to created_at + retention_days (30 days for free tier, configurable per plan). The daily GC job runs DELETE FROM file_versions WHERE purge_after < NOW() and, for each removed version, atomically decrements chunk_index.ref_count for each chunk in the version's chunk list: UPDATE chunk_index SET ref_count = ref_count - 1 WHERE chunk_hash = ?. If the updated ref_count reaches 0, the GC then deletes the blob from the object store by its hash key (an S3/GCS DeleteObject call), and finally removes the chunk_index row. The object store has no ref_count column — it is a content-addressed blob store; all reference tracking lives in chunk_index. This two-step sequence (decrement in index → delete from object store if zero → delete index row) prevents races where a concurrent upload re-references a chunk between the decrement and the delete. This is distinct from the orphaned-chunk GC path (§11) which handles failed uploads — version-expiry GC is the planned lifecycle path for fully committed versions whose retention window has closed.
Why denormalize user_id into file_versions? Sharding strategy ›
In a sharded database, cross-shard joins are expensive or impossible. If a query needs file_versions for a file owned by a user on shard 3, the version data must also be on shard 3. Including user_id in file_versions (even though it duplicates the files table relationship) ensures the shard router can keep all a user's related records co-located. This denormalization is intentional, not a mistake.
Caching strategy
The caching architecture follows from two NFRs: metadata reads must complete in under 200 ms (§2), and file downloads must start within 2 seconds from the nearest edge (§2). These are satisfied by different cache layers operating at different points in the request path established in §4.
| Cache layer | What's cached | TTL | Why it exists |
|---|---|---|---|
| Client local disk | Full file chunks for recently accessed files | LRU until storage limit | Offline access; zero latency for recently seen files |
| CDN edge nodes | Chunk objects by SHA-256 hash | Indefinite (content-addressed) | File content never changes for a given hash; cache hits are always valid |
| Redis cluster | File metadata, session tokens, permission checks | 60–300 s per entry | Metadata reads happen on every sync event; DB round-trip would bottleneck the API |
| API server in-process | User account info, routing config | 30–60 s | Avoid Redis hop for extremely hot data (logged-in user's account) |
Cache invalidation
Chunk objects are immutable by design (content-addressed), so the CDN never needs invalidation for chunks: the same hash always returns the same bytes. Every metadata write triggers a cache invalidation event: after each successful commit, the Sync API publishes to a durable Kafka topic (the same cluster already in the design), and consumers clear the relevant Redis keys. Using a durable topic rather than Redis pub/sub ensures that a temporarily disconnected consumer does not permanently miss invalidation events. Redis pub/sub is fire-and-forget: it delivers nothing to subscribers that restart or lag. This keeps cache coherence bounded to the event propagation latency, typically under 100 ms within a region.
Content-addressing eliminates the hardest part of CDN caching: because every chunk is keyed by its SHA-256 hash, CDN caching is trivially correct. There is no cache poisoning risk, no stale-data risk, and no need for cache-busting URLs. A chunk served from a Berlin CDN edge node 2 years after it was first uploaded is guaranteed to be byte-identical to what was stored at upload time.
Deep-dive: scalability
At 50M+ DAU the system faces four distinct scaling challenges: the chunk hash lookup table becomes a hot global bottleneck; metadata shards become unevenly loaded for power users; the sync notification layer must hold millions of persistent connections; and cross-region replication introduces consistency trade-offs.
Scaling the chunk hash index (global dedup table) L5 deep-dive ›
The chunk hash index is the single hottest table in the system. Every upload — by every user — must query it before uploading a chunk. At 50M DAU uploading 3 files/day averaging 10 chunks each, that's ~17,000 hash lookups per second sustained, with 5× peak during business hours (~85K/s).
A sharded key-value store (such as Apache Cassandra or Amazon DynamoDB) handles this naturally: the partition key is the SHA-256 hash (first 8 bytes), distributing evenly across nodes. Reads are single-key lookups with no fan-out. The table schema is simple: chunk_hash → {storage_path, size_bytes, ref_count}. At 10^11 unique chunks with 512 bytes per row, total table size is ~50 TB — comfortably within DynamoDB or Cassandra's capacity model.
Metadata shard hot-spots and large-account handling L6 deep-dive ›
Sharding by user_id distributes load evenly for average users. But enterprise customers with shared drives accessed by thousands of employees create a different problem: all write traffic for that shared namespace hits one shard. The shared drive is a logical account, and all editors write to the same shard concurrently.
The solution is hierarchical sharding: a shared drive gets its own synthetic "owner ID" distinct from any individual user ID. Traffic targeting that drive is further sub-sharded by folder ID within the drive, distributing concurrent write load across multiple shards. The metadata service maintains a routing table mapping drive IDs to their shard assignments.
Notification at 15M concurrent connections L6 deep-dive ›
A WebSocket server handling 300K connections per node (using epoll-based event loop I/O — Go, Node.js, C++) requires ~50 nodes per region for 15M concurrent connections. When a file changes, the Sync API publishes a change event to a durable message queue (Kafka topic = account_id). A notification service fan-out layer reads partitions by account_id and writes to the correct WebSocket server, which knows which client connections belong to that account. For files in shared folders, this fan-out must also look up the file_shares table for the affected file_id and publish change events to each grantee_id's notification topic — otherwise collaborators never receive push notifications for changes to files they have been granted access to. This means the fan-out requires a metadata lookup step and cannot be a simple per-account Kafka partition route.
WebSocket servers must be stateful (they hold open connections) but must not be in the write path. If the WebSocket server goes down, clients fall back to polling the /sync/delta endpoint with their cursor. The cursor-based sync model means no sync events are permanently lost — a client that was offline simply polls on reconnect and catches up.
Cross-region consistency and the sync cursor L7 deep-dive ›
Metadata is written to the primary region and replicated asynchronously to replica regions with typical lag of 50–200 ms. A client in EU may read from the EU replica and see a slightly stale file tree. This is acceptable for the sync use case — eventual consistency bounded by replication lag is the explicit consistency model from §2. The sync cursor encodes a timestamp; a client reading stale data will receive the correct state on the next poll once replication catches up.
For writes, all clients route metadata mutations to the primary region. The cross-region latency adds 80–120 ms to write operations for EU clients writing to US-East primary. At L7, the discussion is whether to implement active-active multi-primary (each region accepts writes, with conflict resolution via vector clocks (logical timestamps that track causal ordering across concurrent writes)): the same problem Google Docs and CRDTs (Conflict-free Replicated Data Types — data structures where any two concurrent edits merge deterministically) solve, but for file metadata rather than document content.
Erasure coding for storage durability L7 deep-dive ›
Triple replication (3 full copies across 3 AZs) provides 11-nines durability but at 3× storage cost. Reed-Solomon erasure coding (RS 6+3 is common) splits each chunk into 6 data shards and 3 parity shards. Any 6 of the 9 shards can reconstruct the original chunk — tolerating 3 simultaneous shard failures. Storage overhead is 9/6 = 1.5×, vs 3× for triple replication. At petabyte scale, this is a 50% reduction in raw storage cost: the economic case is compelling.
The tradeoff: reading an erasure-coded chunk requires reading at least 6 shards in parallel and decoding via Galois field arithmetic (polynomial interpolation over GF(2⁸)) — not simple XOR, which only applies to single-parity schemes like RAID-5. This adds CPU overhead and small latency. Hot chunks in the CDN cache bypass this — they are served as whole objects. Only CDN-cold chunks require reconstruction.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Upload exceeds user quota | User attempts to upload a file that would push their quota_used_bytes over quota_limit_bytes |
Sync API does an optimistic quota read at upload-init for a fast early exit, but enforces quota atomically at commit time via UPDATE user_quotas SET quota_used_bytes = quota_used_bytes + ? WHERE user_id = ? AND quota_used_bytes + ? <= quota_limit_bytes. If 0 rows are updated, the commit is rejected with 507 Insufficient Storage. This prevents the race condition where two concurrent uploads both pass the init-time check but together exceed quota. Client surfaces "storage full" UI and optionally links to upgrade flow. |
L3/L4 |
| Upload interrupted mid-stream | Client uploads 80% of chunks, then loses connection; server has partial data | TUS-protocol resumable uploads: client queries upload cursor on reconnect, re-uploads only from last committed chunk. Upload sessions expire after 24 hours; orphaned chunks are cleaned by background GC. | L3/L4 |
| Chunk store unavailable | Object storage returns 503; uploads fail | Return 503 to client immediately (don't buffer chunks in API servers). Client retries with exponential backoff. The system uses an object store with 99.99% availability SLA — outages are rare but must be handled gracefully by the client library. | L3/L4 |
| Two devices save the same file concurrently | Device A and Device B both modify a file while offline; first commit wins, second overwrites data | The server rejects the second commit if its parent_version_id doesn't match the current head. The Sync API creates a "conflict copy" (Device B's conflicted copy) instead of overwriting. No data is lost. User resolves manually. | L5 |
| Metadata DB shard fails | A shard failure makes all files for affected users inaccessible | Each shard has a primary + 2 synchronous replicas. On primary failure, the replica promotes automatically (typically in 10–30 s). During promotion, writes are rejected (503); reads may serve from a slightly stale replica. Clients retry with backoff. | L5 |
| Chunk hash collision (deliberate) | Attacker crafts a chunk whose SHA-256 matches an existing chunk they don't own, "stealing" access | Require proof-of-possession: client must upload the first 4 KB of any chunk it claims to already own. Server validates hash of the fragment matches. This makes hash-stealing attacks require actually having the content. | L5 |
| Notification delivery failure | WebSocket connection drops; change event not delivered to Device B | Events are also persisted in Kafka with a 7-day retention window. On reconnect, Device B polls /sync/delta with its last cursor. The cursor-based protocol ensures all events are catchable on poll regardless of push reliability. | L5 |
| Chunk hash index unavailable | The global chunk hash index (DynamoDB/Cassandra) is a synchronous dependency in every upload: the Block Service must perform the dedup lookup (step ③) before any chunk can be accepted or skipped. If the index is unavailable, all upload-init requests fail at the dedup check. | Add a circuit breaker on the hash index client. On open circuit, degrade to always-upload mode: skip the dedup lookup, accept all chunks unconditionally, and reconcile orphaned duplicates in the next GC run. Return a response header indicating dedup was bypassed so clients do not retry unnecessarily. Restore dedup checks once the circuit closes. Alternatively, return 503 with a Retry-After header if degraded-mode storage cost is unacceptable. |
L5 |
| Orphaned chunks (upload abort after block write) | Block is written to chunk store but metadata commit never happens; storage leak accumulates | Upload sessions have a 24-hour expiry. A background garbage collector runs daily, scanning upload_sessions for expired/uncommitted sessions and deleting associated chunk_store objects whose ref_count has never been incremented. GC uses a two-phase check (mark then delete with a 1-hour delay) to guard against slow commits. Critically, the final delete must re-verify ref_count = 0 at deletion time: re-read chunk_index.ref_count atomically before issuing the object-store DeleteObject call, because a new upload could have incremented ref_count to 1 between the mark and delete phases. Only delete the object store blob and chunk_index row if ref_count is still 0 at that moment. |
L7/L8 |
| Region-wide outage | US-East primary region loses connectivity; EU users can't write files | EU read replicas serve reads from the last consistent snapshot. Writes are queued in client (up to 24 hours of local changes). On region recovery, clients flush their local write queue. Active-active would eliminate this window but adds significant complexity (see §10 deep-dive). | L7/L8 |
How to answer by level
The file storage question is calibrated by how deep you go on the sync protocol, chunking algorithm, and consistency model. An L3 answer covers the mechanics; an L7 answer owns the cost structure and consistency trade-offs.
L3/L4 What good looks like at this level ›
- Identifies two core services: file metadata store and file content store
- Proposes chunked uploads to handle large files and network failures
- Designs a simple sync mechanism: client polls for changes, downloads new versions
- Handles the basic upload flow end-to-end: chunk → upload → commit metadata
- Mentions version history as a list of file records with timestamps
files (
file_id UUID PK,
user_id UUID,
path TEXT,
s3_url TEXT,
version INT,
updated_at TIMESTAMPTZ
)
No sharding concerns at this level. One table is enough to answer "what changed?" via a WHERE user_id = ? AND updated_at > ? scan.
- L3/L4 uses polling for sync; L5 introduces event-driven push
- L3/L4 may re-upload entire files; L5 proposes delta / chunk-diff sync
- L3/L4 doesn't bring up deduplication as a storage strategy
- L3/L4 uses a single flat file store; L5 separates metadata and blocks explicitly
L5 What good looks like at this level ›
- Explains content-defined chunking (CDC) and why it's superior to fixed-size for delta sync
- Proposes SHA-256 content addressing for deduplication; identifies the global hash index as a separate component
- Designs the "/upload/init → upload chunks → /upload/complete" two-phase commit
- Proposes WebSocket or long-poll for sync notifications instead of polling
- Identifies conflict resolution as a distinct problem and proposes conflict copies
- L5 proposes dedup; L6 reasons about privacy implications and proof-of-possession
- L5 designs for a single region; L6 reasons about cross-region read/write separation
- L5 describes conflict copies; L6 probes further (CRDTs, version vectors, shared drive conflicts with multiple owners)
- L5 caches metadata; L6 designs invalidation and discusses cache consistency under concurrent writes
L6 What good looks like at this level ›
- Owns the chunk index scaling problem end-to-end: partitioning strategy, concurrent upload race condition, GC for orphaned chunks
- Addresses the enterprise shared-drive shard hot-spot and proposes hierarchical sub-sharding
- Designs the notification connection server tier separately from the API tier, with cursor-based fallback
- Discusses cross-region metadata replication lag and its effect on sync cursors
- Proposes erasure coding (RS 6+3) vs. triple replication and reasons about cost vs. latency trade-off
- L6 designs for the current scale; L7 asks "what's the cheapest this system can get while meeting SLAs?"
- L6 accepts active-passive multi-region; L7 analyzes active-active trade-offs, vector clocks, and CRDT approaches
- L7 frames build vs. buy decisions at petabyte scale (e.g., Dropbox's decision to build Magic Pocket vs. staying on S3)
L7/L8 What good looks like at this level ›
- Frames the cost structure: storage dominates; dedup ratio and erasure coding are the two levers. Example at 50M DAU, 5 GB avg stored per user, $0.023/GB/month:
Delta: ~$14 M/yr saved from a 0.20 improvement in the retention factor. Consistent with the interactive estimator defaults.Scenario Retention factor Stored (PB) Annual cost Baseline dedup 0.70× 175 PB ~$48 M/yr Improved dedup 0.50× 125 PB ~$34 M/yr - Reasons about whether to build custom object storage vs. S3 at the point of scale that makes it economically rational
- Designs multi-primary active-active metadata with causal consistency (version vectors) and probes the consistency boundaries
- Identifies which parts of the system are genuinely novel engineering problems vs. solved infrastructure
- Over-engineering: proposing global strong consistency when eventual suffices
- Ignoring cost modeling — L7 must reason about money, not just correctness
- Not probing failure modes of their own proposed solutions (e.g., what breaks in active-active under network partition)
Classic probes table
| Question | L3/L4 answer | L5/L6 answer | L7/L8 answer |
|---|---|---|---|
| How does Device B know a file changed on Device A? | Polling every N seconds | Push notification via WebSocket / long-poll from Notification Bus; cursor-based delta pull on reconnect | Same, plus reasons about mobile constraints (APNs/FCM) and the economics of 15M persistent connections per region |
| How do you limit the upload bandwidth for a 5 GB video? | Split the file into chunks and upload in sequence | CDC chunking → hash → only upload missing chunks; TUS resumable protocol for interruption safety | Same, plus client-side adaptive bitrate for upload (detect congestion, reduce chunk rate), and server-side per-user upload quota enforcement at the load balancer layer |
| What happens when two people edit the same file simultaneously? | Last write wins (data loss) | Conflict copy: both versions preserved; parent_version_id mismatch triggers conflict creation; user resolves manually | Conflict copy for binary files; CRDT or OT for text files; CRDTs eliminate conflicts structurally but require intent-preserving merge semantics |
| How do you achieve 11-nines durability? | Backup the data frequently | Multi-AZ replication with erasure coding (RS 6+3), synchronous write to 2+ shards before ACK, bit-rot scrubbing | Same, plus cross-region replication for disaster recovery, integrity audit pipelines (checksumming at rest), and the math: at RS 6+3 with independent disk failure rates, you can compute the annual probability of data loss explicitly |
How the pieces connect
Trace a single 1-byte edit end to end
Device A makes a 1-byte change inside a 100 MB file. Here is every system touched, in order:
- CDC re-chunks the file (§5): only the chunk whose boundary shifted changes; the other ~6,000 chunks produce identical hashes. One new chunk is identified.
- Hash lookup in the chunk index (§4, §8): the client sends the new chunk's SHA-256 to
POST /upload/init. The chunk index confirms the hash is not already stored; the server returns one presigned upload URL. - PUT to Block Store (§4): the client uploads one chunk directly to the Block Store, bypassing the Sync API. The only bytes that cross the network are the one changed chunk: the upload bandwidth NFR (§2) is satisfied.
- Metadata commit (§7, §8):
POST /upload/completewrites a newfile_versionsrow pointing to the updated chunk list. Theparent_version_idcheck ensures no concurrent edit from another device has been silently overwritten. - Kafka event published (§4): the Sync API emits a change event async to the Notification Bus (step ⑨ in Fig. 3). This decouples the upload acknowledgement from the sync fan-out.
- Device B wakes via WebSocket (§10): the WebSocket gateway, subscribed to the Kafka topic for this user, pushes "file changed" to all of Device B's open connections within the sync latency NFR window (< 5 s, §2).
- Device B pulls delta (§7): Device B calls
GET /sync/delta?cursor=…. The server returns the new chunk list diff: one new chunk hash and its presigned download URL. Device B downloads one chunk. The file is reconstructed from the unchanged cached chunks plus the one new chunk — no other data moves.
NFR chain traces
- Upload bandwidth NFR (§2) → fixed-size chunking causes boundary shift on insertion → CDC chunking chosen (§5) → only modified chunks re-uploaded → sync latency NFR also satisfied (§7)
- Storage efficiency NFR (§2) → global deduplication required → chunks keyed by SHA-256 → global chunk index table (§4, §8) → the index becomes the single hottest component and must be independently scaled (§10)
- Sync latency < 5 s (§2) → polling is too slow at scale → event-driven notification bus (§4) → Kafka fan-out to WebSocket gateway → cursor-based /sync/delta as durable fallback (§7)
- Durability 11 nines (§2) → triple replication alone costs 3× storage → erasure coding RS 6+3 (§10) reduces overhead to 1.5× at comparable durability → storage cost halved at petabyte scale
- Concurrent device edits (§11) → last-write-wins causes data loss → parent_version_id check + conflict copies (§7) → no data lost, user resolves → connects forward to CRDT trade-off discussion at L7 (§12)
- Content-addressable chunks (§5) → same hash = same content, always valid → CDN caching never needs invalidation for chunk content (§9) → cache TTL can be indefinite → file read latency hits CDN p99 (< 20 ms) rather than origin p99 (< 200 ms)
- Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
- Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
- Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
- Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
- Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
- Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
- Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
- Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
- Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
- Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
- Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
- Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
- Online Judge System Design — secure sandboxing, execution queues, and worker scaling