System Design Interview

Design an Online Judge — System Design Interview Guide

Three constraints pull in opposite directions: keeping execution isolated, keeping latency under 5 seconds, and keeping resource limits identical across every user. Every architectural decision in this system flows from that tension.

~35 min read  ·  12 sections  ·  interactive estimator

L3 / L4 — working system L5 / L6: tradeoffs & scale L7 / L8 — architecture & strategy
01

What the interviewer is testing

An online judge is a deceptively tricky system design question. On the surface it sounds like a CRUD app: accept code, run it, return a result. The hard part is that running arbitrary, untrusted code is one of the most security-sensitive operations in software. Every architectural decision — from the execution environment to the queue topology to the storage layout — flows from that single constraint.

The question also spans a wide range of interesting sub-problems: secure sandboxing, resource fairness, queue-based decoupling, test case management, and leaderboard design. A strong answer demonstrates that you understand which of these is the primary constraint (isolation) and which are consequential (everything else).

LevelWhat good looks likeThe trap to avoid
L3/L4 Working flow: accept code, run it, return pass/fail. Recognise that execution must be isolated. Treating this as a simple REST API. Missing sandboxing entirely, or vaguely saying "run in Docker" without addressing resource limits.
L5 Queue-based decoupling, async result delivery, explicit CPU/memory/time limits, test case storage in object storage. Conflating the API tier with the execution tier. Not addressing what happens when a worker crashes mid-execution.
L6 Full sandbox technology comparison (seccomp vs gVisor vs Firecracker), worker autoscaling via Little's Law, contest mode pre-warming, leaderboard design. Designing for average load, not contest spike load. Ignoring noisy-neighbour effects in shared execution environments.
L7/L8 Multi-region execution, build caching across submissions, plagiarism detection pipeline, cost model (Firecracker vs dedicated VMs), SLA reasoning. Over-engineering the API layer when the execution layer is the actual bottleneck. Ignoring operational cost of deep isolation.

Common opening probe: "What makes this different from a regular web service?" — The right answer centres on trust: every other web service you design assumes the request is data. An online judge's request is executable code. That changes the security model entirely.

02

Requirements clarification

Before designing anything, nail down scope. An online judge can mean a competitive programming platform (Codeforces), a hiring assessment tool (HackerRank), or a practice platform (LeetCode). The functional requirements are similar; the non-functional requirements diverge sharply on contest concurrency and plagiarism detection.

Functional requirements

RequirementIn scopeNotes
Problem browsingYesList, filter by tag/difficulty, view problem statement
Code submissionYesMultiple languages; submission stored permanently
Execution & judgingYesRun against hidden test cases; return verdict within 5 s
Submission historyYesPer-user history with verdicts and runtime stats
Leaderboard / rankingsYesGlobal and per-contest; near-real-time during contests
Contest modeYesTime-bounded, concurrent surge of submissions
Plagiarism detectionNo (L7+)Out of scope for core design; flag as extension
Custom test cases (user-defined)NoSimplification; production systems support this

Non-functional requirements

NFRTargetRationale
IsolationSandbox escape must not occur structurallyUntrusted code can attempt kernel exploits, filesystem reads, network calls
Execution latency< 5 s end-to-end (p99)Competitive programming is interactive; users wait at the keyboard
Verdict accuracyDeterministic for identical inputsWrong verdicts destroy platform credibility
FairnessIdentical resource limits per language across all submissionsA user on a busy node must not receive a TLE that a user on an idle node avoids
Availability99.9% (≈8.7 h downtime/year)Not life-critical; brief outages acceptable outside contest windows
Throughput10× normal load during contest windowsContest starts trigger synchronised submission bursts
Data durabilitySubmissions never lostUsers reference submission history months later

NFR reasoning

Isolation: why sandbox escape must not occur structurally

Submitted code runs with the platform's execution credentials. Without isolation, a submission could read other users' source code, exfiltrate test cases (which have commercial value), establish outbound network connections, or exploit kernel vulnerabilities to escalate to the host. Any of these constitutes a complete platform compromise.

"Must not occur structurally" is precise language: we cannot claim it never happens — zero-day hypervisor escapes exist. We claim the architecture provides no structural path: no shared filesystem, no shared kernel (with microVMs), no shared network namespace, and no capability to install kernel modules.

What this drives Firecracker microVMs or gVisor as the execution layer (§5), not plain Docker. Each submission gets a fresh, throwaway VM. The isolation boundary is hardware virtualisation, not just Linux namespaces.
Execution latency: why 5 s, not 1 s or 30 s

1 second is too tight: VM boot overhead (Firecracker cold start is ~125 ms), compilation time for Java/Go/Rust, and network round-trips consume a non-trivial fraction. 30 seconds is too loose: users disengage. 5 seconds is the practical ceiling — fast enough to feel interactive, slow enough to accommodate compilation and microVM boot.

The 5 s budget breaks down roughly as: 100–200 ms API + queue enqueue, 100–200 ms worker dequeue + VM boot (warm pool, see §9), 0–3 s compilation + execution, 100 ms result write + notification.

What this drives Async execution via a job queue (§4), pre-warmed VM pool (§9), and WebSocket or long-poll for result delivery (§6) — not synchronous HTTP execution.
Fairness: why identical limits matter for competitive integrity

If two users submit identical correct solutions but one hits a TLE because their worker was under CPU contention, the result is wrong. This is not just a UX issue — in contest mode, incorrect verdicts affect rankings and prize eligibility. Fairness requires that resource limits are enforced at the cgroup level, not just hoped-for by scheduling, and that no two submissions share CPU time within the same VM.

What this drives One submission per microVM (never multi-tenant execution), per-language CPU/memory cgroup limits defined in a configuration table, and worker nodes reserved exclusively for execution (never co-located with API servers).
03

Capacity estimation

An online judge has two very different traffic profiles that must be estimated separately. Submission traffic is write-heavy, bursty, and computationally expensive — each submission triggers code execution that takes 1–5 seconds. Problem-browsing traffic is read-heavy, cheap, and highly cacheable — a user reads a problem statement many times before they submit once.

The most important output of this estimation isn't storage or QPS — it's worker count. Workers are the scarce resource, sized by Little's Law: at steady state, the number of items in the system equals the arrival rate multiplied by the service time (L = λW). This gives the minimum workers at 100% utilisation — but a system running at 100% utilisation has an unboundedly growing queue at any variance spike. Practical provisioning targets 70–80% utilisation, meaning you need the Little's Law result divided by 0.75. Getting this wrong means either under-provisioning (queue builds up, latency blows past 5 s) or over-provisioning (idle workers, wasted cost).

Interactive capacity estimator

2 M
20 : 1
3.0 s
3 yr
Write QPS (avg)
23
submissions / sec
2M ÷ 86,400
Read QPS (avg)
462
requests / sec
23 × 20
Peak Write QPS
70
submissions / sec
avg × 3× peak factor
Workers needed (peak, 75% util)
280
concurrent sandbox slots
ceil(70 × 3.0 s ÷ 0.75)
Submission storage
219
GB over retention
2M × 10 KB × 3 yr
Total submissions
2.19 B
rows in DB
2M/day × 365 × 3

Key architectural implication: At 2 M submissions/day with a 3× contest spike, you need ~280 concurrent execution slots at a 75% utilisation target (see worker card above — 210 is the theoretical minimum at 100% utilisation, which no healthy system should run at). These are not threads or processes sharing a machine — each slot is an isolated microVM. This cost model, not application logic, drives the decision to use a pre-warmed VM pool with autoscaling rather than spinning fresh VMs per submission.

Fixed assumptions: Average submission size 10 KB (code + metadata + result). Peak factor 3× average (contest burst). Test case storage is estimated separately — a typical problem has 50–100 test cases at 10–100 KB each, totalling ~5–10 MB per problem. With 3,000 problems, test case storage is ~15–30 GB — small relative to submissions but accessed at high frequency during execution.

04

High-level architecture

Client Browser / App CDN Static + Problems Load Balancer L7 / HTTPS API Servers Stateless · N nodes Job Queue Kafka / SQS Redis Status · Leaderboard Exec Workers Sandbox pool Object Storage S3 · Test cases PostgreSQL Problems · Users LEGEND Synchronous Asynchronous
Figure 1 — High-level architecture. Submission ingestion (API) is decoupled from execution (workers) via a durable job queue.

Component breakdown

CDN (content delivery network) serves static assets and — critically — problem statement HTML/JSON. Problem content is read-only after creation, so it can be cached at edge nodes indefinitely with a long TTL. This alone eliminates the majority of API server requests.

Load balancer distributes HTTPS traffic across API server instances. It terminates TLS, applies rate limiting per IP and per API key, and routes health checks. It does not route execution traffic — that goes through the queue.

API servers are stateless application nodes. They validate submissions (syntax check, size limit, language support), write submission records to the database, enqueue execution jobs, and serve read requests from cache. Being stateless means they scale horizontally without coordination.

Job queue (Kafka or SQS) is the critical decoupling point. The API enqueues a submission job and immediately returns an accepted response with a submission ID. The queue provides durability — if all workers crash, jobs are not lost. It also provides natural backpressure during contest spikes.

Execution workers are the production bottleneck. Each worker dequeues one job, boots a fresh sandbox environment, compiles and runs the submission against every test case, and writes the verdict. Workers are stateless; the number of active workers is the primary scaling lever for throughput.

PostgreSQL is the system of record for problems, submissions, users, and test case metadata. Relational schema is appropriate because the entities have well-defined relationships and ACID guarantees simplify submission status consistency.

Redis serves two distinct purposes: submission status polling (short TTL keys, updated by workers as test cases run) and leaderboard rankings (sorted sets, updated atomically per accepted submission).

Object storage (S3) holds test case input/output files. Test case files are immutable after creation, referenced by a deterministic key, and accessed at high frequency by workers. Object storage is cheaper and more scalable than PostgreSQL BLOB columns for this workload.

Architectural rationale

Why a job queue instead of synchronous execution

If the API server executed code synchronously, a contest spike of 300 simultaneous submissions would require 300 simultaneous HTTP connections all held open for 3–5 seconds. This exhausts connection pools and makes the API unresponsive to other requests. With a queue, the API returns immediately (HTTP 202 Accepted) and workers drain the queue at their natural rate.

Tradeoff The user must poll or subscribe for results instead of getting a synchronous response. This is an acceptable UX tradeoff — competitive programmers expect a few seconds of wait time.
Alternatives gRPC streaming WebSocket from API to worker Synchronous for "run" (not submit)
Why object storage for test cases, not PostgreSQL BLOBs

Test case files are read-only after creation, accessed by workers only (not the API), and range from 1 KB to several MB. Storing them in PostgreSQL as BYTEA columns bloats row sizes, increases checkpoint pressure, and couples test case reads to the same database handling user authentication and submission metadata. Object storage separates concerns, is an order of magnitude cheaper per GB, and supports parallel reads natively. AWS S3 provides strong read-after-write consistency (since December 2020), so workers always see the latest version of a test case object immediately after upload — no eventual consistency window to worry about. S3 versioning is still needed for deliberate rollbacks, not for consistency.

Tradeoff An extra network hop for workers fetching test cases. Mitigated by caching test case files locally on the worker node across submissions for the same problem (workers process many submissions for popular problems).
Why Redis for leaderboards, not PostgreSQL ORDER BY

During a contest, rank queries happen every few seconds per active user — thousands of simultaneous "what's my rank?" requests. A SELECT rank() OVER (...) window query on PostgreSQL would work at small scale, but becomes expensive when the submissions table has millions of rows and leaderboard queries are concurrent. Redis sorted sets (ZADD + ZREVRANK) give O(log N) rank operations at any scale. Use ZREVRANK (not ZRANK) when higher score = better rank — ZRANK returns ascending order (lowest score = rank 0), which is correct for penalty-time rankings but wrong for point-based leaderboards. The individual ZADD operation is atomic, but the full verdict pipeline (PostgreSQL write → Redis ZADD → user stats update) is not: the §10 reconciliation job handles divergence when any step fails.

Tradeoff Eventual consistency: a leaderboard read immediately after an accepted verdict may still show the old rank for a brief window while the Redis write propagates. During a contest this is acceptable; for final standings, a consistent SQL query is used.

Real-world comparison

DecisionThis designLeetCode (inferred)Codeforces
Sandbox technology Firecracker microVMs Custom Docker + seccomp ptrace-based isolation
Execution queue Kafka (durable, replayable) SQS / internal queue Custom in-memory queue
Result delivery Redis polling + WebSocket Long-poll / WebSocket Page refresh / SSE
Test case storage S3 (object storage) S3 / GCS Local filesystem on judge
Leaderboard Redis sorted sets Redis + PostgreSQL PostgreSQL (small scale)

Codeforces' filesystem-local test case approach works because their judge fleet is small and contests are geographically concentrated. LeetCode's global user base and always-on practice mode require distributed object storage. Neither is wrong: the right choice follows from your NFRs and scale. Note: real-world values in the table above are inferred from public engineering posts, not confirmed by the platforms.

05

Sandboxing strategy

Four canonical approaches exist for executing untrusted code, each with a different isolation primitive and a different overhead cost. The choice between them is the most consequential decision in the execution layer:

LOW OVERHEAD HIGH ISOLATION ① seccomp + nsjail Isolation primitive: Linux syscall filter Overhead: ~5 ms startup Shared kernel with host Complex allowlist config per language Kernel exploit → escape possible Used by: older judges ② Docker + cgroups Isolation: Linux namespaces + cgroups Overhead: ~50–200 ms (pre-pulled image) Shared kernel with host Familiar tooling; widely supported Container breakout CVEs exist Used by: most platforms ③ gVisor (runsc) Isolation primitive: user-space kernel Overhead: ~100 ms; kernel in userspace Sandboxed kernel intercepts all syscalls Sentry mediates syscalls; smaller attack surface Some syscalls unimplemented → compat issues Used by: Google Cloud Run ④ Firecracker microVM ✓ Isolation primitive: hardware virtualisation Overhead: ~125 ms kernel boot + runtime init Separate kernel per VM; KVM-based Strongest isolation; no shared kernel surface Requires KVM-capable host; complex ops Used by: AWS Lambda, Fly.io
Figure 2 — Four sandboxing approaches. ① and ② share the host kernel; ③ and ④ do not. Our choice is ④ Firecracker for production, ② Docker + seccomp for development.

Our choice for this system: Firecracker microVMs in production, Docker + seccomp in development and CI. Firecracker boots a dedicated Linux kernel per submission — there is no shared kernel surface between submissions or between submissions and the host. AWS Lambda, Fly.io, and Render all use this model for the same reason. gVisor reduces the host kernel attack surface but does not eliminate it: its Sentry process still calls a subset (~50) of real host kernel syscalls to handle I/O and memory, so a host kernel vulnerability in that call set is still reachable.

The performance cost of Firecracker is real but manageable. Pre-warming solves the cold start problem: the worker pool maintains paused Firecracker snapshots per language, each with the language runtime already initialised inside the guest. The 125 ms figure cited in Firecracker's benchmarks is kernel boot only — total cold start including Python or JVM runtime initialisation is 400–700 ms. With a warm snapshot that already has the runtime loaded, restore takes ~10–50 ms (depending on guest memory size), well within the 5 s budget. The warm pool must be maintained per language runtime — a Python snapshot cannot serve a C++ submission. At 210 peak worker slots across 4 languages, a conservative warm pool holds 50–60 slots per popular language, with the remainder allocated on contest demand.

Defence-in-depth: Even with Firecracker, the system applies additional layers: no outbound network from the VM, no access to instance metadata endpoints, cgroup CPU/memory/process limits inside the guest, and a watchdog process that kills the VM if wall-clock time exceeds the limit. Isolation should never rely on a single mechanism.

Per-language resource limits: how fairness is enforced

A common misconception: cgroups do not kill a process on CPU time overrun — they throttle it. To enforce a hard kill, use setrlimit(RLIMIT_CPU, soft, hard) inside the sandbox: the OS sends SIGXCPU at the soft limit and SIGKILL at the hard limit. Alternatively, a watchdog polls cpuacct.usage and issues SIGKILL when accumulated CPU time exceeds the budget. The watchdog also enforces wall_ms — catching infinite loops that sleep on I/O, which CPU time accounting misses entirely.

Different languages have very different runtime characteristics. Java incurs JVM startup overhead; Python is slower than C++ by a factor of 10–50 for CPU-bound code. Judges address this with language-specific multipliers defined in a configuration table:

// Resource limits config (per language)
{
  "python3":  { "cpu_ms": 5000, "wall_ms": 8000, "mem_mb": 256 },
  "java":     { "cpu_ms": 3000, "wall_ms": 6000, "mem_mb": 512 },
  "cpp":      { "cpu_ms": 1000, "wall_ms": 3000, "mem_mb": 256 },
  "javascript": { "cpu_ms": 4000, "wall_ms": 7000, "mem_mb": 256 }
}

The mem_mb limit uses memory.limit_in_bytes (cgroups v1) or memory.max (v2), which OOM-kills the process when exceeded — unlike CPU, memory enforcement does kill.

One open question Should Python get 5× the time limit of C++, or should the problem setter define limits per language? Most platforms use multipliers; competitive programmers argue this disadvantages algorithm quality vs language speed. There is no consensus answer: the key is to acknowledge the tradeoff.
05b

API design

The API surface for an online judge is small. Three endpoints carry most of the traffic: problem fetch, submission create, and submission status. The interesting design decisions are in validation, the response contract for async execution, and status polling.

POST /api/submissions — create a submission

// Request
POST /api/submissions
Authorization: Bearer <clerk_jwt>
Content-Type: application/json

{
  "problem_id": "two-sum",
  "language":   "python3",          // enum: cpp | java | python3 | javascript
  "source_code": "class Solution:\n    def twoSum(...",
  "contest_id":  "spring-2026"         // optional; null for practice
}

// Response — 202 Accepted (async; execution has not yet started)
{
  "submission_id": "sub_01HX8K...",
  "status":        "QUEUED",
  "poll_url":      "/api/submissions/sub_01HX8K...",
  "queued_at":     "2026-04-30T12:34:56Z"
}

Why 202 Accepted, not 201 Created? The submission row is created (201 would be acceptable), but the primary resource: the execution verdict — does not yet exist. HTTP 202 explicitly signals "the request has been accepted for processing, but processing has not been completed," setting the correct expectation that the client must poll for the result.

Validation on submission create: source code size limit (64 KB), language must be in the supported set for the problem, rate limit enforced (5 submissions/minute per user per problem), contest window check if contest_id is set. Source code is stored verbatim; no sanitisation: the sandbox handles isolation. An Idempotency-Key request header (client-generated UUID) allows the API to return the existing submission record if the same request is retried within a 10-minute window, preventing duplicate executions from network retries or double-clicks.

GET /api/submissions/:id — poll for result

// Response while executing
{
  "submission_id": "sub_01HX8K...",
  "status":        "RUNNING",        // QUEUED | RUNNING | ACCEPTED | WRONG_ANSWER | TLE | MLE | RE | CE
  "tests_passed":  7,
  "tests_total":   20,
  "runtime_ms":    null,             // populated on completion
  "memory_kb":     null
}

// Response on completion (ACCEPTED)
{
  "submission_id": "sub_01HX8K...",
  "status":        "ACCEPTED",
  "tests_passed":  20,
  "tests_total":   20,
  "runtime_ms":    84,
  "memory_kb":     18432,
  "percentile_runtime": 92.4      // faster than 92.4% of accepted solutions
}
Optional endpointMethodLevelNotes
/api/problems/:idGETL3/L4Returns problem statement, constraints, examples. Served from CDN cache.
/api/submissions/:id/detailGETL5Returns per-test-case results. Actual vs expected diffs are only returned for sample cases (is_sample=true) during an active contest; hidden case diffs are withheld until contest end — preventing test case extraction via strategic wrong-answer submissions.
/api/contests/:id/leaderboardGETL5Paginated leaderboard from Redis sorted set. Cursor-based pagination.
POST /api/submissions (run mode)POSTL6With "mode":"run" flag: runs against user-provided input only, not test suite. Lower priority queue.
POST /api/submissions/bulkPOSTL7/L8Batch submission for hiring assessment platforms. Async with webhook callback.

Run mode vs submit mode — separate Kafka topics, not a priority field. Kafka does not support per-message priority within a partition — a "low priority" flag in a shared topic would be ignored. Use two separate topics: submissions.submit (full test suite, contest-critical) and submissions.run (user-provided input only, best-effort). During a contest, worker consumer group assignments bias toward submissions.submit by increasing the number of workers subscribed to it. Workers subscribed only to submissions.run automatically shed load when not needed without any priority scheduling logic.

WebSocket vs polling: The 5 s end-to-end budget (§2) has ~500 ms of slack after execution; polling every 500 ms consumes that slack entirely. WebSocket push eliminates the polling lag at the cost of routing complexity — both are viable, pick based on your tolerance for stateful connections. Polling is simpler and stateless: the client calls GET /submissions/:id every 500 ms until status is terminal. WebSocket reduces latency by ~500 ms on average and eliminates wasted requests, but requires sticky sessions or a pub/sub fan-out (Redis Pub/Sub) to route worker completion events to the right connection. L5+ should name the tradeoff; L7+ should describe the Redis Pub/Sub design.

06

Core submission flow

The dominant write path — user submits code, gets a verdict — is where the latency NFR (§2) and the isolation NFR collide. A synchronous execution model satisfies neither: it blocks connections during the 1–5 s execution window and makes isolation harder by coupling the API process to the sandbox. The queue-based design satisfies both at the cost of one tradeoff: the client must poll or subscribe for results rather than waiting on a synchronous response.

① User submits code POST /api/submissions ② API validates + stores Write submission row (QUEUED) async ③ Enqueue job Kafka topic: submissions 202 Accepted sub_id returned ④ Worker dequeues Resume pre-warmed microVM ⑤ Execute against test cases Run N tests; compare stdout Wrong / TLE / RE / MLE Record failing test index ⑥ Write verdict + metrics PostgreSQL + Redis status key ⑦ Client receives result Poll /submissions/:id or WS push All tests pass Update leaderboard LEGEND Synchronous Asynchronous
Figure 3 — Core submission flow. Steps ①–③ are synchronous (milliseconds). Steps ④–⑦ are async (seconds). The 202 response decouples ingestion from execution.

Latency budget recap (5 s end-to-end): Steps ①–③ take ~100–200 ms (API validation, DB write, queue enqueue). Step ④ takes ~10–50 ms with a pre-warmed Firecracker snapshot (§5). Step ⑤ takes 0–3 s (compilation + execution). Step ⑥ takes ~50 ms (DB write + Redis update). Step ⑦ adds ~500 ms if polling every 500 ms, or ~10 ms if using WebSocket push. Total budget consumed: 4.5–4.9 s — tight but within the 5 s NFR when the warm pool is healthy (§9).

The critical tradeoff in this flow is between polling and WebSocket push for step ⑦. Polling is simpler: the client calls GET /submissions/:id every 500 ms; the API reads from Redis; no server-side connection state. WebSocket push is lower latency: the worker publishes a completion event to a Redis Pub/Sub channel, and the API server subscribed on behalf of the client's WebSocket connection forwards it immediately.

The polling design is strongly preferred at L3–L5. At L5–L6, the interviewer may probe whether you know that WebSocket connections require sticky routing (or Redis Pub/Sub fan-out) to work correctly behind a load balancer. Name that constraint and it becomes a strength, not a weakness.

Output comparison: why "compare stdout" is incomplete

Most problems use exact string comparison, but three categories require different comparators: floating-point problems (accept output within a relative or absolute epsilon), problems with multiple valid outputs (e.g., "output any valid spanning tree"), and interactive problems (a two-process protocol where the judge sends input based on the program's responses). Hardcoding string equality breaks all three.

The judge config should include a comparator_type field in the problem schema, with three values:

// problem config (stored in PostgreSQL problems table)
{
  "comparator_type": "exact",        // exact | float_eps | special_judge
  "float_epsilon":   1e-6,            // used when comparator_type = float_eps
  "special_judge_s3_key": null        // path to comparator binary in S3
}

Workers load the comparator alongside test case files. For special_judge, the worker downloads the comparator binary from S3, runs it as a subprocess with (input, expected_output, actual_output) as arguments, and interprets its exit code as pass/fail. The comparator binary itself is sandboxed — it runs in a separate (lighter) container, not in the submission microVM.

What to say in the interview "Compare stdout" is the naive approach and correct for most problems. Flag that real judges need at least three comparator modes, that the comparator type is a property of the problem (not the submission), and that special judge binaries introduce a second trusted code execution path that needs its own — though less strict — sandboxing.
07

Data model

Before defining any columns, identify the entities and how they get accessed. The schema follows the access patterns, not the other way around.

Entities and access patterns

EntityPrimary operationsFrequencyQuery shape
ProblemRead by ID, list by tag/difficultyVery high (cached)Point lookup + tag filter
TestCaseRead all cases for a problem (by worker)High per executionBulk fetch by problem_id
SubmissionInsert on submit; update on verdict; read by user; read by contestHigh writes, moderate readsPoint lookup by ID; scan by user_id; scan by contest_id + problem_id
UserRead on auth; update stats on accepted verdictLow writes, moderate readsPoint lookup by ID or clerk_id
ContestRead during contest window; leaderboard during + afterLow; bursty during contestPoint lookup; join to submissions

Two things stand out from this table. First, test cases are accessed in bulk by problem — all cases for a problem are fetched together by the worker, never individually. This means the test case files belong in object storage (S3) with a key pattern of problems/{problem_id}/cases/{case_id}.{in|out}, while metadata (case count, time limit per case) stays in PostgreSQL. Second, submissions are written at high volume but queried in two very different ways: point-lookup by ID (status polling) and range scan by user (submission history). Both access patterns need covering indexes.

problems idULID PK slugTEXT UNIQUE titleTEXT difficultyENUM tagsTEXT[] time_limit_msINT test_countINT test_cases (metadata) idULID PK problem_idFK → problems is_sampleBOOL s3_input_keyTEXT s3_output_keyTEXT submissions idULID PK user_idFK → users problem_idFK → problems contest_idFK nullable languageENUM statusENUM source_codeTEXT runtime_msINT memory_kbINT submitted_atTIMESTAMPTZ KEY INDEXES submissions(id) — status poll submissions(user_id, submitted_at) submissions(contest_id, problem_id) users idULID PK clerk_idTEXT UNIQUE usernameTEXT UNIQUE solved_countINT ratingINT contests idULID PK slugTEXT UNIQUE start_atTIMESTAMPTZ end_atTIMESTAMPTZ problem_idsULID[]
Figure 4 — Data model. Test case files live in S3; only metadata (S3 key paths) are stored in PostgreSQL. The submissions index callout shows the three covering indexes that drive the most frequent query patterns.
Why store source_code in PostgreSQL, not S3

Source code is small (median ~2 KB, max 64 KB) and accessed frequently — every submission history page load reads it. Storing it as a PostgreSQL TEXT column keeps reads fast and transactional (the source code and status are always consistent). The 64 KB limit must be enforced at the application layer (CHECK constraint or API validation) — PostgreSQL's TEXT type has no inherent length cap. PostgreSQL TOASTs column values over ~2 KB out-of-line automatically, so a 10 KB source code value incurs a second page read on access; this is acceptable for this workload but worth knowing if query latency becomes a concern.

When this breaks down If the platform adds support for multi-file projects (e.g., import from other files), source code grows beyond a single column and object storage becomes appropriate. At that point the schema adds a source_s3_key column and keeps code out of PostgreSQL.
Why ULID for primary keys instead of auto-increment INT

ULIDs (Universally Unique Lexicographically Sortable Identifiers) are time-ordered, globally unique, and safe to expose in URLs. Sequential INT IDs expose submission volume (a competitor can enumerate your submission counts from public IDs), and UUID v4s are random (poor B-tree index locality). ULID gives you all three properties: uniqueness, time-ordering for index locality, and opacity.

Alternatives UUID v7 (RFC 9562; time-ordered; PostgreSQL needs pg_uuidv7 ext or app-level generation) Snowflake ID (Twitter-style) CUID2
08

Caching strategy

Caching in an online judge operates at three distinct layers, each serving a different read pattern. The key insight is that these layers have completely different invalidation requirements: problem content almost never changes (long TTL), submission status changes frequently during execution (short TTL), and leaderboards change on every accepted verdict (near-real-time).

Client Browser Layer 1: CDN Cloudfront / Fastly • Problem HTML/JSON • Static assets • TTL: 24 h (problems) • Invalidate: on edit • ~80% of reads served Layer 2: Redis In-process cache • Submission status keys • Leaderboard sorted sets • Problem meta (5 min TTL) • Status TTL: 30 min • ~15% of reads served Layer 3: DB PostgreSQL • All writes land here • Submission history • Authoritative source • ~5% of reads • (after cache misses) Worker-local cache Test case files (S3) Cached on worker disk miss miss
Figure 5 — Three-layer cache hierarchy. The CDN serves the vast majority of read traffic. Redis handles dynamic data (status, leaderboard). PostgreSQL is the fallback for cache misses only.
LayerWhat it cachesTTL / InvalidationWhy it exists
CDN edge Problem statements (HTML/JSON), static assets 24 h TTL; explicit invalidation on problem edit Problem content is the highest-volume read and never changes once published. CDN eliminates origin hits entirely for the dominant traffic pattern.
Redis — status keys sub:{id}:status → current verdict + partial results 30 min TTL; set by worker on each test case completion Status polling happens every 500 ms per active submission. Reading from PostgreSQL on every poll would be wasteful; Redis gives microsecond reads for this hot key.
Redis — leaderboard Sorted set per contest: leaderboard:{contest_id} No TTL; ZADD on every accepted verdict; archived after contest O(log N) rank operations at any scale. Concurrent rank queries during a contest would create hot-spot reads on PostgreSQL without this layer.
Worker-local disk Test case files (S3 objects) for recently-seen problems LRU eviction; invalidated if test cases change (rare) Workers process many submissions for the same popular problem. Fetching test cases from S3 on each submission adds ~50–100 ms of latency. Local caching reduces this to a disk read.

Leaderboard consistency: The Redis leaderboard may lag by the duration of a single ZADD write after an accepted verdict. In practice this window is under 10 ms in normal operation, but during a network partition it can grow. For final contest standings, the system always regenerates rankings from PostgreSQL — Redis is the live view, not the authoritative record.

09

Worker scaling and contest load

Scaling an online judge beyond a few thousand daily active users requires addressing three distinct bottlenecks: the execution worker pool (the primary constraint), the submission queue under contest burst load, and the database under high-write submission traffic. Each requires a different approach.

CDN Edge cached Load Balancer Rate limit · TLS API Cluster api-1, api-2, … api-N Stateless · autoscale on CPU Kafka Cluster replication=3 · min-ISR=2 · acks=all Partitioned by lang Worker Pool Python · Java · C++ · JS Autoscale on queue depth PostgreSQL Primary All writes land here Connection pool: PgBouncer Read Replica ×2 Submission history queries Analytics reads Redis Cluster Status · Leaderboard Pub/Sub for WS push S3 Test cases Worker-local LRU cache Autoscaler Watches queue depth Scales worker fleet LEGEND Synchronous Asynchronous Scaling levers (in priority order) 1. Worker fleet — scale on queue depth metric (primary bottleneck) 2. Kafka partitions — partition per language for parallel consumption 3. PostgreSQL read replicas — route history queries off primary 4. Redis Cluster — shard leaderboard keys across slots 5. API tier — autoscale on CPU (least constrained layer) 6. PgBouncer — connection pool in front of PostgreSQL primary 7. Pre-warm worker pool 15 min before contest start
Figure 6 — Production architecture. The autoscaler monitors Kafka queue depth and provisions or terminates worker instances. Workers are the primary scaling constraint; the API tier is rarely the bottleneck.
Worker autoscaling: how to size and respond to contest load

The autoscaler watches Kafka consumer group lag for the submissions topics. When lag exceeds a threshold (e.g., 500 messages), it provisions additional worker instances. However, consumer lag has an important blind spot for slow-processing consumers: lag is calculated as latest_offset − committed_offset, and workers only commit offsets after writing the verdict (3–5 seconds later). A worker that has dequeued 50 messages but not yet committed shows lag = 0, even though 50 executions are in flight. Supplement lag with a "submissions in RUNNING state" counter from Redis: the combined signal accurately reflects both queued and in-flight work. Tools like Burrow (LinkedIn's open-source Kafka consumer monitor) track per-consumer lag without relying on committed offsets — useful precisely when consumers have long processing windows.

For contests, the autoscaler uses a scheduled pre-warm: 15 minutes before contest start, the pool scales to maximum capacity regardless of current queue depth. This eliminates cold-start delays during the critical first minutes when all participants submit simultaneously. Pre-warmed Firecracker snapshots are kept ready per language.

Tradeoff Pre-warming wastes capacity if the contest is cancelled or undersubscribed. At ~$0.015/hr per instance, 300 pre-warmed slots for 15 minutes costs about $1.10 — an acceptable insurance cost.
Kafka partition strategy: why partition by language

Use separate topics per language (submissions.python3, submissions.cpp, etc.) rather than a single topic with a language key. This is important: with a single topic partitioned by language key, you get at most 4 active partitions for 4 languages, meaning at most 4 concurrent consumers, which cannot support a 280-worker pool. Separate topics each with 64–128 partitions allows the worker pool for each language to scale to the required depth independently.

Producer configuration is critical for durability: submissions must be produced with acks=all and enable.idempotence=true. Without acks=all, the min.insync.replicas=2 setting is completely ignored — a broker acknowledges the write before replication, and a leader failure before replication means the job is silently lost. Since a lost submission is unacceptable, acks=all is non-negotiable.

Alternative A single topic with a shared worker pool handling all languages. Simpler operationally, but a language-specific contest surge can starve other language queues. Separate topics give independent scaling and fault isolation.
Database scaling: when read replicas and PgBouncer are needed

The primary bottleneck is the worker pool, not the database. PostgreSQL primary handles submission inserts (one write per submission, plus one update when the verdict is written). At 70 peak submissions/sec, that's ~140 writes/sec — well within a single Postgres instance's capacity. Read replicas become necessary when submission history queries, analytics dashboards, and problem listing queries compete with these writes for I/O on the primary.

PgBouncer in transaction mode is added in front of the primary when the worker fleet scales past ~100 nodes. Each worker opens a database connection to write verdicts; 200+ concurrent connections exhaust PostgreSQL's connection limit. PgBouncer multiplexes many client connections onto a small pool of real PostgreSQL connections.

Important: PgBouncer transaction mode is incompatible with server-side named prepared statements. Most ORMs (SQLAlchemy, ActiveRecord, Diesel) use server-side prepared statements by default — enabling transaction mode without disabling them causes prepared statement "..." does not exist errors when a connection is returned to the pool and a different client tries to use it. Configure your driver to use simple query protocol or client-side prepared statements before enabling transaction mode. PgBouncer session mode avoids this but provides weaker connection multiplexing.

What to say in the interview "Database scaling is not the first thing I'd reach for: the worker fleet is the bottleneck. I'd add read replicas when analytics queries start competing with writes, and PgBouncer when the worker fleet grows past ~100 nodes and connection count becomes the constraint."
10

Failure modes & edge cases

An online judge has several failure modes that don't exist in typical web services, because the execution layer introduces an entirely new class of failure: the sandboxed process. Understanding how each failure propagates — and where it's caught — is what separates an L5 answer from an L3 answer.

ScenarioProblemSolutionLevel
Worker crashes mid-execution Submission is dequeued but verdict is never written. Job is "lost." Kafka's at-least-once delivery: the job is requeued when the consumer heartbeat times out. Worker commits the offset only after writing the verdict to PostgreSQL and Redis. Idempotent verdict writes handle duplicate execution: INSERT INTO submissions (...) ON CONFLICT (id) DO UPDATE SET status = EXCLUDED.status WHERE submissions.status = 'QUEUED': the guard prevents overwriting a completed verdict if a replay arrives late. L3/L4
Infinite loop / TLE Worker process hangs, holding a VM slot indefinitely. Other submissions queue behind it. Dual timeout enforcement: cgroup CPU-time limit kills tight spin loops; wall-clock watchdog (SIGKILL after wall_ms) catches I/O-blocking infinite loops. Both must fire independently — using only one can be bypassed via sleep(). L3/L4
Submission produces non-deterministic output Multithreaded code with race conditions passes sometimes, fails others. Users complain about "flaky" judgements. The judge runs each submission once — it does not re-run on failure. Non-determinism is a correct verdict (wrong answer or runtime error). Problem setters should avoid problems where correct solutions rely on thread scheduling. Document this as a known limitation. L5
Test case file corruption in S3 A test case input or expected output file is corrupted (bit flip, incomplete write). All submissions for a problem receive wrong verdicts. Store a SHA-256 checksum of each test case in PostgreSQL alongside the S3 key. Workers verify the checksum after download; a mismatch triggers an alert and stops judging for that problem until the file is restored. S3 versioning retains previous known-good versions. L5
Redis leaderboard diverges from PostgreSQL A Redis crash between verdict write and ZADD leaves a correct submission missing from the leaderboard. On Redis restart, replay accepted submission records from PostgreSQL to reconstruct the sorted set. A background reconciliation job runs every 5 minutes during contests to detect and repair divergence. Redis is the live view; PostgreSQL is the source of truth. L5
Contest starts, queue fills instantly 10,000 participants submit within the first 60 seconds. Queue depth spikes to 50,000+. Latency blows past 5 s for submissions in the tail. Pre-warm worker pool 15 min before contest start (§9). Apply per-user rate limiting (5 submissions/min) to smooth the burst. If queue depth still exceeds threshold, surface an estimated wait time to users ("your submission is queued, expected in ~8 s") instead of a blank spinner. L5
Sandbox escape via kernel exploit A submission exploits a kernel vulnerability to break out of the microVM and access the host or other VMs. Defence-in-depth: Firecracker provides a hardware-virtualisation boundary; even if the guest kernel is compromised, the hypervisor boundary limits the blast radius to the single VM. VMs run as unprivileged users; host network is unreachable from guest; metadata endpoints are blocked. Monitor for anomalous syscall patterns (unexpected exec calls, file descriptor creation). No architecture fully prevents zero-day exploits; the goal is minimise blast radius and detect quickly. L7/L8
Test case authoring error — wrong expected output A problem setter creates a test case with an incorrect expected output. All correct solutions are judged as Wrong Answer. Separate the problem-setter role from the judging system: require at least one accepted reference solution before activating a problem. Run the reference solution against all test cases during problem creation; if any case fails, block publication. Store the reference solution's output hash as the ground truth. Problem authoring pipeline: test case files upload via admin-only path; SHA-256 stored in PostgreSQL at write time; S3 versioning enabled so rollback is possible; reference solution run triggered automatically on publish: the problem goes live only if every test case passes. L7/L8
10b

Security, trust & compliance

Three security concerns are specific to online judges and map directly to components already in the architecture: abuse of the submission pipeline, the problem setter trust model (who can create test cases and how are they validated?), and data integrity under adversarial conditions. L6+ interviewers probe all three explicitly.

Submission rate limiting and abuse prevention

Without rate limiting, a single malicious actor can flood the execution queue with thousands of submissions, consuming worker capacity and degrading fairness for other users. Rate limiting is enforced at two levels:

Per-user submission rate limiting: how it's enforced

The API enforces a token-bucket rate limit per user per problem: maximum 5 submissions/minute (enforced via a Redis counter INCR ratelimit:sub:{user_id}:{problem_id}:{minute_bucket} with a 60-second TTL). Requests exceeding the limit receive a 429 Too Many Requests with a Retry-After header. A separate, higher limit (30 submissions/minute globally per user) catches bulk scripted abuse before it reaches the per-problem check.

During an active contest, the submission window check also applies: submissions outside the contest time window are rejected at the API before they reach the queue, not silently enqueued to be judged incorrectly.

L5 probe "A user submits the same wrong solution repeatedly to try to reverse-engineer the hidden test cases." — The per-problem rate limit (5/min) limits this to 5 attempts per minute. For full-suite submissions, hiding per-test-case diffs during an active contest (see §5b) prevents learning which test case failed, making brute-force test case extraction impractical even within the rate limit.
Problem setter trust model: who can upload test cases?

Problem setters are trusted users — they can upload test case files directly to S3 via an admin-only signed URL path. But trust is scoped: a problem setter can only write test cases for problems they own; they cannot read other problems' test cases, user submissions, or execution verdicts. The IAM policy on the S3 bucket enforces prefix-level write isolation: problems/{problem_id}/cases/* is writable only by the owning problem setter's IAM role.

The reference solution validation pipeline (see §10 — test case authoring failure mode) further constrains trust: a problem cannot go live until a reference solution passes all test cases. This means a problem setter who accidentally or maliciously uploads bad test cases is caught before any user submission is judged against them. The pipeline runs in an isolated worker: the reference solution itself is sandboxed exactly like a user submission.

What to say in the interview "I treat problem setters as a semi-trusted role: they have write access to their own problem's test cases but cannot see submissions or other problems' data. Every problem requires a passing reference solution before going live — this is both a quality gate and a security check."
Alternatives Problem setter uploads to a staging bucket; admin reviews before promoting to production Automated test case generator (fuzzing) replaces manual upload
Test case integrity: SHA-256 checksums end to end

Every test case file has a SHA-256 checksum stored in PostgreSQL alongside its S3 key at upload time. Workers verify the checksum after downloading a test case file before running any submission against it. A mismatch triggers an immediate alert and pauses judging for that problem — no user receives a wrong verdict from a corrupted input/output file. S3 versioning retains the previous known-good version for rollback.

Checksum verification also closes a subtle attack vector: a compromised problem setter who modifies a test case file in S3 directly (bypassing the admin upload path) would trigger checksum failures on workers, alerting the platform before any user is affected.

Verification cost SHA-256 of a 100 KB test case takes ~0.1 ms on modern hardware. At 50 test cases per problem, total verification overhead per submission is ~5 ms — negligible within the 5 s budget.
Plagiarism detection: a background pipeline, not a blocking check

Plagiarism detection cannot run in the critical path — it's too expensive. The standard approach is a background pipeline triggered after contest end:

  1. Normalise all accepted submissions: strip comments, rename variables to canonical names (var1, var2...), normalise whitespace.
  2. Tokenise the normalised source into a sequence of language tokens (keywords, operators, literals).
  3. MinHash LSH (Locality-Sensitive Hashing): generate a MinHash signature for each submission's token set. Submissions whose signatures fall in the same LSH bucket are near-duplicate candidates — O(1) lookup per submission, not O(N²) pairwise comparison.
  4. Flag candidates for human review. Never auto-disqualify — false positives (two people independently solving the same problem the same way) are common.
L7 probe "A user submits 50 slightly different copies of the same solution during a contest to inflate their attempt count and confuse plagiarism detection." — The per-user rate limit (5/min) bounds total attempts. MinHash LSH would flag all 50 as near-duplicates of each other. Token normalisation removes trivial obfuscations (rename variables, add comments). The real challenge is detecting structural plagiarism where the algorithm is copied but re-implemented — this requires AST-level comparison, which is out of scope for an interview answer.
11

How to answer by level

L3 / L4 SDE I / SDE II: can you build a working system?
What good looks like
  • Identifies the core flow: submit → queue → execute → return verdict
  • Recognises code must run in an isolated environment (even if just says "Docker")
  • Proposes PostgreSQL for problems and submissions
  • Handles the basic timeout/TLE case with a kill signal
  • Knows the API returns async (202) not sync
What separates L5 from L3/L4
  • L3 says "run in Docker" — L5 asks what happens if a container escapes
  • L3 doesn't explain why the queue is necessary — L5 quantifies the concurrency problem
  • L3 ignores worker crash recovery — L5 explains at-least-once delivery
  • L3 polls PostgreSQL directly — L5 introduces Redis for hot status keys
L5 Senior SDE: do you understand the tradeoffs?
What good looks like
  • Compares Docker vs gVisor vs Firecracker — names the isolation primitive for each
  • Applies Little's Law to size the worker pool
  • Describes at-least-once delivery and idempotent verdict writes
  • Separates test case metadata (PostgreSQL) from test case files (S3)
  • Explains dual timeout (CPU-time + wall-clock) and why one alone is insufficient
  • Designs polling vs WebSocket result delivery and names the WebSocket routing constraint
What separates L6 from L5
  • L5 says "autoscale workers" — L6 specifies the queue depth metric, pre-warm trigger, and cost model
  • L5 knows Redis sorted sets for leaderboard — L6 describes reconciliation on Redis failure
  • L5 designs the schema — L6 addresses the submission volume at 2 B rows and partitioning strategy
  • L5 designs for average load — L6 explicitly designs for the contest spike
L6 Staff SDE: can you own this end-to-end?
What good looks like
  • Owns the full execution layer: Firecracker snapshot strategy, warm pool sizing, per-language partition strategy
  • Contest mode as a first-class concern: pre-warm schedule, per-user rate limiting, wait time estimation
  • PostgreSQL partitioning by submission date for the 2 B row scale
  • Test case integrity: SHA-256 checksum verification by worker
  • Designs the reference solution validation pipeline for problem authoring
  • PgBouncer rationale: when connections, not queries, become the bottleneck
What separates L7 from L6
  • L6 designs for one region — L7 considers multi-region execution for latency and data residency
  • L6 ignores plagiarism — L7 sketches a token-fingerprint detection pipeline
  • L6 optimises for throughput — L7 models the cost of Firecracker vs spot instance farms
  • L6 knows the failure modes — L7 has opinions on SLA tiers and graceful degradation strategies
L7 / L8 Principal / Distinguished: should we build this, and how?
What good looks like
  • Frames build vs buy: AWS Fargate / Lambda for isolation vs owned Firecracker fleet — explicit cost and ops trade-offs
  • Multi-region: execution in the user's nearest region for latency; results replicated globally for leaderboard consistency
  • Plagiarism detection: token normalisation (rename variables) + MinHash LSH for near-duplicate detection — a background pipeline, not in the critical path
  • Graceful degradation SLA: during capacity exhaustion, run sample tests immediately and defer full test suite — give partial feedback within 1 s
  • Versioned judge engine: when the platform upgrades a language runtime, old submissions can be re-judged to verify reproducibility
Common gaps at L7
  • Over-engineering the API or database layer when the execution layer is the constraint
  • Proposing expensive solutions (dedicated bare-metal per user) without acknowledging the cost model
  • Ignoring operational complexity of self-managed Firecracker vs managed container services
  • Not discussing the platform's trust model for problem setters (who can write test cases?)

Classic probes — level-differentiated answers

QuestionL3/L4L5/L6L7/L8
"How do you prevent a submission from reading other users' code?" Run each in its own container / process Separate kernel per microVM (Firecracker); no shared filesystem; no network; capability drop Defence-in-depth: microVM + seccomp + no metadata endpoint + anomaly detection. Discuss blast radius of a zero-day hypervisor escape.
"What happens if the queue fills up during a contest?" Add more servers Pre-warm pool before contest; autoscale on queue depth; per-user rate limiting; surface estimated wait to user Graceful degradation: run sample test cases immediately for fast partial feedback; defer full suite; SLA tiering for paid vs free tier; cost model for pre-warming
"How does your leaderboard stay consistent?" Query the database Redis sorted sets updated on accepted verdict; reconcile from PostgreSQL on Redis restart; Redis is live view, PostgreSQL is source of truth Multi-region: eventual consistency for live view is acceptable during contest; strong consistency for final standings via distributed transaction or canonical read from single region
"A problem setter uploads a test case with a wrong expected output. How do you detect it?" Someone reports it Require a passing reference solution before publishing; SHA-256 integrity check on test case files by workers Automated regression: any accepted solution for a problem can be added to a reference set; a test case change triggers re-judging the reference set before it goes live

How the pieces connect

1
Isolation NFR (§2) → Firecracker microVMs (§5) → one-submission-per-VM model (§4). The isolation boundary is hardware virtualisation, not Linux namespaces — that distinction is why gVisor (shared host kernel via Sentry) is the runner-up, not the choice.
2
Latency NFR ≤5 s (§2) → async job queue (§4) → client polls Redis for status (§8). The 5 s budget has ~500 ms of slack after execution; polling every 500 ms consumes that slack entirely, which is why WebSocket push is the L5+ answer when the interviewer probes result delivery.
3
Little's Law at peak contest load (§3) → 280 worker slots needed at 75% utilisation → worker autoscaling on queue depth is the primary scaling lever (§9). This is why §9 focuses on the worker fleet, not the API tier or database — those are not the bottleneck.
4
Dual timeout enforcement in §5 (CPU-time limit + wall-clock watchdog) → TLE verdict in §10. One limit alone is bypassable: CPU-time misses sleep()-based infinite loops; wall-clock alone misses tight spin loops. The failure mode is only fully closed when both fire independently.
5
Read:write ratio ~20:1 for problems (§3) → CDN edge cache + Redis (§8) handles problem reads without touching the database. The submission write path and the problem read path are completely decoupled — scaling one does not affect the other.
6
Fairness NFR (§2) → per-language cgroup limits (§5) + separate Kafka topics per language (§9). Fairness requires both identical resource enforcement inside the sandbox and independent queue depth per language — a Python surge must not starve the C++ queue.

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series