System Design Interview

Design an Online Judge — System Design Interview Guide

Three constraints pull in opposite directions: keeping execution isolated, keeping latency under 5 seconds, and keeping resource limits identical across every user. Every architectural decision in this system flows from that tension.

~35 min read · 12 sections · interactive estimator

L3 / L4 — working system L5 / L6: tradeoffs & scale L7 / L8 — architecture & strategy

What the interviewer is testing

An online judge is a deceptively tricky system design question. On the surface it sounds like a CRUD app: accept code, run it, return a result. The hard part is that running arbitrary, untrusted code is one of the most security-sensitive operations in software. Every architectural decision — from the execution environment to the queue topology to the storage layout — flows from that single constraint.

The question also spans a wide range of interesting sub-problems: secure sandboxing, resource fairness, queue-based decoupling, test case management, and leaderboard design. A strong answer demonstrates that you understand which of these is the primary constraint (isolation) and which are consequential (everything else).

Level	What good looks like	The trap to avoid
L3/L4	Working flow: accept code, run it, return pass/fail. Recognise that execution must be isolated.	Treating this as a simple REST API. Missing sandboxing entirely, or vaguely saying "run in Docker" without addressing resource limits.
L5	Queue-based decoupling, async result delivery, explicit CPU/memory/time limits, test case storage in object storage.	Conflating the API tier with the execution tier. Not addressing what happens when a worker crashes mid-execution.
L6	Full sandbox technology comparison (seccomp vs gVisor vs Firecracker), worker autoscaling via Little's Law, contest mode pre-warming, leaderboard design.	Designing for average load, not contest spike load. Ignoring noisy-neighbour effects in shared execution environments.
L7/L8	Multi-region execution, build caching across submissions, plagiarism detection pipeline, cost model (Firecracker vs dedicated VMs), SLA reasoning.	Over-engineering the API layer when the execution layer is the actual bottleneck. Ignoring operational cost of deep isolation.

◆

Common opening probe: "What makes this different from a regular web service?" — The right answer centres on trust: every other web service you design assumes the request is data. An online judge's request is executable code. That changes the security model entirely.

Requirements clarification

Before designing anything, nail down scope. An online judge can mean a competitive programming platform (Codeforces), a hiring assessment tool (HackerRank), or a practice platform (LeetCode). The functional requirements are similar; the non-functional requirements diverge sharply on contest concurrency and plagiarism detection.

Functional requirements

Requirement	In scope	Notes
Problem browsing	Yes	List, filter by tag/difficulty, view problem statement
Code submission	Yes	Multiple languages; submission stored permanently
Execution & judging	Yes	Run against hidden test cases; return verdict within 5 s
Submission history	Yes	Per-user history with verdicts and runtime stats
Leaderboard / rankings	Yes	Global and per-contest; near-real-time during contests
Contest mode	Yes	Time-bounded, concurrent surge of submissions
Plagiarism detection	No (L7+)	Out of scope for core design; flag as extension
Custom test cases (user-defined)	No	Simplification; production systems support this

Non-functional requirements

NFR	Target	Rationale
Isolation	Sandbox escape must not occur structurally	Untrusted code can attempt kernel exploits, filesystem reads, network calls
Execution latency	< 5 s end-to-end (p99)	Competitive programming is interactive; users wait at the keyboard
Verdict accuracy	Deterministic for identical inputs	Wrong verdicts destroy platform credibility
Fairness	Identical resource limits per language across all submissions	A user on a busy node must not receive a TLE that a user on an idle node avoids
Availability	99.9% (≈8.7 h downtime/year)	Not life-critical; brief outages acceptable outside contest windows
Throughput	10× normal load during contest windows	Contest starts trigger synchronised submission bursts
Data durability	Submissions never lost	Users reference submission history months later

NFR reasoning

Isolation: why sandbox escape must not occur structurally ›

Submitted code runs with the platform's execution credentials. Without isolation, a submission could read other users' source code, exfiltrate test cases (which have commercial value), establish outbound network connections, or exploit kernel vulnerabilities to escalate to the host. Any of these constitutes a complete platform compromise.

"Must not occur structurally" is precise language: we cannot claim it never happens — zero-day hypervisor escapes exist. We claim the architecture provides no structural path: no shared filesystem, no shared kernel (with microVMs), no shared network namespace, and no capability to install kernel modules.

What this drives Firecracker microVMs or gVisor as the execution layer (§5), not plain Docker. Each submission gets a fresh, throwaway VM. The isolation boundary is hardware virtualisation, not just Linux namespaces.

Execution latency: why 5 s, not 1 s or 30 s ›

1 second is too tight: VM boot overhead (Firecracker cold start is ~125 ms), compilation time for Java/Go/Rust, and network round-trips consume a non-trivial fraction. 30 seconds is too loose: users disengage. 5 seconds is the practical ceiling — fast enough to feel interactive, slow enough to accommodate compilation and microVM boot.

The 5 s budget breaks down roughly as: 100–200 ms API + queue enqueue, 100–200 ms worker dequeue + VM boot (warm pool, see §9), 0–3 s compilation + execution, 100 ms result write + notification.

What this drives Async execution via a job queue (§4), pre-warmed VM pool (§9), and WebSocket or long-poll for result delivery (§6) — not synchronous HTTP execution.

Fairness: why identical limits matter for competitive integrity ›

If two users submit identical correct solutions but one hits a TLE because their worker was under CPU contention, the result is wrong. This is not just a UX issue — in contest mode, incorrect verdicts affect rankings and prize eligibility. Fairness requires that resource limits are enforced at the cgroup level, not just hoped-for by scheduling, and that no two submissions share CPU time within the same VM.

What this drives One submission per microVM (never multi-tenant execution), per-language CPU/memory cgroup limits defined in a configuration table, and worker nodes reserved exclusively for execution (never co-located with API servers).

Capacity estimation

An online judge has two very different traffic profiles that must be estimated separately. Submission traffic is write-heavy, bursty, and computationally expensive — each submission triggers code execution that takes 1–5 seconds. Problem-browsing traffic is read-heavy, cheap, and highly cacheable — a user reads a problem statement many times before they submit once.

The most important output of this estimation isn't storage or QPS — it's worker count. Workers are the scarce resource, sized by Little's Law: at steady state, the number of items in the system equals the arrival rate multiplied by the service time (L = λW). This gives the minimum workers at 100% utilisation — but a system running at 100% utilisation has an unboundedly growing queue at any variance spike. Practical provisioning targets 70–80% utilisation, meaning you need the Little's Law result divided by 0.75. Getting this wrong means either under-provisioning (queue builds up, latency blows past 5 s) or over-provisioning (idle workers, wasted cost).

Interactive capacity estimator

Submissions / day 2 M

Read : Write ratio 20 : 1

Avg execution time (s) 3.0 s

Retention (years) 3 yr

Write QPS (avg)

submissions / sec

2M ÷ 86,400

Read QPS (avg)

462

requests / sec

23 × 20

Peak Write QPS

submissions / sec

avg × 3× peak factor

Workers needed (peak, 75% util)

280

concurrent sandbox slots

ceil(70 × 3.0 s ÷ 0.75)

Submission storage

219

GB over retention

2M × 10 KB × 3 yr

Total submissions

2.19 B

rows in DB

2M/day × 365 × 3

⚠

Key architectural implication: At 2 M submissions/day with a 3× contest spike, you need ~280 concurrent execution slots at a 75% utilisation target (see worker card above — 210 is the theoretical minimum at 100% utilisation, which no healthy system should run at). These are not threads or processes sharing a machine — each slot is an isolated microVM. This cost model, not application logic, drives the decision to use a pre-warmed VM pool with autoscaling rather than spinning fresh VMs per submission.

ℹ

Fixed assumptions: Average submission size 10 KB (code + metadata + result). Peak factor 3× average (contest burst). Test case storage is estimated separately — a typical problem has 50–100 test cases at 10–100 KB each, totalling ~5–10 MB per problem. With 3,000 problems, test case storage is ~15–30 GB — small relative to submissions but accessed at high frequency during execution.

High-level architecture

Figure 1 — High-level architecture. Submission ingestion (API) is decoupled from execution (workers) via a durable job queue.

Component breakdown

CDN (content delivery network) serves static assets and — critically — problem statement HTML/JSON. Problem content is read-only after creation, so it can be cached at edge nodes indefinitely with a long TTL. This alone eliminates the majority of API server requests.

Load balancer distributes HTTPS traffic across API server instances. It terminates TLS, applies rate limiting per IP and per API key, and routes health checks. It does not route execution traffic — that goes through the queue.

API servers are stateless application nodes. They validate submissions (syntax check, size limit, language support), write submission records to the database, enqueue execution jobs, and serve read requests from cache. Being stateless means they scale horizontally without coordination.

Job queue (Kafka or SQS) is the critical decoupling point. The API enqueues a submission job and immediately returns an accepted response with a submission ID. The queue provides durability — if all workers crash, jobs are not lost. It also provides natural backpressure during contest spikes.

Execution workers are the production bottleneck. Each worker dequeues one job, boots a fresh sandbox environment, compiles and runs the submission against every test case, and writes the verdict. Workers are stateless; the number of active workers is the primary scaling lever for throughput.

PostgreSQL is the system of record for problems, submissions, users, and test case metadata. Relational schema is appropriate because the entities have well-defined relationships and ACID guarantees simplify submission status consistency.

Redis serves two distinct purposes: submission status polling (short TTL keys, updated by workers as test cases run) and leaderboard rankings (sorted sets, updated atomically per accepted submission).

Object storage (S3) holds test case input/output files. Test case files are immutable after creation, referenced by a deterministic key, and accessed at high frequency by workers. Object storage is cheaper and more scalable than PostgreSQL BLOB columns for this workload.

Architectural rationale

Why a job queue instead of synchronous execution ›

If the API server executed code synchronously, a contest spike of 300 simultaneous submissions would require 300 simultaneous HTTP connections all held open for 3–5 seconds. This exhausts connection pools and makes the API unresponsive to other requests. With a queue, the API returns immediately (HTTP 202 Accepted) and workers drain the queue at their natural rate.

Tradeoff The user must poll or subscribe for results instead of getting a synchronous response. This is an acceptable UX tradeoff — competitive programmers expect a few seconds of wait time.

Alternatives gRPC streaming WebSocket from API to worker Synchronous for "run" (not submit)

Why object storage for test cases, not PostgreSQL BLOBs ›

Test case files are read-only after creation, accessed by workers only (not the API), and range from 1 KB to several MB. Storing them in PostgreSQL as BYTEA columns bloats row sizes, increases checkpoint pressure, and couples test case reads to the same database handling user authentication and submission metadata. Object storage separates concerns, is an order of magnitude cheaper per GB, and supports parallel reads natively. AWS S3 provides strong read-after-write consistency (since December 2020), so workers always see the latest version of a test case object immediately after upload — no eventual consistency window to worry about. S3 versioning is still needed for deliberate rollbacks, not for consistency.

Tradeoff An extra network hop for workers fetching test cases. Mitigated by caching test case files locally on the worker node across submissions for the same problem (workers process many submissions for popular problems).

Why Redis for leaderboards, not PostgreSQL ORDER BY ›

During a contest, rank queries happen every few seconds per active user — thousands of simultaneous "what's my rank?" requests. A SELECT rank() OVER (...) window query on PostgreSQL would work at small scale, but becomes expensive when the submissions table has millions of rows and leaderboard queries are concurrent. Redis sorted sets (ZADD + ZREVRANK) give O(log N) rank operations at any scale. Use ZREVRANK (not ZRANK) when higher score = better rank — ZRANK returns ascending order (lowest score = rank 0), which is correct for penalty-time rankings but wrong for point-based leaderboards. The individual ZADD operation is atomic, but the full verdict pipeline (PostgreSQL write → Redis ZADD → user stats update) is not: the §10 reconciliation job handles divergence when any step fails.

Tradeoff Eventual consistency: a leaderboard read immediately after an accepted verdict may still show the old rank for a brief window while the Redis write propagates. During a contest this is acceptable; for final standings, a consistent SQL query is used.

Real-world comparison

Decision	This design	LeetCode (inferred)	Codeforces
Sandbox technology	Firecracker microVMs	Custom Docker + seccomp	ptrace-based isolation
Execution queue	Kafka (durable, replayable)	SQS / internal queue	Custom in-memory queue
Result delivery	Redis polling + WebSocket	Long-poll / WebSocket	Page refresh / SSE
Test case storage	S3 (object storage)	S3 / GCS	Local filesystem on judge
Leaderboard	Redis sorted sets	Redis + PostgreSQL	PostgreSQL (small scale)

▶

Codeforces' filesystem-local test case approach works because their judge fleet is small and contests are geographically concentrated. LeetCode's global user base and always-on practice mode require distributed object storage. Neither is wrong: the right choice follows from your NFRs and scale. Note: real-world values in the table above are inferred from public engineering posts, not confirmed by the platforms.

Sandboxing strategy

Four canonical approaches exist for executing untrusted code, each with a different isolation primitive and a different overhead cost. The choice between them is the most consequential decision in the execution layer:

Figure 2 — Four sandboxing approaches. ① and ② share the host kernel; ③ and ④ do not. Our choice is ④ Firecracker for production, ② Docker + seccomp for development.

Our choice for this system: Firecracker microVMs in production, Docker + seccomp in development and CI. Firecracker boots a dedicated Linux kernel per submission — there is no shared kernel surface between submissions or between submissions and the host. AWS Lambda, Fly.io, and Render all use this model for the same reason. gVisor reduces the host kernel attack surface but does not eliminate it: its Sentry process still calls a subset (~50) of real host kernel syscalls to handle I/O and memory, so a host kernel vulnerability in that call set is still reachable.

The performance cost of Firecracker is real but manageable. Pre-warming solves the cold start problem: the worker pool maintains paused Firecracker snapshots per language, each with the language runtime already initialised inside the guest. The 125 ms figure cited in Firecracker's benchmarks is kernel boot only — total cold start including Python or JVM runtime initialisation is 400–700 ms. With a warm snapshot that already has the runtime loaded, restore takes ~10–50 ms (depending on guest memory size), well within the 5 s budget. The warm pool must be maintained per language runtime — a Python snapshot cannot serve a C++ submission. At 210 peak worker slots across 4 languages, a conservative warm pool holds 50–60 slots per popular language, with the remainder allocated on contest demand.

⚠

Defence-in-depth: Even with Firecracker, the system applies additional layers: no outbound network from the VM, no access to instance metadata endpoints, cgroup CPU/memory/process limits inside the guest, and a watchdog process that kills the VM if wall-clock time exceeds the limit. Isolation should never rely on a single mechanism.

Per-language resource limits: how fairness is enforced ›

A common misconception: cgroups do not kill a process on CPU time overrun — they throttle it. To enforce a hard kill, use setrlimit(RLIMIT_CPU, soft, hard) inside the sandbox: the OS sends SIGXCPU at the soft limit and SIGKILL at the hard limit. Alternatively, a watchdog polls cpuacct.usage and issues SIGKILL when accumulated CPU time exceeds the budget. The watchdog also enforces wall_ms — catching infinite loops that sleep on I/O, which CPU time accounting misses entirely.

Different languages have very different runtime characteristics. Java incurs JVM startup overhead; Python is slower than C++ by a factor of 10–50 for CPU-bound code. Judges address this with language-specific multipliers defined in a configuration table:

// Resource limits config (per language)
{
  "python3":  { "cpu_ms": 5000, "wall_ms": 8000, "mem_mb": 256 },
  "java":     { "cpu_ms": 3000, "wall_ms": 6000, "mem_mb": 512 },
  "cpp":      { "cpu_ms": 1000, "wall_ms": 3000, "mem_mb": 256 },
  "javascript": { "cpu_ms": 4000, "wall_ms": 7000, "mem_mb": 256 }
}

The mem_mb limit uses memory.limit_in_bytes (cgroups v1) or memory.max (v2), which OOM-kills the process when exceeded — unlike CPU, memory enforcement does kill.

One open question Should Python get 5× the time limit of C++, or should the problem setter define limits per language? Most platforms use multipliers; competitive programmers argue this disadvantages algorithm quality vs language speed. There is no consensus answer: the key is to acknowledge the tradeoff.

05b

API design

The API surface for an online judge is small. Three endpoints carry most of the traffic: problem fetch, submission create, and submission status. The interesting design decisions are in validation, the response contract for async execution, and status polling.

POST /api/submissions — create a submission

// Request
POST /api/submissions
Authorization: Bearer <clerk_jwt>
Content-Type: application/json

{
  "problem_id": "two-sum",
  "language":   "python3",          // enum: cpp | java | python3 | javascript
  "source_code": "class Solution:\n    def twoSum(...",
  "contest_id":  "spring-2026"         // optional; null for practice
}

// Response — 202 Accepted (async; execution has not yet started)
{
  "submission_id": "sub_01HX8K...",
  "status":        "QUEUED",
  "poll_url":      "/api/submissions/sub_01HX8K...",
  "queued_at":     "2026-04-30T12:34:56Z"
}

ℹ

Why 202 Accepted, not 201 Created? The submission row is created (201 would be acceptable), but the primary resource: the execution verdict — does not yet exist. HTTP 202 explicitly signals "the request has been accepted for processing, but processing has not been completed," setting the correct expectation that the client must poll for the result.

Validation on submission create: source code size limit (64 KB), language must be in the supported set for the problem, rate limit enforced (5 submissions/minute per user per problem), contest window check if contest_id is set. Source code is stored verbatim; no sanitisation: the sandbox handles isolation. An Idempotency-Key request header (client-generated UUID) allows the API to return the existing submission record if the same request is retried within a 10-minute window, preventing duplicate executions from network retries or double-clicks.

GET /api/submissions/:id — poll for result

// Response while executing
{
  "submission_id": "sub_01HX8K...",
  "status":        "RUNNING",        // QUEUED | RUNNING | ACCEPTED | WRONG_ANSWER | TLE | MLE | RE | CE
  "tests_passed":  7,
  "tests_total":   20,
  "runtime_ms":    null,             // populated on completion
  "memory_kb":     null
}

// Response on completion (ACCEPTED)
{
  "submission_id": "sub_01HX8K...",
  "status":        "ACCEPTED",
  "tests_passed":  20,
  "tests_total":   20,
  "runtime_ms":    84,
  "memory_kb":     18432,
  "percentile_runtime": 92.4      // faster than 92.4% of accepted solutions
}

Optional endpoint	Method	Level	Notes
`/api/problems/:id`	GET	L3/L4	Returns problem statement, constraints, examples. Served from CDN cache.
`/api/submissions/:id/detail`	GET	L5	Returns per-test-case results. Actual vs expected diffs are only returned for sample cases (`is_sample=true`) during an active contest; hidden case diffs are withheld until contest end — preventing test case extraction via strategic wrong-answer submissions.
`/api/contests/:id/leaderboard`	GET	L5	Paginated leaderboard from Redis sorted set. Cursor-based pagination.
`POST /api/submissions` (run mode)	POST	L6	With `"mode":"run"` flag: runs against user-provided input only, not test suite. Lower priority queue.
`POST /api/submissions/bulk`	POST	L7/L8	Batch submission for hiring assessment platforms. Async with webhook callback.

ℹ

Run mode vs submit mode — separate Kafka topics, not a priority field. Kafka does not support per-message priority within a partition — a "low priority" flag in a shared topic would be ignored. Use two separate topics: submissions.submit (full test suite, contest-critical) and submissions.run (user-provided input only, best-effort). During a contest, worker consumer group assignments bias toward submissions.submit by increasing the number of workers subscribed to it. Workers subscribed only to submissions.run automatically shed load when not needed without any priority scheduling logic.

▶

WebSocket vs polling: The 5 s end-to-end budget (§2) has ~500 ms of slack after execution; polling every 500 ms consumes that slack entirely. WebSocket push eliminates the polling lag at the cost of routing complexity — both are viable, pick based on your tolerance for stateful connections. Polling is simpler and stateless: the client calls GET /submissions/:id every 500 ms until status is terminal. WebSocket reduces latency by ~500 ms on average and eliminates wasted requests, but requires sticky sessions or a pub/sub fan-out (Redis Pub/Sub) to route worker completion events to the right connection. L5+ should name the tradeoff; L7+ should describe the Redis Pub/Sub design.

Core submission flow

The dominant write path — user submits code, gets a verdict — is where the latency NFR (§2) and the isolation NFR collide. A synchronous execution model satisfies neither: it blocks connections during the 1–5 s execution window and makes isolation harder by coupling the API process to the sandbox. The queue-based design satisfies both at the cost of one tradeoff: the client must poll or subscribe for results rather than waiting on a synchronous response.

Figure 3 — Core submission flow. Steps ①–③ are synchronous (milliseconds). Steps ④–⑦ are async (seconds). The 202 response decouples ingestion from execution.

ℹ

Latency budget recap (5 s end-to-end): Steps ①–③ take ~100–200 ms (API validation, DB write, queue enqueue). Step ④ takes ~10–50 ms with a pre-warmed Firecracker snapshot (§5). Step ⑤ takes 0–3 s (compilation + execution). Step ⑥ takes ~50 ms (DB write + Redis update). Step ⑦ adds ~500 ms if polling every 500 ms, or ~10 ms if using WebSocket push. Total budget consumed: 4.5–4.9 s — tight but within the 5 s NFR when the warm pool is healthy (§9).

The critical tradeoff in this flow is between polling and WebSocket push for step ⑦. Polling is simpler: the client calls GET /submissions/:id every 500 ms; the API reads from Redis; no server-side connection state. WebSocket push is lower latency: the worker publishes a completion event to a Redis Pub/Sub channel, and the API server subscribed on behalf of the client's WebSocket connection forwards it immediately.

The polling design is strongly preferred at L3–L5. At L5–L6, the interviewer may probe whether you know that WebSocket connections require sticky routing (or Redis Pub/Sub fan-out) to work correctly behind a load balancer. Name that constraint and it becomes a strength, not a weakness.

Output comparison: why "compare stdout" is incomplete ›

Most problems use exact string comparison, but three categories require different comparators: floating-point problems (accept output within a relative or absolute epsilon), problems with multiple valid outputs (e.g., "output any valid spanning tree"), and interactive problems (a two-process protocol where the judge sends input based on the program's responses). Hardcoding string equality breaks all three.

The judge config should include a comparator_type field in the problem schema, with three values:

// problem config (stored in PostgreSQL problems table)
{
  "comparator_type": "exact",        // exact | float_eps | special_judge
  "float_epsilon":   1e-6,            // used when comparator_type = float_eps
  "special_judge_s3_key": null        // path to comparator binary in S3
}

Workers load the comparator alongside test case files. For special_judge, the worker downloads the comparator binary from S3, runs it as a subprocess with (input, expected_output, actual_output) as arguments, and interprets its exit code as pass/fail. The comparator binary itself is sandboxed — it runs in a separate (lighter) container, not in the submission microVM.

What to say in the interview "Compare stdout" is the naive approach and correct for most problems. Flag that real judges need at least three comparator modes, that the comparator type is a property of the problem (not the submission), and that special judge binaries introduce a second trusted code execution path that needs its own — though less strict — sandboxing.

Data model

Before defining any columns, identify the entities and how they get accessed. The schema follows the access patterns, not the other way around.

Entities and access patterns

Entity	Primary operations	Frequency	Query shape
Problem	Read by ID, list by tag/difficulty	Very high (cached)	Point lookup + tag filter
TestCase	Read all cases for a problem (by worker)	High per execution	Bulk fetch by problem_id
Submission	Insert on submit; update on verdict; read by user; read by contest	High writes, moderate reads	Point lookup by ID; scan by user_id; scan by contest_id + problem_id
User	Read on auth; update stats on accepted verdict	Low writes, moderate reads	Point lookup by ID or clerk_id
Contest	Read during contest window; leaderboard during + after	Low; bursty during contest	Point lookup; join to submissions

Two things stand out from this table. First, test cases are accessed in bulk by problem — all cases for a problem are fetched together by the worker, never individually. This means the test case files belong in object storage (S3) with a key pattern of problems/{problem_id}/cases/{case_id}.{in|out}, while metadata (case count, time limit per case) stays in PostgreSQL. Second, submissions are written at high volume but queried in two very different ways: point-lookup by ID (status polling) and range scan by user (submission history). Both access patterns need covering indexes.

Figure 4 — Data model. Test case files live in S3; only metadata (S3 key paths) are stored in PostgreSQL. The submissions index callout shows the three covering indexes that drive the most frequent query patterns.

Why store source_code in PostgreSQL, not S3 ›

Source code is small (median ~2 KB, max 64 KB) and accessed frequently — every submission history page load reads it. Storing it as a PostgreSQL TEXT column keeps reads fast and transactional (the source code and status are always consistent). The 64 KB limit must be enforced at the application layer (CHECK constraint or API validation) — PostgreSQL's TEXT type has no inherent length cap. PostgreSQL TOASTs column values over ~2 KB out-of-line automatically, so a 10 KB source code value incurs a second page read on access; this is acceptable for this workload but worth knowing if query latency becomes a concern.

When this breaks down If the platform adds support for multi-file projects (e.g., import from other files), source code grows beyond a single column and object storage becomes appropriate. At that point the schema adds a source_s3_key column and keeps code out of PostgreSQL.

Why ULID for primary keys instead of auto-increment INT ›

ULIDs (Universally Unique Lexicographically Sortable Identifiers) are time-ordered, globally unique, and safe to expose in URLs. Sequential INT IDs expose submission volume (a competitor can enumerate your submission counts from public IDs), and UUID v4s are random (poor B-tree index locality). ULID gives you all three properties: uniqueness, time-ordering for index locality, and opacity.

Alternatives UUID v7 (RFC 9562; time-ordered; PostgreSQL needs pg_uuidv7 ext or app-level generation) Snowflake ID (Twitter-style) CUID2

Caching strategy

Caching in an online judge operates at three distinct layers, each serving a different read pattern. The key insight is that these layers have completely different invalidation requirements: problem content almost never changes (long TTL), submission status changes frequently during execution (short TTL), and leaderboards change on every accepted verdict (near-real-time).

Figure 5 — Three-layer cache hierarchy. The CDN serves the vast majority of read traffic. Redis handles dynamic data (status, leaderboard). PostgreSQL is the fallback for cache misses only.

Layer	What it caches	TTL / Invalidation	Why it exists
CDN edge	Problem statements (HTML/JSON), static assets	24 h TTL; explicit invalidation on problem edit	Problem content is the highest-volume read and never changes once published. CDN eliminates origin hits entirely for the dominant traffic pattern.
Redis — status keys	`sub:{id}:status` → current verdict + partial results	30 min TTL; set by worker on each test case completion	Status polling happens every 500 ms per active submission. Reading from PostgreSQL on every poll would be wasteful; Redis gives microsecond reads for this hot key.
Redis — leaderboard	Sorted set per contest: `leaderboard:{contest_id}`	No TTL; `ZADD` on every accepted verdict; archived after contest	O(log N) rank operations at any scale. Concurrent rank queries during a contest would create hot-spot reads on PostgreSQL without this layer.
Worker-local disk	Test case files (S3 objects) for recently-seen problems	LRU eviction; invalidated if test cases change (rare)	Workers process many submissions for the same popular problem. Fetching test cases from S3 on each submission adds ~50–100 ms of latency. Local caching reduces this to a disk read.

⚠

Leaderboard consistency: The Redis leaderboard may lag by the duration of a single ZADD write after an accepted verdict. In practice this window is under 10 ms in normal operation, but during a network partition it can grow. For final contest standings, the system always regenerates rankings from PostgreSQL — Redis is the live view, not the authoritative record.

Worker scaling and contest load

Scaling an online judge beyond a few thousand daily active users requires addressing three distinct bottlenecks: the execution worker pool (the primary constraint), the submission queue under contest burst load, and the database under high-write submission traffic. Each requires a different approach.

Figure 6 — Production architecture. The autoscaler monitors Kafka queue depth and provisions or terminates worker instances. Workers are the primary scaling constraint; the API tier is rarely the bottleneck.

Worker autoscaling: how to size and respond to contest load ›

The autoscaler watches Kafka consumer group lag for the submissions topics. When lag exceeds a threshold (e.g., 500 messages), it provisions additional worker instances. However, consumer lag has an important blind spot for slow-processing consumers: lag is calculated as latest_offset − committed_offset, and workers only commit offsets after writing the verdict (3–5 seconds later). A worker that has dequeued 50 messages but not yet committed shows lag = 0, even though 50 executions are in flight. Supplement lag with a "submissions in RUNNING state" counter from Redis: the combined signal accurately reflects both queued and in-flight work. Tools like Burrow (LinkedIn's open-source Kafka consumer monitor) track per-consumer lag without relying on committed offsets — useful precisely when consumers have long processing windows.

For contests, the autoscaler uses a scheduled pre-warm: 15 minutes before contest start, the pool scales to maximum capacity regardless of current queue depth. This eliminates cold-start delays during the critical first minutes when all participants submit simultaneously. Pre-warmed Firecracker snapshots are kept ready per language.

Tradeoff Pre-warming wastes capacity if the contest is cancelled or undersubscribed. At ~$0.015/hr per instance, 300 pre-warmed slots for 15 minutes costs about $1.10 — an acceptable insurance cost.

Kafka partition strategy: why partition by language ›

Use separate topics per language (submissions.python3, submissions.cpp, etc.) rather than a single topic with a language key. This is important: with a single topic partitioned by language key, you get at most 4 active partitions for 4 languages, meaning at most 4 concurrent consumers, which cannot support a 280-worker pool. Separate topics each with 64–128 partitions allows the worker pool for each language to scale to the required depth independently.

Producer configuration is critical for durability: submissions must be produced with acks=all and enable.idempotence=true. Without acks=all, the min.insync.replicas=2 setting is completely ignored — a broker acknowledges the write before replication, and a leader failure before replication means the job is silently lost. Since a lost submission is unacceptable, acks=all is non-negotiable.

Alternative A single topic with a shared worker pool handling all languages. Simpler operationally, but a language-specific contest surge can starve other language queues. Separate topics give independent scaling and fault isolation.

Database scaling: when read replicas and PgBouncer are needed ›

The primary bottleneck is the worker pool, not the database. PostgreSQL primary handles submission inserts (one write per submission, plus one update when the verdict is written). At 70 peak submissions/sec, that's ~140 writes/sec — well within a single Postgres instance's capacity. Read replicas become necessary when submission history queries, analytics dashboards, and problem listing queries compete with these writes for I/O on the primary.

PgBouncer in transaction mode is added in front of the primary when the worker fleet scales past ~100 nodes. Each worker opens a database connection to write verdicts; 200+ concurrent connections exhaust PostgreSQL's connection limit. PgBouncer multiplexes many client connections onto a small pool of real PostgreSQL connections.

Important: PgBouncer transaction mode is incompatible with server-side named prepared statements. Most ORMs (SQLAlchemy, ActiveRecord, Diesel) use server-side prepared statements by default — enabling transaction mode without disabling them causes prepared statement "..." does not exist errors when a connection is returned to the pool and a different client tries to use it. Configure your driver to use simple query protocol or client-side prepared statements before enabling transaction mode. PgBouncer session mode avoids this but provides weaker connection multiplexing.

What to say in the interview "Database scaling is not the first thing I'd reach for: the worker fleet is the bottleneck. I'd add read replicas when analytics queries start competing with writes, and PgBouncer when the worker fleet grows past ~100 nodes and connection count becomes the constraint."

Failure modes & edge cases

An online judge has several failure modes that don't exist in typical web services, because the execution layer introduces an entirely new class of failure: the sandboxed process. Understanding how each failure propagates — and where it's caught — is what separates an L5 answer from an L3 answer.

Scenario	Problem	Solution	Level
Worker crashes mid-execution	Submission is dequeued but verdict is never written. Job is "lost."	Kafka's at-least-once delivery: the job is requeued when the consumer heartbeat times out. Worker commits the offset only after writing the verdict to PostgreSQL and Redis. Idempotent verdict writes handle duplicate execution: `INSERT INTO submissions (...) ON CONFLICT (id) DO UPDATE SET status = EXCLUDED.status WHERE submissions.status = 'QUEUED'`: the guard prevents overwriting a completed verdict if a replay arrives late.	L3/L4
Infinite loop / TLE	Worker process hangs, holding a VM slot indefinitely. Other submissions queue behind it.	Dual timeout enforcement: cgroup CPU-time limit kills tight spin loops; wall-clock watchdog (SIGKILL after wall_ms) catches I/O-blocking infinite loops. Both must fire independently — using only one can be bypassed via `sleep()`.	L3/L4
Submission produces non-deterministic output	Multithreaded code with race conditions passes sometimes, fails others. Users complain about "flaky" judgements.	The judge runs each submission once — it does not re-run on failure. Non-determinism is a correct verdict (wrong answer or runtime error). Problem setters should avoid problems where correct solutions rely on thread scheduling. Document this as a known limitation.	L5
Test case file corruption in S3	A test case input or expected output file is corrupted (bit flip, incomplete write). All submissions for a problem receive wrong verdicts.	Store a SHA-256 checksum of each test case in PostgreSQL alongside the S3 key. Workers verify the checksum after download; a mismatch triggers an alert and stops judging for that problem until the file is restored. S3 versioning retains previous known-good versions.	L5
Redis leaderboard diverges from PostgreSQL	A Redis crash between verdict write and ZADD leaves a correct submission missing from the leaderboard.	On Redis restart, replay accepted submission records from PostgreSQL to reconstruct the sorted set. A background reconciliation job runs every 5 minutes during contests to detect and repair divergence. Redis is the live view; PostgreSQL is the source of truth.	L5
Contest starts, queue fills instantly	10,000 participants submit within the first 60 seconds. Queue depth spikes to 50,000+. Latency blows past 5 s for submissions in the tail.	Pre-warm worker pool 15 min before contest start (§9). Apply per-user rate limiting (5 submissions/min) to smooth the burst. If queue depth still exceeds threshold, surface an estimated wait time to users ("your submission is queued, expected in ~8 s") instead of a blank spinner.	L5
Sandbox escape via kernel exploit	A submission exploits a kernel vulnerability to break out of the microVM and access the host or other VMs.	Defence-in-depth: Firecracker provides a hardware-virtualisation boundary; even if the guest kernel is compromised, the hypervisor boundary limits the blast radius to the single VM. VMs run as unprivileged users; host network is unreachable from guest; metadata endpoints are blocked. Monitor for anomalous syscall patterns (unexpected exec calls, file descriptor creation). No architecture fully prevents zero-day exploits; the goal is minimise blast radius and detect quickly.	L7/L8
Test case authoring error — wrong expected output	A problem setter creates a test case with an incorrect expected output. All correct solutions are judged as Wrong Answer.	Separate the problem-setter role from the judging system: require at least one accepted reference solution before activating a problem. Run the reference solution against all test cases during problem creation; if any case fails, block publication. Store the reference solution's output hash as the ground truth. Problem authoring pipeline: test case files upload via admin-only path; SHA-256 stored in PostgreSQL at write time; S3 versioning enabled so rollback is possible; reference solution run triggered automatically on publish: the problem goes live only if every test case passes.	L7/L8

10b

Security, trust & compliance

Three security concerns are specific to online judges and map directly to components already in the architecture: abuse of the submission pipeline, the problem setter trust model (who can create test cases and how are they validated?), and data integrity under adversarial conditions. L6+ interviewers probe all three explicitly.

Submission rate limiting and abuse prevention

Without rate limiting, a single malicious actor can flood the execution queue with thousands of submissions, consuming worker capacity and degrading fairness for other users. Rate limiting is enforced at two levels:

Per-user submission rate limiting: how it's enforced ›

The API enforces a token-bucket rate limit per user per problem: maximum 5 submissions/minute (enforced via a Redis counter INCR ratelimit:sub:{user_id}:{problem_id}:{minute_bucket} with a 60-second TTL). Requests exceeding the limit receive a 429 Too Many Requests with a Retry-After header. A separate, higher limit (30 submissions/minute globally per user) catches bulk scripted abuse before it reaches the per-problem check.

During an active contest, the submission window check also applies: submissions outside the contest time window are rejected at the API before they reach the queue, not silently enqueued to be judged incorrectly.

L5 probe "A user submits the same wrong solution repeatedly to try to reverse-engineer the hidden test cases." — The per-problem rate limit (5/min) limits this to 5 attempts per minute. For full-suite submissions, hiding per-test-case diffs during an active contest (see §5b) prevents learning which test case failed, making brute-force test case extraction impractical even within the rate limit.

Problem setter trust model: who can upload test cases? ›

Problem setters are trusted users — they can upload test case files directly to S3 via an admin-only signed URL path. But trust is scoped: a problem setter can only write test cases for problems they own; they cannot read other problems' test cases, user submissions, or execution verdicts. The IAM policy on the S3 bucket enforces prefix-level write isolation: problems/{problem_id}/cases/* is writable only by the owning problem setter's IAM role.

The reference solution validation pipeline (see §10 — test case authoring failure mode) further constrains trust: a problem cannot go live until a reference solution passes all test cases. This means a problem setter who accidentally or maliciously uploads bad test cases is caught before any user submission is judged against them. The pipeline runs in an isolated worker: the reference solution itself is sandboxed exactly like a user submission.

What to say in the interview "I treat problem setters as a semi-trusted role: they have write access to their own problem's test cases but cannot see submissions or other problems' data. Every problem requires a passing reference solution before going live — this is both a quality gate and a security check."

Alternatives Problem setter uploads to a staging bucket; admin reviews before promoting to production Automated test case generator (fuzzing) replaces manual upload

Test case integrity: SHA-256 checksums end to end ›

Every test case file has a SHA-256 checksum stored in PostgreSQL alongside its S3 key at upload time. Workers verify the checksum after downloading a test case file before running any submission against it. A mismatch triggers an immediate alert and pauses judging for that problem — no user receives a wrong verdict from a corrupted input/output file. S3 versioning retains the previous known-good version for rollback.

Checksum verification also closes a subtle attack vector: a compromised problem setter who modifies a test case file in S3 directly (bypassing the admin upload path) would trigger checksum failures on workers, alerting the platform before any user is affected.

Verification cost SHA-256 of a 100 KB test case takes ~0.1 ms on modern hardware. At 50 test cases per problem, total verification overhead per submission is ~5 ms — negligible within the 5 s budget.

Plagiarism detection: a background pipeline, not a blocking check ›

Plagiarism detection cannot run in the critical path — it's too expensive. The standard approach is a background pipeline triggered after contest end:

Normalise all accepted submissions: strip comments, rename variables to canonical names (var1, var2...), normalise whitespace.
Tokenise the normalised source into a sequence of language tokens (keywords, operators, literals).
MinHash LSH (Locality-Sensitive Hashing): generate a MinHash signature for each submission's token set. Submissions whose signatures fall in the same LSH bucket are near-duplicate candidates — O(1) lookup per submission, not O(N²) pairwise comparison.
Flag candidates for human review. Never auto-disqualify — false positives (two people independently solving the same problem the same way) are common.

L7 probe "A user submits 50 slightly different copies of the same solution during a contest to inflate their attempt count and confuse plagiarism detection." — The per-user rate limit (5/min) bounds total attempts. MinHash LSH would flag all 50 as near-duplicates of each other. Token normalisation removes trivial obfuscations (rename variables, add comments). The real challenge is detecting structural plagiarism where the algorithm is copied but re-implemented — this requires AST-level comparison, which is out of scope for an interview answer.

How to answer by level

L3 / L4 SDE I / SDE II: can you build a working system? ›

What good looks like

Identifies the core flow: submit → queue → execute → return verdict
Recognises code must run in an isolated environment (even if just says "Docker")
Proposes PostgreSQL for problems and submissions
Handles the basic timeout/TLE case with a kill signal
Knows the API returns async (202) not sync

What separates L5 from L3/L4

L3 says "run in Docker" — L5 asks what happens if a container escapes
L3 doesn't explain why the queue is necessary — L5 quantifies the concurrency problem
L3 ignores worker crash recovery — L5 explains at-least-once delivery
L3 polls PostgreSQL directly — L5 introduces Redis for hot status keys

L5 Senior SDE: do you understand the tradeoffs? ›

What good looks like

Compares Docker vs gVisor vs Firecracker — names the isolation primitive for each
Applies Little's Law to size the worker pool
Describes at-least-once delivery and idempotent verdict writes
Separates test case metadata (PostgreSQL) from test case files (S3)
Explains dual timeout (CPU-time + wall-clock) and why one alone is insufficient
Designs polling vs WebSocket result delivery and names the WebSocket routing constraint

What separates L6 from L5

L5 says "autoscale workers" — L6 specifies the queue depth metric, pre-warm trigger, and cost model
L5 knows Redis sorted sets for leaderboard — L6 describes reconciliation on Redis failure
L5 designs the schema — L6 addresses the submission volume at 2 B rows and partitioning strategy
L5 designs for average load — L6 explicitly designs for the contest spike

L6 Staff SDE: can you own this end-to-end? ›

What good looks like

Owns the full execution layer: Firecracker snapshot strategy, warm pool sizing, per-language partition strategy
Contest mode as a first-class concern: pre-warm schedule, per-user rate limiting, wait time estimation
PostgreSQL partitioning by submission date for the 2 B row scale
Test case integrity: SHA-256 checksum verification by worker
Designs the reference solution validation pipeline for problem authoring
PgBouncer rationale: when connections, not queries, become the bottleneck

What separates L7 from L6

L6 designs for one region — L7 considers multi-region execution for latency and data residency
L6 ignores plagiarism — L7 sketches a token-fingerprint detection pipeline
L6 optimises for throughput — L7 models the cost of Firecracker vs spot instance farms
L6 knows the failure modes — L7 has opinions on SLA tiers and graceful degradation strategies

L7 / L8 Principal / Distinguished: should we build this, and how? ›

What good looks like

Frames build vs buy: AWS Fargate / Lambda for isolation vs owned Firecracker fleet — explicit cost and ops trade-offs
Multi-region: execution in the user's nearest region for latency; results replicated globally for leaderboard consistency
Plagiarism detection: token normalisation (rename variables) + MinHash LSH for near-duplicate detection — a background pipeline, not in the critical path
Graceful degradation SLA: during capacity exhaustion, run sample tests immediately and defer full test suite — give partial feedback within 1 s
Versioned judge engine: when the platform upgrades a language runtime, old submissions can be re-judged to verify reproducibility

Common gaps at L7

Over-engineering the API or database layer when the execution layer is the constraint
Proposing expensive solutions (dedicated bare-metal per user) without acknowledging the cost model
Ignoring operational complexity of self-managed Firecracker vs managed container services
Not discussing the platform's trust model for problem setters (who can write test cases?)

Classic probes — level-differentiated answers

Question	L3/L4	L5/L6	L7/L8
"How do you prevent a submission from reading other users' code?"	Run each in its own container / process	Separate kernel per microVM (Firecracker); no shared filesystem; no network; capability drop	Defence-in-depth: microVM + seccomp + no metadata endpoint + anomaly detection. Discuss blast radius of a zero-day hypervisor escape.
"What happens if the queue fills up during a contest?"	Add more servers	Pre-warm pool before contest; autoscale on queue depth; per-user rate limiting; surface estimated wait to user	Graceful degradation: run sample test cases immediately for fast partial feedback; defer full suite; SLA tiering for paid vs free tier; cost model for pre-warming
"How does your leaderboard stay consistent?"	Query the database	Redis sorted sets updated on accepted verdict; reconcile from PostgreSQL on Redis restart; Redis is live view, PostgreSQL is source of truth	Multi-region: eventual consistency for live view is acceptable during contest; strong consistency for final standings via distributed transaction or canonical read from single region
"A problem setter uploads a test case with a wrong expected output. How do you detect it?"	Someone reports it	Require a passing reference solution before publishing; SHA-256 integrity check on test case files by workers	Automated regression: any accepted solution for a problem can be added to a reference set; a test case change triggers re-judging the reference set before it goes live

How the pieces connect

Isolation NFR (§2) → Firecracker microVMs (§5) → one-submission-per-VM model (§4). The isolation boundary is hardware virtualisation, not Linux namespaces — that distinction is why gVisor (shared host kernel via Sentry) is the runner-up, not the choice.

Latency NFR ≤5 s (§2) → async job queue (§4) → client polls Redis for status (§8). The 5 s budget has ~500 ms of slack after execution; polling every 500 ms consumes that slack entirely, which is why WebSocket push is the L5+ answer when the interviewer probes result delivery.

Little's Law at peak contest load (§3) → 280 worker slots needed at 75% utilisation → worker autoscaling on queue depth is the primary scaling lever (§9). This is why §9 focuses on the worker fleet, not the API tier or database — those are not the bottleneck.

Dual timeout enforcement in §5 (CPU-time limit + wall-clock watchdog) → TLE verdict in §10. One limit alone is bypassable: CPU-time misses sleep()-based infinite loops; wall-clock alone misses tight spin loops. The failure mode is only fully closed when both fire independently.

Read:write ratio ~20:1 for problems (§3) → CDN edge cache + Redis (§8) handles problem reads without touching the database. The submission write path and the problem read path are completely decoupled — scaling one does not affect the other.

Fairness NFR (§2) → per-language cgroup limits (§5) + separate Kafka topics per language (§9). Fairness requires both identical resource enforcement inside the sandbox and independent queue depth per language — a Python surge must not starve the C++ queue.

Also in this series

Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
Payment Processing System Design — idempotency keys, exactly-once semantics, and append-only ledger models
Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking

Design an Online Judge — System Design Interview Guide

What the interviewer is testing

Requirements clarification

Functional requirements

Non-functional requirements

NFR reasoning

Capacity estimation

Interactive capacity estimator

High-level architecture

Component breakdown

Architectural rationale

Real-world comparison

Sandboxing strategy

API design

POST /api/submissions — create a submission

GET /api/submissions/:id — poll for result

Core submission flow

Data model

Entities and access patterns

Caching strategy

Worker scaling and contest load

Failure modes & edge cases

Security, trust & compliance

Submission rate limiting and abuse prevention

How to answer by level

Classic probes — level-differentiated answers

How the pieces connect

System Design Mock Interviews

Practice Coding Interviews Now