Design a Payment Processing System
Simple to describe at checkout. Brutally hard to make correct at scale: idempotency, ledger consistency, and zero double-charges.
~40 min read · Stripe / PayPal · Covers Sections 1–11
What the interviewer is testing
Payment systems are the canonical test of whether a candidate understands correctness at scale, not just throughput. Most distributed systems happily trade consistency for availability; payments cannot. A double-charge or a lost transaction is a regulatory event, a trust event, and sometimes a legal event.
The interview question is deceptively simple: "design Stripe." What examiners are really probing is whether you grasp that payments are state machines with external actors (card networks, banks) that cannot be rolled back. The hard problems — idempotency, exactly-once execution, ledger correctness, reconciliation — are invisible until you've reasoned about failure modes explicitly.
| Level | Bar |
|---|---|
| L3/L4 | Sketch a working payment flow (customer → API → processor → bank). Know what an idempotency key is. Identify the need for ACID storage. |
| L5 | Explain exactly-once semantics across retries. Design the ledger model. Reason about the PSP (payment service provider) as an external system with its own failure modes. |
| L6 | Own the consistency model end-to-end: idempotency key lifecycle, distributed transaction across ledger + payment state, reconciliation pipeline design. |
| L7/L8 | Design for global scale: cross-region consistency, multi-currency ledger, regulatory isolation (PCI DSS scoping), fraud pipeline integration, settlement SLAs. |
The single most common mistake: designing the payment flow like a standard CRUD service. A payment process crosses an external system boundary (the card network) that you cannot control, inspect, or roll back. Your design must account for this from the start.
Requirements clarification
Functional requirements
| Requirement | Notes |
|---|---|
| Initiate a payment | Card-on-file or tokenised card. Async confirmation via webhook. |
| Payment status enquiry | Poll or webhook-driven status updates (pending → succeeded / failed). |
| Refund a payment | Full or partial refund. Referential integrity with original charge. |
| Merchant payout | Settlement to merchant bank account. Timing depends on agreement and jurisdiction: typically T+1 or T+2 business days. New merchants may carry a rolling reserve. The nightly batch job (§9) handles this cadence; it is not a real-time transfer. |
| Transaction history | Per-customer and per-merchant listing, paginated. |
| Webhook delivery | At-least-once delivery with retry to merchant endpoints. |
Non-functional requirements
| NFR | Target | Reasoning |
|---|---|---|
| Exactly-once charge | Zero preventable double-charges; minimised in all failure windows | Double-charging is a financial and legal incident. Duplicate execution may occur in narrow failure windows; the system must detect and suppress it. Idempotency keys enforce this in steady state; the recovery sweep covers crash windows. |
| Availability | 99.99% (52 min/year downtime) | Every minute of payment downtime is lost GMV. Multi-region active-active deployment required. |
| Charge latency | p99 < 3 s (sync), < 30 s (async) | Processor round-trip dominates; our system must not add more than 200 ms of internal latency. |
| Consistency | Strong consistency for ledger writes | Ledger entries must reflect real money movement. Eventual consistency is acceptable for analytics and dashboards. |
| Data durability | Zero data loss (ledger) | A lost transaction entry is a financial discrepancy. WAL-based storage with synchronous replication required. |
| Compliance | PCI DSS Level 1 | Card data must not touch our servers. Use tokenisation via a vault service; scope stays narrow. |
Architectural rationale for NFRs
Exactly-once semantics correctness · §5 ›
Exactly-once in a distributed system means: the effect of an operation is applied exactly once, even if the request is delivered more than once. For payments this is non-negotiable — a customer charged twice will dispute both transactions, triggering chargebacks and merchant penalties.
The implementation is idempotency keys: the client generates a UUID per payment attempt and sends it as a header. The server stores the key → result mapping atomically with the ledger entry. Any replay returns the stored result. This requires the key store and ledger write to be in the same ACID transaction.
Strong consistency for ledger correctness · §7 ›
A balance derived from an eventually consistent view could show incorrect available funds, enabling over-spending. Ledger writes must be linearisable: each entry reflects an authoritative money movement, and reads must see all prior writes in sequence.
This drives the choice of a relational database with synchronous replication (not async replica lag). In practice, Stripe uses a sharded MySQL cluster with synchronous secondary; payments go to the shard owning that merchant account.
PCI DSS compliance scoping security · §4 ›
PCI DSS Level 1 mandates an annual on-site audit for any system that stores, processes, or transmits raw card data. The smallest-scope design keeps raw card data out of our systems entirely by delegating card collection to a client-side tokenisation SDK (Stripe.js, Braintree SDK). The client vault returns a single-use payment method token; our backend only ever sees the token.
Availability: 99.99% reliability · §4 §9 ›
99.99% availability = roughly 52 minutes of downtime per year. Achieving this requires multi-region active-active deployment with automatic failover. No single component — not the DB, not the payment API, not the PSP connector — can be a single point of failure.
Capacity estimation
Payment workloads are write-heavy on the ledger and read-heavy on dashboards. The write path (charge, refund, payout) is latency-sensitive and correctness-critical. The read path (transaction history, analytics) can tolerate slight staleness and is served from read replicas or a separate analytics store.
Four dimensions matter: transaction volume (writes/day), read demand, ledger row size, and retention. Adjust below for your assumed scale:
Interactive capacity estimator
A 10 M transactions/day load (~116 write QPS) is trivially handled by a single PostgreSQL primary. The complexity of a payment system at this scale comes entirely from enforcing idempotency and preventing dropped writes (SELECT FOR UPDATE + INSERT + UPDATE with WAL-synced commits). These are correctness problems, not scale problems, until you exceed roughly 100 K TPS — a Visa-level load.
High-level architecture
Component breakdown
Token Vault (Stripe.js / Braintree SDK) collects raw card data in the client's browser before it ever reaches our servers. It returns a single-use payment method token, keeping our backend out of PCI scope entirely.
API Gateway + Load Balancer terminates TLS, validates API keys or OAuth tokens, enforces per-merchant rate limits (see Rate Limiter Design), and routes traffic to the payment service fleet. It is also where idempotency key headers are validated for basic format before reaching business logic.
Payment Service is the core business logic layer. It owns the payment state machine (created → processing → succeeded/failed/refunded), enforces idempotency, calls the PSP connector, and writes atomically to the ledger. This service is stateless — all state lives in the ledger DB.
PSP Connector wraps the external payment service provider (e.g. Stripe, Adyen, Braintree). It translates internal payment intents into processor-specific API calls, normalises responses, and handles processor-specific retry semantics. Multiple connectors let us route by currency, geography, or cost.
Ledger DB is the authoritative source of truth for all money movement. It stores both the payment record and idempotency key results in a single ACID transaction. A relational database (PostgreSQL, MySQL, or at Google scale a globally distributed relational database like Spanner) is mandatory here — no eventual consistency shortcuts.
Webhook Dispatcher delivers asynchronous status updates (payment succeeded, refund processed) to merchant endpoints. At-least-once delivery with exponential back-off retry; merchants must be idempotent receivers. For deep dives into asynchronous delivery, see our Notification Service Design post.
Analytics Pipeline streams ledger events (via a change-data-capture stream like Debezium → Kafka) into an analytics store and a reconciliation engine. Separated from the hot path to avoid latency coupling.
Cache (Redis) stores recently resolved idempotency keys for fast hot-path deduplication, reducing ledger DB reads on retried requests. Cache is a performance optimisation only: the canonical truth remains the DB.
Architectural rationale
Why a dedicated Ledger DB instead of a general payments table? data model · §7 ›
A ledger is append-only by design — no UPDATE on balance fields, only INSERT of new debit/credit entries. This makes the system naturally idempotent (an inserted entry for an idempotency key is the proof of execution), fully auditable, and recoverable from any snapshot. Updating a balance field in place would require read-modify-write, which needs locks and is vulnerable to lost updates under concurrency.
Why separate the PSP Connector from the Payment Service? modularity · §6 ›
Payment processors have wildly different APIs, error codes, and retry semantics. The connector layer absorbs this complexity, presenting a normalised interface to the payment service. Swapping processors (e.g. adding Adyen as a fallback) or routing by geography doesn't touch the core business logic.
Why async analytics separated from the write path? observability · §9 ›
Reconciliation, fraud analysis, and chargebacks all require batch or near-real-time analytics over transaction data. Coupling this to the hot write path would add latency and create a blast radius if the analytics system falls behind. Change-data-capture (CDC) on the ledger DB ensures analytics always sees committed writes without polling or impacting write throughput.
Real-world comparison
| Decision | This design | Stripe | PayPal |
|---|---|---|---|
| Core DB | PostgreSQL / Spanner | Sharded MySQL (Vitess) | Oracle (migrating to PostgreSQL) |
| Idempotency | Key table + Redis cache | Idempotency-Key header, stored in MySQL | PayPal-Request-Id header |
| Ledger model | Append-only double-entry | Double-entry ledger | Account balance + transaction log |
| PSP layer | Connector service | Stripe IS the processor (direct network integration) | Braintree gateway + PayPal processor |
| Analytics | CDC → Kafka → ClickHouse | Kafka → Druid | Kafka → Hadoop + Flink |
| Reconciliation | Nightly batch + real-time CDC | Nightly batch against network settlement files | Nightly batch + dispute management system |
No single architecture fits every payment product. Stripe built its own card network integration (direct acquiring) to control the full stack — most companies cannot. The right design follows from what you own: merchant aggregator (Stripe-like), marketplace (PayPal-like), or in-house payment team for a single retailer (Amazon-like).
Core algorithm — idempotency
The defining algorithmic question for a payment system isn't "how do we charge a card?" — it's "how do we guarantee that we charge it exactly once, even if the network drops, the server crashes mid-request, or the client retries three times?" This is the idempotency problem.
There are two primary approaches for enforcing exactly-once execution at the API layer. The choice between them determines how your server handles concurrent retries: the hardest case.
Our choice for this system: idempotency keys. Distributed locking prevents concurrent execution but not sequential duplicates — if the lock is released before the client receives the response, a new request can re-execute. Idempotency keys solve both: they record the result of the first execution and return it to any subsequent request, regardless of timing.
Idempotency key implementation: correct two-phase pattern implementation detail ›
Key rule: the PSP call must happen outside any open DB transaction. Holding a transaction open during a 1–2 s network call causes lock contention, connection pool starvation, and transaction timeouts. The correct pattern uses two short transactions around the external call:
async function processPayment(req) {
const key = req.headers['Idempotency-Key'];
// Phase 1 -- fast dedup check (no txn needed for read)
const cached = await redis.get(`idem:${key}`);
if (cached) return JSON.parse(cached); // hot path exits here
// Phase 2 -- short txn: check DB + reserve the slot (~5 ms held)
const existing = await db.transaction(async (txn) => {
const row = await txn.query(
'SELECT status, psp_ref FROM payments WHERE idempotency_key=$1 FOR UPDATE', [key]
);
if (row.rows.length > 0) {
if (row.rows[0].status === 'processing') throw new Error('Payment in flight'); // concurrent retry — caller retries after delay
return row.rows[0]; // completed payment — return stored result
}
// Reserve slot -- marks in-flight so concurrent retry sees it above
await txn.query(
'INSERT INTO payments(id,idempotency_key,status) VALUES(gen_random_uuid(),$1,$2)',
[key, 'processing']
);
return null; // null = new payment, proceed
}); // txn commits here; lock released
if (existing) return existing; // duplicate detected via DB
// Phase 3 -- PSP call OUTSIDE any transaction (1-2 s, no lock held)
const result = await chargeCard(req.body);
// Phase 4 -- short txn: commit result + ledger entry atomically (~5 ms held)
await db.transaction(async (txn) => {
await txn.query(
'UPDATE payments SET status=$1, psp_ref=$2 WHERE idempotency_key=$3',
[result.status, result.psp_ref, key]
);
await txn.query(
'INSERT INTO idempotency_keys(key, result, created_at) VALUES($1,$2,NOW())',
[key, JSON.stringify(result)]
);
await txn.query('INSERT INTO ledger_entries(...) VALUES(...)');
});
// Phase 5 -- populate Redis cache for future retries
await redis.setex(`idem:${key}`, 86400, JSON.stringify(result));
return result;
}
status=processing. The recovery sweep detects and resolves this by querying the PSP directly.
One open question: What happens if the server crashes after calling the PSP but before writing to the DB? The money moved but there's no ledger entry. Recovery: a background job periodically scans for in-flight transactions (older than 2× the PSP’s timeout SLA — typically 30–60 s) with status = 'processing' and queries the PSP for their actual outcome via payment reference. This "recovery sweep" is mandatory in any production payment system.
API design
The API surface for a payment system is deliberately minimal — fewer endpoints means fewer attack vectors and less surface for idempotency bugs. The two core endpoints handle initiation and retrieval. Everything else is additive.
POST /v1/payments — initiate a payment
Creates a payment intent. Returns immediately with status: "processing"; confirmation arrives via webhook. Requires Idempotency-Key header — requests without one are rejected with 400 Bad Request.
// Request
POST /v1/payments
Authorization: Bearer sk_live_...
Idempotency-Key: a8f3c2d1-4b5e-6789-abcd-ef0123456789
Content-Type: application/json
{
"amount": 4999, // in smallest currency unit (cents)
"currency": "usd",
"payment_method": "pm_1abc...", // token from vault, not raw card data
"merchant_id": "merch_xyz",
"metadata": { "order_id": "ord_9988" }
}
// Response — 201 Created
{
"id": "pay_7f3e...",
"status": "processing", // terminal: succeeded | failed | refunded
"amount": 4999,
"currency": "usd",
"created_at": "2026-04-20T20:00:00Z",
"idempotency_key": "a8f3c2d1-..."
}
Input validation: amount must be a positive integer; currency must be a valid ISO 4217 code; payment_method token must be unexpired and belong to the calling merchant; metadata keys must not contain PII. Reject requests failing any of these with descriptive 4xx errors — never a silent 200.
GET /v1/payments/:id — retrieve payment status
// Request
GET /v1/payments/pay_7f3e
Authorization: Bearer sk_live_...
// Response — 200 OK
{
"id": "pay_7f3e...",
"status": "succeeded",
"amount": 4999,
"currency": "usd",
"processor_response": { "network_txn_id": "visa_abc123" },
"settled_at": "2026-04-21T01:00:00Z"
}
Optional endpoints by level
| Endpoint | Purpose | Level |
|---|---|---|
POST /v1/refunds | Full or partial refund of a payment. References original payment ID. Requires its own Idempotency-Key — separate from the original charge key. A retried refund request with the same key returns the cached refund result without re-executing. Never reuse the charge's idempotency key for its refund. | L3/L4 |
GET /v1/payments?merchant_id=&page=&limit= | Paginated transaction history with cursor-based pagination. | L5 |
POST /v1/webhooks/endpoint | Register a merchant webhook URL. Validated by sending a test event. | L5 |
POST /v1/payouts | Trigger merchant payout to bank account. Requires payout schedule or manual trigger. | L6 |
GET /v1/balance | Merchant available balance derived from settled transaction sum. Served from CQRS read model (not the primary). L7/L8 concern: multi-currency balance requires currency-segregated ledger partitions with locked FX rates at settlement time — not just summing all entries. | L5 L7/L8 ↑ |
Core flow — charge a card
The payment flow crosses an external system boundary — that's what makes it categorically different from most CRUD operations. Our system must handle the case where we successfully submitted a charge request to the PSP, but never received a response. We cannot determine from our own logs whether money moved. This uncertainty is the reason for the state machine and recovery sweep.
The key tradeoff in this flow, referencing the exactly-once NFR from §2, is how to handle the timeout branch. Two strategies:
- Leave payment in "processing" state
- Background job reconciles with PSP after 30 s/1 min
- One authoritative outcome per payment
- No risk of double-charge
- Client retries without idempotency key enforcement
- If PSP processed the first request, retry causes double-charge
- Simple to implement, catastrophic failure mode
- Only safe with strict server-side idempotency
Fraud scoring sits at step ③: between the payment row INSERT and the PSP call. A real-time ML scoring service (~20–50 ms budget) returns a risk score. Low-risk charges proceed to the PSP immediately. High-risk charges are held for manual review (status=under_review). Above-threshold charges are blocked with a 402 Payment Required response and no PSP call is made — preventing chargebacks before they happen. L7/L8
Data model
Before writing a single column name, it helps to identify the entities and how they get used. A payment system has four main entities — each accessed very differently.
Entities and access patterns
| Entity | Operation | Frequency | Query shape |
|---|---|---|---|
| Payment | Create (charge) | Very High | Point write by merchant_id + idempotency key |
| Payment | Status lookup | High | Point read by payment_id |
| Payment | Merchant history | Medium | Range scan by merchant_id, sorted by created_at DESC |
| Ledger Entry | Create (debit/credit) | Very High | Append-only insert, paired with payment write in same txn |
| Ledger Entry | Balance calculation | Medium | Aggregate SUM by merchant_id and currency |
| Idempotency Key | Lookup on retry | Low-Medium | Point read by key string |
| Merchant | Auth / rate limit | Very High | Point read by API key (cached) |
| Webhook Endpoint | Delivery lookup | Medium | Point read by merchant_id |
Two things jump out from these access patterns. First, payments are accessed almost exclusively by payment_id (point reads) or by merchant_id + created_at (range scans) — these become primary key and secondary index. Second, ledger entries are append-only with aggregate queries by merchant — a strong hint that balance computation belongs in a materialised view or a separate read model, not in the hot write path.
Field-level rationale
amount as bigint (not float) correctness ›
Floating-point arithmetic is lossy. $49.99 represented as a float is 49.98999999..., which means summing thousands of transactions produces accumulated rounding errors. The industry standard is to store amounts as integers in the smallest currency unit (cents for USD, pence for GBP, yen for JPY). All arithmetic is exact integer arithmetic; display formatting applies the decimal on output.
ledger_entries.type as enum (debit | credit) double-entry bookkeeping ›
Double-entry bookkeeping requires every transaction to have a matching debit and credit. A customer card payment of $49.99 creates two entries in our ledger: a debit to our receivables account (funds incoming from the card network) and a credit to the merchant payable account (funds we owe the merchant). The customer's own bank account is debited by the issuing bank, not by us. Representing both as rows with a type column lets you verify correctness by asserting that SUM(credits) = SUM(debits) across any closed time window — this is the reconciliation invariant.
idempotency_keys.expires_at: key expiry storage management ›
Idempotency keys are only useful during a retry window — typically 24–72 hours after the original request. After that, the same key from the same client would represent an accidental reuse, not a legitimate retry. Expiring keys frees storage and prevents false deduplication across separate payment attempts that happen to share a key by programmer error.
ledger_entries.id: bigserial vs distributed ID scalability · §9 ›
The schema uses bigserial (PostgreSQL auto-increment sequence) for ledger entry IDs. On a single-node deployment this is correct — sequential IDs are insert-ordered, which aligns with the append-only access pattern. However: bigserial requires a central sequence generator. In a sharded deployment (§9), each shard needs to generate globally unique IDs independently.
bigserial with a Snowflake-style distributed ID: [41-bit timestamp | 10-bit node ID | 12-bit sequence]. This generates unique, roughly-time-ordered IDs on each shard node with no coordination. Stripe and Twitter both use variants of this approach. The time-ordering property preserves sortability within a shard, though cross-shard ordering requires the timestamp component.
Caching strategy
Caching in a payment system is more constrained than in most systems because correctness cannot be sacrificed for speed. The ledger itself must never be served from a stale cache. But several adjacent data paths benefit from caching without correctness risk: idempotency key lookups, merchant auth/config, and payment status reads.
Cache hierarchy
| Layer | What is cached | TTL | Why this layer exists | Invalidation |
|---|---|---|---|---|
| Rate limit counters (at API Gateway, in a distributed in-memory cache like Redis) |
Per-merchant request count, per-IP abuse counters | 1 min sliding window | Enforcing >1,000 req/s limits without a DB read per request | Expires naturally. DECR on window slide. |
| Idempotency keys (at Payment Service, in Redis) |
Key → serialised result for recently-executed payments | 24–72 h | Avoids a DB read (with FOR UPDATE lock) for every retry. Reduces hot-path latency by ~5 ms. | TTL-based. Never evicted early — stale result is always correct. |
| Merchant config (at Payment Service, in Redis) |
API key hash → merchant object (name, payout settings, rate limits) | 5 min | Auth check on every request requires merchant lookup — this is a hot read with a stable document | Invalidate on merchant config update (cache-aside pattern) |
| Dashboard / read replica (PostgreSQL async replica or OLAP read model) |
Transaction history, balance summaries, analytics | Seconds of replica lag | Offload aggregate queries from the write primary; balance from replica is acceptable for display (not for charging) | Replica lag. Not acceptable for balance-before-charge decisions. |
Critical invariant: The ledger write path — INSERT into ledger_entries + payments UPDATE — must always go to the primary DB, never through a cache or read replica. A stale balance from a read replica must never gate a charge decision. Cache misses here are acceptable; cache hits that serve stale payment state are not.
Deep-dive scalability
When transaction volume leaps to 100 K TPS (Visa-level), or business needs dictate cross-region consistency, a single-primary architecture hits its limits.
Ledger sharding strategy L5+ · scalability ›
The ledger cannot be sharded by transaction ID (random) because merchant-scoped queries would require scatter-gather across all shards. The natural shard key is merchant_id: all of a merchant's ledger entries live on the same shard, making balance queries and history scans single-shard operations.
Shard count should be provisioned at 4–8× expected peak to avoid live resharding. A hash ring (consistent hashing) with virtual nodes allows adding capacity with minimal rebalancing.
Multi-region consistency model L6+ · global deployment ›
Merchant-based sharding achieves regional affinity: a merchant's ledger shard lives in the region closest to them, and their payment service instances route writes to that shard. Cross-region writes happen only for merchants that have cross-region merchant accounts — a small minority. This avoids the need for a globally consistent DB on the critical path.
For truly global merchants (L7+ discussion), a globally distributed relational database (Google Spanner, CockroachDB) provides external consistency across regions at the cost of cross-region round-trip latency on writes (~100–200 ms). This is acceptable for payment settlement but not for real-time charge flow — mitigate by accepting writes locally and committing globally via a two-phase approach.
Distributed payment ID generation L5+ · uniqueness at scale ›
Payment IDs must be globally unique, sortable by creation time (for pagination), and generated without a central coordinator (to avoid a bottleneck). Stripe-style IDs (e.g. pay_1abc...) use a prefixed base62 encoding of a Snowflake-style ID: 41-bit timestamp + 10-bit datacenter/machine ID + 12-bit sequence. This generates 4096 unique IDs per millisecond per node, with no coordination required.
Reconciliation pipeline design L6+ · data integrity ›
Reconciliation compares our internal ledger against settlement files from card networks (Visa, Mastercard) and bank ACH files. The pipeline: (1) ingest settlement file via SFTP/API at cut-of-day, (2) parse and normalise into a staging table, (3) match by external reference ID + amount + currency, (4) classify mismatches as: missing in our ledger (possible bug), missing in settlement (payment not settled yet), amount mismatch (fee discrepancy), (5) route mismatches to an alerts queue for ops review.
Change-data-capture (Debezium on the Ledger DB → a durable message queue like Kafka) provides the near-real-time feed for ledger changes. The reconciliation job can also run as a nightly full-scan batch as a safety net.
PSP multi-routing and fallback L6+ · reliability ›
Individual PSPs have their own outages and degraded periods (Stripe had a major incident in 2023). A multi-PSP router maintains health scores per processor (success rate, p95 latency over a 5-minute window) and routes new payments to the healthiest processor for a given currency/card type. On failure, it falls back to the next-best processor — but only for "soft" failures (network timeout, 503), never for "hard" declines (card rejected, fraud block).
3DS / Strong Customer Authentication (SCA) L6+ · EU compliance ›
EU PSD2 mandates Strong Customer Authentication (SCA) for most consumer card payments. This turns the single synchronous PSP call into a multi-step challenge flow — the cardholder must complete an additional verification (biometric, OTP, or app push) before the charge is authorised. Our PSP connector must handle this as a redirect/callback pattern, not a single HTTP call.
The flow with 3DS2: (1) Payment service initiates the charge intent with the PSP. (2) PSP assesses risk — low-risk transactions may be exempted (frictionless flow). (3) If a challenge is required, the PSP returns a redirect URL. The client sends the cardholder to the issuer's 3DS page. (4) After challenge completion, the PSP posts a webhook to our connector. (5) Connector calls the PSP to confirm authorisation and commits the ledger entry.
challenge_required and challenge_completed. Payments that time out in challenge_required must be expired, not recovered — no funds have moved at this point (the authorisation was never completed), so the recovery sweep is irrelevant. Expire the intent and allow the customer to retry with a fresh attempt.
Failure modes & edge cases
| Scenario | Problem | Solution | Level |
|---|---|---|---|
| Client retries without idempotency key | Server executes duplicate charge because no deduplication key exists | Reject requests without Idempotency-Key header with 400 Bad Request. Never process a payment without one. |
L3/L4 |
| Server crash after PSP charge, before DB write | Money moved but no ledger entry. Internal balance is incorrect. | Recovery sweep: background job scans payments in "processing" state older than 2× the PSP timeout SLA (typically 30–60 s), queries PSP for actual outcome via the stored payment reference, and writes the result to the ledger atomically. | L5 |
| PSP timeout — unknown outcome | Payment may have succeeded or failed at the network level. Cannot safely retry. | Never retry on timeout without first querying PSP for outcome via payment reference. Leave status as "processing" until confirmed. See §6. | L5 |
| Redis cache unavailable | Idempotency key cache miss forces every request to hit the DB with SELECT FOR UPDATE. Latency spikes but correctness is preserved. | Design cache as a performance optimisation only. The DB is the source of truth. Alert on p99 latency increase; no correctness incident. Restore Redis from cluster replica. | L5 |
| Ledger DB primary failure | Payment service cannot write. All payment initiations fail. | Synchronous replica promotion via automated failover (e.g. Patroni for PostgreSQL). Target RTO < 30 s. Accept degraded availability during failover window. | L6 |
| Webhook delivery failure | Merchant never receives payment confirmation. Manual reconciliation required. | At-least-once webhook delivery with exponential back-off, capped at 24-hour intervals (1 s → 2 s → 4 s → … capped at 24 h). After ~16 retry attempts spanning 72 h of failures, publish to dead-letter queue and alert. Merchant can also poll GET /v1/payments/:id. | L6 |
| Hot shard (large merchant) | One merchant's volume overwhelms its DB shard (write QPS exceeds primary capacity) | Detect via per-shard write QPS metrics. Options: dedicate a shard group to the merchant; sub-shard by transaction date range; upgrade to larger instance class. Avoid: re-hashing live data. | L7/L8 |
| Currency conversion race condition | FX rate applied at charge initiation vs settlement differs. Merchant receives incorrect payout amount. | Record the FX rate and source (provider + timestamp) at the moment of conversion. Payout amount is calculated from the locked rate, not the spot rate at settlement time. Discrepancy reporting in reconciliation pipeline. | L7/L8 |
| Chargeback received | Customer disputes a succeeded charge via their bank. Card network claws back funds from our settlement account — this happens before merchant is notified. | Create a chargeback ledger entry (debit) linked to the original payment ID. Freeze the disputed amount from the merchant's available balance — the network has already reclaimed these funds from our settlement account, and the freeze prevents the merchant from withdrawing money we no longer hold. Surface to the merchant dashboard for evidence submission within the network's response window (typically 7–20 days). Route to dead-letter queue if evidence deadline is missed. Track dispute.created → evidence_submitted → dispute.resolved_(won|lost) as payment state transitions. |
L5/L6 |
How to answer by level
L3/L4: SDE I / SDE II bar: build a working system ›
- Sketch the payment flow: Client → API → PSP → response → DB write
- Know what an idempotency key is and why it's required
- Choose a relational DB for the ledger; justify with ACID requirement
- Identify the need for webhook notifications to merchants
- Address basic failure: what happens if the PSP call fails
- L3/L4 treats the PSP call as atomic — L5 knows it isn't
- L3/L4 doesn't address the server-crash-after-PSP-call scenario
- L3/L4 conflates "the request succeeded" with "the money moved"
- L3/L4 doesn't distinguish hard vs soft PSP declines
L5: Senior SDE bar: understand the tradeoffs ›
- Design the idempotency key lifecycle: generation, storage, expiry, cache
- Reason about the timeout failure mode and the recovery sweep
- Design the ledger as append-only (double-entry, not balance update)
- Separate the PSP connector from core payment logic
- Draw the data model with correct indexes
- Explain why amount is stored as integer cents, not float dollars
- L5 designs the payment flow; L6 designs end-to-end correctness including reconciliation
- L5 mentions sharding; L6 explains the shard key choice and its implications for hot merchants
- L5 knows idempotency keys need to be in the same DB txn as the ledger write; L6 can trace what breaks if they aren't
L6: Staff SDE bar: own it end-to-end ›
- Design the reconciliation pipeline: CDC → Kafka → OLAP → mismatch queue
- Reason through the distributed transaction problem (idempotency key + ledger write atomicity)
- Explain merchant-based sharding and the hot-shard mitigation
- Design multi-PSP routing with health-score-based fallback
- Address PCI DSS scoping via tokenisation; explain why narrow scope matters
- Propose a CQRS read model for balance queries
- L6 designs for one region; L7 designs for global consistency
- L6 knows reconciliation exists; L7 designs the SLA and mismatch classification schema
- L6 doesn't engage with regulatory isolation across geographies (GDPR, PSD2, RBI)
- L6 doesn't reason about FX rate locking and multi-currency ledger design
L7/L8: Principal / Distinguished bar: should we build this, and how? ›
- Multi-currency ledger: FX rate locking at charge time; currency-segregated entries
- Global consistency: Spanner or CockroachDB for cross-region ledger; tradeoff vs regional sharding
- Regulatory isolation: GDPR (EU data residency), PSD2 (open banking), RBI (India localisation) — each drives architectural constraints
- Fraud pipeline integration: real-time feature scoring on charge path (adds ~20 ms); async model refresh via streaming features
- Settlement architecture: netting across merchants before bank transfer; T+1 vs T+2 settlement cycles
- Builds the case for whether the company should build vs embed a PSP
- Gets lost in technical depth without connecting to business/regulatory drivers
- Doesn't address the build-vs-buy question for payment infrastructure
- Treats all failure modes as engineering problems, not risk management problems
- Doesn't acknowledge that card network rules (Visa/Mastercard) constrain what the system can legally do
Classic probes — level differentiated
| Question | L3/L4 | L5/L6 | L7/L8 |
|---|---|---|---|
| How do you prevent double charges? | Use a unique transaction ID; don't charge twice | Idempotency key stored atomically with ledger entry in a single ACID txn; SELECT FOR UPDATE on DB lookup | Full key lifecycle: generation, Redis hot cache, DB source of truth, expiry policy, cross-region key replication for global merchants |
| What happens if the PSP call times out? | Retry with a new request | Never retry blindly: the first request may have succeeded. Leave status=processing, query PSP for outcome via recovery sweep. Classify transient vs hard failures. | Plus: PSP-side idempotency keys (Stripe's own idempotency header), multi-PSP fallback with health scoring, timeout budget management across the full request chain |
| How do you design the ledger? | A transactions table with rows per payment | Append-only double-entry ledger (debit + credit per transaction); amount as integer cents; balance derived by summation; reconciliation via SUM(credits) = SUM(debits) | Plus: multi-currency entries with locked FX rate; currency-segregated balance views; CQRS read model for running balances; event sourcing for full audit trail reconstruction |
| How do you scale the system to 1 M TPS? | Add more servers; use a CDN | Shard the ledger DB by merchant_id; evaluate Spanner for global consistency; separate analytics from write path via CDC. Explain why UUID v4 is bad for pagination at this scale. | Plus: evaluate direct card network integration (acquiring bank membership) to eliminate PSP latency; design global settlement netting; reason through the capital reserve implications of holding merchant balances |
- Rate Limiter System Design, atomic Redis operations, distributed race conditions, and multi-tier quota enforcement
- URL Shortener System Design, hash encoding tradeoffs, database sharding strategies, and viral key mitigation
- Web Crawler System Design, Bloom filter deduplication, politeness throttling, and distributed frontier design
- Twitter/X Feed System Design, fan-out write amplification, hybrid push/pull strategy, and celebrity threshold design
- Notification Service System Design, multi-channel delivery, idempotency keys, and priority queues at scale
- Search Autocomplete System Design, Trie data structures, prefix caching, and read-heavy scale strategies
- Key-Value Store System Design, Consistent hashing, quorum consensus, and SSTable fundamentals
- Chat System (WhatsApp) System Design, WebSocket management, transient vs persistent storage, and read receipts
- Video Streaming (YouTube) System Design, ABR streaming, CDN distribution, and metadata management
- Distributed Message Queue System Design, Kafka partition tuning, exactly-once delivery, and geo-replication
- File Storage (Dropbox / Google Drive) System Design, chunking, delta sync, conflict resolution, and global deduplication
- Ride-Sharing System Design (Uber / Lyft) — geohashing, WebSocket-driven location tracking, and ETA prediction
- Top-K Leaderboard System Design — Redis sorted sets, approximate counting, and stream aggregation
- Airbnb Booking & Reservation System — inventory locks, double-booking prevention, and async elasticsearch sync
- Photo-Sharing Feed System Design — image pipelines, CDN delivery, and social graph scaling
- Proximity Search System Design (Yelp / Google Places) — geohash indexing, quadtree partitioning, and Bayesian review ranking
- Online Judge System Design — secure sandboxing, execution queues, and worker scaling