System Design Interview Guide

Design a Payment Processing System

Simple to describe at checkout. Brutally hard to make correct at scale: idempotency, ledger consistency, and zero double-charges.

L3/L4: working flow L5/L6: tradeoffs & guarantees L7/L8: correctness at global scale

~40 min read · Stripe / PayPal · Covers Sections 1–11

Hero Image for Design a Payment Processing System
01

What the interviewer is testing

Payment systems are the canonical test of whether a candidate understands correctness at scale, not just throughput. Most distributed systems happily trade consistency for availability; payments cannot. A double-charge or a lost transaction is a regulatory event, a trust event, and sometimes a legal event.

The interview question is deceptively simple: "design Stripe." What examiners are really probing is whether you grasp that payments are state machines with external actors (card networks, banks) that cannot be rolled back. The hard problems — idempotency, exactly-once execution, ledger correctness, reconciliation — are invisible until you've reasoned about failure modes explicitly.

LevelBar
L3/L4Sketch a working payment flow (customer → API → processor → bank). Know what an idempotency key is. Identify the need for ACID storage.
L5Explain exactly-once semantics across retries. Design the ledger model. Reason about the PSP (payment service provider) as an external system with its own failure modes.
L6Own the consistency model end-to-end: idempotency key lifecycle, distributed transaction across ledger + payment state, reconciliation pipeline design.
L7/L8Design for global scale: cross-region consistency, multi-currency ledger, regulatory isolation (PCI DSS scoping), fraud pipeline integration, settlement SLAs.
⚠️

The single most common mistake: designing the payment flow like a standard CRUD service. A payment process crosses an external system boundary (the card network) that you cannot control, inspect, or roll back. Your design must account for this from the start.

02

Requirements clarification

Functional requirements

RequirementNotes
Initiate a paymentCard-on-file or tokenised card. Async confirmation via webhook.
Payment status enquiryPoll or webhook-driven status updates (pending → succeeded / failed).
Refund a paymentFull or partial refund. Referential integrity with original charge.
Merchant payoutSettlement to merchant bank account. Timing depends on agreement and jurisdiction: typically T+1 or T+2 business days. New merchants may carry a rolling reserve. The nightly batch job (§9) handles this cadence; it is not a real-time transfer.
Transaction historyPer-customer and per-merchant listing, paginated.
Webhook deliveryAt-least-once delivery with retry to merchant endpoints.

Non-functional requirements

NFRTargetReasoning
Exactly-once chargeZero preventable double-charges; minimised in all failure windowsDouble-charging is a financial and legal incident. Duplicate execution may occur in narrow failure windows; the system must detect and suppress it. Idempotency keys enforce this in steady state; the recovery sweep covers crash windows.
Availability99.99% (52 min/year downtime)Every minute of payment downtime is lost GMV. Multi-region active-active deployment required.
Charge latencyp99 < 3 s (sync), < 30 s (async)Processor round-trip dominates; our system must not add more than 200 ms of internal latency.
ConsistencyStrong consistency for ledger writesLedger entries must reflect real money movement. Eventual consistency is acceptable for analytics and dashboards.
Data durabilityZero data loss (ledger)A lost transaction entry is a financial discrepancy. WAL-based storage with synchronous replication required.
CompliancePCI DSS Level 1Card data must not touch our servers. Use tokenisation via a vault service; scope stays narrow.

Architectural rationale for NFRs

Exactly-once semantics correctness · §5

Exactly-once in a distributed system means: the effect of an operation is applied exactly once, even if the request is delivered more than once. For payments this is non-negotiable — a customer charged twice will dispute both transactions, triggering chargebacks and merchant penalties.

The implementation is idempotency keys: the client generates a UUID per payment attempt and sends it as a header. The server stores the key → result mapping atomically with the ledger entry. Any replay returns the stored result. This requires the key store and ledger write to be in the same ACID transaction.

Tradeoff Idempotency keys require persistent storage, which adds a DB read on every payment initiation. This is acceptable because the alternative — a double-charge — has unbounded downside. Keys are typically expired after 24–72 hours to bound storage growth.
Drives IdempotencyKey table atomic ledger write ACID DB requirement
Strong consistency for ledger correctness · §7

A balance derived from an eventually consistent view could show incorrect available funds, enabling over-spending. Ledger writes must be linearisable: each entry reflects an authoritative money movement, and reads must see all prior writes in sequence.

This drives the choice of a relational database with synchronous replication (not async replica lag). In practice, Stripe uses a sharded MySQL cluster with synchronous secondary; payments go to the shard owning that merchant account.

Tradeoff Strong consistency sacrifices availability during network partitions (CAP theorem). For payments this is the correct trade: a paused payment is recoverable; an incorrect balance is not.
Drives relational DB (ACID) synchronous replication no eventual-consistency shortcuts in ledger path
PCI DSS compliance scoping security · §4

PCI DSS Level 1 mandates an annual on-site audit for any system that stores, processes, or transmits raw card data. The smallest-scope design keeps raw card data out of our systems entirely by delegating card collection to a client-side tokenisation SDK (Stripe.js, Braintree SDK). The client vault returns a single-use payment method token; our backend only ever sees the token.

Tradeoff Delegating tokenisation to a third-party vault means we depend on that vendor's availability for payment initiation. Mitigate by supporting multiple vault providers with fallback routing.
Drives token vault (external) no raw card string storage narrow PCI scope
Availability: 99.99% reliability · §4 §9

99.99% availability = roughly 52 minutes of downtime per year. Achieving this requires multi-region active-active deployment with automatic failover. No single component — not the DB, not the payment API, not the PSP connector — can be a single point of failure.

Tradeoff Active-active with strong consistency is the hardest operational mode. The alternative (active-passive) gives simpler consistency but sacrifices availability during failover windows of 1–5 minutes, which already violates the SLA.
Drives multi-region deployment HSM-based key management per region global DB (Spanner / CockroachDB) at L7+
03

Capacity estimation

Payment workloads are write-heavy on the ledger and read-heavy on dashboards. The write path (charge, refund, payout) is latency-sensitive and correctness-critical. The read path (transaction history, analytics) can tolerate slight staleness and is served from read replicas or a separate analytics store.

Four dimensions matter: transaction volume (writes/day), read demand, ledger row size, and retention. Adjust below for your assumed scale:

Interactive capacity estimator

10 M
10 : 1
500 B
7 yr
Write QPS
transactions/sec
Read QPS
reads/sec
Peak write QPS
req/sec (3× peak factor)
Ledger storage
total over retention
Cache size (hot 5%)
recent txns in-memory
Total rows (ledger)
entries over retention
💡

A 10 M transactions/day load (~116 write QPS) is trivially handled by a single PostgreSQL primary. The complexity of a payment system at this scale comes entirely from enforcing idempotency and preventing dropped writes (SELECT FOR UPDATE + INSERT + UPDATE with WAL-synced commits). These are correctness problems, not scale problems, until you exceed roughly 100 K TPS — a Visa-level load.

04

High-level architecture

Client (Browser / Mobile) Token Vault (Stripe.js / SDK) API Gateway + Load Balancer Payment Service Idempotency · State machine Ledger DB PostgreSQL / Spanner PSP Connector Stripe / Adyen / Braintree Webhook Dispatcher Async · At-least-once Analytics Pipeline Reconciliation · Fraud Cache Redis — idempotency keys tokenise LEGEND Synchronous Async / event
Figure 1 — High-level payment system architecture. Sync paths: solid arrows. Async paths: dashed arrows.

Component breakdown

Token Vault (Stripe.js / Braintree SDK) collects raw card data in the client's browser before it ever reaches our servers. It returns a single-use payment method token, keeping our backend out of PCI scope entirely.

API Gateway + Load Balancer terminates TLS, validates API keys or OAuth tokens, enforces per-merchant rate limits (see Rate Limiter Design), and routes traffic to the payment service fleet. It is also where idempotency key headers are validated for basic format before reaching business logic.

Payment Service is the core business logic layer. It owns the payment state machine (created → processing → succeeded/failed/refunded), enforces idempotency, calls the PSP connector, and writes atomically to the ledger. This service is stateless — all state lives in the ledger DB.

PSP Connector wraps the external payment service provider (e.g. Stripe, Adyen, Braintree). It translates internal payment intents into processor-specific API calls, normalises responses, and handles processor-specific retry semantics. Multiple connectors let us route by currency, geography, or cost.

Ledger DB is the authoritative source of truth for all money movement. It stores both the payment record and idempotency key results in a single ACID transaction. A relational database (PostgreSQL, MySQL, or at Google scale a globally distributed relational database like Spanner) is mandatory here — no eventual consistency shortcuts.

Webhook Dispatcher delivers asynchronous status updates (payment succeeded, refund processed) to merchant endpoints. At-least-once delivery with exponential back-off retry; merchants must be idempotent receivers. For deep dives into asynchronous delivery, see our Notification Service Design post.

Analytics Pipeline streams ledger events (via a change-data-capture stream like Debezium → Kafka) into an analytics store and a reconciliation engine. Separated from the hot path to avoid latency coupling.

Cache (Redis) stores recently resolved idempotency keys for fast hot-path deduplication, reducing ledger DB reads on retried requests. Cache is a performance optimisation only: the canonical truth remains the DB.

Architectural rationale

Why a dedicated Ledger DB instead of a general payments table? data model · §7

A ledger is append-only by design — no UPDATE on balance fields, only INSERT of new debit/credit entries. This makes the system naturally idempotent (an inserted entry for an idempotency key is the proof of execution), fully auditable, and recoverable from any snapshot. Updating a balance field in place would require read-modify-write, which needs locks and is vulnerable to lost updates under concurrency.

Tradeoff Balance queries require summing ledger entries rather than reading a single field — slower for dashboards. Mitigate with a materialised view or a CQRS read model that maintains a running balance updated asynchronously.
Alternatives event sourcing account balance field
Why separate the PSP Connector from the Payment Service? modularity · §6

Payment processors have wildly different APIs, error codes, and retry semantics. The connector layer absorbs this complexity, presenting a normalised interface to the payment service. Swapping processors (e.g. adding Adyen as a fallback) or routing by geography doesn't touch the core business logic.

Tradeoff An extra network hop (payment service → connector service) adds ~5–10 ms latency inside our system. This is negligible relative to the 1–2 s PSP round-trip, but becomes relevant if running as a separate microservice rather than a library call.
Alternatives inline SDK call multi-PSP router
Why async analytics separated from the write path? observability · §9

Reconciliation, fraud analysis, and chargebacks all require batch or near-real-time analytics over transaction data. Coupling this to the hot write path would add latency and create a blast radius if the analytics system falls behind. Change-data-capture (CDC) on the ledger DB ensures analytics always sees committed writes without polling or impacting write throughput.

Drives Debezium → Kafka pipeline separate OLAP store (ClickHouse)

Real-world comparison

DecisionThis designStripePayPal
Core DBPostgreSQL / SpannerSharded MySQL (Vitess)Oracle (migrating to PostgreSQL)
IdempotencyKey table + Redis cacheIdempotency-Key header, stored in MySQLPayPal-Request-Id header
Ledger modelAppend-only double-entryDouble-entry ledgerAccount balance + transaction log
PSP layerConnector serviceStripe IS the processor (direct network integration)Braintree gateway + PayPal processor
AnalyticsCDC → Kafka → ClickHouseKafka → DruidKafka → Hadoop + Flink
ReconciliationNightly batch + real-time CDCNightly batch against network settlement filesNightly batch + dispute management system
💡

No single architecture fits every payment product. Stripe built its own card network integration (direct acquiring) to control the full stack — most companies cannot. The right design follows from what you own: merchant aggregator (Stripe-like), marketplace (PayPal-like), or in-house payment team for a single retailer (Amazon-like).

05

Core algorithm — idempotency

The defining algorithmic question for a payment system isn't "how do we charge a card?" — it's "how do we guarantee that we charge it exactly once, even if the network drops, the server crashes mid-request, or the client retries three times?" This is the idempotency problem.

There are two primary approaches for enforcing exactly-once execution at the API layer. The choice between them determines how your server handles concurrent retries: the hardest case.

① Idempotency Key (Recommended) ② Distributed Locking Client sends Idempotency-Key header Check Redis cache for key (fast path) Check DB (source of truth) SELECT FOR UPDATE Execute + store result atomically in same ACID txn Client sends payment request Acquire distributed lock (Redis SETNX / Redlock) Execute under lock Release lock — second request can now execute (unsafe!)
Figure 2 — Idempotency key approach (①, recommended) vs distributed locking (②). ① returns the cached result to any duplicate. ② releases the lock after execution — a second request can execute.

Our choice for this system: idempotency keys. Distributed locking prevents concurrent execution but not sequential duplicates — if the lock is released before the client receives the response, a new request can re-execute. Idempotency keys solve both: they record the result of the first execution and return it to any subsequent request, regardless of timing.

Idempotency key implementation: correct two-phase pattern implementation detail

Key rule: the PSP call must happen outside any open DB transaction. Holding a transaction open during a 1–2 s network call causes lock contention, connection pool starvation, and transaction timeouts. The correct pattern uses two short transactions around the external call:

async function processPayment(req) {
  const key = req.headers['Idempotency-Key'];

  // Phase 1 -- fast dedup check (no txn needed for read)
  const cached = await redis.get(`idem:${key}`);
  if (cached) return JSON.parse(cached); // hot path exits here

  // Phase 2 -- short txn: check DB + reserve the slot (~5 ms held)
  const existing = await db.transaction(async (txn) => {
    const row = await txn.query(
      'SELECT status, psp_ref FROM payments WHERE idempotency_key=$1 FOR UPDATE', [key]
    );
    if (row.rows.length > 0) {
      if (row.rows[0].status === 'processing') throw new Error('Payment in flight'); // concurrent retry — caller retries after delay
      return row.rows[0]; // completed payment — return stored result
    }
    // Reserve slot -- marks in-flight so concurrent retry sees it above
    await txn.query(
      'INSERT INTO payments(id,idempotency_key,status) VALUES(gen_random_uuid(),$1,$2)',
      [key, 'processing']
    );
    return null; // null = new payment, proceed
  }); // txn commits here; lock released

  if (existing) return existing; // duplicate detected via DB

  // Phase 3 -- PSP call OUTSIDE any transaction (1-2 s, no lock held)
  const result = await chargeCard(req.body);

  // Phase 4 -- short txn: commit result + ledger entry atomically (~5 ms held)
  await db.transaction(async (txn) => {
    await txn.query(
      'UPDATE payments SET status=$1, psp_ref=$2 WHERE idempotency_key=$3',
      [result.status, result.psp_ref, key]
    );
    await txn.query(
      'INSERT INTO idempotency_keys(key, result, created_at) VALUES($1,$2,NOW())',
      [key, JSON.stringify(result)]
    );
    await txn.query('INSERT INTO ledger_entries(...) VALUES(...)');
  });

  // Phase 5 -- populate Redis cache for future retries
  await redis.setex(`idem:${key}`, 86400, JSON.stringify(result));
  return result;
}
Why two transactions? Transaction 1 holds a lock for ~5 ms (one INSERT) then releases. The PSP call runs in open air: no DB connection held, no lock contention. Transaction 2 commits the result for ~5 ms. If the server crashes between Phases 3 and 4, the payment row sits in status=processing. The recovery sweep detects and resolves this by querying the PSP directly.
🎯

One open question: What happens if the server crashes after calling the PSP but before writing to the DB? The money moved but there's no ledger entry. Recovery: a background job periodically scans for in-flight transactions (older than 2× the PSP’s timeout SLA — typically 30–60 s) with status = 'processing' and queries the PSP for their actual outcome via payment reference. This "recovery sweep" is mandatory in any production payment system.

5b

API design

The API surface for a payment system is deliberately minimal — fewer endpoints means fewer attack vectors and less surface for idempotency bugs. The two core endpoints handle initiation and retrieval. Everything else is additive.

POST /v1/payments — initiate a payment

Creates a payment intent. Returns immediately with status: "processing"; confirmation arrives via webhook. Requires Idempotency-Key header — requests without one are rejected with 400 Bad Request.

// Request
POST /v1/payments
Authorization: Bearer sk_live_...
Idempotency-Key: a8f3c2d1-4b5e-6789-abcd-ef0123456789
Content-Type: application/json

{
  "amount": 4999,              // in smallest currency unit (cents)
  "currency": "usd",
  "payment_method": "pm_1abc...", // token from vault, not raw card data
  "merchant_id": "merch_xyz",
  "metadata": { "order_id": "ord_9988" }
}

// Response — 201 Created
{
  "id": "pay_7f3e...",
  "status": "processing",     // terminal: succeeded | failed | refunded
  "amount": 4999,
  "currency": "usd",
  "created_at": "2026-04-20T20:00:00Z",
  "idempotency_key": "a8f3c2d1-..."
}

Input validation: amount must be a positive integer; currency must be a valid ISO 4217 code; payment_method token must be unexpired and belong to the calling merchant; metadata keys must not contain PII. Reject requests failing any of these with descriptive 4xx errors — never a silent 200.

GET /v1/payments/:id — retrieve payment status

// Request
GET /v1/payments/pay_7f3e
Authorization: Bearer sk_live_...

// Response — 200 OK
{
  "id": "pay_7f3e...",
  "status": "succeeded",
  "amount": 4999,
  "currency": "usd",
  "processor_response": { "network_txn_id": "visa_abc123" },
  "settled_at": "2026-04-21T01:00:00Z"
}

Optional endpoints by level

EndpointPurposeLevel
POST /v1/refundsFull or partial refund of a payment. References original payment ID. Requires its own Idempotency-Key — separate from the original charge key. A retried refund request with the same key returns the cached refund result without re-executing. Never reuse the charge's idempotency key for its refund.L3/L4
GET /v1/payments?merchant_id=&page=&limit=Paginated transaction history with cursor-based pagination.L5
POST /v1/webhooks/endpointRegister a merchant webhook URL. Validated by sending a test event.L5
POST /v1/payoutsTrigger merchant payout to bank account. Requires payout schedule or manual trigger.L6
GET /v1/balanceMerchant available balance derived from settled transaction sum. Served from CQRS read model (not the primary). L7/L8 concern: multi-currency balance requires currency-segregated ledger partitions with locked FX rates at settlement time — not just summing all entries.L5 L7/L8 ↑
06

Core flow — charge a card

The payment flow crosses an external system boundary — that's what makes it categorically different from most CRUD operations. Our system must handle the case where we successfully submitted a charge request to the PSP, but never received a response. We cannot determine from our own logs whether money moved. This uncertainty is the reason for the state machine and recovery sweep.

① Receive POST /v1/payments ② Idempotency key check Redis cache → DB SELECT FOR UPDATE CACHE HIT Return cached result MISS ③ Insert payment (status=processing) Ledger DB — ACID txn ④ Call PSP Connector Stripe / Adyen charge API (~1–2 s round-trip) PSP response? succeeded / failed / timeout succeeded ⑤a Update status → succeeded failed ⑤b Update status → failed timeout ⑤c Leave status=processing Recovery sweep queries PSP for outcome LEGEND Synchronous Async / uncertain
Figure 3 — Charge a card: core flow. The timeout branch is the dangerous case — money may have moved but we have no confirmation.

The key tradeoff in this flow, referencing the exactly-once NFR from §2, is how to handle the timeout branch. Two strategies:

Recovery sweep (recommended)
  • Leave payment in "processing" state
  • Background job reconciles with PSP after 30 s/1 min
  • One authoritative outcome per payment
  • No risk of double-charge
Client-side retry (dangerous)
  • Client retries without idempotency key enforcement
  • If PSP processed the first request, retry causes double-charge
  • Simple to implement, catastrophic failure mode
  • Only safe with strict server-side idempotency
🛡️

Fraud scoring sits at step ③: between the payment row INSERT and the PSP call. A real-time ML scoring service (~20–50 ms budget) returns a risk score. Low-risk charges proceed to the PSP immediately. High-risk charges are held for manual review (status=under_review). Above-threshold charges are blocked with a 402 Payment Required response and no PSP call is made — preventing chargebacks before they happen. L7/L8

07

Data model

Before writing a single column name, it helps to identify the entities and how they get used. A payment system has four main entities — each accessed very differently.

Entities and access patterns

EntityOperationFrequencyQuery shape
PaymentCreate (charge)Very HighPoint write by merchant_id + idempotency key
PaymentStatus lookupHighPoint read by payment_id
PaymentMerchant historyMediumRange scan by merchant_id, sorted by created_at DESC
Ledger EntryCreate (debit/credit)Very HighAppend-only insert, paired with payment write in same txn
Ledger EntryBalance calculationMediumAggregate SUM by merchant_id and currency
Idempotency KeyLookup on retryLow-MediumPoint read by key string
MerchantAuth / rate limitVery HighPoint read by API key (cached)
Webhook EndpointDelivery lookupMediumPoint read by merchant_id

Two things jump out from these access patterns. First, payments are accessed almost exclusively by payment_id (point reads) or by merchant_id + created_at (range scans) — these become primary key and secondary index. Second, ledger entries are append-only with aggregate queries by merchant — a strong hint that balance computation belongs in a materialised view or a separate read model, not in the hot write path.

payments 🔑 id uuid PK merchant_id uuid FK amount bigint currency char(3) status enum idempotency_key text psp_ref text created_at timestamptz IDX: (merchant_id, created_at) ledger_entries 🔑 id bigserial PK payment_id uuid FK merchant_id uuid type enum (debit|credit) amount bigint currency char(3) created_at timestamptz idempotency_keys 🔑 key text PK result jsonb created_at timestamptz expires_at timestamptz merchants 🔑 id uuid PK name text api_key_hash text payout_account text created_at timestamptz
Figure 4 — Core schema. FK relationships shown as dashed arrows. Ledger entries are append-only — no UPDATE paths on this table.

Field-level rationale

amount as bigint (not float) correctness

Floating-point arithmetic is lossy. $49.99 represented as a float is 49.98999999..., which means summing thousands of transactions produces accumulated rounding errors. The industry standard is to store amounts as integers in the smallest currency unit (cents for USD, pence for GBP, yen for JPY). All arithmetic is exact integer arithmetic; display formatting applies the decimal on output.

Gotcha Some currencies (JPY, KRW) have no minor units — 1 yen IS the smallest unit. Your schema should handle this by storing amounts in the actual smallest unit, which is documented per-currency via ISO 4217.
ledger_entries.type as enum (debit | credit) double-entry bookkeeping

Double-entry bookkeeping requires every transaction to have a matching debit and credit. A customer card payment of $49.99 creates two entries in our ledger: a debit to our receivables account (funds incoming from the card network) and a credit to the merchant payable account (funds we owe the merchant). The customer's own bank account is debited by the issuing bank, not by us. Representing both as rows with a type column lets you verify correctness by asserting that SUM(credits) = SUM(debits) across any closed time window — this is the reconciliation invariant.

Alternatives signed amount (negative = debit) separate debit/credit tables
idempotency_keys.expires_at: key expiry storage management

Idempotency keys are only useful during a retry window — typically 24–72 hours after the original request. After that, the same key from the same client would represent an accidental reuse, not a legitimate retry. Expiring keys frees storage and prevents false deduplication across separate payment attempts that happen to share a key by programmer error.

ledger_entries.id: bigserial vs distributed ID scalability · §9

The schema uses bigserial (PostgreSQL auto-increment sequence) for ledger entry IDs. On a single-node deployment this is correct — sequential IDs are insert-ordered, which aligns with the append-only access pattern. However: bigserial requires a central sequence generator. In a sharded deployment (§9), each shard needs to generate globally unique IDs independently.

At shard scale Replace bigserial with a Snowflake-style distributed ID: [41-bit timestamp | 10-bit node ID | 12-bit sequence]. This generates unique, roughly-time-ordered IDs on each shard node with no coordination. Stripe and Twitter both use variants of this approach. The time-ordering property preserves sortability within a shard, though cross-shard ordering requires the timestamp component.
Alternatives UUID v7 (time-ordered) ULID Snowflake ID
08

Caching strategy

Caching in a payment system is more constrained than in most systems because correctness cannot be sacrificed for speed. The ledger itself must never be served from a stale cache. But several adjacent data paths benefit from caching without correctness risk: idempotency key lookups, merchant auth/config, and payment status reads.

Client API Gateway rate limit cache Payment Svc idem key cache Redis Cluster (in-memory cache) Ledger DB (NOT cached — always fresh) Read Replica dashboard queries Layer 1: rate limit counters (1 min TTL) Layer 2: idem keys + config (24h TTL) Layer 3: read replica (async lag) LEGEND Synchronous Async / cache read
Figure 5 — Cache layers anchored to the §4 request path. The Ledger DB write path is never served from cache (marked ✕). Dashboard reads use a read replica.

Cache hierarchy

LayerWhat is cachedTTLWhy this layer existsInvalidation
Rate limit counters
(at API Gateway, in a distributed in-memory cache like Redis)
Per-merchant request count, per-IP abuse counters 1 min sliding window Enforcing >1,000 req/s limits without a DB read per request Expires naturally. DECR on window slide.
Idempotency keys
(at Payment Service, in Redis)
Key → serialised result for recently-executed payments 24–72 h Avoids a DB read (with FOR UPDATE lock) for every retry. Reduces hot-path latency by ~5 ms. TTL-based. Never evicted early — stale result is always correct.
Merchant config
(at Payment Service, in Redis)
API key hash → merchant object (name, payout settings, rate limits) 5 min Auth check on every request requires merchant lookup — this is a hot read with a stable document Invalidate on merchant config update (cache-aside pattern)
Dashboard / read replica
(PostgreSQL async replica or OLAP read model)
Transaction history, balance summaries, analytics Seconds of replica lag Offload aggregate queries from the write primary; balance from replica is acceptable for display (not for charging) Replica lag. Not acceptable for balance-before-charge decisions.
⚠️

Critical invariant: The ledger write path — INSERT into ledger_entries + payments UPDATE — must always go to the primary DB, never through a cache or read replica. A stale balance from a read replica must never gate a charge decision. Cache misses here are acceptable; cache hits that serve stale payment state are not.

09

Deep-dive scalability

When transaction volume leaps to 100 K TPS (Visa-level), or business needs dictate cross-region consistency, a single-primary architecture hits its limits.

Global LB GeoDNS routing Region: US-EAST Payment Svc (3–5 instances) Redis Cluster idem keys · rate limits DB Primary merchant shard A–M CDC → Kafka Reconciliation pipeline Region: EU-WEST Payment Svc (3–5 instances) Redis Cluster idem keys · rate limits DB Primary merchant shard N–Z CDC → Kafka Reconciliation pipeline cross-region Kafka replication LEGEND Synchronous Async / CDC Sharding: merchants A–M → US, N–Z → EU
Figure 6 — Production-scale multi-region deployment. Merchant-based sharding pins a merchant's ledger writes to one region, eliminating cross-region consistency overhead on the critical path.
Ledger sharding strategy L5+ · scalability

The ledger cannot be sharded by transaction ID (random) because merchant-scoped queries would require scatter-gather across all shards. The natural shard key is merchant_id: all of a merchant's ledger entries live on the same shard, making balance queries and history scans single-shard operations.

Shard count should be provisioned at 4–8× expected peak to avoid live resharding. A hash ring (consistent hashing) with virtual nodes allows adding capacity with minimal rebalancing.

Tradeoff Merchant-based sharding creates hot shards for very large merchants (e.g. Amazon as a Stripe customer). Mitigate with sub-sharding large merchants by transaction date range, or routing to dedicated shard groups.
Alternatives transaction_id hash shard global distributed DB (Spanner)
Multi-region consistency model L6+ · global deployment

Merchant-based sharding achieves regional affinity: a merchant's ledger shard lives in the region closest to them, and their payment service instances route writes to that shard. Cross-region writes happen only for merchants that have cross-region merchant accounts — a small minority. This avoids the need for a globally consistent DB on the critical path.

For truly global merchants (L7+ discussion), a globally distributed relational database (Google Spanner, CockroachDB) provides external consistency across regions at the cost of cross-region round-trip latency on writes (~100–200 ms). This is acceptable for payment settlement but not for real-time charge flow — mitigate by accepting writes locally and committing globally via a two-phase approach.

Options regional sharding (recommended) Google Spanner (L7+) CockroachDB
Distributed payment ID generation L5+ · uniqueness at scale

Payment IDs must be globally unique, sortable by creation time (for pagination), and generated without a central coordinator (to avoid a bottleneck). Stripe-style IDs (e.g. pay_1abc...) use a prefixed base62 encoding of a Snowflake-style ID: 41-bit timestamp + 10-bit datacenter/machine ID + 12-bit sequence. This generates 4096 unique IDs per millisecond per node, with no coordination required.

Alternatives UUID v4 (not sortable) UUID v7 (sortable, good option) database sequence (bottleneck)
Reconciliation pipeline design L6+ · data integrity

Reconciliation compares our internal ledger against settlement files from card networks (Visa, Mastercard) and bank ACH files. The pipeline: (1) ingest settlement file via SFTP/API at cut-of-day, (2) parse and normalise into a staging table, (3) match by external reference ID + amount + currency, (4) classify mismatches as: missing in our ledger (possible bug), missing in settlement (payment not settled yet), amount mismatch (fee discrepancy), (5) route mismatches to an alerts queue for ops review.

Change-data-capture (Debezium on the Ledger DB → a durable message queue like Kafka) provides the near-real-time feed for ledger changes. The reconciliation job can also run as a nightly full-scan batch as a safety net.

Tradeoff Real-time CDC-based reconciliation has lower latency but requires exactly-once message delivery guarantees. Nightly batch is simpler but catches discrepancies up to 24 hours late, which may breach SLA for certain dispute windows.
PSP multi-routing and fallback L6+ · reliability

Individual PSPs have their own outages and degraded periods (Stripe had a major incident in 2023). A multi-PSP router maintains health scores per processor (success rate, p95 latency over a 5-minute window) and routes new payments to the healthiest processor for a given currency/card type. On failure, it falls back to the next-best processor — but only for "soft" failures (network timeout, 503), never for "hard" declines (card rejected, fraud block).

Tradeoff Multi-PSP routing multiplies compliance scope — each processor requires separate PCI DSS certification alignment. For most companies, two processors (primary + one hot standby) is the right tradeoff.
3DS / Strong Customer Authentication (SCA) L6+ · EU compliance

EU PSD2 mandates Strong Customer Authentication (SCA) for most consumer card payments. This turns the single synchronous PSP call into a multi-step challenge flow — the cardholder must complete an additional verification (biometric, OTP, or app push) before the charge is authorised. Our PSP connector must handle this as a redirect/callback pattern, not a single HTTP call.

The flow with 3DS2: (1) Payment service initiates the charge intent with the PSP. (2) PSP assesses risk — low-risk transactions may be exempted (frictionless flow). (3) If a challenge is required, the PSP returns a redirect URL. The client sends the cardholder to the issuer's 3DS page. (4) After challenge completion, the PSP posts a webhook to our connector. (5) Connector calls the PSP to confirm authorisation and commits the ledger entry.

Architectural impact The payment shifts from synchronous (one API call → result) to asynchronous (initiate → wait for PSP webhook → commit). The payment state machine gains two new states: challenge_required and challenge_completed. Payments that time out in challenge_required must be expired, not recovered — no funds have moved at this point (the authorisation was never completed), so the recovery sweep is irrelevant. Expire the intent and allow the customer to retry with a fresh attempt.
SCA Exemptions merchant-initiated transactions (MIT) subscriptions after first auth low-value (<€30) low-risk
10

Failure modes & edge cases

ScenarioProblemSolutionLevel
Client retries without idempotency key Server executes duplicate charge because no deduplication key exists Reject requests without Idempotency-Key header with 400 Bad Request. Never process a payment without one. L3/L4
Server crash after PSP charge, before DB write Money moved but no ledger entry. Internal balance is incorrect. Recovery sweep: background job scans payments in "processing" state older than 2× the PSP timeout SLA (typically 30–60 s), queries PSP for actual outcome via the stored payment reference, and writes the result to the ledger atomically. L5
PSP timeout — unknown outcome Payment may have succeeded or failed at the network level. Cannot safely retry. Never retry on timeout without first querying PSP for outcome via payment reference. Leave status as "processing" until confirmed. See §6. L5
Redis cache unavailable Idempotency key cache miss forces every request to hit the DB with SELECT FOR UPDATE. Latency spikes but correctness is preserved. Design cache as a performance optimisation only. The DB is the source of truth. Alert on p99 latency increase; no correctness incident. Restore Redis from cluster replica. L5
Ledger DB primary failure Payment service cannot write. All payment initiations fail. Synchronous replica promotion via automated failover (e.g. Patroni for PostgreSQL). Target RTO < 30 s. Accept degraded availability during failover window. L6
Webhook delivery failure Merchant never receives payment confirmation. Manual reconciliation required. At-least-once webhook delivery with exponential back-off, capped at 24-hour intervals (1 s → 2 s → 4 s → … capped at 24 h). After ~16 retry attempts spanning 72 h of failures, publish to dead-letter queue and alert. Merchant can also poll GET /v1/payments/:id. L6
Hot shard (large merchant) One merchant's volume overwhelms its DB shard (write QPS exceeds primary capacity) Detect via per-shard write QPS metrics. Options: dedicate a shard group to the merchant; sub-shard by transaction date range; upgrade to larger instance class. Avoid: re-hashing live data. L7/L8
Currency conversion race condition FX rate applied at charge initiation vs settlement differs. Merchant receives incorrect payout amount. Record the FX rate and source (provider + timestamp) at the moment of conversion. Payout amount is calculated from the locked rate, not the spot rate at settlement time. Discrepancy reporting in reconciliation pipeline. L7/L8
Chargeback received Customer disputes a succeeded charge via their bank. Card network claws back funds from our settlement account — this happens before merchant is notified. Create a chargeback ledger entry (debit) linked to the original payment ID. Freeze the disputed amount from the merchant's available balance — the network has already reclaimed these funds from our settlement account, and the freeze prevents the merchant from withdrawing money we no longer hold. Surface to the merchant dashboard for evidence submission within the network's response window (typically 7–20 days). Route to dead-letter queue if evidence deadline is missed. Track dispute.created → evidence_submitted → dispute.resolved_(won|lost) as payment state transitions. L5/L6
11

How to answer by level

L3/L4: SDE I / SDE II bar: build a working system
What good looks like
  • Sketch the payment flow: Client → API → PSP → response → DB write
  • Know what an idempotency key is and why it's required
  • Choose a relational DB for the ledger; justify with ACID requirement
  • Identify the need for webhook notifications to merchants
  • Address basic failure: what happens if the PSP call fails
What separates L5 from L3/L4
  • L3/L4 treats the PSP call as atomic — L5 knows it isn't
  • L3/L4 doesn't address the server-crash-after-PSP-call scenario
  • L3/L4 conflates "the request succeeded" with "the money moved"
  • L3/L4 doesn't distinguish hard vs soft PSP declines
L5: Senior SDE bar: understand the tradeoffs
What good looks like
  • Design the idempotency key lifecycle: generation, storage, expiry, cache
  • Reason about the timeout failure mode and the recovery sweep
  • Design the ledger as append-only (double-entry, not balance update)
  • Separate the PSP connector from core payment logic
  • Draw the data model with correct indexes
  • Explain why amount is stored as integer cents, not float dollars
What separates L6 from L5
  • L5 designs the payment flow; L6 designs end-to-end correctness including reconciliation
  • L5 mentions sharding; L6 explains the shard key choice and its implications for hot merchants
  • L5 knows idempotency keys need to be in the same DB txn as the ledger write; L6 can trace what breaks if they aren't
L6: Staff SDE bar: own it end-to-end
What good looks like
  • Design the reconciliation pipeline: CDC → Kafka → OLAP → mismatch queue
  • Reason through the distributed transaction problem (idempotency key + ledger write atomicity)
  • Explain merchant-based sharding and the hot-shard mitigation
  • Design multi-PSP routing with health-score-based fallback
  • Address PCI DSS scoping via tokenisation; explain why narrow scope matters
  • Propose a CQRS read model for balance queries
What separates L7 from L6
  • L6 designs for one region; L7 designs for global consistency
  • L6 knows reconciliation exists; L7 designs the SLA and mismatch classification schema
  • L6 doesn't engage with regulatory isolation across geographies (GDPR, PSD2, RBI)
  • L6 doesn't reason about FX rate locking and multi-currency ledger design
L7/L8: Principal / Distinguished bar: should we build this, and how?
What good looks like
  • Multi-currency ledger: FX rate locking at charge time; currency-segregated entries
  • Global consistency: Spanner or CockroachDB for cross-region ledger; tradeoff vs regional sharding
  • Regulatory isolation: GDPR (EU data residency), PSD2 (open banking), RBI (India localisation) — each drives architectural constraints
  • Fraud pipeline integration: real-time feature scoring on charge path (adds ~20 ms); async model refresh via streaming features
  • Settlement architecture: netting across merchants before bank transfer; T+1 vs T+2 settlement cycles
  • Builds the case for whether the company should build vs embed a PSP
Common L7 failure modes
  • Gets lost in technical depth without connecting to business/regulatory drivers
  • Doesn't address the build-vs-buy question for payment infrastructure
  • Treats all failure modes as engineering problems, not risk management problems
  • Doesn't acknowledge that card network rules (Visa/Mastercard) constrain what the system can legally do

Classic probes — level differentiated

QuestionL3/L4L5/L6L7/L8
How do you prevent double charges? Use a unique transaction ID; don't charge twice Idempotency key stored atomically with ledger entry in a single ACID txn; SELECT FOR UPDATE on DB lookup Full key lifecycle: generation, Redis hot cache, DB source of truth, expiry policy, cross-region key replication for global merchants
What happens if the PSP call times out? Retry with a new request Never retry blindly: the first request may have succeeded. Leave status=processing, query PSP for outcome via recovery sweep. Classify transient vs hard failures. Plus: PSP-side idempotency keys (Stripe's own idempotency header), multi-PSP fallback with health scoring, timeout budget management across the full request chain
How do you design the ledger? A transactions table with rows per payment Append-only double-entry ledger (debit + credit per transaction); amount as integer cents; balance derived by summation; reconciliation via SUM(credits) = SUM(debits) Plus: multi-currency entries with locked FX rate; currency-segregated balance views; CQRS read model for running balances; event sourcing for full audit trail reconstruction
How do you scale the system to 1 M TPS? Add more servers; use a CDN Shard the ledger DB by merchant_id; evaluate Spanner for global consistency; separate analytics from write path via CDC. Explain why UUID v4 is bad for pagination at this scale. Plus: evaluate direct card network integration (acquiring bank membership) to eliminate PSP latency; design global settlement netting; reason through the capital reserve implications of holding merchant balances
How the pieces connect
01 Exactly-once NFR (§2) idempotency key stored atomically with ledger write ACID relational DB required for ledger (§7) Redis cache as performance layer only, never as truth (§8)
02 PCI DSS compliance NFR (§2) tokenisation SDK collects raw card data client-side backend only sees payment method tokens PSP connector translates tokens to charges (§4) without touching raw card data
03 External system boundary at PSP (§4 §6) timeout means unknown outcome payment state machine (created → processing → succeeded/failed) recovery sweep queries PSP for ground truth (§10)
04 Append-only ledger design (§7) balance derived by summing entries (no UPDATE) balance queries are expensive on the primary CQRS read model / read replica for dashboard queries (§8) primary is never queried for analytics
05 Merchant-based sharding (§9) single-shard history queries per merchant hot merchant = hot shard risk sub-sharding by date range or dedicated shard group for top merchants (§10)
06 Strong consistency NFR (§2) ledger writes always to primary (§8) cross-region correctness requires global DB (Spanner) or regional affinity sharding (§9) reconciliation pipeline (§9) catches any discrepancy not caught by idempotency

System Design Mock Interviews

AI-powered system design practice with real-time feedback on your architecture and tradeoff reasoning.

Coming Soon

Practice Coding Interviews Now

Get instant feedback on your approach, communication, and code — powered by AI.

Start a Coding Mock Interview →
Also in this series