Design a Stock Exchange: The System Design Interview Walkthrough

Most candidates walk into a stock exchange system design interview and immediately start talking about databases and load balancers. Both matter. The harder problem is determinism. Two traders submit orders at the same millisecond. One of them gets filled. The other doesn't. Whoever gets filled first depends on a sequence number assigned before any matching happens. Get that sequencing wrong and you have a lawsuit, not just a bug.

Keep that frame in mind as we build this.

Michael Reeves on making his goldfish trade stocks Engineering a trading system correctly is harder than it looks. Even for the goldfish.

Scoping a Stock Exchange System Design Interview

Start by narrowing. Ask: are we building a full exchange (listing, settlement, market making) or just the trading core? For a 45-minute interview, scope down to the critical path.

In scope. Submit orders (limit and market), match them by price-time priority, publish trades and quotes, handle cancels.

Out of scope. Account management, margin, options/futures, payment rails.

Non-functional requirements. This is where it gets interesting. A typical exchange processes millions of orders per day, with peak throughput in the tens of thousands of orders per second at market open. Nasdaq's market data feed hits application latency at roughly 42 microseconds for co-located clients. LMAX, a retail derivatives exchange, demonstrated 6 million orders per second on a single JVM thread in 2011. Your target in an interview: single-digit millisecond round-trip, ~100K orders/sec sustained, deterministic ordering guaranteed.

High-Level Architecture: Six Boxes, Left to Right

Draw these as a pipeline. The interviewer wants to see the critical path before you dive into any single component.

The six-box pipeline from broker to settlement, with the sequencer highlighted as the fairness guarantee Every order travels this path. The sequencer is the component that keeps the lawyers away.

Gateway. The front door. Accepts orders via FIX (Financial Information eXchange), the wire protocol used by virtually every broker and exchange globally. Validates schema, rate-limits clients, translates FIX messages into internal structs. Stateless and horizontally scalable.

Risk Checks. Pre-trade validation: does the client have enough buying power? Is the order size within position limits? Is this ticker currently halted? Reject fast, before anything hits the sequencer. This is mostly in-memory reads, O(1) per check. It is also where you stop someone from accidentally submitting a $10 billion market order at 9:30am on a Tuesday.

Sequencer. A single-writer component that stamps every inbound order with a monotonically increasing sequence number. No two orders share a sequence number. This is the fairness guarantee. The sequencer also writes the order to a journal (append-only log on disk) before forwarding to the matching engine. Journal write, then sequence stamp. That ordering matters more than almost anything else in this design.

Matching Engine. Single-threaded per symbol. Maintains the order book in memory. Processes orders in strict sequence-number order. Emits trade events when orders cross.

Market Data Service. Consumes trade and quote events from the matching engine. Publishes to market data subscribers over multicast UDP. Packet loss is acceptable. Latency is not. Subscribers use sequence gaps to detect missed messages and request retransmission, but the primary feed is fire-and-forget.

Trade Reporter. Persists executed trades asynchronously. Feeds the downstream settlement system (T+1 in the US as of May 2024). This path doesn't touch the matching engine's critical latency.

Data Model

Three core tables, kept thin.

-- Orders: one row per submitted order
orders (
  order_id       BIGINT PRIMARY KEY,  -- from sequencer
  symbol         VARCHAR(8),
  side           ENUM('BUY','SELL'),
  type           ENUM('LIMIT','MARKET'),
  price          NUMERIC(12,4),       -- NULL for market orders
  quantity       INT,
  filled_qty     INT DEFAULT 0,
  status         ENUM('OPEN','PARTIAL','FILLED','CANCELLED'),
  client_id      BIGINT,
  sequence_num   BIGINT,              -- sequencer stamp
  created_at     TIMESTAMPTZ
);

-- Trades: one row per fill (partial or complete)
trades (
  trade_id       BIGINT PRIMARY KEY,
  symbol         VARCHAR(8),
  buy_order_id   BIGINT,
  sell_order_id  BIGINT,
  price          NUMERIC(12,4),
  quantity       INT,
  executed_at    TIMESTAMPTZ
);

-- Instruments: static reference data
instruments (
  symbol         VARCHAR(8) PRIMARY KEY,
  status         ENUM('OPEN','HALTED','CLOSED'),
  lot_size       INT,
  tick_size      NUMERIC(10,6)
);

The matching engine never reads from this database. It reconstructs order book state from the journal at startup and holds everything in memory. These tables are for the reporting and settlement path.

The Order Book: Get This Data Structure Right

Get the order book wrong and nothing else matters. It is the data structure that defines the exchange.

Each side (bids and asks) is a price-level map. At every price level sits a FIFO queue of resting orders. You also maintain a hash map from order ID to its position for O(1) cancels.

Order book structure showing bids and asks as sorted price levels, each with a FIFO queue, and a HashMap cancel index Bids descending, asks ascending. The spread in the middle is where the action happens.

from sortedcontainers import SortedList
from collections import deque, OrderedDict

class OrderBook:
    def __init__(self, side: str):
        # 'bid' -> descending (highest price first)
        # 'ask' -> ascending (lowest price first)
        self.side = side
        self.levels: dict[float, deque] = {}
        self.order_map: dict[str, object] = {}
        # SortedList with key for direction
        sign = -1 if side == 'bid' else 1
        self.prices = SortedList(key=lambda p: sign * p)

    def add(self, order) -> None:
        if order.price not in self.levels:
            self.levels[order.price] = deque()
            self.prices.add(order.price)
        self.levels[order.price].append(order)
        self.order_map[order.order_id] = order

    def best(self) -> float | None:
        return self.prices[0] if self.prices else None

    def cancel(self, order_id: str) -> None:
        order = self.order_map.pop(order_id, None)
        if not order:
            return
        q = self.levels[order.price]
        q.remove(order)             # O(n) on deque; prod uses doubly-linked list
        if not q:
            del self.levels[order.price]
            self.prices.remove(order.price)

In production you'd swap the deque.remove (O(n)) for a doubly-linked list node stored inside the order_map entry, giving O(1) cancel. The asymptotic profile then becomes: add O(log P) where P is distinct price levels, cancel O(1), match O(M log P) where M is matched orders.

The matching loop itself is a tight inner loop. A limit buy at price $100 gets matched against asks starting from the lowest ask, consuming levels until either the buy quantity is filled or no more asks satisfy the price.

The Matching Engine: Single-Threaded on Purpose

The matching engine runs single-threaded per symbol. No locks, no concurrent writes. This sounds like a mistake. It is not. It is the whole point.

Concurrency buys you nothing here because the output must be deterministic. If two threads race to match orders and produce different fill outcomes depending on which thread wins, you have violated price-time priority. A regulator audit will find it. LMAX's engineering team reached this same conclusion: profiling showed processors spending more time managing lock contention than executing actual matching logic, so they eliminated locks entirely. One thread, one ring buffer, no concurrency within a symbol.

The Disruptor pattern they built is a lock-free ring buffer with pre-allocated slots. Producers push orders to the matching engine thread without synchronization overhead. The matching thread reads from its claimed slot, processes the order, and increments its cursor. No GC pressure from allocating queue nodes. Power-of-two sizing means indexing is a single bitwise AND.

Scaling across symbols is the answer to throughput. Symbol A's order book lives on shard 1. Symbol B lives on shard 2. You can run hundreds of matching engine threads in parallel because no two order books share state. The single-threaded design is a feature of each shard, not a ceiling on the system.

Sequencer and Durability: Journal First, Always

The sequencer is a single-writer component shared across all symbols. Every order, regardless of symbol, flows through the sequencer before it hits a matching engine shard.

Two things happen in a specific order: journal write, then sequence stamp. Never reversed.

Sequencer flow showing journal write happening before sequence stamping, with crash recovery via journal replay If the matching engine crashes, it replays the journal from the last checkpoint. The in-memory order book is just a cache.

If the matching engine crashes, it replays the journal from the last checkpoint. LMAX demonstrated that a full JVM restart, snapshot load, and day-of-journal replay completes in under one minute.

The journal is the source of truth. The in-memory order book is a materialized view. This is event sourcing applied to a trading system. It gives you crash recovery, a complete audit trail, and the ability to replay history on a developer machine to reproduce a bug exactly. In 2010, the Tokyo Stock Exchange's Arrowhead system cut matching latency from roughly one second to five milliseconds by adopting exactly this model: triply-redundant in-memory servers with journal-based recovery, no disk-backed state on the hot path.

Market Data: Speed Over Everything

Trade events and quote updates get published over multicast UDP, not TCP. TCP's delivery guarantee requires acknowledgment per packet, which adds latency. Market data is not transactional. If a subscriber misses a quote update, the next update supersedes it anyway. Publish fast. Let subscribers detect gaps via sequence number holes, then request selective retransmission from a separate snapshot service.

Market data multicast fan-out showing the matching engine publishing to UDP multicast, with multiple subscriber types and sequence gap detection One publisher, many subscribers. Packet dropped? Detect the gap, request the retransmit. The primary feed does not wait.

Nasdaq's ITCH protocol uses exactly this model. For co-located clients, 90th percentile application-level latency is around 42 microseconds. NYSE's OpenBook Ultra publishes on every limit-order book event with microsecond-level feed latency.

Risk Checks and Circuit Breakers

Pre-trade risk runs before the sequencer. Checks include: position limit (don't let a client build a position that exceeds their margin), notional cap (reject a $1B limit order from a retail client), and instrument status (reject orders on halted stocks).

Post-match, the exchange enforces two circuit breaker mechanisms. Individual stock: LULD (Limit Up-Limit Down) bands calculated as a percentage of the rolling five-minute average price. If a trade would execute outside the band, it triggers a brief pause. Market-wide: if the S&P 500 drops 7%, trading halts for 15 minutes across all instruments. A 13% drop means another 15-minute halt. A 20% drop closes the market for the day. We do not ask why. We just implement the rule.

Scaling Bottlenecks

Component	Bottleneck	Solution
Matching engine	CPU throughput per symbol	Partition by symbol, one thread per shard
Sequencer	Single writer for all symbols	Shard by symbol group, each group gets its own sequencer
Gateway	Connection limits	Stateless horizontal scaling, sticky sessions per client
Market data	Fan-out to thousands of subscribers	Multicast UDP, separate snapshot servers
Trade DB	Write throughput	Append-only, partitioned by date, async from hot path

Flag the sequencer sharding point explicitly in an interview. A single global sequencer is the initial design. Under load it becomes the bottleneck. Shard by symbol group and cross-symbol ordering guarantees weaken. If the interviewer pushes here, acknowledge it directly: within a symbol, ordering is deterministic. Across symbols, you get best-effort ordering. That's acceptable for most exchange semantics.

Key Tradeoffs

Consistency vs. availability. The matching engine is CP. During a sequencer outage, orders queue up but nothing matches incorrectly. Market data dissemination is AP: stale data is published, subscribers reconcile via snapshots.

Latency vs. durability. Fsync on every journal write adds ~100 microseconds. Group commits (fsync every N orders or every T microseconds) reduce that cost with bounded risk: at most N orders are unreplayed on crash.

In-memory state vs. disk. The hot path (matching) is entirely in-memory. The cold path (persistence, reporting) uses async writes. This means a cold restart requires journal replay, which takes seconds to minutes depending on journal length. Periodic snapshots cap replay time.

The 45-Minute Clock

0-5 min. Requirements. Nail functional scope (limit/market orders, price-time priority, cancel, market data). Nail non-functional (low latency, deterministic ordering, high availability). Establish scale: 100K orders/sec sustained.
5-12 min. High-level diagram. Draw the six-box pipeline. Name each component and its responsibility in one sentence.
12-25 min. Matching engine deep dive. Order book data structure (TreeMap + Queue + HashMap). Single-threaded by design and why. Sequencer and the journal.
25-35 min. Market data dissemination and persistence. UDP multicast, sequence gaps, snapshot servers. Async trade reporting.
35-42 min. Risk checks, circuit breakers, scaling. Partition by symbol, sequencer sharding tradeoff.
42-45 min. Interviewer's follow-ups. Common ones: how do you handle a matching engine crash mid-trade? (Journal guarantees replay from last consistent state.) How do you test this? (Replay recorded order streams in CI.)

Recap

The core insight is determinism before throughput. The sequencer stamps every order before matching.
Order book = TreeMap + FIFO Queue + HashMap. Add O(log P), cancel O(1), match O(M log P).
Matching engine is single-threaded per symbol. Scale by sharding symbols, not by adding threads.
Journal-first, in-memory hot path. The matching engine's state is a materialized view of the journal.
Market data uses multicast UDP. Loss is acceptable; out-of-order or stale data is not; sequence numbers detect gaps.
Pre-trade risk checks run before the sequencer. Circuit breakers run at match time and at the exchange level.
Partition sequencers by symbol group when a single sequencer becomes the throughput ceiling.

If you want to practice talking through this design under real interview conditions, SpaceComplexity runs voice-based mock system design interviews with rubric feedback on your structure, tradeoff reasoning, and communication. Narrating a design while drawing it live is a different skill than reading about it.