Design a Stock Exchange: The System Design Interview Walkthrough

- Determinism before throughput: the sequencer stamps every order with a monotonically increasing number before any matching happens.
- Order book = TreeMap + FIFO Queue + HashMap: O(log P) add, O(1) cancel with a doubly-linked list, O(M log P) match.
- Matching engine is single-threaded per symbol: no locks, no concurrency within a symbol; scale by sharding symbols across threads.
- Journal-first architecture: the in-memory order book is a materialized view of an append-only log, enabling crash recovery in under a minute.
- Market data uses multicast UDP: loss is acceptable, latency is not; sequence gaps trigger selective retransmission from a snapshot service.
- Pre-trade risk checks run before the sequencer: position limits, notional caps, instrument halts, and LULD circuit breakers all fire here.
- Sequencer sharding is the key tradeoff: one global sequencer guarantees cross-symbol ordering; sharding by symbol group weakens that guarantee.
Most candidates walk into a stock exchange system design interview and immediately start talking about databases and load balancers. Both matter. The harder problem is determinism. Two traders submit orders at the same millisecond. One of them gets filled. The other doesn't. Whoever gets filled first depends on a sequence number assigned before any matching happens. Get that sequencing wrong and you have a lawsuit, not just a bug.
Keep that frame in mind as we build this.
Engineering a trading system correctly is harder than it looks. Even for the goldfish.
Scoping a Stock Exchange System Design Interview
Start by narrowing. Ask: are we building a full exchange (listing, settlement, market making) or just the trading core? For a 45-minute interview, scope down to the critical path.
In scope. Submit orders (limit and market), match them by price-time priority, publish trades and quotes, handle cancels.
Out of scope. Account management, margin, options/futures, payment rails.
Non-functional requirements. This is where it gets interesting. A typical exchange processes millions of orders per day, with peak throughput in the tens of thousands of orders per second at market open. Nasdaq's market data feed hits application latency at roughly 42 microseconds for co-located clients. LMAX, a retail derivatives exchange, demonstrated 6 million orders per second on a single JVM thread in 2011. Your target in an interview: single-digit millisecond round-trip, ~100K orders/sec sustained, deterministic ordering guaranteed.
High-Level Architecture: Six Boxes, Left to Right
Draw these as a pipeline. The interviewer wants to see the critical path before you dive into any single component.
Every order travels this path. The sequencer is the component that keeps the lawyers away.
Gateway. The front door. Accepts orders via FIX (Financial Information eXchange), the wire protocol used by virtually every broker and exchange globally. Validates schema, rate-limits clients, translates FIX messages into internal structs. Stateless and horizontally scalable.
Risk Checks. Pre-trade validation: does the client have enough buying power? Is the order size within position limits? Is this ticker currently halted? Reject fast, before anything hits the sequencer. This is mostly in-memory reads, O(1) per check. It is also where you stop someone from accidentally submitting a $10 billion market order at 9:30am on a Tuesday.
Sequencer. A single-writer component that stamps every inbound order with a monotonically increasing sequence number. No two orders share a sequence number. This is the fairness guarantee. The sequencer also writes the order to a journal (append-only log on disk) before forwarding to the matching engine. Journal write, then sequence stamp. That ordering matters more than almost anything else in this design.
Matching Engine. Single-threaded per symbol. Maintains the order book in memory. Processes orders in strict sequence-number order. Emits trade events when orders cross.
Market Data Service. Consumes trade and quote events from the matching engine. Publishes to market data subscribers over multicast UDP. Packet loss is acceptable. Latency is not. Subscribers use sequence gaps to detect missed messages and request retransmission, but the primary feed is fire-and-forget.
Trade Reporter. Persists executed trades asynchronously. Feeds the downstream settlement system (T+1 in the US as of May 2024). This path doesn't touch the matching engine's critical latency.
Data Model
Three core tables, kept thin.
-- Orders: one row per submitted order orders ( order_id BIGINT PRIMARY KEY, -- from sequencer symbol VARCHAR(8), side ENUM('BUY','SELL'), type ENUM('LIMIT','MARKET'), price NUMERIC(12,4), -- NULL for market orders quantity INT, filled_qty INT DEFAULT 0, status ENUM('OPEN','PARTIAL','FILLED','CANCELLED'), client_id BIGINT, sequence_num BIGINT, -- sequencer stamp created_at TIMESTAMPTZ ); -- Trades: one row per fill (partial or complete) trades ( trade_id BIGINT PRIMARY KEY, symbol VARCHAR(8), buy_order_id BIGINT, sell_order_id BIGINT, price NUMERIC(12,4), quantity INT, executed_at TIMESTAMPTZ ); -- Instruments: static reference data instruments ( symbol VARCHAR(8) PRIMARY KEY, status ENUM('OPEN','HALTED','CLOSED'), lot_size INT, tick_size NUMERIC(10,6) );
The matching engine never reads from this database. It reconstructs order book state from the journal at startup and holds everything in memory. These tables are for the reporting and settlement path.
The Order Book: Get This Data Structure Right
Get the order book wrong and nothing else matters. It is the data structure that defines the exchange.
Each side (bids and asks) is a price-level map. At every price level sits a FIFO queue of resting orders. You also maintain a hash map from order ID to its position for O(1) cancels.
Bids descending, asks ascending. The spread in the middle is where the action happens.
from sortedcontainers import SortedList from collections import deque, OrderedDict class OrderBook: def __init__(self, side: str): # 'bid' -> descending (highest price first) # 'ask' -> ascending (lowest price first) self.side = side self.levels: dict[float, deque] = {} self.order_map: dict[str, object] = {} # SortedList with key for direction sign = -1 if side == 'bid' else 1 self.prices = SortedList(key=lambda p: sign * p) def add(self, order) -> None: if order.price not in self.levels: self.levels[order.price] = deque() self.prices.add(order.price) self.levels[order.price].append(order) self.order_map[order.order_id] = order def best(self) -> float | None: return self.prices[0] if self.prices else None def cancel(self, order_id: str) -> None: order = self.order_map.pop(order_id, None) if not order: return q = self.levels[order.price] q.remove(order) # O(n) on deque; prod uses doubly-linked list if not q: del self.levels[order.price] self.prices.remove(order.price)
In production you'd swap the deque.remove (O(n)) for a doubly-linked list node stored inside the order_map entry, giving O(1) cancel. The asymptotic profile then becomes: add O(log P) where P is distinct price levels, cancel O(1), match O(M log P) where M is matched orders.
The matching loop itself is a tight inner loop. A limit buy at price $100 gets matched against asks starting from the lowest ask, consuming levels until either the buy quantity is filled or no more asks satisfy the price.
The Matching Engine: Single-Threaded on Purpose
The matching engine runs single-threaded per symbol. No locks, no concurrent writes. This sounds like a mistake. It is not. It is the whole point.
Concurrency buys you nothing here because the output must be deterministic. If two threads race to match orders and produce different fill outcomes depending on which thread wins, you have violated price-time priority. A regulator audit will find it. LMAX's engineering team reached this same conclusion: profiling showed processors spending more time managing lock contention than executing actual matching logic, so they eliminated locks entirely. One thread, one ring buffer, no concurrency within a symbol.
The Disruptor pattern they built is a lock-free ring buffer with pre-allocated slots. Producers push orders to the matching engine thread without synchronization overhead. The matching thread reads from its claimed slot, processes the order, and increments its cursor. No GC pressure from allocating queue nodes. Power-of-two sizing means indexing is a single bitwise AND.
Scaling across symbols is the answer to throughput. Symbol A's order book lives on shard 1. Symbol B lives on shard 2. You can run hundreds of matching engine threads in parallel because no two order books share state. The single-threaded design is a feature of each shard, not a ceiling on the system.
Sequencer and Durability: Journal First, Always
The sequencer is a single-writer component shared across all symbols. Every order, regardless of symbol, flows through the sequencer before it hits a matching engine shard.
Two things happen in a specific order: journal write, then sequence stamp. Never reversed.
If the matching engine crashes, it replays the journal from the last checkpoint. The in-memory order book is just a cache.
If the matching engine crashes, it replays the journal from the last checkpoint. LMAX demonstrated that a full JVM restart, snapshot load, and day-of-journal replay completes in under one minute.
The journal is the source of truth. The in-memory order book is a materialized view. This is event sourcing applied to a trading system. It gives you crash recovery, a complete audit trail, and the ability to replay history on a developer machine to reproduce a bug exactly. In 2010, the Tokyo Stock Exchange's Arrowhead system cut matching latency from roughly one second to five milliseconds by adopting exactly this model: triply-redundant in-memory servers with journal-based recovery, no disk-backed state on the hot path.
Market Data: Speed Over Everything
Trade events and quote updates get published over multicast UDP, not TCP. TCP's delivery guarantee requires acknowledgment per packet, which adds latency. Market data is not transactional. If a subscriber misses a quote update, the next update supersedes it anyway. Publish fast. Let subscribers detect gaps via sequence number holes, then request selective retransmission from a separate snapshot service.
One publisher, many subscribers. Packet dropped? Detect the gap, request the retransmit. The primary feed does not wait.
Nasdaq's ITCH protocol uses exactly this model. For co-located clients, 90th percentile application-level latency is around 42 microseconds. NYSE's OpenBook Ultra publishes on every limit-order book event with microsecond-level feed latency.
Risk Checks and Circuit Breakers
Pre-trade risk runs before the sequencer. Checks include: position limit (don't let a client build a position that exceeds their margin), notional cap (reject a $1B limit order from a retail client), and instrument status (reject orders on halted stocks).
Post-match, the exchange enforces two circuit breaker mechanisms. Individual stock: LULD (Limit Up-Limit Down) bands calculated as a percentage of the rolling five-minute average price. If a trade would execute outside the band, it triggers a brief pause. Market-wide: if the S&P 500 drops 7%, trading halts for 15 minutes across all instruments. A 13% drop means another 15-minute halt. A 20% drop closes the market for the day. We do not ask why. We just implement the rule.
Scaling Bottlenecks
| Component | Bottleneck | Solution |
|---|---|---|
| Matching engine | CPU throughput per symbol | Partition by symbol, one thread per shard |
| Sequencer | Single writer for all symbols | Shard by symbol group, each group gets its own sequencer |
| Gateway | Connection limits | Stateless horizontal scaling, sticky sessions per client |
| Market data | Fan-out to thousands of subscribers | Multicast UDP, separate snapshot servers |
| Trade DB | Write throughput | Append-only, partitioned by date, async from hot path |
Flag the sequencer sharding point explicitly in an interview. A single global sequencer is the initial design. Under load it becomes the bottleneck. Shard by symbol group and cross-symbol ordering guarantees weaken. If the interviewer pushes here, acknowledge it directly: within a symbol, ordering is deterministic. Across symbols, you get best-effort ordering. That's acceptable for most exchange semantics.
Key Tradeoffs
Consistency vs. availability. The matching engine is CP. During a sequencer outage, orders queue up but nothing matches incorrectly. Market data dissemination is AP: stale data is published, subscribers reconcile via snapshots.
Latency vs. durability. Fsync on every journal write adds ~100 microseconds. Group commits (fsync every N orders or every T microseconds) reduce that cost with bounded risk: at most N orders are unreplayed on crash.
In-memory state vs. disk. The hot path (matching) is entirely in-memory. The cold path (persistence, reporting) uses async writes. This means a cold restart requires journal replay, which takes seconds to minutes depending on journal length. Periodic snapshots cap replay time.
The 45-Minute Clock
- 0-5 min. Requirements. Nail functional scope (limit/market orders, price-time priority, cancel, market data). Nail non-functional (low latency, deterministic ordering, high availability). Establish scale: 100K orders/sec sustained.
- 5-12 min. High-level diagram. Draw the six-box pipeline. Name each component and its responsibility in one sentence.
- 12-25 min. Matching engine deep dive. Order book data structure (TreeMap + Queue + HashMap). Single-threaded by design and why. Sequencer and the journal.
- 25-35 min. Market data dissemination and persistence. UDP multicast, sequence gaps, snapshot servers. Async trade reporting.
- 35-42 min. Risk checks, circuit breakers, scaling. Partition by symbol, sequencer sharding tradeoff.
- 42-45 min. Interviewer's follow-ups. Common ones: how do you handle a matching engine crash mid-trade? (Journal guarantees replay from last consistent state.) How do you test this? (Replay recorded order streams in CI.)
Recap
- The core insight is determinism before throughput. The sequencer stamps every order before matching.
- Order book = TreeMap + FIFO Queue + HashMap. Add O(log P), cancel O(1), match O(M log P).
- Matching engine is single-threaded per symbol. Scale by sharding symbols, not by adding threads.
- Journal-first, in-memory hot path. The matching engine's state is a materialized view of the journal.
- Market data uses multicast UDP. Loss is acceptable; out-of-order or stale data is not; sequence numbers detect gaps.
- Pre-trade risk checks run before the sequencer. Circuit breakers run at match time and at the exchange level.
- Partition sequencers by symbol group when a single sequencer becomes the throughput ceiling.
If you want to practice talking through this design under real interview conditions, SpaceComplexity runs voice-based mock system design interviews with rubric feedback on your structure, tradeoff reasoning, and communication. Narrating a design while drawing it live is a different skill than reading about it.
Further Reading
- The LMAX Architecture (Martin Fowler, 2011)
- Trading curb (Wikipedia, circuit breaker mechanics)
- Order Matching Engine: Everything You Need to Know (Devexperts)
- Nasdaq TotalView-ITCH (Databento, market data protocol reference)
- System Design Stock Exchange (System Design Newsletter, Neo Kim)