Design Amazon's E-Commerce Platform: The System Design Interview Walkthrough

- Discovery vs transaction split: browsing tolerates eventual consistency; inventory and orders require strong consistency or you're overselling
- reserved_quantity with optimistic locking: soft-reserve at checkout start, hard-decrement only on payment confirmation, never at add-to-cart
- Cart is advisory: always revalidate prices and stock at checkout; stale cart data costs nothing, a wrong order does
- Saga over 2PC: three local commits with explicit compensating actions on failure, no distributed coordinator to take down
- Flash sale needs four layers: virtual waiting room (Redis sorted set), Lua atomic DECR, async Kafka pipeline, and a DB unique constraint as the final guard
- MySQL for active orders, Cassandra for archive: keeps the hot table small and fast; wide-column handles time-series history reads well
- Elasticsearch via async CDC pipeline: catalog changes propagate with a 30-second lag; that's fine because checkout revalidates anyway
You click "Buy Now." Amazon charges your card, decrements the inventory, assigns a warehouse, prints a label, and sends you a tracking number. Usually in under a second. Across 600 million product listings, from 9.7 million sellers, for 12 million orders every day.
That's the problem. Now design it. In 45 minutes. On a whiteboard. While someone watches and says nothing.
This amazon system design interview guide covers requirements, the core data model, the five hardest services, how the system scales, and how to pace yourself in 45 minutes.
Scope It Before You Touch the Whiteboard
Amazon does a lot. You have 45 minutes. What separates good candidates is resisting the urge to design everything. The recommendation engine. The returns workflow. The seller analytics dashboard. All interesting. All out of scope.
Start by asking what the interviewer cares about. A reasonable scope for this problem is:
- Users can browse and search products
- Users can add items to a cart and place an order
- The system reserves inventory to prevent overselling
- Orders are paid and fulfilled
- Out of scope: recommendations, seller-facing tools, returns, advertising
Write that down. Then get to scale.
Scale That Shapes the Design
These numbers drive every decision you make:
| Metric | Estimate |
|---|---|
| Product listings | 600 million |
| Active sellers | 9.7 million |
| Orders per day | 12 million (~140/sec average) |
| Peak (Prime Day) | ~66,000 orders/sec |
| DynamoDB requests at peak | 126 million/sec |
| Daily revenue | ~$1.75 billion |
The average throughput is manageable. Prime Day is not. Those DynamoDB numbers should make you nervous. Design for the peak, and you design the right system. That forces decisions around inventory consistency, cache depth, and queue-based checkout that wouldn't appear in a vanilla CRUD walkthrough.
Read:write ratio for browsing vs. ordering is roughly 1000:1. Most traffic never results in a purchase.
The Architecture Splits Into Two Worlds
Draw a line down the middle of your diagram. Literally. The first mark you make. On the left: discovery (browsing, search, recommendations). On the right: transactions (cart, checkout, orders, payments, inventory).
Discovery is read-heavy and tolerates staleness. Transactions require correctness.
A user seeing slightly stale search results costs you nothing. A user who checks out two units of a sold-out item costs you a customer and a support ticket. The entire architecture flows from that distinction. Get it backwards and you'll end up with strong consistency on product titles and eventual consistency on payments.
[Client]
|
[API Gateway / Load Balancer]
| |
[Discovery Path] [Transaction Path]
- Product Service - Cart Service
- Search Service - Inventory Service
- CDN / Redis - Order Service
- Payment Service
Discovery and transaction paths diverge at the load balancer and never share a database.
Data Model: Start Normalized, Split Where You Must
Start with the product model. A product has multiple variants (color, size). Each variant has a SKU, and inventory tracks stock per SKU per warehouse.
-- Product catalog products (product_id, title, description, category_id, seller_id) product_variants(variant_id, product_id, sku, price, attributes JSONB) -- Inventory (separate service, separate DB) inventory (sku, warehouse_id, quantity, reserved_quantity, version) -- Orders orders (order_id, user_id, status, idempotency_key, total_amount, created_at) order_items (item_id, order_id, sku, quantity, unit_price) -- Cart (document store or Redis hash) carts (user_id | session_id -> {sku: quantity, ...})
The reserved_quantity column is the key insight for inventory. Stock has two states: available and reserved. You never decrement actual quantity until the order is confirmed. The version field enables optimistic locking to prevent double-decrement bugs.
The SKU is the key that ties catalog and inventory together across separate service databases.
The Five Core Services
1. Product Catalog Service
The catalog is the source of truth for product metadata. It is not the source of truth for pricing or inventory. Those live in separate services.
The catalog feeds an Elasticsearch index asynchronously via a Kafka pipeline. The index is denormalized and search-optimized: a single document per variant with pre-joined category names, seller info, and image URLs. Stale search results are acceptable because the real validation happens at checkout, not at browse time.
Cache product pages aggressively in Redis with a 5-minute TTL. For a product that hasn't changed in a week, there is no reason to hit Postgres on every request.
2. Inventory Service
This is the hardest service. Get it wrong and you sell six laptops for the price of one. It is the only place in the system where you must have strong consistency.
The reservation model has two phases. When checkout begins, create a soft reservation: reserved_quantity += N with a 15-minute TTL stored in Redis alongside the DB record. At payment confirmation, convert the reservation to a permanent decrement: quantity -= N, reserved_quantity -= N. If payment fails or the TTL expires, the reservation releases automatically.
Why not decrement at add-to-cart? Because users add items to carts and abandon them constantly. If every cart addition was a hard decrement, you would perpetually under-report stock.
The DB update uses optimistic locking:
UPDATE inventory SET reserved_quantity = reserved_quantity + 1, version = version + 1 WHERE sku = 'ABC-123' AND warehouse_id = 42 AND (quantity - reserved_quantity) >= 1 AND version = :expected_version;
Zero rows updated means someone else got there first. Retry or return out-of-stock.
The soft reserve is the entire reason your flash sale doesn't sell the same unit to twelve people at once.
3. Cart Service
The cart is advisory. Think of it as a wishlist that might convert. About 70 percent of the time it won't. Never trust the cart's prices at checkout. Always refetch the price from the catalog service at order creation time. A seller might have changed the price in the thirty seconds since the user added it.
Store carts in Redis as hash maps keyed by user_id. Carts are ephemeral. If you lose a cart, the user is annoyed. If you lose an order, you have a financial dispute.
For logged-out users, persist a session token in a cookie and merge carts on login.
4. Order Service
Orders are an explicit state machine. Define the states and make invalid transitions impossible.
PENDING_PAYMENT
|
v (payment authorized)
CONFIRMED
|
v (warehouse picked + packed)
SHIPPED
|
v (delivery confirmed)
DELIVERED
PENDING_PAYMENT --x--> CANCELLED (timeout or user action)
CONFIRMED ------x--> CANCELLED (before shipped, rare)
Define the states and make invalid transitions impossible at the code level, not just in the diagram.
Active orders live in MySQL. Delivered and cancelled orders archive to Cassandra after 30 days. MySQL does not need to hold the full history of every order from 2006. Cassandra's wide-column model handles time-series reads on historical data well.
Each order carries an idempotency_key (typically a UUID the client generates before submitting). The server inserts the order using INSERT ... ON CONFLICT (idempotency_key) DO NOTHING. A duplicate submission returns the existing order record. This makes the create-order endpoint safe to retry.
5. Payment Service
The payment flow uses a Saga, not a 2PC. Distributed two-phase commit requires a coordinator and fails hard when that coordinator goes down. That's the thing you were trying to prevent. A Saga breaks the transaction into local commits with compensating actions on failure.
The sequence:
- Reserve inventory (local commit in inventory DB)
- Authorize payment via external gateway (local commit in payments DB)
- Confirm order (local commit in orders DB)
If step 2 fails: release the inventory reservation (compensating action). If step 3 fails: void the payment authorization + release reservation (two compensating actions).
No global coordinator. No locks held across services. Just explicit rollback logic.
Scaling the Read Path
Product pages are the 1000x majority of traffic. Scale them by never touching the database.
The chain is: CDN for static assets and cacheable product pages, Redis for product metadata (5-minute TTL), Elasticsearch for search queries. Your origin database should see a fraction of your user traffic.
For search, Elasticsearch handles text queries, filters (price range, category, rating), and fuzzy matching. The index gets updated asynchronously when product data changes. A 30-second lag between catalog update and search index update is acceptable.
Images and static content go through CloudFront or equivalent CDN. A product image does not change. Cache it with a long TTL and a content-addressed URL so cache-busting is explicit.
For product availability hints on the search results page (the "Only 3 left in stock" label), serve a Redis cached count refreshed every 60 seconds. Exact accuracy here is not worth the database load.
How to Survive a Flash Sale
Imagine 10 million users competing for 10,000 units in 60 seconds. Your system will not survive this without explicit design choices. Every major retailer has a horror story here. The good ones built four layers of defense.
Layer 1: Virtual waiting room. Before users reach the buy button, queue them in a Redis sorted set with random scores. Admit 20,000 users per second. Issue signed JWT tokens. This converts a 10 million user spike into a steady 20K/sec stream.
ZADD waitingroom <random_score> <user_id>
ZPOPMIN waitingroom 20000 -- admit next batch
Redis sorted sets are built on a skip list internally, which is why range operations like ZPOPMIN run in O(log n).
Layer 2: Redis Lua atomic decrement. Never do a read-then-write for inventory under load. Use a Lua script executed atomically on Redis:
local stock = redis.call('GET', KEYS[1]) if tonumber(stock) <= 0 then return -1 end redis.call('DECR', KEYS[1]) return 1
A single Redis node executes the script without interleaving. Throughput: 100,000+ ops/sec. For a hot SKU, shard the counter across 16-32 keys and sum on reads.
Layer 3: Async order pipeline. After the Redis DECR succeeds, publish an order intent to Kafka and return HTTP 202 Accepted immediately. Workers consume from Kafka and write the actual order record. The synchronous path is just two network round-trips.
Layer 4: DB unique constraint as final guard. Even with all the above, put UNIQUE(user_id, sale_id) on the orders table. It prevents a user from exploiting race conditions to buy twice. It is the safety net, not the primary defense.
Throughput drops at each layer. By the time a request hits the database, it has already earned its spot.
Which Services Need Strong Consistency?
Every service has a different answer to "availability vs. consistency."
| Service | Consistency Need | Why |
|---|---|---|
| Inventory | Strong (CP) | Overselling is a financial error |
| Orders | Strong (CP) | Immutable financial record |
| Product Catalog | Eventual (AP) | Stale title text costs nothing |
| Search Index | Eventual (AP) | 30s lag is fine |
| Cart | Soft (AP) | Revalidated at checkout |
| Notifications | Eventual (AP) | Delayed email is fine |
Stating this table in an interview demonstrates that you understand the CAP theorem as a tool for reasoning, not a buzzword to drop before your interviewer's coffee goes cold.
How to Pace the Amazon System Design Interview
This is more content than 45 minutes can hold. Here is what to prioritize.
0-5 min: Clarify scope. Agree on the six core flows: browse, search, cart, checkout, order, payment. Write them down visibly.
5-10 min: Scale estimates. Derive peak orders/sec. State the 1000:1 read/write ratio. Pick your consistency tiers.
10-20 min: Draw the two-world architecture (discovery vs. transaction). Name each service. Sketch the data model for inventory and orders specifically. These are the hardest parts and your interviewer knows it.
20-35 min: Go deep on inventory. Walk through the reservation model, the optimistic lock query, and what happens when the payment fails. Then walk through the order state machine and the Saga compensation chain.
35-45 min: Cover scaling. Redis cache for reads, Elasticsearch for search, the flash sale queue. Bring up the tradeoff table. Name what you would defer (recommendations, returns, internationalization) and why.
If you get stuck, say what the constraint is: "I'm choosing availability over consistency here because the cost of a stale catalog page is zero." Showing your reasoning is worth more than the answer itself. Going silent is the one move that leaves nothing on the page for the interviewer to write down.
Putting It Into Practice
Reading about this is one layer of understanding. Explaining it out loud under pressure is a completely different skill. System design interviews test your ability to structure ambiguity on the fly, defend tradeoffs, and communicate while the clock is running.
SpaceComplexity gives you voice-based mock interviews with rubric feedback on exactly this: how clearly you communicate your design, how well you scope the problem, and whether your tradeoffs hold up under follow-up questions. You can run through this problem end-to-end and get a score before the real thing.
Recap
- Split discovery (read-heavy, eventual consistency OK) from transactions (write-heavy, strong consistency required)
- The
inventorytable usesreserved_quantity+ optimistic locking. Never decrement at add-to-cart - The cart is advisory. Revalidate everything at checkout
- Orders are a state machine backed by MySQL for active orders, Cassandra for history
- Payment uses a Saga with compensating actions, not 2PC
- Flash sales need a virtual waiting room, Redis Lua atomic DECR, and async order pipeline
- Cache product pages in Redis and CDN. The database should not see raw browse traffic