Food Delivery System Design Interview: The DoorDash Walkthrough

May 27, 202613 min read
interview-prepcareerdsaalgorithms
Food Delivery System Design Interview: The DoorDash Walkthrough
TL;DR
  • Browse is AP, ordering is CP: split consistency before drawing a single box, every storage decision follows from it
  • 25,000 location writes/sec is the dominant load, not the 150 order writes/sec most candidates fixate on
  • Driver locations live in Redis with a geospatial index, never in the same PostgreSQL table as orders
  • Order state machine: draw every state and actor explicitly, and make every transition idempotent
  • Dispatch uses Redis SETNX to prevent two concurrent orders from grabbing the same driver
  • WebSocket push beats polling for live tracking because one driver event fans out to exactly one customer
  • Geohash vs H3, Cassandra vs PostgreSQL: bring these tradeoffs up unprompted in the final five minutes

You open an app. You tap four things. Thirty minutes later a stranger hands you food through a car window. It sounds trivially simple, which is exactly why interviewers love this question. The moment you start drawing boxes, you realize you are actually designing three different systems with contradictory consistency requirements, a real-time logistics engine, and a geospatial matching problem. All at once. Good luck.

This walkthrough covers everything you need to navigate the DoorDash (or Uber Eats, or any food delivery) system design question: requirements, data model, the order state machine, driver dispatch, real-time tracking, and a clock breakdown for the 45-minute interview.


Start With Requirements, Not Boxes

The worst way to begin a system design interview is to immediately start drawing boxes. You will draw the wrong boxes. Spend three minutes on requirements first.

Functional requirements (agree on these explicitly):

  • Customers browse restaurants and menus
  • Customers place orders and track them in real time
  • Restaurants receive and accept/reject orders, update preparation status
  • Drivers (dashers) receive dispatch requests, accept or decline, navigate to pickup and dropoff
  • All parties see live order status

Non-functional requirements (these drive every tradeoff later):

  • Order placement: strong consistency required. A customer should never see "order placed" if payment failed or the item is unavailable
  • Location tracking: eventual consistency is fine. A 5-second delay on the driver's pin is acceptable
  • High availability for browse and search. A restaurant being temporarily stale in search results is tolerable; the checkout flow going down is not
  • Target: sub-200ms API responses for order placement, ~5s update interval for live tracking

Browse is AP (availability over consistency). Order and payment are CP (consistency over availability). Nail this split early. Every storage and architecture choice follows from it. If you get one thing across in the first five minutes, make it this.


The Number That Ruins Naive Designs

Interviewers want to see that you can estimate before designing. Round aggressively.

DoorDash processes roughly 2 million orders per day. Spread across 24 hours with lunch and dinner peaks around 3x baseline, peak order writes land around 100 to 150 per second. Not extreme. A single PostgreSQL primary handles that comfortably for years.

Here is the number that trips people up. At peak, DoorDash has around 100,000 active drivers, each pinging GPS every 4 seconds. That is 25,000 location writes per second sustained during dinner rush. This is the dominant write load in the system, and it needs a completely separate architecture from order writes.

Restaurant menu reads vastly outnumber writes. Assume a 1,000:1 read-to-write ratio. Cache aggressively, or watch your database cry.


Five Services, One Clear Boundary Each

Every candidate draws boxes. The signal is whether your boxes have real, defensible boundaries.

High-level service architecture: customer, restaurant, and driver apps hitting an API gateway which routes to five services: Order Service backed by PostgreSQL, Restaurant Service with Redis cache, Dispatch Service for geo-matching, Location Service ingesting GPS at 25K writes/sec, and Notification Service pushing via WebSocket

Five services, three storage tiers, one AP/CP decision that threads through all of it.

Before diving into each service, here is roughly what good microservices architecture looks like compared to the monolithic alternative:

Monolithic house vs microservice house: illustration comparing a single cluttered building to an over-engineered multi-component structure

The microservices house has a dedicated "waiting room" and an "event queue." Accurate.

Order Service owns the transaction. It validates the cart, calls the payment processor, writes a durable order record, and publishes an order.created event to Kafka. It talks to the Restaurant Service to confirm item availability before charging.

Restaurant Service holds the catalog, menus, operating hours, and preparation-time estimates. Heavily read. Backed by PostgreSQL with a Redis read-through cache.

Dispatch Service receives order.created, finds a nearby available driver, and assigns them. This is the hardest subproblem in the system. We will spend a lot of time here.

Location Service ingests GPS pings from driver apps, writes to Redis (short TTL), and publishes driver.location_updated events to Kafka. It calculates ETAs using a routing engine.

Notification Service consumes Kafka events and pushes updates to customer and driver via WebSocket. Stateless. Horizontally scalable. The easy one.


The Schema Is Two Tables and a Redis Key

Keep it tight. You do not need to schema everything. Show the core tables and explain the relationships.

-- Core order tables orders ( id UUID PRIMARY KEY, customer_id UUID, restaurant_id UUID, driver_id UUID, -- null until assigned status order_status, -- enum: PLACED, ACCEPTED, PREPARING, READY, PICKED_UP, DELIVERED, CANCELLED total_cents INT, idempotency_key VARCHAR UNIQUE, created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ ) order_items ( id UUID PRIMARY KEY, order_id UUID REFERENCES orders, menu_item_id UUID, quantity INT, price_cents INT ) -- Driver location (in Redis, not Postgres) -- Key: driver:{driver_id} Value: { lat, lng, status, updated_at } -- Geospatial index: GEOADD drivers_geo lng lat driver_id

The orders table lives in PostgreSQL. You want ACID guarantees here. Payment deduction and order creation happen in a single transaction. The idempotency_key (set by the client before the first request) prevents double-charges if the network times out and the client retries. Without it, your system charges someone twice and now you have a very angry customer and a refund to process.

Driver location never touches PostgreSQL. It lives in Redis, updated in-place with a short TTL so stale drivers age out automatically. No joins. No transactions. Just a fast geospatial index.


Draw the State Machine. All of It.

Every order is a finite state machine. Draw it explicitly. Candidates who say "there are various states" without naming them signal that they have never thought through the edge cases.

Order state machine showing PLACED, RESTAURANT_ACCEPTED, PREPARING, READY_FOR_PICKUP, PICKED_UP, DELIVERED as the happy path, with CANCELLED branching off any state with compensation logic and an idempotency rule callout

The CANCELLED state has compensation logic. Unassigning a driver while processing a refund is not trivial. Worth a sentence in the interview.

PLACED → RESTAURANT_ACCEPTED → PREPARING → READY_FOR_PICKUP
  → PICKED_UP → DELIVERED

Any state → CANCELLED (with compensation logic)

Each transition is triggered by a specific actor: the restaurant app sends ACCEPTED and READY_FOR_PICKUP, the driver app sends PICKED_UP and DELIVERED. The Order Service validates that the incoming transition is legal from the current state before writing it.

State transitions must be idempotent. A driver's app might send PICKED_UP twice due to a retry. The correct behavior is to check IF current_status = 'READY_FOR_PICKUP' THEN update and return the current state if the transition is a no-op. Not an error. A no-op.

A background sweeper runs every minute scanning for orders stuck in a state longer than expected. An order RESTAURANT_ACCEPTED for 45 minutes with no PREPARING update triggers an alert. Out of scope for the interview, but worth a sentence.


Dispatch: The Hard Problem Made Tractable

Dispatch is where most candidates go vague. Break it into two concrete steps and the interviewer will like you.

Step 1: Find nearby available drivers.

Drivers continuously write their GPS coordinates to Redis via GEOADD. When an order is ready for dispatch, the Dispatch Service calls GEOSEARCH to get all drivers within 5 km of the restaurant.

GEOSEARCH drivers_geo FROMMEMBER restaurant:{id} BYRADIUS 5 km ASC COUNT 10

This returns up to 10 nearest drivers in under 1ms. Redis stores coordinates as geohash internally, so the query is a range scan on a sorted set.

Step 2: Score and assign.

From the candidate set, the Dispatch Service scores each driver by estimated drive time to restaurant, acceptance rate, and whether they already have a delivery in progress (batching). It picks the top candidate and sends a push notification via the Notification Service.

If the driver does not accept within 30 seconds, the service retries with the next candidate. This "offer timeout and retry" loop is the core of the dispatch algorithm.

Dispatch flow: Kafka consumer receives order.created, Dispatch Service queries Redis GEOSEARCH, scores candidates, acquires Redis SETNX lock, sends push notification to Driver 1 with 30-second timeout, on timeout releases lock and retries next candidate, on accept writes assignment to PostgreSQL

The SETNX lock is the part people forget. Without it, two orders race to grab the same driver and you have a very confused dasher.

Assignment must be atomic. Two orders could race to grab the same driver. Use a Redis SETNX lock on driver_lock:{driver_id} with a short TTL (60 seconds). If the lock fails, skip that driver and try the next candidate. Release the lock once assignment is committed to PostgreSQL.


The Moving Pin: Four Hops, Under Five Seconds

Customers want to see a moving pin. Here is how you build that.

The driver app sends a GPS ping (lat, lng, heading, timestamp) every 4 seconds. The Location Service receives it, writes the latest position to Redis, and publishes a driver.location_updated event to Kafka partitioned by order_id. Partition by order ID, not driver ID, so the consumer always sees updates for a single delivery in order.

A stream processor picks up the event, calculates an updated ETA using a routing API, and publishes a delivery.eta_updated event. The Notification Service consumes this and pushes it to the customer's WebSocket connection.

Location update pipeline: Driver App sends GPS ping every 4 seconds, Location Service writes to Redis and publishes to Kafka partitioned by order_id, ETA Consumer recalculates arrival time, Notification Service pushes via WebSocket to Customer App, total latency under 5 seconds with annotated budget per hop

Partition by order_id, not driver_id. A driver could have multiple deliveries and you want each customer's updates to stay in order.

End-to-end latency target: under 5 seconds. The system can tolerate occasional 10-second gaps without user impact. This is why you picked eventual consistency for location to begin with.

Why WebSocket over polling? At 25,000 location pings per second, if every customer app polled every 4 seconds you would multiply that write load by the number of customers actively watching deliveries. Push inverts the fan-out: one event from one driver goes to one customer via one WebSocket. See the push vs. pull tradeoff breakdown here for when that math flips.


The Two Storage Splits You Must Explain

Interviewers are listening for how you think about consistency. Do not make them ask. State it clearly.

Orders table: PostgreSQL. You need atomic transactions (charge card + create order), strong reads (no phantom "order confirmed" before payment succeeds), and foreign key integrity. Consistency trumps everything. At 150 peak writes/second, a single PostgreSQL primary handles this comfortably for years.

Driver locations: Redis. Eventual consistency is fine. A 5-second-stale driver position does not break correctness. You get sub-millisecond writes, automatic TTL expiry for disconnected drivers, and the built-in GEOSEARCH command. No joins. No transactions.

Menu catalog: PostgreSQL + Redis cache. Reads are 1,000x more frequent than writes. Write to PostgreSQL, invalidate the cache on update. If a menu item price is 5 minutes stale, the discrepancy surfaces at checkout, where the Order Service reads the authoritative price from PostgreSQL before charging. This is a deliberate product decision, not an architecture mistake.


How to Spend 45 Minutes Without Wasting Them

Most candidates run out of time mid-diagram. Follow this breakdown and you will always finish with tradeoffs on the table.

TimeWhat to cover
0-3 minRequirements, constraints, non-functional targets
3-7 minScale estimation (orders/sec, location writes/sec)
7-15 minHigh-level architecture sketch, five services
15-22 minData model (orders, order_items, Redis for location)
22-32 minDeep dive: order state machine + dispatch
32-40 minDeep dive: real-time location tracking pipeline
40-45 minTradeoffs, bottlenecks, what you would do next

Spend the last five minutes being proactive about what breaks. "The dispatch loop is currently single-region. To geo-shard by city, I would partition Kafka by city code and run separate Dispatch Service instances per region, each with their own Redis cluster." That kind of answer shows you think in production, not just in diagrams.


Three Tradeoffs to Bring Up Unprompted

Geohash vs. H3. Redis uses geohash internally. H3 (Uber's hexagonal grid) gives more uniform neighbor distances, which matters for ETA accuracy. DoorDash uses H3 for ETA modeling. For the interview, geohash plus Redis is the practical answer. Mention H3 as the production upgrade.

Cassandra vs. PostgreSQL for orders. Cassandra scales writes horizontally but gives up transactions. For food delivery, a double-charge is a customer service nightmare that costs real money. PostgreSQL with horizontal read replicas is the pragmatic choice at DoorDash's scale. The write path (150 orders/sec) is trivially handled by a single primary.

Driver batching. Assigning one driver to two nearby orders improves efficiency but complicates the state machine significantly. Mention it as a V2 feature: the dispatch algorithm scores in-progress drivers higher if their current dropoff is near the new pickup.


What This Interview Actually Tests

This question tests whether you can spot three independent complexity axes: transactional correctness for orders, geospatial indexing for dispatch, and high-throughput streaming for location updates. Candidates who treat the whole system as one homogeneous database do not pass.

The signal is your ability to choose different consistency models for different parts of the system and justify why. Not knowing the exact Redis command name is fine. Not knowing why driver locations should not live in the same PostgreSQL table as orders is not.

If you want to practice narrating a design like this under real interview pressure, SpaceComplexity runs voice-based mock system design interviews with rubric feedback. The feedback tells you whether your consistency argument landed, not just whether you drew the right boxes.


Recap

  • Split requirements early: browse is AP, ordering is CP. Every storage decision follows.
  • Scale estimation exposes the real bottleneck: 25,000 location writes/sec, not 150 order writes/sec.
  • Order table in PostgreSQL. Driver locations in Redis with geospatial index. Menu reads through cache.
  • Order state machine: enumerate every state and actor that triggers each transition. Idempotent transitions required.
  • Dispatch: Redis GEOSEARCH for candidates, score and offer sequentially, Redis SETNX lock to prevent double-assignment.
  • Location pipeline: driver app → Location Service → Redis + Kafka → ETA consumer → WebSocket push.
  • Discuss geohash vs H3, Cassandra vs PostgreSQL, and driver batching as tradeoffs in the final five minutes.

Further Reading