Design a Webhooks Delivery System: The 45-Minute Interview Walkthrough

June 11, 202611 min read
interview-prepcareersystem-designalgorithms
Design a Webhooks Delivery System: The 45-Minute Interview Walkthrough
TL;DR
  • At-least-once delivery is the only guarantee you can make over HTTP; exactly-once is impossible because a lost acknowledgment is indistinguishable from a lost request
  • Partition the delivery queue by subscription_id to isolate failing endpoints; one subscriber's outage cannot block deliveries to anyone else
  • Exponential backoff with jitter is not optional: synchronized retries from a mass failure create a thundering herd that triggers the next failure wave
  • Separate delivery_tasks from delivery_attempts and use a leased_until column for crash recovery instead of a distributed lock
  • HMAC-SHA256 sign every payload with a Unix timestamp in the signed content; subscribers verify with constant-time comparison to block replay and timing attacks
  • Per-subscription circuit breakers prevent retry budget waste; classify failures before retrying (5xx = retry, 4xx = dead letter queue immediately)

Sending an HTTP POST is trivial. Building a system that reliably sends millions of them to URLs you don't control, with guaranteed delivery, exponential backoff, and cryptographic verification, is the actual problem. That gap is why the webhooks system design interview runs deeper than it looks. It seems simple until you ask what happens when the subscriber's endpoint goes down for an hour. Spoiler: the answer is not "it's fine."

The Clock Is Already Running

Spend your time roughly like this:

  • 0-5 min: Clarify scope and constraints
  • 5-15 min: High-level architecture
  • 15-25 min: Data model and API
  • 25-40 min: Deep dives on delivery guarantees, fan-out, and security
  • 40-45 min: Trade-offs and scaling

The clock forces prioritization. Most candidates over-explain the API and run out of time before reaching delivery guarantees. That is where the real complexity lives, and the part most candidates never reach. It's like spending your whole hiking trip putting on sunscreen and then never leaving the parking lot.

Clarify Before You Draw

Four scope questions that change the design:

Who produces events? Internal services (your own backend) make this simpler. A public event ingestion API adds authentication and validation concerns, plus a whole lot of "what if someone just POSTs garbage at us."

What delivery guarantee do you need? Exactly-once delivery over HTTP is impossible. A sender cannot distinguish "the 200 response was lost in transit" from "the subscriber never received the request." Saying this out loud early signals that you understand distributed systems, and not just that you read the buzzwords once.

What's the retry window? Stripe retries for 3 days with roughly 16 attempts. GitHub doesn't retry automatically at all. Your answer should be a deliberate design decision, not a guess you deliver with confident eyebrows.

What's the fan-out factor? One event going to one subscriber is trivial. One event going to 10,000 subscribers is a fundamentally different architecture problem.

A reasonable baseline scope: internal event producers, 10M events per day, up to 100 subscribers per event type, a 72-hour retry window, at-least-once delivery.

Five Components, in Order

The event queue decouples event production from delivery. Your internal services write events and return immediately. The dispatcher reads those events asynchronously, looks up matching subscriptions, and fans out to the delivery queue.

The delivery queue is partitioned by subscription_id. This is the key structural decision for avoiding head-of-line blocking: one subscriber's failing endpoint cannot slow down any other subscriber's deliveries.

Delivery workers pull tasks from the delivery queue, fire HTTP POSTs with a strict timeout (5 seconds), record the result, and either complete the task or re-enqueue it with backoff.

Five-box webhook delivery architecture: internal services through event queue, dispatcher, delivery queue, and delivery workers, with retry loop and dead letter queue branch The full delivery pipeline. Events flow down. Failures flow right into the retry queue, and then (after N attempts) into the dead letter queue.

Four Tables and Why They're Split

webhook_subscriptions

id UUID PRIMARY KEY tenant_id UUID endpoint_url TEXT event_types TEXT[] -- ['payment.completed', 'refund.created'] signing_secret TEXT -- per-subscription HMAC key active BOOLEAN created_at TIMESTAMPTZ

events

id UUID PRIMARY KEY -- stable across retries type TEXT payload JSONB tenant_id UUID created_at TIMESTAMPTZ

delivery_tasks

id UUID PRIMARY KEY event_id UUID REFERENCES events subscription_id UUID REFERENCES webhook_subscriptions status ENUM(pending, in_flight, delivered, failed, dlq) attempt_count INT DEFAULT 0 next_retry_at TIMESTAMPTZ leased_until TIMESTAMPTZ -- prevents duplicate delivery after worker crash created_at TIMESTAMPTZ

delivery_attempts

id UUID PRIMARY KEY task_id UUID REFERENCES delivery_tasks http_status_code INT response_time_ms INT error_message TEXT attempted_at TIMESTAMPTZ

The split between delivery_tasks and delivery_attempts matters. Tasks track the current state of a delivery. Attempts are an append-only audit log. Never overwrite history to record a retry. Your future self debugging a production incident will thank you.

The leased_until column solves worker crash recovery without a distributed lock. When a worker picks up a task, it sets leased_until = now() + 30s. If the worker dies mid-flight, another worker claims the task after the lease expires and retries rather than losing the delivery entirely. No Zookeeper required. No 3am Zookeeper incident required either.

Five Endpoints, One Easy to Forget

POST   /webhooks/subscriptions
GET    /webhooks/subscriptions
DELETE /webhooks/subscriptions/:id
GET    /webhooks/subscriptions/:id/deliveries
POST   /webhooks/subscriptions/:id/deliveries/:delivery_id/retry

The retry endpoint is easy to forget and important to mention. Subscribers need a way to replay events from the dead letter queue after they fix a broken endpoint. Without it, they have to re-trigger events from their source system, which is often impossible or requires persuading the payments team to do something uncomfortable.

Delivery Guarantees: The Hard Part

You can guarantee at-least-once. You cannot guarantee exactly-once over HTTP. Say this explicitly. Here is why: your delivery worker sends a POST and waits for a 200. If the subscriber processes the request and then crashes before sending the 200, you have no way to distinguish "delivered but acknowledgment lost" from "never delivered." So you retry. The subscriber processes the event twice.

The correct fix lives on the subscriber's side: process events idempotently using the event.id as an idempotency key. Check before processing, insert after. This is a standard two-step database operation that the subscriber controls. You cannot enforce it from the sender side, and explaining that is the difference between a candidate who understands distributed systems and one who just read about them.

For the retry schedule, use exponential backoff with jitter. A reasonable schedule: 1 min, 5 min, 30 min, 2 hours, 6 hours, 12 hours, 24 hours. After 72 hours, stop retrying. The jitter is not optional. Without it, 10,000 webhooks that all fail during a subscriber outage will all retry at precisely the same moment when the window opens, creating a thundering herd that triggers another failure wave. You fixed the outage and then your own retry logic broke you again.

Adding a random offset of up to 25 percent of the base delay distributes the load across time. It is a small change with a disproportionate effect.

Retry backoff comparison: synchronized spikes without jitter creating a thundering herd, versus spread-out load with jitter staying under capacity Without jitter, all 10K webhooks retry at exactly the same instant. With jitter, the load fans out across the window and stays below capacity.

Classify failures before deciding to retry. A 5xx response or a network timeout is transient: retry. A 400, 401, or 404 is permanent: skip retries and move directly to the dead letter queue. Retrying a 401 sixteen times over three days wastes resources and signals you haven't thought about failure modes. The endpoint is telling you something. Listen.

Fan-Out Will Kill Your Dispatcher

If one event triggers 10,000 subscriptions, do not create 10,000 delivery tasks synchronously in the dispatcher. Create them in a background fan-out worker that batches inserts of 1,000 at a time. Otherwise the dispatcher becomes the bottleneck for every event, including the quiet ones.

Store only the event_id in delivery queue tasks, not the full payload. If 10,000 subscribers receive a 2 KB event and you embed the payload in each task, you're writing 20 MB to the queue per event. Store the event once in the events table, reference it by ID in each task, and let the delivery worker fetch it on demand. One extra DB read per delivery. Worth it.

Head-of-line blocking is the most underrated problem in webhook systems. If all subscribers share one delivery queue, a subscriber whose endpoint consistently times out at 5 seconds ties up delivery workers and adds latency for everyone else. Partition the delivery queue by subscription_id so each subscription has its own isolated logical queue. One subscriber's failures are fully contained.

Head-of-line blocking: shared queue where subscriber A's timeouts block subscriber B, versus per-subscription partitioned queues with full isolation Shared queue: subscriber A's broken endpoint holds up subscriber B. Partitioned queues: A's failure is contained to A's lane.

Pair per-subscription queues with a per-subscription circuit breaker. After 5 failures in 10 seconds, pause delivery to that endpoint for 60 seconds. Probe with one request. Success: resume. Failure: extend the pause. The circuit breaker pattern prevents a bad endpoint from burning through your retry budget all at once.

Security: Sign Every Payload

Subscribers need to verify that incoming requests came from your system and not from an attacker who discovered their endpoint URL. And yes, people do discover endpoint URLs. Especially when they're /webhooks/123 and you're incrementing IDs.

Sign every payload with HMAC-SHA256 and include a timestamp. Stripe's format is the canonical example:

X-Stripe-Signature: t=1492774577,v1=5257a869e7ecebeda32affa62cdca3fa51cad7e77a0e56ff536d0ce8e108d8bd

The signed content is {timestamp}.{raw_body}. Including the timestamp in the signature prevents replay attacks. An attacker who captures a valid webhook cannot replay it an hour later because the subscriber checks abs(now - t) > 300 seconds and rejects it.

Two implementation details subscribers often get wrong. First, sign the raw request body before any JSON parsing, because parsed JSON may have different key ordering. Second, use constant-time comparison functions to prevent timing attacks: hmac.compare_digest in Python, crypto.timingSafeEqual in Node.js. Regular string comparison leaks information through response timing. Small detail, real vulnerability.

Give each subscription its own unique signing secret. If one subscriber's secret leaks, you rotate it in isolation without affecting other subscriptions.

Three Bottlenecks Worth Naming

At 10M events per day, that's roughly 115 events per second, with peaks potentially 10x higher during incidents or launches. The queue absorbs this because you scale workers, not the dispatcher.

Three bottlenecks to name in the interview:

Dispatcher fan-out. At 115 events per second with 100 subscribers each, you need 11,500 database writes per second for delivery tasks. Use batched inserts and partition the subscriptions lookup by event_type so the dispatcher does not scan the full table per event. A full-table scan at 11,500 writes per second is how you find out what your database's ceiling is.

Delivery workers. Each worker spends most of its time waiting on network I/O. Use async HTTP clients (aiohttp, Go goroutines, Vert.x) to issue many concurrent requests per worker process. A single async worker can handle 100 to 500 concurrent HTTP requests depending on your timeout settings.

The delivery_tasks table. The most contended table in the system. Index on (status, next_retry_at) for the retry poller. Index on subscription_id for delivery history queries. Archive delivered tasks older than 30 days to cold storage. The message queue vs pub-sub choice matters here: Kafka is natural because it's already partitioned, supports consumer groups, and handles replay. SQS with FIFO queues per subscription works at lower scale.

The Trade-offs You Need to Name

DecisionTrade-off
At-least-once deliverySimpler to implement, but requires subscribers to handle duplicates
Per-subscription queuesFull isolation, but more queue partitions to manage
Jitter in backoffPrevents thundering herd, at the cost of unpredictable timing
Event ID in queue, not payloadLower memory, but adds one DB read per delivery
Synchronous fan-outSimple, but dispatcher blocks at high subscriber counts

What to Cover in 45 Minutes

  • Decouple event production from delivery with a durable queue.
  • Partition the delivery queue by subscription_id to prevent head-of-line blocking.
  • Use exponential backoff with jitter. Classify failures before retrying.
  • Keep delivery tasks and delivery attempts in separate tables; leases replace distributed locks for worker crash recovery.
  • Sign every payload with HMAC-SHA256, include a Unix timestamp, verify with constant-time comparison.
  • Apply per-subscription circuit breakers to avoid burning your retry budget on dead endpoints.
  • After N retries, move to a dead letter queue and expose a replay API.

The gap between reading this design and explaining it clearly under a 45-minute clock is bigger than it looks. You need to make these decisions out loud, in order, while an interviewer asks follow-up questions designed to find the edges of your understanding. SpaceComplexity runs voice-based system design mock interviews that ask follow-up questions in real time and grade your answers on a rubric, so you find out what you actually understand versus what you think you understand.

Further Reading