Distributed Task Scheduler System Design: The 45-Minute Interview Walkthrough

May 27, 202612 min read
interview-prepcareerdsaalgorithms
Distributed Task Scheduler System Design: The 45-Minute Interview Walkthrough
TL;DR
  • SELECT FOR UPDATE SKIP LOCKED is the atomic claiming primitive that prevents two schedulers from dispatching the same job simultaneously
  • Leader election via Redis or etcd keeps a single active polling loop, reducing database load without sacrificing high availability
  • Heartbeats every 15 seconds let a timeout sweeper reclaim jobs from crashed workers and re-queue them automatically
  • Jitter on next_run_at spreads recurring jobs across a window so 50,000 cron jobs don't all fire at midnight simultaneously
  • At-least-once semantics with idempotent handlers is the right default; exactly-once requires unique execution IDs and a ON CONFLICT DO NOTHING constraint
  • Pull-based workers with three priority queues prevent slow bulk jobs from blocking critical work at scale
  • A partial index on (next_run_at) WHERE status = 'PENDING' keeps the polling query fast even against billions of historical job rows

You have one database. One cron server. Every night at 2 AM it runs billing jobs for your 10,000 users. Then your startup gets acquired, you now have 10 million users, and the billing job takes six hours and crashes halfway through. Your cron server is a single point of failure. The business is on fire.

That is the distributed task scheduler system design problem. Congratulations, you played yourself.

In a system design interview, this question tests whether you understand the gap between "run this thing on a schedule" and "guarantee this thing runs exactly once across a fleet of 500 workers that can crash at any time." This walkthrough covers requirements, architecture, data model, the scheduling loop, execution semantics, and the bottlenecks that show up at scale.


Obama gives medal to "Successfully built a task scheduler" from "Cron to run the scheduler"

The eternal trap. We have all been here.


Clarify Before You Draw Anything (0-5 min)

The first five minutes separate good answers from great ones. Four questions determine the whole design.

What kinds of jobs? Short-lived HTTP callbacks (under 30 seconds) versus long-running ETL jobs (hours) need fundamentally different heartbeat and timeout strategies.

One-time, recurring, or both? One-time delayed tasks (send this email in 10 minutes) and recurring cron-style jobs (run every day at 9 AM) share infrastructure but have different scheduling logic.

What delivery guarantee? At-most-once (garbage collection jobs, fine if skipped), at-least-once (default for most workloads), or exactly-once (billing, email sends). Exactly-once is the hardest. It is also the distributed systems version of "I'll do it for free this one time." Most interviewers are happy if you name all three and defend a choice.

Scale targets? A reasonable framing: 100 million jobs per day, up to 10 thousand concurrent workers. That is about 1,200 jobs per second on average, with spiky peaks perhaps 10x that.


Five Boxes on the Whiteboard (5-15 min)

Five-box architecture: Client API, Scheduler Service (leader-elected), Job Store (Postgres), Task Queue (Kafka/SQS), Worker Pool with arrows showing data flow and scheduling loop

Every responsible system starts with a diagram and ends with an incident. Here is the diagram.

Each box has exactly one responsibility. Keep it that way.

Client API. REST endpoints to submit, cancel, and query jobs. Thin layer, stateless, horizontally scalable.

Job Store. A relational database (PostgreSQL works well) that holds job definitions and current state. This is the source of truth.

Scheduler Service. The brain. It polls the Job Store for jobs that are due, publishes them to the Task Queue, and manages leader election so only one instance is active at a time.

Task Queue. A message queue (SQS, Kafka, or RabbitMQ) that decouples scheduling from execution. Workers pull from it. The queue absorbs bursts and provides natural backpressure.

Worker Pool. Stateless worker processes that pull jobs from the queue, execute them, and report results. Horizontally scalable. Each worker sends periodic heartbeats to signal liveness.


Two Tables Do Most of the Work (15-20 min)

CREATE TABLE jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL, type TEXT NOT NULL, -- 'HTTP', 'GRPC', 'SCRIPT' payload JSONB NOT NULL, -- {url, headers, body} for HTTP schedule TEXT, -- cron expression, NULL for one-time next_run_at TIMESTAMPTZ NOT NULL, -- indexed, the polling cursor status TEXT NOT NULL, -- PENDING, RUNNING, SUCCEEDED, FAILED, CANCELLED max_retries INT NOT NULL DEFAULT 3, retry_count INT NOT NULL DEFAULT 0, timeout_secs INT NOT NULL DEFAULT 30, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE INDEX idx_jobs_next_run ON jobs (next_run_at) WHERE status = 'PENDING'; CREATE TABLE job_executions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), job_id UUID REFERENCES jobs(id), worker_id TEXT NOT NULL, status TEXT NOT NULL, -- CLAIMED, RUNNING, SUCCEEDED, FAILED, TIMED_OUT started_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, error TEXT );

next_run_at is the scheduling cursor. The scheduler polls: SELECT * FROM jobs WHERE next_run_at <= NOW() AND status = 'PENDING' LIMIT 500. After a recurring job completes, the worker computes the next occurrence from the cron expression and updates next_run_at. The partial index on (next_run_at) WHERE status = 'PENDING' keeps that query fast even with billions of historical rows. Without it, that 2 AM poll hits a full table scan and you find out the hard way.


The Scheduling Loop: Where Most Designs Go Wrong (20-30 min)

This is the heart of the interview. The naive approach is obvious. The problem is subtle.

Naive approach. The scheduler polls the jobs table, finds due jobs, dispatches them to workers. Done.

The problem. You need more than one scheduler for high availability. Now two schedulers both find the same due job at the same millisecond. Both dispatch it. The job runs twice. For a billing job that charges a credit card, that is not a fun production incident.

Race condition: two scheduler nodes simultaneously read Job X as PENDING, both publish to the queue, two workers both execute it and both charge the credit card

Two schedulers walk into a database. One customer gets charged twice. Classic bit.

Fix: claim with an atomic update.

UPDATE jobs SET status = 'CLAIMED', claimed_at = NOW() WHERE id = $1 AND status = 'PENDING' AND next_run_at <= NOW() RETURNING *;

Only one writer wins the race. The loser gets zero rows back and moves on. PostgreSQL's SELECT ... FOR UPDATE SKIP LOCKED is even better for batch claiming: workers skip rows already locked by another transaction instead of blocking.

SELECT id, payload, type FROM jobs WHERE next_run_at <= NOW() AND status = 'PENDING' LIMIT 100 FOR UPDATE SKIP LOCKED;

After claiming, the scheduler publishes job IDs to the Task Queue. Workers pull from the queue, look up the job by ID, execute it, and write the result back.

Leader election still helps at scale. Even with atomic claiming, 10 scheduler nodes polling the same table simultaneously creates unnecessary database load. A simple leader election via a distributed lock in Redis or etcd means only the leader runs the polling loop. The others are hot standby.

Leader election diagram: three scheduler nodes competing for a Redis/etcd lock. One wins and becomes ACTIVE LEADER, the other two enter standby and watch the lock.

Scheduler A got the lock. Schedulers B and C are standing by, quietly judging.

If the leader crashes, a new one is elected within seconds (etcd lease expiry defaults to 5 seconds in most setups). Google's SRE book documents exactly this pattern for their distributed cron service.


Execution Semantics: Pick One and Defend It

At-most-once means the scheduler dispatches the job once and forgets. If the worker crashes mid-execution, the job is lost. Acceptable for idempotent housekeeping work. Unacceptable for anything user-facing.

At-least-once means the scheduler retries on failure, so the job runs at least once but possibly more. This is the right default. Workers must be idempotent (running the same job twice has the same effect as running it once). The Dropbox ATF system, which processes around 9,000 tasks per second, uses at-least-once semantics with application-level idempotency.

Exactly-once is never truly free. The closest approximation: give each execution a unique execution_id, pass it through to downstream systems, and use a database unique constraint to reject duplicate side effects.

INSERT INTO payments (execution_id, ...) ON CONFLICT (execution_id) DO NOTHING;

This is the standard pattern for financial operations.

Design for at-least-once and require idempotent job handlers. Document what exactly-once would cost and when it is worth it.


Heartbeats Keep Workers Honest

Long-running jobs need a liveness signal. If a worker claims a job and then crashes, without heartbeats the job is stuck in RUNNING forever. Like that one PR from 2019 still in review.

Workers send a heartbeat every 15 seconds: UPDATE job_executions SET last_heartbeat_at = NOW() WHERE id = $1. A separate timeout sweeper (run by the scheduler leader) fires every minute, scans for executions where last_heartbeat_at < NOW() - INTERVAL '60 seconds', marks them TIMED_OUT, resets the parent job status to PENDING, and re-queues them if the retry count is below the limit.

Heartbeat timeline: Worker A claims job, sends heartbeats at 0s/15s/30s/45s, crashes at 60s. Sweeper fires at 120s, detects stale heartbeat, re-queues the job. Worker B picks it up at 135s.

Worker A did its best. Worker B will finish the job. The sweeper is the unsung hero nobody gives a Slack reaction to.


What the API Looks Like

POST   /v1/jobs            # create a job
GET    /v1/jobs/{id}       # get job status
DELETE /v1/jobs/{id}       # cancel (sets status=CANCELLED if not yet RUNNING)
GET    /v1/jobs/{id}/executions  # execution history

A minimal job payload:

{ "name": "send-invoice", "type": "HTTP", "payload": { "url": "https://internal.example.com/billing/send", "method": "POST", "body": { "user_id": 42 } }, "schedule": "0 9 * * 1", "max_retries": 3, "timeout_seconds": 30 }

One-time jobs use "run_at": "2026-06-01T09:00:00Z" instead of schedule. The scheduler stores whichever was provided and computes next_run_at accordingly.


Three Bottlenecks That Show Up at Scale (30-40 min)

At 1,200 jobs per second average, a single PostgreSQL instance with proper indexing handles the load comfortably. The polling query touches a small hot index partition. Three bottlenecks appear as you push further.

Bottleneck 1: The DB polling wall. At tens of thousands of jobs per second, even SELECT FOR UPDATE SKIP LOCKED saturates the database. Partition the scheduler: assign each shard responsibility for a key range (hash job_id modulo N). Each shard polls only its own slice. Leader election still applies within each shard.

Bottleneck 2: The thundering herd at :00. If 50,000 recurring jobs all have next_run_at = 2026-06-01 00:00:00, the scheduler sees all 50,000 at once at midnight. It is the database equivalent of everyone trying to buy Taylor Swift tickets at the same moment. Add jitter when computing next_run_at for recurring jobs. A random offset up to 30 seconds spreads the burst without meaningfully affecting job timing.

next_run_at = cron_next_occurrence(schedule) + timedelta(seconds=random.randint(0, 30))

Bottleneck 3: Worker queue depth. When workers are slow and the queue grows deep, new jobs wait behind a backlog. Fix: priority lanes. Maintain three queues (high, normal, low). Workers poll high first, fall through to normal, then low. Critical jobs never wait behind bulk exports.

Three priority queues: Scheduler routes jobs to HIGH/NORMAL/LOW based on priority tag. Workers always poll HIGH first, then NORMAL, then LOW. Billing never waits behind a CSV export.

Your 2 AM billing job will not wait behind someone's quarterly analytics export. That is the promise.


What Happens When Things Fail

Scheduler dies. A standby leader takes over within one lease expiry period (5-10 seconds). The new leader immediately polls for overdue jobs. Jobs that should have fired during the gap are dispatched as soon as the leader recovers. Most systems catch up, but only within a configurable lookback window (1 hour is common).

Worker dies mid-execution. The heartbeat timeout sweeper detects the stale heartbeat and re-queues the job. The job runs again on a different worker. This is the at-least-once guarantee in practice.

Message queue loses a message. Use a durable queue with acknowledgment-based deletion. Workers only acknowledge (delete) a message after the execution is recorded in the database as SUCCEEDED. If the worker crashes after completing but before acknowledging, the message is re-delivered and the idempotent handler ignores the duplicate.

Database node failure. Use a primary-replica setup with automatic failover (Postgres with Patroni, or Amazon RDS Multi-AZ). Scheduler nodes reconnect to the new primary. Executions in-flight during failover may time out and get retried. Plan for it.


Tradeoffs Worth Defending (40-45 min)

DecisionOption AOption BWhen to choose A
Scheduler coordinationLeader election (one active)Partitioned shards (N active)Under ~50K jobs/sec
Job storePostgreSQL + SKIP LOCKEDRedis sorted set (scored by time)You need durability and queryability
Delivery guaranteeAt-least-once + idempotent handlersExactly-once via idempotency keysAlmost always at-least-once
Worker modelPull from queuePush from schedulerAlways pull: better load balancing
Missed jobs on restartSkip missed intervalsCatch up within lookback windowCatch up unless jobs are non-idempotent

The most common wrong answer: designing a push-based scheduler. If the scheduler holds a mapping of job to assigned worker and pushes work directly, it becomes a bottleneck and a single point of complexity. A pull-based queue lets workers self-select based on capacity and fails gracefully when workers crash.


The 45-Minute Clock

  • 0-5: Clarify job types, scale, delivery guarantee, one-time vs recurring
  • 5-15: Draw the five-box architecture, name each component
  • 15-20: Walk through the data model, explain next_run_at as the polling cursor
  • 20-30: The scheduling loop, the double-dispatch race, SKIP LOCKED, leader election
  • 30-35: Heartbeats, timeout detection, execution semantics
  • 35-42: Scaling bottlenecks (DB wall, thundering herd, priority queues), fault tolerance scenarios
  • 42-45: Tradeoffs table, open questions for interviewer

Further Reading


If you want to practice walking through this under actual interview pressure, SpaceComplexity runs voice-based mock system design sessions with rubric scoring. Talking through the scheduling loop out loud, under a 45-minute clock, is a different skill from reading about it.

The scheduler is a deceptively simple problem. The hard part is not the data model or the API. The hard part is the scheduling loop: atomic claiming to prevent double execution, heartbeats to detect stale workers, jitter to prevent thundering herds, and a clear-eyed choice of delivery semantics before you draw a single box. Get those four pieces right and the rest follows.


Internal: key-value-store-system-design, distributed-cache-system-design, the-tradeoff-maze