Job Scheduler System Design: Cron Expressions Are Easy. Coordination Isn't.

June 11, 202612 min read
interview-prepcareersystem-designalgorithms
Job Scheduler System Design: Cron Expressions Are Easy. Coordination Isn't.
TL;DR
  • At-least-once vs exactly-once is the first clarification to nail: exactly-once requires a transactional outbox, not just an atomic dispatch.
  • Partial index on next_run_at is load-bearing. Without it, every scheduler tick becomes a full table scan across millions of rows.
  • SELECT FOR UPDATE SKIP LOCKED is the Postgres-native fix for double-dispatch: multiple schedulers each atomically claim different job batches without Redis or leader election.
  • Thundering herd at midnight: batch with LIMIT 100, add 0-10s jitter to next_run_at, and give high-frequency jobs their own priority lane.
  • Missed execution policy is a per-job product decision: skip, fire-once within a catch-up window, or fire-all for financial audit trails.
  • Distributed coordination is the interview's hard part. An engineer who says SKIP LOCKED and explains the transaction boundary signals production experience.

You use cron every day. You know 0 2 * * * means 2 AM. What you probably haven't thought about is how to design the thing that reads that expression and makes work happen, reliably, across dozens of machines, at exactly the right time.

That's the job scheduler system design interview. Spoiler: the cron parsing is the easy part. You'll spend five minutes on it and thirty minutes explaining why running two schedulers without coordination means your user's credit card gets charged twice.

This is how to walk through it without that outcome.


Clarify Before You Draw Anything

Five minutes of requirements saves fifteen minutes of redesign. Ask:

  • What types of jobs? HTTP callbacks (call a URL), message queue events (push to Kafka), or arbitrary code execution?
  • What scale? Millions of jobs or thousands? Sub-second precision or minute-level?
  • How reliable? At-least-once (simpler) or exactly-once (much harder)?
  • Recurring only, or one-off too? "Send this email in 30 minutes" is a different problem from "run this every Monday at 9 AM."
  • Timezone support? This sounds like a detail. It is not. Daylight Saving will not respect your users. Plan accordingly.

A reasonable target for the interview: 10 million registered jobs, fire 10,000 per second, precision to the minute, at-least-once delivery. Pin these numbers early. They'll drive every tradeoff you make for the rest of the session.


Job Scheduler System Design: The Five Components

At the simplest level, a job scheduler is five components.

API Service  →  Job Store (Postgres)  ←  Scheduler  →  Message Queue  →  Workers

Five-component architecture: API Service, Job Store, Scheduler, Message Queue, Worker Pool connected in sequence

The scheduler reads jobs from Postgres, dispatches to the queue, and updates next_run_at. Workers are stateless and scale out horizontally.

  1. API Service. CRUD for job definitions. Accepts cron expressions, validates them, persists to the job store.
  2. Job Store. A Postgres table holding every registered job, its cron expression, and (critically) its next_run_at timestamp.
  3. Scheduler. Polls the job store every few seconds. Grabs any job where next_run_at <= now(). Dispatches it, then updates next_run_at to the following fire time.
  4. Message Queue. Kafka or SQS. The scheduler writes a "fire this job" message; workers consume it.
  5. Worker Pool. Stateless processes that read from the queue, execute the job, and write the result back.

This is the answer you give at the 10-minute mark. It's correct. The next 30 minutes are about why it's harder than it looks.


Two Tables Do All the Work

Two tables carry everything.

CREATE TABLE jobs ( id UUID PRIMARY KEY, name TEXT NOT NULL, cron_expr TEXT NOT NULL, -- "0 2 * * *" callback JSONB NOT NULL, -- { type: "http", url: "...", method: "POST" } timezone TEXT DEFAULT 'UTC', missed_policy TEXT DEFAULT 'skip', -- skip | fire_once | fire_all enabled BOOLEAN DEFAULT true, next_run_at TIMESTAMPTZ NOT NULL, created_at TIMESTAMPTZ DEFAULT now() ); CREATE INDEX idx_jobs_next_run ON jobs (next_run_at) WHERE enabled = true; CREATE TABLE job_runs ( id UUID PRIMARY KEY, job_id UUID REFERENCES jobs(id), scheduled_at TIMESTAMPTZ NOT NULL, started_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, status TEXT, -- pending | running | success | failed error TEXT );

The partial index on next_run_at is load-bearing. Every scheduler tick is a WHERE next_run_at <= now() AND enabled = true scan. Without that index, it degrades from O(log n) to a full table scan across 10 million rows on every tick.

Store all timestamps in UTC and convert at display time. If you store in local time, daylight saving transitions will make your 2 AM job run twice in November and skip March entirely. Your on-call engineer will not thank you.


Don't Write Your Own Parser

The format has forked. Classic Unix cron has five fields (minute, hour, day, month, weekday); Quartz Scheduler added a seconds field and a year field; AWS EventBridge uses its own dialect. Use an established library (cron-parser in Node, croniter in Python, cron in Go) and store the expression as-is. When computing next_run_at, pass it through the parser with the job's timezone.

One edge case that comes up: February 29th in an annual cron. Most parsers skip it and advance to March 1st. Know this when the interviewer asks about edge cases.


The Hard Part: Coordinating a Distributed Job Scheduler

One scheduler works fine until you need high availability. Two schedulers both polling the same job store will both see job_id=42 with next_run_at = 09:00:00 and both dispatch it. Your user's API gets called twice. Congratulations, you've just sent two invoices for the same order. Customer support is about to have a very good morning.

This is the core distributed systems problem. You have three options.

Option 1: Leader election. One scheduler is the leader and does all work. The others watch it via ZooKeeper or etcd. If the leader dies, a follower takes over within seconds. Simple to reason about. The failure window between leader death and new leader election means some jobs might fire late.

Option 2: Distributed lock per job. Before dispatching job_id=42, acquire a Redis lock with key job:42:run:2026-06-11T09:00:00. Set expiry to one minute. Whoever gets the lock dispatches; the others skip. This scales better than leader election but adds Redis as a dependency. You need to tune TTLs carefully so the lock doesn't expire before the job dispatches. See distributed lock design for the full Redis locking pattern.

Option 3: SELECT FOR UPDATE SKIP LOCKED. This is the Postgres-native answer that interviewers remember.

BEGIN; SELECT id, cron_expr, callback, timezone FROM jobs WHERE next_run_at <= now() AND enabled = true ORDER BY next_run_at LIMIT 100 FOR UPDATE SKIP LOCKED; -- dispatch these jobs, update next_run_at for each COMMIT;

SKIP LOCKED tells Postgres: if another transaction already locked this row, skip it rather than waiting. Multiple schedulers run simultaneously, each atomically claiming a different batch of 100 jobs. No Redis required. The database's own row-level locking is the coordination mechanism.

SELECT FOR UPDATE SKIP LOCKED: Scheduler A locks job_42, Scheduler B sees it locked and skips to job_43 instead

Scheduler A and Scheduler B each claim a disjoint batch. No row is dispatched twice. No extra infrastructure needed.

The downside: it works only with Postgres (or MySQL 8+). If you're on a different store, you need Option 1 or 2. In the interview, Option 3 is the most precise answer if you can explain the semantics. Mention Option 1 as the simpler fallback for non-Postgres stacks.


The Thundering Herd at Midnight

Ask yourself: how many of your users schedule daily jobs at midnight? A lot. Every single person who ever touched a cron job has written 0 0 * * * at least once. It's the "Hello World" of scheduling. It is also a distributed systems fire drill. When your scheduler ticks at 00:00:00 UTC, it might find 100,000 jobs all due at the same moment.

00:00:00 UTC, scheduler tick, finds 100,000 jobs with next_run_at = 00:00:00, fires 100,000 queue messages, Kafka throughput spike, worker pool saturates, job_runs insert storm

The thundering herd at midnight: 100K jobs fire in one tick, causing a massive queue spike. Jitter spreads the load.

Without jitter, midnight looks like a DDoS of your own infrastructure. With jitter, the spike flattens and nobody gets paged.

Three mitigations:

Batch and paginate. The LIMIT 100 in the query above is intentional. Claim 100 jobs per tick, dispatch them, loop. You drain the queue in controlled bursts rather than one explosion.

Add jitter. When computing next_run_at, add a random offset of 0-10 seconds. A job nominally at midnight becomes anywhere from 00:00:00 to 00:00:10. This spreads spike load without violating the cron contract. No user expects millisecond precision on a cron job.

Separate priority lanes. High-frequency jobs (every minute) get their own queue and worker pool. That way a 100,000-job backlog from midnight doesn't delay the 09:05 payroll run sitting behind it.


Three Policies for Missed Executions

Your scheduler service restarts at 2:03 AM. Not at 2:00. Not at a convenient time. 2:03. Three jobs were due between 2:00 and 2:03. Welcome to production. What do you do?

This is a product question dressed up as a systems question. The job_runs table is what makes missed-execution recovery possible. On startup, the scheduler queries for jobs that have no run record for their last scheduled slot.

The behavior depends on missed_policy:

  • Skip. Set next_run_at to the next future fire time. Correct for "send a daily digest." A missed run is just gone.
  • Fire once. If the job is within a configurable catch-up window (say 15 minutes), fire it immediately and mark it late. Correct for billing jobs where the run must happen, just not catastrophically delayed.
  • Fire all. Execute every missed run in order. Correct for financial audit trails where every invocation must be recorded. Warning: this can generate a backlog storm after extended downtime. Always rate-limit the catch-up path.

Missed execution policies: skip drops missed runs, fire_once executes one late run, fire_all queues all missed runs in order

Store the policy per job. Different jobs in the same system have completely different semantics about what "missed" means.


Scaling the Workers

Workers are stateless and scale horizontally. The message queue is the buffer between scheduling rate and execution rate.

A few things to pin down:

  • Partition Kafka by job_id % num_partitions. Jobs for the same customer stay on the same partition, so your per-customer history is ordered without locking.
  • Set a per-job timeout. A job that hangs indefinitely blocks a worker thread. Workers must enforce their own deadline and write a failed status record when it expires.
  • Retries go back on the queue with exponential backoff plus a dead-letter queue for jobs that exhaust retries. You want humans paged on DLQ growth, not on individual failures.

At 10,000 dispatches per second, 50ms average job duration, and 100 worker threads per machine: you need roughly 10,000 * 0.05 / 100 = 5 machines. Start with 10 to absorb spikes.


Tradeoffs to Know Cold

At-least-once vs exactly-once. The SELECT FOR UPDATE approach gives at-least-once if you dispatch before committing. True exactly-once requires a transactional outbox: write the "dispatch this" message to the same Postgres transaction as the next_run_at update, then a separate relay process reads the outbox and publishes to Kafka. More moving parts but a real guarantee.

HTTP callbacks vs queue messages. HTTP is simpler (call a URL, check the status code) but your scheduler must handle timeouts, retries, and auth. Queue messages push execution logic into the consumer. Prefer queue messages for internal jobs, HTTP for third-party integrations.

Push vs pull for dispatch. The architecture above uses a queue (scheduler pushes, workers pull). An alternative is workers polling the job store directly. Pull is simpler to set up but worse under load. The queue gives you backpressure and decouples scheduling latency from execution latency. At 10,000 dispatches per second, you want the queue.

For a deeper look at how this problem relates to distributed task schedulers with dependency graphs and priority queues, that post covers the Celery/Temporal-style patterns.


The 45-Minute Clock

  • 0-5 min. Clarify requirements. At-least-once or exactly-once? Scale? Job types?
  • 5-15 min. Five-box architecture. Data model with the partial index.
  • 15-30 min. Deep dive on distributed coordination. Spend real time on SELECT FOR UPDATE SKIP LOCKED.
  • 30-40 min. Thundering herd, missed executions, worker scaling.
  • 40-45 min. Tradeoffs. Lead with at-least-once vs exactly-once and push vs pull.

If the interviewer cuts you off early, make sure the distributed coordination question is answered. That's what separates a pass from a strong hire on this problem. An engineer who says "run multiple schedulers" without addressing double-dispatch has missed the point. An engineer who says SKIP LOCKED and explains the transaction boundary has demonstrated they've thought about this in production.

The interviewer asking about SKIP LOCKED isn't being cruel. They're checking if you've actually shipped one of these and had to deal with the mess when two schedulers both woke up at the same time.

If you want to practice this kind of walkthrough out loud under real time pressure, SpaceComplexity runs voice-based mock system design interviews with rubric-based feedback on communication, trade-off reasoning, and problem-solving. The job scheduler is one of the problems in rotation.


Recap

  • Clarify at-least-once vs exactly-once and job types before you draw anything.
  • Five components: API, job store, scheduler, queue, workers. Partial index on next_run_at is load-bearing.
  • Use a parser library for cron expressions. Store timestamps in UTC. DST is a trap.
  • Distributed coordination: leader election (simple), Redis lock (flexible), SELECT FOR UPDATE SKIP LOCKED (Postgres-native and precise).
  • Thundering herd: batch and paginate, add jitter to next_run_at, separate priority lanes.
  • Missed executions are a product decision. Support skip, fire-once, and fire-all as a per-job policy field.
  • Exactly-once requires transactional outbox, not just atomic dispatch.

Further reading