"Is This Thing On?" — Building a Heartbeat System from Scratch

April 23, 20269 min read
distributed-systemsfailure-detectionheartbeats
"Is This Thing On?" — Building a Heartbeat System from Scratch
TL;DR
  • A heartbeat is just a node periodically telling a monitor "I'm alive." The monitor records the arrival time and derives liveness (now - last < timeout) instead of storing an isActive flag — one source of truth, no consistency bug.
  • Keep detection state in-memory. After a restart it rebuilds within one interval; a database just adds latency, write load, and a new failure dependency. Emit events to logs/a time-series DB for history.
  • Trust the receiver's clock, not the sender's — receiving the message is itself proof of life, and it sidesteps clock skew. Use epoch timestamps everywhere.
  • Start the timeout at ~10× the heartbeat interval, then tune it by weighing the cost of a false positive against the cost of slow detection for your system.
  • Run nodes as independent processes (so one can die alone), give each a stable ID, and detect-and-notify — let an orchestrator do the responding.

Part 1 of 4 in the Heartbeats & Failure Detection series:

  1. Is This Thing On? — Building a Heartbeat System from Scratch (you are here)
  2. The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds
  3. Going Decentralized — Gossip, Rumors, and the SWIM Protocol
  4. Getting Smart — Phi Accrual and the End of Hardcoded Timeouts

You have a server. It's running. Probably. How do you actually know?

Turns out, the simplest answer is also the oldest trick in distributed systems: just have the thing tell you it's alive. Periodically. Over and over. Like a needy friend who texts "u up?" every second. That's a heartbeat.

The Setup

Two components. A node that says "I'm alive" every second, and a monitor that listens and keeps track. The node sends an HTTP POST, the monitor records when it heard from it. That's a push-based heartbeat — the node does the work, the monitor just sits there.

// node: push a heartbeat every second setInterval(() => sendHeartbeat(nodeId), 1000); // monitor: record when we heard from each node const monitorMap = new Map<string, number>(); app.post("/", (req, res) => { const { message, id } = req.body; if (message === "alive" && id) { monitorMap.set(id, Date.now()); res.json({ message: "ack" }); } });

Forty lines of code. And yet every design decision behind those forty lines opens a rabbit hole.

"Should I Store isActive?" — No.

My first instinct was to store { lastTimestamp: 1234567890, isActive: true }. Two fields. Nice and explicit.

Bad idea. Now you have two sources of truth. What happens if the timestamp says the node hasn't checked in for 30 seconds but isActive still says true? Which one do you believe? You've created a consistency bug in a system that's supposed to detect failures.

Instead, store only the timestamp and derive liveness:

`${nodeKey}: ${timestamp > Date.now() - TIMEOUT ? "Alive" : "Dead"}`

One source of truth. If the last heartbeat was within the timeout window, the node is alive. Done. No boolean to get out of sync.

Where Do You Store This Map?

In-memory Map? Redis? A database? If the monitor crashes, you lose everything. Surely persistence is better?

Well, think about what "everything" is. The heartbeat interval is 1 second. If the monitor restarts, every alive node will send a fresh heartbeat within one second. The map rebuilds itself before you've finished pouring your coffee.

Meanwhile, the database costs you:

  • A network hop on every heartbeat (latency)
  • 1000 writes/sec at 1000 nodes — for data that's stale in a second
  • A new failure dependency — if the DB goes down, your monitoring system is now the thing that's broken

In-memory for detection. But what about "what went down at 3 AM last Tuesday?" That's a different question. The monitor should emit events — "node-3 died at T=1234567890" — to a logging system or time-series DB. In-memory for the live detection, logs for the forensics. Two different problems, two different storage strategies.

The only time "just rebuild from the next heartbeat" breaks down is if your interval is really large. Nodes checking in every 5 minutes? That's a 5-minute blind spot after a restart. But if the interval is small enough, the state is essentially disposable.

Whose Clock Do You Trust?

I almost made a mistake here. "The node should send its own timestamp — that's more accurate since it captures when the heartbeat was actually sent, not when it arrived."

Sounds precise. Sounds wrong.

The fact that the monitor received the message is itself proof of liveness. If the node claims it sent a heartbeat at T=10 but the monitor gets it at T=15, what does T=10 tell you? The node might have crashed at T=11. The monitor only knows: "I heard from you at T=15." That's what matters.

And then there's the real killer: clock skew. Node clocks drift. One node thinks it's 3:00:00 PM, another thinks it's 2:59:47 PM. If your failure detection relies on timestamps from N machines with N clocks, a single misconfigured clock breaks the whole thing.

With one monitor recording Date.now() on receipt, you have one clock. One source of truth.

(At scale with multiple monitors, this gets harder — NTP, vector clocks, logical clocks. But that's a future chapter's problem.)

Oh, and use epoch timestamps everywhere. Sidesteps timezone nonsense entirely.

The Timeout: 10 Seconds, But Why?

The heartbeat interval is 1 second. The timeout is 10 seconds. Why 10?

If the timeout is 3 seconds, the node can only miss 2 heartbeats before being declared dead. One GC pause, one burst of network congestion, one moment of CPU saturation — congratulations, you just triggered an unnecessary failover. Which is often more disruptive than the failure you thought you detected.

If the timeout is 30 seconds, a genuinely dead node sits there for half a minute while requests pile up on a corpse.

Healthy nodes miss heartbeats all the time. Network congestion, GC pauses, CPU saturation, OS scheduling delays, packet loss. The question is how many missed heartbeats you can tolerate before deciding something is actually wrong.

The rule of thumb: 10x the heartbeat interval. The node can miss 9 heartbeats — enough to ride out transient hiccups without panicking, while still detecting real failures within 10 seconds.

But 10x isn't a law. It depends on what you're building. Payment processing? A false positive mid-transaction could be catastrophic — lean longer. Live video streaming? 10 seconds of undetected failure means users staring at a spinner — lean shorter.

The "right" timeout is the one where the cost of a false positive roughly balances the cost of slow detection for your system.

Who Watches the Watcher?

What if the monitor goes down?

First thought: put the data in Redis so it survives. But if the monitor is down, who's reading Redis? Who's acting on it? A Redis full of timestamps with nothing checking them is just a warm cache of irrelevance.

Second thought: a monitor that watches the monitor. But then who watches that? Turtles all the way down.

The actual answer: the monitors watch each other. Form a cluster of peers that monitor each other's health. No hierarchy, no infinite regression. We'll build this later.

Some terminology: active-passive (hot standby) means one handles traffic, a backup waits. Active-active means all handle traffic simultaneously. Failover is the switch from failed primary to backup. And the key question with any failover: who decides the primary is down, and who reroutes? Same detection problem, one level up.

Scaling to Multiple Nodes

One node works. What about five? The monitor side is boring — the Map already handles multiple keys. The problems are on the node side.

The identity problem: each node needs a unique ID. Node picks its own UUID? Simple, but no trust boundary. Monitor assigns? Needs registration. From config/environment? Simple and deterministic. We went with config. The security question is real, but it's a different chapter's problem.

Independent processes, not a loop: my first attempt was a for loop spawning 5 setInterval calls in one process. All five "nodes" die together. That's not distributed, that's just concurrent. The fix: child_process.fork(). Each node is its own OS process. Kill one, the others keep running. That's the whole point.

Detection vs response: when the monitor spots a dead node, should it spin up a replacement? No. The monitor detects. An orchestrator responds. In AWS terms, health checks detect, Auto Scaling Groups respond. In Kubernetes, kubelet detects, the control plane responds. Detect and notify, don't detect and act.

Per-Node Timeouts

A payment processor and a log aggregator are both sending heartbeats. Should they share the same timeout? No. The payment processor needs a longer leash — false positives are catastrophic. The log aggregator can tolerate a shorter one.

Use a Map for per-node timeout overrides, with a fallback to the default. And the timeout config lives on the monitor, not the node. If a node could set its own timeout, a misbehaving node could say "my timeout is 24 hours" and effectively hide itself from failure detection.

One small thing: use ?? (nullish coalescing) instead of || for the fallback. || treats 0 as falsy. ?? only falls back on null/undefined. Minor, but good habit.

What We Built

Maybe 60 lines of code across three files. But the design decisions behind those lines:

  • Derive, don't store. One source of truth beats two that can disagree.
  • In-memory for detection, logs for history. Separate concerns.
  • Trust the receiver's clock. Clock skew is real.
  • 10x timeout as a starting point. Adjust based on your system's cost structure.
  • Push for simplicity. The node does the work.
  • Independent processes for independent failures. Can't kill one without killing all? Not distributed.
  • Detect and notify, don't detect and act. The monitor finds problems. Something else fixes them.

Next up: what happens when push doesn't scale, when failure detection gets it wrong, and why "is it dead?" is a harder question than it sounds.


Next → The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds