The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds

Part 2 of 4 in the Heartbeats & Failure Detection series:

Is This Thing On? — Building a Heartbeat System from Scratch
The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds (you are here)
Going Decentralized — Gossip, Rumors, and the SWIM Protocol
Getting Smart — Phi Accrual and the End of Hardcoded Timeouts

Last time I built a basic heartbeat system — nodes push "I'm alive" to a monitor, the monitor derives liveness from timestamps. Simple. Works. But simple systems break at scale, and this one breaks in interesting ways.

Push vs Pull: Who's In Control?

With 1000 nodes pushing heartbeats every second, that's 1000 requests/sec hitting a single monitor. And here's the fun part: they might all fire at the exact same millisecond. Burst of traffic, then silence, then another burst. This is called a thundering herd, and your monitor has zero control over it — the nodes decide when to send.

My first thought: put a queue in front of the monitor. Buffer the bursts.

Bad idea. A heartbeat says "I'm alive right now." If it sits in a queue for 8 seconds and your timeout is 10 seconds, you've got 2 seconds of margin. The heartbeat is stale by the time you process it. Queuing works for emails and order processing — tasks where timing doesn't matter. Not for liveness signals.

The real question isn't "how do I handle the burst?" — it's "should the nodes be driving this at all?"

Pull-based monitoring flips the model. Instead of nodes pushing to the monitor, the monitor reaches out to each node: "are you alive?" Now the monitor controls the timing. It can space checks evenly — no thundering herd, predictable load.

But pull has its own cost: the monitor needs to know about every node upfront. With push, a new node just starts sending and the monitor discovers it. With pull, someone has to register the node first.

Neither model is "better":

	Push	Pull
Who controls timing	Node	Monitor
Node discovery	Automatic	Needs a registry
Load pattern	Bursty	Smooth
Monitor complexity	Simple (just listen)	Higher (must poll)

And here's something that doesn't show up in technical comparisons: the organizational dimension.

Push means the monitoring team publishes an API spec. Other teams integrate on their own time. Self-service. No bottleneck. Pull means every new node is a ticket to the monitoring team: "please add us to your polling list." The monitoring team becomes a bottleneck, but they have full control.

This maps to Conway's Law — system architecture mirrors org structure. Autonomous teams lean toward push. Centralized platform teams lean toward pull. The technical trade-off and the organizational trade-off are inseparable.

Gray Failures: The Zombie Problem

A node that's alive but not responding — frozen in a GC pause, stuck in an infinite loop, so overloaded it can't send heartbeats — looks identical to a dead node from the monitor's perspective. No heartbeat within the timeout, therefore dead.

These are called gray failures: technically alive, functionally useless. The monitor can't distinguish why a heartbeat is missing. Dead, frozen, network-partitioned — same signal, or rather, same lack of signal.

This matters because the response should be different. You don't restart a node that's just network-partitioned — that creates a duplicate. But the monitor can't tell the difference. All it sees is silence.

False Positives and False Negatives

Two ways failure detection can be wrong:

False positive: healthy node declared dead. The monitor panics, triggers a failover, restarts a perfectly good node, reroutes traffic unnecessarily. Annoying, wasteful, but recoverable. It's a fire drill.

False negative: dead node declared alive. Traffic keeps flowing to a black hole. Users see errors. Data might be lost. And nobody's investigating because the monitoring system says everything is fine. That's a silent outage.

False negatives are worse. Always. A false positive wastes resources. A false negative loses users.

And here's the kicker: the same lever controls both. Shorter timeout → catch dead nodes faster (fewer false negatives) but panic on transient hiccups (more false positives). Longer timeout → patient through hiccups (fewer false positives) but slow to catch real failures (more false negatives).

You cannot minimize both at the same time. It's a dial, not a switch.

Breaking the Trade-off: Get a Second Opinion

So you can't tune the timeout to fix both. But what if, before declaring a node dead, you asked someone else: "hey, can you reach this node?"

If the monitor can't reach node-3, it could be because node-3 is dead. Or it could be because the network between the monitor and node-3 is broken. But if the monitor asks node-5 "can you reach node-3?" and node-5 says "yeah, it's fine" — you just avoided a false positive without needing a longer timeout.

This is secondary confirmation — getting a second opinion before acting. And it naturally leads to a bigger idea: what if multiple nodes are all checking each other?

Quorum: The Majority Rules

If multiple nodes check each other, disagreements happen. Node-1 says node-3 is dead. Node-5 says node-3 is alive. Who do you believe?

Majority vote. If 3 out of 5 nodes say node-3 is alive, it's alive — even if 2 can't reach it. Those 2 probably have a localized network issue.

This is quorum-based failure detection, and it's one of those ideas that feels obvious in hindsight but solves a genuinely hard problem. A localized network failure can't cause a majority to agree on a false death. You need a real, widespread failure for the quorum to declare a node dead.

But there's a catch. Every node checking every other node is N-squared connections. Five nodes? 20 checks. Fine. A thousand nodes? Nearly a million checks per round. Doesn't scale.

So you need a way to spread information through a cluster without everyone talking to everyone. You need gossip. But that's the next post.

What We Learned

Push vs pull isn't about scale alone — it's about control, discovery, and org structure.
Gray failures are invisible — the monitor can't tell dead from frozen from partitioned.
False negatives are worse than false positives — a silent outage beats a fire drill.
A single timeout can't minimize both — it's a dial between false positives and false negatives.
Secondary confirmation breaks the trade-off — ask someone else before declaring death.
Quorum prevents localized network issues from triggering false alerts — but full-mesh checking doesn't scale.

Next: how gossip protocols spread information like rumors, and why three states (alive, suspect, dead) are better than two.

← Previous: Is This Thing On? — Building a Heartbeat System from Scratch

Next → Going Decentralized — Gossip, Rumors, and the SWIM Protocol