The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds

April 23, 20267 min read
distributed-systemsfailure-detectionpush-vs-pullquorum
The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds
TL;DR
  • Push (nodes send) is simple and self-discovering but bursty — a thundering herd the monitor can't pace. Pull (monitor polls) smooths load and controls timing but needs a registry of every node. The choice is also organizational (Conway's Law).
  • Don't put a queue in front of heartbeats — a delayed "I'm alive" is a stale lie by the time you read it.
  • Gray failures (frozen, overloaded, network-partitioned) look identical to death: same missing signal, but each needs a different response the monitor can't choose.
  • False negatives (dead node declared alive → silent outage) are worse than false positives (healthy node declared dead → fire drill), and a single timeout dial trades one for the other — you can't minimize both.
  • Break the trade-off with a second opinion: secondary confirmation + quorum (majority vote) stops a localized network glitch from faking a death — but full-mesh checking is O(N²), which sets up the need for gossip.

Part 2 of 4 in the Heartbeats & Failure Detection series:

  1. Is This Thing On? — Building a Heartbeat System from Scratch
  2. The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds (you are here)
  3. Going Decentralized — Gossip, Rumors, and the SWIM Protocol
  4. Getting Smart — Phi Accrual and the End of Hardcoded Timeouts

Last time I built a basic heartbeat system — nodes push "I'm alive" to a monitor, the monitor derives liveness from timestamps. Simple. Works. But simple systems break at scale, and this one breaks in interesting ways.

Push vs Pull: Who's In Control?

With 1000 nodes pushing heartbeats every second, that's 1000 requests/sec hitting a single monitor. And here's the fun part: they might all fire at the exact same millisecond. Burst of traffic, then silence, then another burst. This is called a thundering herd, and your monitor has zero control over it — the nodes decide when to send.

My first thought: put a queue in front of the monitor. Buffer the bursts.

Bad idea. A heartbeat says "I'm alive right now." If it sits in a queue for 8 seconds and your timeout is 10 seconds, you've got 2 seconds of margin. The heartbeat is stale by the time you process it. Queuing works for emails and order processing — tasks where timing doesn't matter. Not for liveness signals.

The real question isn't "how do I handle the burst?" — it's "should the nodes be driving this at all?"

Pull-based monitoring flips the model. Instead of nodes pushing to the monitor, the monitor reaches out to each node: "are you alive?" Now the monitor controls the timing. It can space checks evenly — no thundering herd, predictable load.

But pull has its own cost: the monitor needs to know about every node upfront. With push, a new node just starts sending and the monitor discovers it. With pull, someone has to register the node first.

Neither model is "better":

PushPull
Who controls timingNodeMonitor
Node discoveryAutomaticNeeds a registry
Load patternBurstySmooth
Monitor complexitySimple (just listen)Higher (must poll)

And here's something that doesn't show up in technical comparisons: the organizational dimension.

Push means the monitoring team publishes an API spec. Other teams integrate on their own time. Self-service. No bottleneck. Pull means every new node is a ticket to the monitoring team: "please add us to your polling list." The monitoring team becomes a bottleneck, but they have full control.

This maps to Conway's Law — system architecture mirrors org structure. Autonomous teams lean toward push. Centralized platform teams lean toward pull. The technical trade-off and the organizational trade-off are inseparable.

Gray Failures: The Zombie Problem

A node that's alive but not responding — frozen in a GC pause, stuck in an infinite loop, so overloaded it can't send heartbeats — looks identical to a dead node from the monitor's perspective. No heartbeat within the timeout, therefore dead.

These are called gray failures: technically alive, functionally useless. The monitor can't distinguish why a heartbeat is missing. Dead, frozen, network-partitioned — same signal, or rather, same lack of signal.

This matters because the response should be different. You don't restart a node that's just network-partitioned — that creates a duplicate. But the monitor can't tell the difference. All it sees is silence.

False Positives and False Negatives

Two ways failure detection can be wrong:

False positive: healthy node declared dead. The monitor panics, triggers a failover, restarts a perfectly good node, reroutes traffic unnecessarily. Annoying, wasteful, but recoverable. It's a fire drill.

False negative: dead node declared alive. Traffic keeps flowing to a black hole. Users see errors. Data might be lost. And nobody's investigating because the monitoring system says everything is fine. That's a silent outage.

False negatives are worse. Always. A false positive wastes resources. A false negative loses users.

And here's the kicker: the same lever controls both. Shorter timeout → catch dead nodes faster (fewer false negatives) but panic on transient hiccups (more false positives). Longer timeout → patient through hiccups (fewer false positives) but slow to catch real failures (more false negatives).

You cannot minimize both at the same time. It's a dial, not a switch.

Breaking the Trade-off: Get a Second Opinion

So you can't tune the timeout to fix both. But what if, before declaring a node dead, you asked someone else: "hey, can you reach this node?"

If the monitor can't reach node-3, it could be because node-3 is dead. Or it could be because the network between the monitor and node-3 is broken. But if the monitor asks node-5 "can you reach node-3?" and node-5 says "yeah, it's fine" — you just avoided a false positive without needing a longer timeout.

This is secondary confirmation — getting a second opinion before acting. And it naturally leads to a bigger idea: what if multiple nodes are all checking each other?

Quorum: The Majority Rules

If multiple nodes check each other, disagreements happen. Node-1 says node-3 is dead. Node-5 says node-3 is alive. Who do you believe?

Majority vote. If 3 out of 5 nodes say node-3 is alive, it's alive — even if 2 can't reach it. Those 2 probably have a localized network issue.

This is quorum-based failure detection, and it's one of those ideas that feels obvious in hindsight but solves a genuinely hard problem. A localized network failure can't cause a majority to agree on a false death. You need a real, widespread failure for the quorum to declare a node dead.

But there's a catch. Every node checking every other node is N-squared connections. Five nodes? 20 checks. Fine. A thousand nodes? Nearly a million checks per round. Doesn't scale.

So you need a way to spread information through a cluster without everyone talking to everyone. You need gossip. But that's the next post.

What We Learned

  • Push vs pull isn't about scale alone — it's about control, discovery, and org structure.
  • Gray failures are invisible — the monitor can't tell dead from frozen from partitioned.
  • False negatives are worse than false positives — a silent outage beats a fire drill.
  • A single timeout can't minimize both — it's a dial between false positives and false negatives.
  • Secondary confirmation breaks the trade-off — ask someone else before declaring death.
  • Quorum prevents localized network issues from triggering false alerts — but full-mesh checking doesn't scale.

Next: how gossip protocols spread information like rumors, and why three states (alive, suspect, dead) are better than two.


← Previous: Is This Thing On? — Building a Heartbeat System from Scratch

Next → Going Decentralized — Gossip, Rumors, and the SWIM Protocol