The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds

- Push (nodes send) is simple and self-discovering but bursty — a thundering herd the monitor can't pace. Pull (monitor polls) smooths load and controls timing but needs a registry of every node. The choice is also organizational (Conway's Law).
- Don't put a queue in front of heartbeats — a delayed "I'm alive" is a stale lie by the time you read it.
- Gray failures (frozen, overloaded, network-partitioned) look identical to death: same missing signal, but each needs a different response the monitor can't choose.
- False negatives (dead node declared alive → silent outage) are worse than false positives (healthy node declared dead → fire drill), and a single timeout dial trades one for the other — you can't minimize both.
- Break the trade-off with a second opinion: secondary confirmation + quorum (majority vote) stops a localized network glitch from faking a death — but full-mesh checking is O(N²), which sets up the need for gossip.
Part 2 of 4 in the Heartbeats & Failure Detection series:
- Is This Thing On? — Building a Heartbeat System from Scratch
- The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds (you are here)
- Going Decentralized — Gossip, Rumors, and the SWIM Protocol
- Getting Smart — Phi Accrual and the End of Hardcoded Timeouts
Last time I built a basic heartbeat system — nodes push "I'm alive" to a monitor, the monitor derives liveness from timestamps. Simple. Works. But simple systems break at scale, and this one breaks in interesting ways.
Push vs Pull: Who's In Control?
With 1000 nodes pushing heartbeats every second, that's 1000 requests/sec hitting a single monitor. And here's the fun part: they might all fire at the exact same millisecond. Burst of traffic, then silence, then another burst. This is called a thundering herd, and your monitor has zero control over it — the nodes decide when to send.
My first thought: put a queue in front of the monitor. Buffer the bursts.
Bad idea. A heartbeat says "I'm alive right now." If it sits in a queue for 8 seconds and your timeout is 10 seconds, you've got 2 seconds of margin. The heartbeat is stale by the time you process it. Queuing works for emails and order processing — tasks where timing doesn't matter. Not for liveness signals.
The real question isn't "how do I handle the burst?" — it's "should the nodes be driving this at all?"
Pull-based monitoring flips the model. Instead of nodes pushing to the monitor, the monitor reaches out to each node: "are you alive?" Now the monitor controls the timing. It can space checks evenly — no thundering herd, predictable load.
But pull has its own cost: the monitor needs to know about every node upfront. With push, a new node just starts sending and the monitor discovers it. With pull, someone has to register the node first.
Neither model is "better":
| Push | Pull | |
|---|---|---|
| Who controls timing | Node | Monitor |
| Node discovery | Automatic | Needs a registry |
| Load pattern | Bursty | Smooth |
| Monitor complexity | Simple (just listen) | Higher (must poll) |
And here's something that doesn't show up in technical comparisons: the organizational dimension.
Push means the monitoring team publishes an API spec. Other teams integrate on their own time. Self-service. No bottleneck. Pull means every new node is a ticket to the monitoring team: "please add us to your polling list." The monitoring team becomes a bottleneck, but they have full control.
This maps to Conway's Law — system architecture mirrors org structure. Autonomous teams lean toward push. Centralized platform teams lean toward pull. The technical trade-off and the organizational trade-off are inseparable.
Gray Failures: The Zombie Problem
A node that's alive but not responding — frozen in a GC pause, stuck in an infinite loop, so overloaded it can't send heartbeats — looks identical to a dead node from the monitor's perspective. No heartbeat within the timeout, therefore dead.
These are called gray failures: technically alive, functionally useless. The monitor can't distinguish why a heartbeat is missing. Dead, frozen, network-partitioned — same signal, or rather, same lack of signal.
This matters because the response should be different. You don't restart a node that's just network-partitioned — that creates a duplicate. But the monitor can't tell the difference. All it sees is silence.
False Positives and False Negatives
Two ways failure detection can be wrong:
False positive: healthy node declared dead. The monitor panics, triggers a failover, restarts a perfectly good node, reroutes traffic unnecessarily. Annoying, wasteful, but recoverable. It's a fire drill.
False negative: dead node declared alive. Traffic keeps flowing to a black hole. Users see errors. Data might be lost. And nobody's investigating because the monitoring system says everything is fine. That's a silent outage.
False negatives are worse. Always. A false positive wastes resources. A false negative loses users.
And here's the kicker: the same lever controls both. Shorter timeout → catch dead nodes faster (fewer false negatives) but panic on transient hiccups (more false positives). Longer timeout → patient through hiccups (fewer false positives) but slow to catch real failures (more false negatives).
You cannot minimize both at the same time. It's a dial, not a switch.
Breaking the Trade-off: Get a Second Opinion
So you can't tune the timeout to fix both. But what if, before declaring a node dead, you asked someone else: "hey, can you reach this node?"
If the monitor can't reach node-3, it could be because node-3 is dead. Or it could be because the network between the monitor and node-3 is broken. But if the monitor asks node-5 "can you reach node-3?" and node-5 says "yeah, it's fine" — you just avoided a false positive without needing a longer timeout.
This is secondary confirmation — getting a second opinion before acting. And it naturally leads to a bigger idea: what if multiple nodes are all checking each other?
Quorum: The Majority Rules
If multiple nodes check each other, disagreements happen. Node-1 says node-3 is dead. Node-5 says node-3 is alive. Who do you believe?
Majority vote. If 3 out of 5 nodes say node-3 is alive, it's alive — even if 2 can't reach it. Those 2 probably have a localized network issue.
This is quorum-based failure detection, and it's one of those ideas that feels obvious in hindsight but solves a genuinely hard problem. A localized network failure can't cause a majority to agree on a false death. You need a real, widespread failure for the quorum to declare a node dead.
But there's a catch. Every node checking every other node is N-squared connections. Five nodes? 20 checks. Fine. A thousand nodes? Nearly a million checks per round. Doesn't scale.
So you need a way to spread information through a cluster without everyone talking to everyone. You need gossip. But that's the next post.
What We Learned
- Push vs pull isn't about scale alone — it's about control, discovery, and org structure.
- Gray failures are invisible — the monitor can't tell dead from frozen from partitioned.
- False negatives are worse than false positives — a silent outage beats a fire drill.
- A single timeout can't minimize both — it's a dial between false positives and false negatives.
- Secondary confirmation breaks the trade-off — ask someone else before declaring death.
- Quorum prevents localized network issues from triggering false alerts — but full-mesh checking doesn't scale.
Next: how gossip protocols spread information like rumors, and why three states (alive, suspect, dead) are better than two.
← Previous: Is This Thing On? — Building a Heartbeat System from Scratch
Next → Going Decentralized — Gossip, Rumors, and the SWIM Protocol