Getting Smart — Phi Accrual and the End of Hardcoded Timeouts

Part 4 of 4 in the Heartbeats & Failure Detection series:

Is This Thing On? — Building a Heartbeat System from Scratch
The Trade-off Maze — Push vs Pull, and Why "Is It Dead?" Is Harder Than It Sounds
Going Decentralized — Gossip, Rumors, and the SWIM Protocol
Getting Smart — Phi Accrual and the End of Hardcoded Timeouts (you are here)

Everything we've built so far has a hardcoded number somewhere. 10-second timeout. 5 suspect rounds. These work, but they're the same for every node in the system. A server in the same rack with sub-millisecond latency gets the same timeout as a server across the ocean with 200ms latency. That doesn't feel right.

My first instinct: measure network latency per node, calculate a rolling average, derive the timeout from that. Which is the right direction — but the details matter.

The Problem with Fixed Thresholds

A 10-second timeout treats all nodes equally. But "normal" is different for every node. Node-1 in the same datacenter responds in 1ms. Node-2 across the Atlantic responds in 200ms. A 300ms silence from node-1 is alarming. A 300ms silence from node-2 is Tuesday.

You could manually configure per-node timeouts (we built that in chapter 3). But that doesn't scale — hundreds of nodes, network conditions that change throughout the day, new nodes spinning up. You'd spend your life tuning thresholds.

What if the system could figure out what "normal" looks like for each node and flag when things deviate?

From Binary to Spectrum

Right now, failure detection is binary: alive or dead. A node either made the timeout or it didn't. But there's a whole spectrum between "perfectly healthy" and "definitely dead."

What about a node that's responding, but slowly? It's not dead — health checks succeed. But it's taking 10x longer than usual. Should you keep routing the same amount of traffic to it? With a binary alive/dead check, you have no choice. The node is "alive," so it gets full traffic. Meanwhile users on that node are waiting and waiting.

A suspicion score — a number from 0 to 1 — lets the system react proportionally. Low score: business as usual. Medium score: maybe reduce traffic. High score: stop sending requests entirely.

Inter-Arrival Times: The Right Signal

So what data do you feed into this score? My first thought was response latency — time each health check request. But there's a better signal: inter-arrival times.

Inter-arrival time is the gap between consecutive heartbeats arriving at the monitor. Not how long a single request took, but how long between "I heard from you" and "I heard from you again."

Why is this better? Because it's a superset signal. If the network is slow, inter-arrival times go up. If the node is overloaded and heartbeats fire late, inter-arrival times go up. If there's packet loss, inter-arrival times go up. GC pause? OS scheduling delay? All captured in one number, passively, from data you're already collecting.

Response latency tells you about one specific request. Inter-arrival time tells you about the node's overall behavior pattern.

But Wait — Gossip Doesn't Work Here

I almost made a mistake. I tried to bolt Phi Accrual onto the gossip system. The problem? In gossip, you pick a random peer each round. So you might check node-2 this round, then node-3, then node-2 again three rounds later. The gap for node-2 isn't "node-2 is slow" — it's "I didn't check node-2 for 6 seconds because I was gossiping with others."

Phi Accrual needs consistent, regular sampling to build a meaningful statistical picture. Gossip's random selection creates irregular gaps that have nothing to do with the node's health.

This is a key insight: gossip and Phi Accrual solve different problems. Gossip spreads information. Phi Accrual detects failures. They coexist, but they don't share a communication channel. Cassandra, for example, uses gossip for membership propagation and Phi Accrual on direct, regular heartbeats — separate mechanisms for separate concerns.

So for Phi Accrual, we went back to the push model: nodes send heartbeats at a fixed interval, the monitor collects arrival data. Simple, consistent, statistically meaningful.

The Math: Standard Deviation and Phi

You have a rolling window of inter-arrival times for each node. Say: [1020, 1050, 980, 1100, 1000]. The mean is 1030ms. That's "normal" for this node.

But how spread out are the values? That's the standard deviation — a measure of variance.

Take each value, subtract the mean
Square each difference (so negatives don't cancel out)
Average the squared differences — that's the variance
Square root — that's the standard deviation

For our example: stddev ≈ 40ms. So "normal" for this node is roughly 1030ms ± 40ms.

Now the node goes silent. It's been 1200ms since the last heartbeat. How worried should you be?

Phi = (current silence - mean) / stddev

Phi = (1200 - 1030) / 40 = 4.25

That's 4.25 standard deviations from the mean. Pretty unusual. But not catastrophic.

What if it's been 2000ms? Phi = (2000 - 1030) / 40 = 24.25. Something is very wrong.

What if the node is actually dead? The silence grows forever. Date.now() - lastHeardAt gets bigger and bigger. Phi climbs to 100, 200, 500. No ambiguity.

The Adaptive Part

Here's where it gets elegant. Node-1 has a mean of 100ms and stddev of 5ms. Node-2 has a mean of 500ms and stddev of 100ms.

A 300ms silence from node-1: Phi = (300 - 100) / 5 = 40. Extremely alarming. A 300ms silence from node-2: Phi = (300 - 500) / 100 = -2. Earlier than expected. Totally fine.

Same silence duration. Radically different interpretations. Each node is judged against its own history, not a global threshold. No manual tuning. No config per node. The system learns.

Thresholds

Phi itself is continuous, but you still need decision points:

Phi < 0 → heartbeat arrived sooner than average. Normal.
Phi 0-3 → within normal variance. Alive.
Phi 3-8 → unusual for this node. Suspect — maybe reduce traffic.
Phi > 8 → extremely abnormal. Declare dead.

Cassandra uses Phi > 8 as its default. These thresholds are configurable, but the point is you configure Phi thresholds once for the whole cluster, not timeouts per node. The adaptive math handles the per-node differences.

The Implementation

Back to the push-based setup. Nodes send heartbeats to a central monitor. Each node gets a random simulated delay (0-500ms) plus jitter to simulate different network conditions.

The monitor has three functions:

recordHeartbeat(nodeId) — first heartbeat creates the entry. Subsequent heartbeats calculate the gap since lastHeardAt, push it to a rolling window (capped at 100 entries), and update lastHeardAt.

computePhi(nodeId) — needs at least 3 samples (can't compute meaningful statistics from 1-2 data points). Calculates mean and stddev, measures current silence, returns the Phi score. Uses Math.max(stddev, 1) to avoid division by zero when arrivals are perfectly uniform.

getStatus(nodeId) — maps Phi to alive/suspect/dead using the thresholds.

Testing: started 3 nodes with different simulated delays. All nodes alive, Phi scores negative (heartbeats arriving on time). Killed one node. Phi climbed from -4 to 168 within seconds. Then to 274. Meanwhile the other nodes sat at -8, -12. Completely unaffected. The system correctly identified one failing node against a backdrop of healthy ones, with no hardcoded timeout anywhere.

What We Built Across This Series

Starting from "just ping it" and ending at statistical failure detection:

Basic heartbeat → derive liveness, don't store it. In-memory, not database. Trust the receiver's clock.
Timeout tuning → 10x rule as a starting point. Per-node overrides. The cost of false positives vs slow detection.
Push vs pull → who controls timing, who discovers nodes, and how does your org structure influence the choice.
False positives vs negatives → a single timeout can't minimize both. Secondary confirmation and quorum break the trade-off.
Gossip → O(log N) information spreading. Seed nodes for bootstrap. Three states prevent false rumors.
SWIM protocol → suspect buffer, secondary confirmation, and gossip fused into one design. Symmetric recovery prevents flapping.
Phi Accrual → statistical, adaptive detection. Each node judged against its own history. No hardcoded timeouts.

Each solution emerged from the limitations of the previous one. That's how distributed systems actually evolve — engineers solve one problem, hit a wall, and invent the next thing.

← Previous: Going Decentralized — Gossip, Rumors, and the SWIM Protocol

Start from the beginning: Is This Thing On? — Building a Heartbeat System from Scratch