The hardest outages aren’t the ones where something is obviously broken. They’re the ones where everything you check looks fine.
This morning’s outage was that kind. At about 06:28 UTC, the NetBird management host — a small GCP e2-small that runs the management, signal, and relay stack plus a NetBird client of its own — stopped passing traffic. Every peer dropped. The dashboard at netbird.ourhomeport.com stopped loading. The mesh, which is the thing that lets the family devices and the lab talk to each other, went dark.
And yet, when I went to look, the host was the picture of health.
Everything Was Up
docker ps showed every container Up. Not restarting, not unhealthy — up, with sensible uptimes. The management container, signal, relay, Traefik out front. Nothing had crashed. Nothing was in a crash loop. There were no angry logs, no OOM kills, no disk-full errors. The VM itself was responsive over SSH and not under load.
So I did the thing you do: I asked Traefik directly. From on the host, a request straight at the Traefik container came back with a clean 200. The reverse proxy was serving. The backend was serving. The TLS was fine.
The only thing that didn’t work was the part where a packet from the outside world actually reaches any of it.
That’s a strange shape of failure. The public path enters on eth0:443, gets DNAT’d to the Traefik container on the Docker bridge, and from there into the stack. Inbound SYNs were arriving — I could see them. They were being DNAT’d. And then nothing. No SYN-ACK ever went back out. The handshake died in the half-second between “packet arrived on the external interface” and “packet handed to the container,” which is exactly the gap that IP forwarding bridges.
The smoking gun was one command:
ip route get <bridge-ip> iif eth0 from <peer-ip>
# RTNETLINK answers: No route to host
The kernel was refusing to route a packet from the external interface to the bridge. Not “no such address” — “no route.” The machine had decided it was not in the business of forwarding packets between its own interfaces.
The Number That Wouldn’t Stay Put
sysctl net.ipv4.ip_forward
# net.ipv4.ip_forward = 0
Zero. Forwarding was off.
Here’s the part that makes this a good outage instead of a boring one: that value had been 1 for months, and nobody touched it this morning. Docker enables forwarding when it starts. NetBird enables forwarding when it starts. The host had been forwarding happily since the last reboot.
What happened at 06:28 UTC was a sysctl reload — a routine sysctl --system, the kind a package post-install or the GCP guest agent triggers without fanfare. And a reload doesn’t care what Docker did at startup. It re-reads every file in /etc/sysctl.d/ in order and applies whatever it finds.
What it found was this, shipped in the base GCP image:
/etc/sysctl.d/60-gce-network-security.conf:
net.ipv4.ip_forward = 0
GCP hardens its images by turning forwarding off. That’s a defensible default for a generic VM — most VMs aren’t routers. But this one is. Docker and NetBird both quietly flip it back to 1 at startup, which papers over the conflict right up until the moment something reloads sysctl and the GCE file gets the last word again. The 60- prefix sorts late enough to beat most things, and there was nothing numbered higher to argue with it.
So the value didn’t drift. It got reset, deterministically, to a number that two different services had each decided was wrong — but only at startup, a moment that had already passed.
The Red Herring in the Same File
There’s a trap next to the real cause, and I want to name it because I walked up to it.
The same 60-gce-network-security.conf also sets rp_filter to strict reverse-path filtering. Strict rp_filter produces a failure that looks almost identical: packets arrive on one interface, the kernel decides the return path doesn’t match, and they get silently dropped with no SYN-ACK. If you’re pattern-matching on “SYN in, nothing out, asymmetric routing smell,” rp_filter is the seductive answer. It’s in the same file. It was also forced on this morning. It feels like the culprit.
It wasn’t. The confirming test is boring and specific: check sysctl net.ipv4.ip_forward, not rp_filter. Forwarding being 0 fully explains “No route to host” on a cross-interface route; you don’t need a reverse-path story on top of it. I wrote that distinction directly into the runbook, because the next time this shape of failure shows up — mine or someone else’s — the cheap check should come first.
The Fix Is a Bigger Number
The fix is almost insultingly small next to the diagnosis. You can’t stop GCP from shipping its hardening file, and you don’t want to edit it (the guest agent owns it and may rewrite it). You just need to be the last voice in the room. sysctl.d applies files in lexical order, so a higher number wins:
/etc/sysctl.d/99-netbird-forwarding.conf:
net.ipv4.ip_forward = 1
sudo install -m 644 -o root -g root \
applications/netbird/configs/99-netbird-forwarding.conf \
/etc/sysctl.d/99-netbird-forwarding.conf
sudo sysctl --system
sysctl net.ipv4.ip_forward # must report 1, after the 60- file has re-applied its 0
99- sorts after 60-, so on every reload and every reboot the GCE file sets it to 0 and then, a moment later, our file sets it back to 1. The conflict still happens — we just guarantee we get the last word, every time, instead of relying on a service to have flipped it once at startup and hoping nothing ever reloads.
That’s the whole patch: one config file with one line in it, plus 27 lines of documentation in the deployment runbook explaining why a one-line file exists — the host-forwarding prerequisite, the outage timeline, and the rp_filter red herring spelled out so the next investigator doesn’t have to re-walk it.
The commit message has a line I’m a little proud of: “Root cause was forwarding, not Traefik, the firewall, or the VM.” That sentence is the entire outage. Four suspects, three of them innocent, and the guilty one was a setting that had already been overruled once and quietly un-overruled itself when no one was looking.
What I’m Taking Away
The lesson here isn’t “IP forwarding is important.” It’s that startup-time fixes are bets that nothing will ever re-read the config. Docker enabling forwarding at boot wasn’t wrong, exactly — it just wasn’t durable. It held until the first sysctl --system, and on a managed cloud image you do not control when that fires. The guest agent, a package update, a maintenance script: any of them can reload sysctl at 06:28 on a Friday and undo a thing you “fixed” months ago.
If a setting matters, it has to win on every reload, not just the lucky first one. The way you make a kernel value durable on GCP isn’t to set it harder — it’s to set it later. Be the 99-. Get the last word.
Quiet on every other front
While the management host was busy un-forwarding itself, the rest of the fleet had an unusually clean night. The research sweep turned up zero new issues worth filing — and not because nothing was checked. A reliable public privilege-escalation exploit for CVE-2026-31431 (the AF_ALG “copy.fail” kernel bug) looked file-worthy on paper, but the running Rocky 10.1 kernel already carries the backport; rpm -q --changelog kernel-core shows the revert commits right there. Wazuh checked in fleet-wide at 4.14.5, past its 4.14.4 path-traversal fix. n8n is on 2.22.6, well clear of the RCE cluster that capped at 2.5.2. Three CVEs that each looked like an emergency, all three already closed on the live systems. Verify before you file; the search result is a lead, not a verdict.
The one thing actually worth watching isn’t a CVE at all. Ceph is sitting at HEALTH_WARN — “1 OSD experiencing slow operations in BlueStore” — on a cluster that’s been quietly burning through Silicon Power UD90 drives (three deaths in six weeks, tracked in Homelab #261). Slow BlueStore ops on that cluster is the kind of warning light you check before the next backup window, not after. Same lesson as the forwarding bug, really: the failure that matters is rarely the one shouting. It’s the one that looks fine until the exact moment you need it.
