Two days ago I wrote a post called “One Wrong Digit.” A transposed IP in a Netbird nameserver group — 72.72.2.22 instead of 76.76.2.22 — was firing DNS probe timeouts every 90 minutes, which triggered mesh-wide ICE resets, which dropped RDP sessions. I found it, disabled the bad nameserver group, watched the logs, declared victory.
Today I found out I was probably wrong.
Not entirely wrong — the bad DNS entry was real, and disabling it was the right call. But the hourly flap had a second, deeper driver that I missed. The evidence was sitting in dmesg the whole time. I just didn’t look there.
Building Observability to Celebrate a Fix
After closing OHP#58 on Thursday, the natural next step was to get better monitoring on the mesh so we’d catch anything similar faster. The observability stack went in last week — OpenObserve for storage and dashboards, an OpenTelemetry Collector to ship logs and metrics. But we hadn’t wired up the Netbird-specific stuff yet.
Today’s commits were that wiring. The otel-collector config got two new pipelines: one pulling logs from all 10 Netbird client hosts via a filelogt receiver pointed at their respective netbird.log paths, and one shipping logs and metrics from the GCP e2-small VM that runs netbird-server itself. Then an OpenObserve dashboard: a “Netbird mesh-health” panel showing ICE state transitions, relay vs. P2P connection ratios, “network map updated” event frequency, and peer reconnect timing.
It took a few iterations to get the panel layouts right. OpenObserve has a 180-column grid, and the default layout was stacking everything vertically. Resized panels, normalized the column spans, got it to something readable.
The goal was a post-fix health check — “look how calm the mesh is now that DNS is fixed.” And the mesh was calmer. But when I started ingesting the netbird-server logs alongside the client logs, a pattern emerged that I hadn’t expected.
What the GCP VM’s dmesg Said
The e2-small VM is a Google Cloud instance: 2 vCPUs, 2GB RAM, no swap. It runs the Netbird management server and signal server in Docker containers, proxied by Traefik. Small but sufficient for a 14-peer mesh.
Rocky Linux 9.7. And by default, Rocky Linux 9 ships with dnf-makecache.timer enabled. That timer fires every hour (offset by a random delay in the 0-15 minute window) and runs dnf makecache, which downloads and indexes the RPM repository metadata. Useful for keeping package cache fresh. Utterly brutal on a 2GB no-swap host.
When dnf makecache runs on this VM, it spikes to around 900MB of RSS. On a 2GB system with no swap and containers already holding ~1GB of their own working set, that’s a trip wire. The OOM killer fires. dmesg shows the SIGKILL landing on the dnf process after roughly five minutes of memory pressure — five minutes where the kernel is actively thrashing, borrowing from every available source, creating substantial latency for anything that needs to allocate.
During that five-minute window, Docker’s bridge network — the virtual networking layer that connects the Netbird containers to each other and to the host — experiences enough scheduling pressure that Netbird’s relay healthcheck times out. The signal server decides the relay is unhealthy. It fires “network map updated” across all peers. Every peer drops its relay connection simultaneously and reconnects a few seconds later.
The OOM burst aligned exactly to the minute with the peer reconnect clusters I was seeing in the new dashboard. Not “approximately” aligned — to the minute, in the shared timestamp view. dmesg had the SIGKILL. The OpenObserve panels had the reconnect surge. Same timestamp.
The DNS Fix Was Coincidental
This is the awkward part to write.
The “one wrong digit” post concluded that disabling the ControlD nameserver group stopped the flaps. That was probably true for the ~90-minute DNS-probe-triggered ICE resets. But the hourly OOM-driven relay drops? Those were always separate, and they continued after the DNS fix. I didn’t notice because the symptoms looked identical: peers reconnecting, “network map updated” in logs, occasional TCP session drops if you hit the window wrong.
Two different mechanisms. Same-looking symptoms. I fixed one and declared the problem solved because I stopped watching closely enough.
The fix I actually needed was two commands:
systemctl disable --now dnf-makecache.timer
And adding a 1GB swapfile as a safety margin:
fallocate -l 1G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
Swap isn’t a performance solution — if dnf makecache is regularly pushing the system into swap, something is wrong. But as a backstop against infrequent spikes (package cache, occasional container restarts), it means the OOM killer doesn’t get called in the first place. The relay stays healthy. The peers stay connected.
The Rest of the Day
The DNS mirror architecture also had a major documentation push today. Seven commits across the Homelab repo (and two more in OurHomePort with the corresponding diagrams) — design spec, implementation plan, spec revisions through three review passes. The short version: ns1 currently does DNS for the internal lab zones, but it’s a single point of failure. The plan is to stand up an Unbound instance on a second host, mirror the recursive resolver config, and add it as a secondary in the Netbird nameserver group once the dead-IP situation is fully cleaned up. Today was the planning work; implementation is queued.
A couple of smaller fixes also landed. The n8n nightly backup was hitting an SELinux denial when trying to tar volumes — the backup container runs with a context that doesn’t have write permission on /var/log for its own output path. Fixed with a targeted policy module rather than a blanket chcon. And the Filebrowser container was occasionally failing to start on boot because systemd was racing against the Ceph RBD mount; added an After= dependency on the rbd-filedrop.service unit to enforce ordering.
Sidebar: The Wazuh View
Tonight’s research sweep flagged something that keeps coming up: ns1 is running an unpatched version of BIND, and Wazuh is actively alerting on it — seven level-10 CVE-2026-1519 events in the last 24 hours. That CVE is a CPU DoS via excessive NSEC3 iteration during insecure delegation validation. Not RCE, but high-severity and actively flagged. The BIND errata (RLSA-2026-8091) is available via dnf update bind. Homelab #147 has been open for this for a while. It’s past time to close it.
The Ceph cluster is also in HEALTH_WARN — osd.0 experiencing slow BlueStore operations. This one is newer; Homelab #203 was filed today. The UD90 drives have had a complicated few months (osd.1 required an RMA, osd.1 is now oddly showing “up since 2d” in the status which might mean the replacement arrived, or might mean the drive temporarily recovered before its next failure). Slow ops on osd.0 after osd.1’s turbulence suggests keeping close watch on the drive health before assuming the cluster is stable.
What’s Left Open
- Homelab #147: Apply BIND errata on ns1 — Wazuh is not going to stop alerting until this is done
- Homelab #203: Investigate Ceph osd.0 slow BlueStore operations
- OHP#59: Netbird on public Wi-Fi — the dnf-makecache fix may actually help this, since relay drops were contributing to handshake failures
- DNS mirror: Phase 1 (Unbound deploy) is queued after the planning work lands
The mesh is now genuinely calm. The hourly OOM-driven reconnects are gone. The dashboard confirms it. I’m choosing to find it funny that it took building observability to celebrate a fix that wasn’t the fix, in order to see the evidence that revealed what the real fix needed to be.
The kernel had receipts. I just had to go looking for them.
