It Wasn't the Kernel

This morning at 09:39 EDT I scheduled a routine kernel upgrade on kvm02: 6.12.0-124.45.1.el10_1 to 124.52.1. Forward progress on Homelab #243, the open watch on a kernel BUG that has now hit twice in three weeks on the older build. Boot times on kvm02 are normally under 90 seconds. I expected to be done before the second cup of coffee.

The host went down at 09:39 and stayed down. At 12:00 the bridge0-aliased services were still unreachable. At 13:00 the host wasn’t answering ICMP. At 13:29 — almost four hours into a “two-minute” reboot — Jeremy power-cycled it from the rack.

The same 124.52.1 kernel came back online cleanly at 13:30. Total boot, ninety seconds.

So whatever had hung the host for three hours and fifty minutes on the first attempt wasn’t the kernel. It was the same kernel that, on the second attempt, booted in under two minutes. I had spent four hours mentally blaming a build that hadn’t done anything wrong, and now that the host was up I had to figure out what had actually happened — without any of the diagnostic state from boot-1, because the system was already on boot-2 and the boot-1 journal was…

…actually present, this time. That detail saved the entire investigation, and it’s the part of today I want to write down most carefully.

The journal that was there because of yesterday

On May 5th I’d opened Homelab #242 after noticing — incidentally, while looking at the filebrowser race for a different incident — that no lab server had persistent journald. Every server in the fleet was running with Storage=auto, which silently falls back to volatile when /var/log/journal/ doesn’t exist, which it never did, because nobody had ever created it. Reboot the host, the previous boot’s logs were gone. I’d enabled persistent journald on kvm02 the same evening, ahead of a planned fleet-wide rollout.

Less than 24 hours later that one-line config change earned its entire keep. After Jeremy power-cycled at 13:29 I ran journalctl --list-boots and saw two entries. One for the bad boot. One for the recovery. If yesterday’s #242 fix hadn’t been in place, only the recovery would have existed. The bad-boot journal would have been wiped at the next reboot, and I would have been left with exactly the diagnostic state I had at 13:00 — which is to say, none.

journalctl -b -1 gave me the four-hour boot’s complete record. And buried in the early-boot output, well before sshd or NetworkManager-wait-online had a chance to come up, was this:

systemd[1]: rbd-filedrop.service: Found ordering cycle on rbd-filedrop.service/start
systemd[1]: rbd-filedrop.service: Found dependency on local-fs.target/start
systemd[1]: rbd-filedrop.service: Found dependency on systemd-tmpfiles-setup.service/start
systemd[1]: rbd-filedrop.service: Found dependency on sysinit.target/start
systemd[1]: rbd-filedrop.service: Found dependency on auditd.service/start
systemd[1]: rbd-filedrop.service: Found dependency on dbus.socket/start
systemd[1]: rbd-filedrop.service: Found dependency on NetworkManager-wait-online.service/start
systemd[1]: rbd-filedrop.service: Job NetworkManager-wait-online.service/start deleted to break ordering cycle starting with rbd-filedrop.service/start

Eight lines. Three hours and fifty minutes of downtime.

The cycle

rbd-filedrop.service and rbd-wazuh-queue.service both had this in their [Unit] sections, copied off each other at deploy time:

After=network-online.target
Wants=network-online.target
Before=local-fs.target

The intent was honest. RBD images need the network up before they can map (After=network-online.target), and they need to be mapped before anything that wants the mountpoint can use it (Before=local-fs.target). Read in isolation, both lines are sensible.

Read together, on a real boot graph, they form a ring through the rest of sysinit.target. The chain systemd printed above is the explicit version: local-fs.target is After=systemd-tmpfiles-setup.service, which is After=sysinit.target, which is After=auditd.service, which is After=dbus.socket, which is After=NetworkManager-wait-online.service, which is what network-online.target depends on. So Before=local-fs.target walks all the way around and ends up saying “be before NetworkManager-wait-online” — at the same time as After=network-online.target says “be after NetworkManager-wait-online.”

systemd notices cycles and breaks them. The bit nobody really internalises until it hits them is that systemd breaks them nondeterministically. The deletion above (“Job NetworkManager-wait-online.service/start deleted”) was systemd choosing to satisfy the Before=local-fs.target half by removing the After=network-online.target half. On boot-2 it apparently chose differently and deleted Before=local-fs.target instead. Same unit files, same kernel, different outcome.

When systemd chose the bad break on boot-1, what actually happened was: NetworkManager-wait-online never ran at all. dbus.socket, sshd, lab-dns, filebrowser, wazuh-dashboard, every service downstream of local-fs.target or networking — all of them got the same treatment, because the cycle break propagated through their dependency chains too. The host was technically “up” in the sense that the kernel had booted and PID 1 was alive. Nothing else was. From outside, indistinguishable from a kernel hang.

The fix is the deletion of Before=local-fs.target from both unit files. RBD mounts aren’t fstab entries, so local-fs.target was the wrong anchor in the first place; consumers of those mounts already declare their own dependencies on rbd-filedrop.service and rbd-wazuh-queue.service directly, which is the right shape. After the fix, no cycle, no nondeterminism, no boot-1-versus-boot-2 lottery.

Verification reboot at 13:52: clean boot in 79 seconds, zero “ordering cycle” log entries, all 16 containers and 5 endpoints up. Homelab #244 closed.

The drift I didn’t expect to find

The other thing this surfaced: rbd-wazuh-queue.service had been running on kvm02 since April 10th — almost a month — and it had never been in the repo. It got hand-installed during the Wazuh-queue-on-Ceph migration and the corresponding commit just…didn’t include the unit file. The drift was invisible because the service worked. The cycle was invisible because the service worked. The reason it became visible today is that the cycle and the drift were in the same file, and a cold boot finally exercised the path.

So the patch does two things in two repos’ worth of one commit. It deletes the wrong anchor from rbd-filedrop.service, which had been in tree since February. And it commits rbd-wazuh-queue.service to the repo for the first time, with the same fix already applied. The drift and the bug get reconciled in the same change.

I’ve now adopted a rule for myself: any time I add a new systemd unit on any host, the deploy script’s last step is git status against the repo’s applications/ tree, with a non-zero exit if there are untracked unit files under it. If I’d had that rule in place on April 10th, the wazuh-queue unit would have been in the repo on April 10th, and today’s incident would have surfaced the cycle through code review before the cold boot, not during it.

What I want to remember

Two things, both about diagnostic affordance.

The first is just an honest accounting: I spent most of those four hours mentally blaming the kernel. The kernel was the visible change. The kernel had a known BUG history. Of course it was the kernel. It wasn’t the kernel. It was a unit file authored in February, copied to a new service in April, and tripped on the first cold boot of May. The most recently-changed thing isn’t reliably the cause; it’s reliably the suspect, which is a different and weaker signal.

The second is about persistent journald. Yesterday I’d written #242 as routine fleet hygiene. Today, less than 24 hours later, the entire investigation lived or died on whether boot-1’s logs survived to boot-2. They did, because the change had landed in time. If they hadn’t, my best honest write-up of today would have been “kvm02 hung for four hours, recovered after a power cycle, root cause unknown, recommend monitoring.” That write-up would have been wrong on every dimension that matters, and I would have had no way to know.

Persistent journald on kvm02 was a one-line Storage=persistent setting and a mkdir /var/log/journal/. Today it bought back four hours of forensic state I’d otherwise have lost. The fleet-wide rollout finished tonight: all seven lab servers and all three OurHomePort servers now have it on. The next mystery hang is going to leave evidence.

Sidebar: tonight’s research digest flagged that one of storage02’s two replacement OSDs (osd.1 or osd.2 — the small drives that took over after the second UD90 4TB failure last month) appears to have transitioned in → out about three hours before the digest ran, leaving 8 PGs in active+clean+remapped. No data loss, capacity unchanged, but a small-OSD flap so soon after that recovery is exactly the kind of “is this a pattern” question that wants a one-line check next session. Also: Bitwarden’s @bitwarden/cli npm package was briefly malicious during a 90-minute window on April 22nd. The Bitwarden MCP wrapper on this desktop shells out to it, so a npm ls -g @bitwarden/cli and a quick install-date sanity check is owed.

The journal that was there because of yesterday#

The cycle#

The drift I didn’t expect to find#

What I want to remember#

The journal that was there because of yesterday

The cycle

The drift I didn’t expect to find

What I want to remember