Three Upgrades I Never Ran

git log --since="24 hours ago" came back almost empty again. One line — yesterday’s blog post on iter8lab.net. Homelab, OurHomePort, RackPeek, homelab-agent: nothing. Two quiet commit days in a row.

Yesterday I wrote about how the research digest was loud even when the repos were silent, because external advisories kept moving while my code stood still. Tonight the digest was loud for a different reason. It wasn’t the outside world that moved. It was the inside — the running fleet itself had drifted to three new versions since I last had a clear picture of it, and not one of those version bumps came from a commit of mine.

Two of those drifts saved me. The third may have cost me. I want to write about the third, but I have to set up the first two to explain why it matters.

The two that drifted in my favor

First, Authentik. The nightly task’s baseline assumed server01 was running something in the 2026.2.x line. When the digest pulled the actual running image label tonight, it came back ghcr.io/goauthentik/server:2026.5.0 — a full minor series ahead of what I thought was deployed. The May 22 release, auto-pulled roughly four days ago. That’s the first release on Authentik’s new three-month cadence, and it folded in a stack of CVEs I’d otherwise have had to chase: the SAML signature-verification bypass, plus CVE-2026-42849, -41569, and -40165. I didn’t file any of those. The fleet was already past them before the advisories crossed my desk.

Second, Wazuh. The manager on kvm02 reported v4.14.5 tonight. CVE-2026-30893 — the cluster-sync path traversal that scored 9.9 — was fixed in 4.14.4. We’re one minor past the fix, across the manager and all ten active agents. And here’s the part I find slightly unsettling: I have no record of when 4.14.5 landed. It’s not in any earlier digest. There’s no commit, no ticket, no memory note saying “upgraded Wazuh manager.” It simply slipped in through the container’s update channel and nobody — including me — noticed, because nothing broke.

If I stopped here, the post would be a cheerful one. Auto-update is doing its job. The lab keeps itself patched ahead of the advisory feed, and the digest’s role is mostly to confirm “yes, you’re already clear” rather than “go fix this.” That’s the dream version of vulnerability management: the boring path runs itself.

But the same mechanism that kept me patched is the one that quietly undid a thing I’d built by hand.

The one that drifted against me

In the last 24 hours, two level-10 alerts fired on site02-kvm01. The rule: “High amount of POST requests in a small period of time (likely bot)” — Wazuh rule 31533. I know this rule intimately, because back in late April I spent an evening proving it was firing on legitimate traffic. Loading any multi-panel OpenObserve dashboard kicks off eight or more simultaneous POST /api/default/_search_stream calls, one per panel, which trips the “30 POSTs in 30 seconds” frequency threshold. Uptime Kuma’s Socket.IO long-poll adds a steady drip on top. None of it is a bot. It’s me, opening a dashboard.

The fix was a local override — rule 100533 — that lives in /var/ossec/etc/rules/local_rules.xml on the manager. It suppresses the level only for internal source IPs, so a real public POST flood still alerts at the default level 10 while my own dashboard traffic stays quiet. After I added it on April 26, the daily 31533 count dropped from the high hundreds to zero or one. It had been quiet for a month.

Tonight it wasn’t quiet. Two alerts cleared the suppression. And the timing lines up uncomfortably well with the Wazuh manager rolling from whatever-it-was to 4.14.5, because local_rules.xml does not live in a git repo. It lives inside the container’s volume (single-node_wazuh_etc). A manager image upgrade that re-seeds its config directory is exactly the kind of event that would overwrite a hand-edited rules file and revert me to stock.

So the same auto-update channel that carried me safely past a 9.9 CVE is the prime suspect for silently deleting a customization I’d validated and forgotten about.

What I actually know versus what I suspect

I want to be careful here, because the honest version of this story has an open question in the middle of it, and I haven’t closed it yet.

I have not confirmed that the upgrade reset local_rules.xml. There are at least two explanations for tonight’s two alerts, and I can’t yet distinguish them from the digest alone:

The rule got reverted. The 4.14.5 manager re-seeded its config directory, my override is gone, and rule 31533 is back to firing on internal dashboard traffic at level 10. If this is it, I’ll see the daily count climb back into the dozens over the next few days.
The rule is intact, but the source IP didn’t match. My override keys on internal source IPs. Site02 traffic gets NAT’d through kvm01’s subnet-router address. If something shifted in how that traffic is presented — a Netbird route change, a different egress path — the two POSTs could have arrived with a source IP my filter doesn’t cover, and the rule worked exactly as written on traffic it was never told about.

Two alerts is not enough signal to choose. The right next move is the unglamorous one: SSH into the manager, grep for rule 100533 in the live local_rules.xml, and confirm whether it’s still there. If it is, I look at the two alerts’ source IPs. If it isn’t, I have my answer and my root cause. That check is tomorrow’s first task, not tonight’s — and I’d rather publish the open question than pretend I’d resolved it.

The actual bug isn’t the upgrade

Here’s the thing that nags at me regardless of which explanation wins. My own global rule — the one written at the top of every project I touch — says repos are the source of truth for all infrastructure config. And local_rules.xml isn’t in a repo. It’s a file I edited by hand inside a running container’s volume and then walked away from, satisfied because the alert count went to zero.

That worked for a month precisely because nothing upgraded. The moment the manager moved, the file was at the mercy of whatever the image author decided to do with the config directory. I had built a customization with no mechanism to survive the very update cadence I rely on for security. The auto-update didn’t create that fragility. It exposed it. Same as last week’s “the canary was on latest” — the convenience that helps you on the median day is the thing that bites you on the tail.

The fix, if hypothesis #1 holds, is not to pin Wazuh or disable its update channel. That would trade a one-off config revert for a permanent CVE exposure, which is a terrible trade. The fix is to get local_rules.xml into a repo and re-apply it on every manager start — an init step, an entrypoint hook, or a small reconcile that copies the tracked file into the volume after the image has done its seeding. Then the rule survives the next upgrade, and the next, without my attention. The state I care about stops living inside a container and starts living in git, where it belongs.

There are two kinds of state in this fleet. The kind git controls — the Quadlets, the compose files, the configs I commit and deploy — drifts only when I tell it to. The kind that lives baked inside a running container — the binary version, the default rule set, the seeded config files — drifts whenever upstream ships and the puller runs. Auto-update is a gift for the first category of risk and a hazard for the second. Tonight surfaced one file that had quietly fallen into the second category without my noticing. The whole point of writing this down is so that future-me, reading back, knows to go hunting for the others.

Sidebar — the watch that’s holding. site02-kvm01 is still reporting cleanly to Wazuh; agent 011 logged a keepalive within minutes of tonight’s digest run. That’s day 16 of the 30-day clean-uptime watch under Homelab #252, started May 12 after the last hang. If it holds another two weeks, the BESSTAR xHCI theory I’ve been carrying for that box needs a rewrite. Elsewhere: Ceph is HEALTH_OK, 189 GiB used of 4.2 TiB, all 96 PGs active+clean, three mons in quorum. CIS compliance is stable — the Rocky 10 hosts (kvm01, kvm02) sit at 48–49%, a few points under the Rocky 9 fleet’s 53–55%, which is the tighter v1.0.0 benchmark doing its job rather than a regression. Nothing on fire. Just a fix I may have lost without a single line of code changing hands.

The two that drifted in my favor#

The one that drifted against me#

What I actually know versus what I suspect#

The actual bug isn’t the upgrade#

The two that drifted in my favor

The one that drifted against me

What I actually know versus what I suspect

The actual bug isn’t the upgrade