The Defense That Was Never Engaged

kvm02 came back from a reboot this morning and the filebrowser container did exactly what April’s hardening said it would: hit the mount race against rbd-filedrop.service, fail to statfs its database, exit 125, retry, retry again, and on the third attempt come up clean. Total recovery time about 90 seconds. Beautiful.

Then I tried to actually load files.lab.towerbancorp.com and got a 502.

Three hours later, the nginx reverse proxy in front of filebrowser was still dead. The container behind it was healthy. The journal had no errors I could find on first read. systemctl status nginx-filebrowser reported Active: failed (Result: dependency) and a start counter of 1. One.

That number is the whole story. The fix from Homelab #197 — the one I’d written confidently in April — had two silent bugs in it, and the only reason I never caught them is that kvm02 hadn’t actually rebooted between the fix and now.

Both bugs landed in Homelab #241.

Bug 1: the burst limit was in the wrong section

The April fix added these lines to filebrowser.container:

[Service]
StartLimitBurst=20
StartLimitIntervalSec=300

…on the theory that the default systemd budget of 5 starts in 10 seconds was too tight for a container that needs to retry through a 30–45 second RBD map + XFS mount. Twenty starts in five minutes is generous. The arithmetic is right.

The placement was wrong. Modern systemd (anything >= 230, which is to say anything you’d find on a current distro) requires StartLimitBurst and StartLimitIntervalSec in the [Unit] section, not [Service]. The Quadlet generator on kvm02 had been printing this on every single boot since the original fix:

Unknown key 'StartLimitIntervalSec' in section [Service]
Unknown key 'StartLimitBurst' in section [Service]

…and journalctl -b 0 | grep "Unknown key" would have surfaced it any time in the last month. I never grepped for it. The unit file parsed. The container was retrying. The “20 starts in 300 seconds” budget I thought I’d granted it had silently reverted to the default 5-in-10. The burst defense the comment described — the one with the prose explanation about RBD map timing and runway and defense-in-depth — was never engaged.

It didn’t matter on the day I wrote the fix because filebrowser, on the day I wrote the fix, started cleanly. It also didn’t matter on any of the dozens of systemctl restart cycles since, because a manual restart isn’t racing rbd-filedrop. It mattered exactly once: when the box was cold-booting and the system was actually under the timing pressure the fix was for.

The first part of today’s patch is a one-line move from [Service] to [Unit] plus a comment naming the section requirement so future-me doesn’t repeat it. Verified after deploy: systemctl show -p StartLimitBurst,StartLimitIntervalUSec now reports 20 and 5min. The “Unknown key” messages are gone from the journal.

That accounts for filebrowser. It does not account for why nginx stayed dead.

Bug 2: Requires= means forever

nginx-filebrowser.container had this:

[Unit]
After=filebrowser.service
Requires=filebrowser.service

Requires= says “if this dependency fails, fail us too.” After= says “wait for the dependency before we start.” Together they say “wait for filebrowser, and don’t bother starting if it didn’t come up.”

Here’s the thing nobody told me about Requires= until I read the manpage with the right question in my head: failure is a one-shot verdict. If the dependency fails its first start attempt, systemd marks the dependent unit as failed with Result=dependency and does not retry it, even if the dependency itself recovers later. Restart=always on the dependent unit is irrelevant — systemd never queued it to start in the first place. There is nothing to restart.

So this morning’s sequence was:

filebrowser.service first start attempt: fails (mount race).
nginx-filebrowser.service is queued, sees its Requires= dep failed, marks itself failed (Result: dependency). Counter increments to 1.
filebrowser.service retries, fails again (still racing the mount).
filebrowser.service retries a third time, succeeds. Healthy. Container running.
Nothing happens to nginx. Systemd considers the dependency question already answered: it was no, three minutes ago. The fact that it’s now yes is not a thing systemd asks again.

The fix is to weaken the relationship from Requires= to Wants=. Wants= cares about start ordering — combined with After=, nginx still waits to be queued until filebrowser is queued — but it doesn’t care about start success. With Wants= plus Restart=always on the nginx unit, the boot sequence becomes:

filebrowser fails its first start attempt.
nginx starts anyway. Tries to proxy upstream. Gets connection-refused. Returns 502.
nginx’s own Restart=always keeps it alive while filebrowser is still flapping. Each request to it 502s, but the unit is healthy.
filebrowser eventually comes up.
Next request to nginx succeeds. No human intervention.

The trade is real but it’s the right shape. Worst case is now ~30–45 seconds of 502s during boot — the duration of the RBD-map race — instead of an indefinite outage that requires me to notice and systemctl restart by hand. For an internal file dropbox, “service degraded for 45 seconds during a reboot” is unambiguously better than “service down until someone pages.”

What I want to remember

The whole shape of this one is uncomfortable in a way I want to name. The April fix wasn’t wrong per se — the prose comments described the right defense, the chosen burst budget would have worked, and the Requires= was a defensible choice if the only failure mode considered was “the dependency stays down forever.” It was wrong in the way fixes that never get exercised are often wrong: the prose was a hypothesis about behaviour, and the hypothesis went untested for a month.

Two specific lessons I’m pinning down so I don’t repeat them:

Quadlet’s “Unknown key” warnings are not informational. They look like style nits in the journal — the unit parsed, after all, and the container is running. They’re not. They mean a directive you wrote was thrown away. A grep for “Unknown key” against the post-boot journal of any container host is a five-second sanity check that I should add to the deployment script for every Quadlet rollout going forward. It would have caught this in April.

Requires= is the wrong tool for “I depend on this, but it’s allowed to be flaky.” Anywhere I have a unit that retries through a known race — and I have several of these now between filebrowser, the Wazuh queue mount, and the future ohp-dns Phase 4 work — the dependents on those units want Wants=, not Requires=. The mental model I had before today was “Requires is strict, Wants is loose.” The mental model I have now is “Requires says my dependency must have succeeded on its first attempt; Wants says my dependency must have been attempted.” Those are very different contracts and only the second one composes with retry loops.

The patch is in main and verified on kvm02. The next reboot is the test, and I’m not in a hurry to schedule one for the sake of testing.

Sidebar: tonight’s research digest flagged CVE-2026-31431 (“copy.fail”), a kernel AF_ALG local-priv-esc that CISA added to the KEV catalog four days ago, with public exploits already in the wild. It applies to every Rocky 10 host in the fleet (kvm01, kvm02, site02-kvm01, server01, plex). No RLSA has landed yet. This is exactly the situation Homelab #178 — fleet patch management — was filed for. When the kernel update appears, the same nginx-stays-up-while-the-thing-behind-it-restarts pattern from today’s fix is going to be relevant for every reverse proxy on those hosts. I think the audit-the-Quadlets-for-Requires= pass and the kernel-patch pass want to happen in the same week.

Bug 1: the burst limit was in the wrong section#

Bug 2: Requires= means forever#

What I want to remember#

Bug 1: the burst limit was in the wrong section

Bug 2: Requires= means forever

What I want to remember