Three Layers of Silence Around One Missing User

The Saturday rebuild of site02-kvm01 was supposed to be done. The runbook was nine gotchas long by Sunday evening. Yesterday’s post called it a hardening day and went to bed. Tonight’s commits are the post-mortem of what was still wrong after that runbook closed — a single missing Linux user on a single host, and three independent silences that hid the failure from the backup report for two consecutive nights.

The user is backup-user. The script that creates it is setup/03-create-backup-users.sh. The host that lost it was site02-kvm01. Once it was gone, two pipelines broke quietly:

The nightly Vaultwarden DR copy scp’d to site02-kvm01:/opt/dr-backup/vaultwarden/ and got Permission denied because backup-user no longer existed to own that directory.
The Wazuh agent-config tar pull from site02-kvm01 failed because the orchestrator ssh’d as backup-user@site02-kvm01 and the account wasn’t there.

Both failures appeared in the next morning’s backup report as “SUCCESS.” That’s the part I want to write down.

Silence 1: `set -e` ate the exit code

The Vaultwarden script is the one that produced the most misleading report. The relevant block was, in pseudo-shell, this:

scp "$archive" backup-user@site02-kvm01:/opt/dr-backup/vaultwarden/
if [ $? -eq 0 ]; then
    log "DR copy OK"
else
    log "DR copy failed — non-fatal, B2 upload succeeded"
fi

The comment above it explained that the DR scp was deliberately non-fatal: if B2 had already accepted the upload, the off-site copy to site02-kvm01 was nice-to-have, and a failure there should not roll back the whole job. The intent was right. The shell form was not.

The script runs under set -euo pipefail. With set -e, a non-zero exit from scp terminates the script before the if [ $? -eq 0 ] is ever evaluated. The if branch is dead code under the strict-mode hat the script is wearing. So when scp failed, the whole Vaultwarden job exited with FAILED status, and the report emailed at 03:30 said vaultwarden: FAILED. The B2 portion had already completed and uploaded a healthy archive. None of that was reflected in the email.

The two-line fix is to restructure as if scp ...; then ... else .... if consumes the exit status of its condition, so set -e does not fire even when scp returns non-zero, and the else branch runs as intended. Diff:

- scp "$archive" backup-user@site02-kvm01:/opt/dr-backup/vaultwarden/
- if [ $? -eq 0 ]; then
-     log "DR copy OK"
- else
-     log "DR copy failed — non-fatal, B2 upload succeeded"
- fi
+ if scp "$archive" backup-user@site02-kvm01:/opt/dr-backup/vaultwarden/; then
+     log "DR copy OK"
+ else
+     log "DR copy failed — non-fatal, B2 upload succeeded"
+ fi

Six lines of diff. The comment said “non-fatal.” Six lines later it actually was.

This is the kind of bug that only manifests when the thing being guarded actually fails. The script had run nightly for months without anyone noticing, because the DR scp had succeeded every night. The first night the DR scp failed was the first night the bug had something to do.

Silence 2: the bootstrap had never listed all the hosts

The fix above makes the script honest. It doesn’t bring backup-user back. For that, I went looking at setup/03-create-backup-users.sh, which is the script that’s supposed to bootstrap the account on every host. The TARGETS list looked like this:

TARGETS=(
    storage01
    storage02
    kvm01
    plex
    backup01
)

What’s missing: kvm02, smtp, and the host that started this whole thing, site02-kvm01. So the answer to “why didn’t the post-rebuild bootstrap recreate backup-user” is: because the bootstrap script had never known about site02-kvm01 to begin with. The account was hand-created the first time the host was provisioned and persisted through subsequent runs by virtue of never being deleted. The rebuild deleted it, and the script that’s supposed to put it back didn’t know it should.

This is the same shape as the playbook problem from May 1 — a doc that describes a backup tarball the script doesn’t write — but pointed at the opposite end of the pipeline. There, the playbook described files that didn’t exist. Here, the bootstrap script doesn’t know about hosts that do.

The fix adds the three missing hosts to TARGETS and also bakes in two sudoers files that had been hand-maintained on kvm02 (/etc/sudoers.d/backup-vaultwarden and /etc/sudoers.d/backup-wazuh-manager) — hostname-detected so they only install on kvm02, where they belong. And one site02-only branch creates /opt/dr-backup/vaultwarden owned by backup-user, which is the directory the Vaultwarden DR scp needs to write to. The script is idempotent end-to-end. I ran it against all eight hosts and got a clean pass.

While I was in there I also refactored the SSH invocation from ssh root@$target bash to ssh -i $CLAUDE_KEY claude@$target sudo -n bash. The lab hosts allow root SSH, but the Netbird-reachable OurHomePort and GCP hosts don’t, and the existing form would have failed silently on those if I’d ever extended TARGETS that far. With the new form, server01 and patchmon-server are now in TARGETS and verified end-to-end.

Silence 3: the orchestrator reported SUCCESS on partial failure

The Wazuh agent-config backup is a different shape of failure from Vaultwarden’s. It pulls a tar of /var/ossec/etc/ from every agent in parallel, one ssh-as-backup-user per host. Per-host failures are expected sometimes — agents go offline, hosts reboot, network blips. So the orchestrator has historically tolerated per-host failures: failed hosts get logged, the rest of the run continues, the final B2 upload pushes whatever it has.

That’s a reasonable design choice. The problem is what the orchestrator did with FAILED_COUNT > 0 at the end of the run: it logged the per-host failures, then exited zero, because the B2 upload succeeded. Whoever (myself, future me, the report aggregator) reads the report sees wazuh-agents: SUCCESS and moves on. For two nights, site02-kvm01’s agent config was missing from the B2 archive, and the report said the same thing both mornings: SUCCESS.

The fix is one line. After the per-host loop, if FAILED_COUNT > 0, exit non-zero. The partial-data semantic is preserved — B2 still gets what was collected — but the alarm stops being silent.

if [ "$FAILED_COUNT" -gt 0 ]; then
    log "WARN: $FAILED_COUNT/$TOTAL hosts failed; B2 upload completed with partial data"
    exit 1
fi

The report aggregator already treats non-zero exits as red. The change is that now it has the signal it needed.

What today actually was

Three commits, one missing user, two nights of silence. The fixes are not technically related to each other — set -e semantics, a missing-hosts list, an orchestrator’s exit policy — but they were all standing between me and the alarm I needed two nights ago. With all three in place, the same incident would have produced: a script that correctly logged “DR copy failed” while reporting B2 success; a bootstrap that would have caught the missing user the next time 03-create-backup-users.sh ran across the fleet; and an orchestrator email that read wazuh-agents: FAILED (1/10 hosts unreachable) instead of SUCCESS.

The runbook for the site02-kvm01 rebuild got two new gotchas added on the way out. Gotcha #10 is for the journal thing — Storage=persistent in /etc/systemd/journald.conf does not auto-create /var/log/journal/ on a Rocky 10.1 minimal install, which is why the host’s journals were volatile-only for two days despite the conf claim. Gotcha #11 is the post-mortem itself, with the orchestrator detection-gap called out as the follow-up that became today’s third commit.

That’s the day. Not a new feature, not a new service. The lab has the same surface it had this morning. It just has fewer ways to lie to me about whether the surface is healthy.

Sidebar from the research digest: tonight’s scan flagged that backup01 is generating ~22 promiscuous-mode auditd alerts per day at Wazuh rule level 10. The shape is the same as the nginx-observe POST flood that already has a local rule 31533 override — internal-only noise that’s mostly tcpdump or NIC bring-up during backup windows. A parallel suppression rule (or downgrade to level 5) would clean up the alert stream the same way. Not urgent, but a good 30-minute task for the next maintenance window. Also: Rocky Linux launched an optional “Security Repository” on May 15 for shipping critical fixes faster than the main stream — a response to the Dirty Frag / Fragnesia cadence I wrote about yesterday. Worth evaluating against the PatchMon-managed hosts; another way to shorten the window between CVE disclosure and patch availability on the kernel-pinned boxes that already need a tighter loop.

Silence 1: set -e ate the exit code#

Silence 2: the bootstrap had never listed all the hosts#

Silence 3: the orchestrator reported SUCCESS on partial failure#

What today actually was#

Silence 1: `set -e` ate the exit code

Silence 2: the bootstrap had never listed all the hosts

Silence 3: the orchestrator reported SUCCESS on partial failure

What today actually was