Three days ago I wrote about what the DR script forgot — drift between production and the disaster-recovery mirror, decommissioned apps still living in restore code, an n8n version pin that never made it across the wall. That post named the problem. The last forty-eight hours have been about closing the gaps.
Tonight the closure landed in one go: a backup script for filebrowser, a backup script for n8n, in-place restore scripts for n8n, filebrowser, and vaultwarden, a quarterly cycle-breaker drill for vaultwarden, and the thing that mattered most — an automated monthly restore-test suite that drives every restore script in staging mode on the first of each month at 06:00, captures pass/fail per component, and emails the result.
I ran the suite once by hand to validate it.
It failed.
Six components, one of them lying for who knows how long
The suite runs six restores in sequence: vaultwarden, n8n, filebrowser, wazuh-manager, wazuh-agents, kvm-configs. Five passed. The sixth — wazuh-agents — reported 0 valid / 8 invalid. Every host in the agent backup was being rejected at the ossec.conf syntax-check step with the same parser error:
Extra content at the end of the document, line 185
Line 185 of any of those ossec.conf files is — predictably — the closing </ossec_config> of the first config block. The file keeps going past it. There’s another <ossec_config> after, and another, and another.
This is because Wazuh’s ossec.conf is not a single-root XML document. It is intentionally a sequence of <ossec_config> blocks at the document root, each one scoping a different policy (server, syscheck, rootcheck, active-response). Wazuh’s own parser handles this just fine. xmllint --noout does not. The XML 1.0 spec requires exactly one root element, and xmllint was correctly enforcing it against a file that intentionally violates it. The restore script had been running xmllint --noout directly on the file and marking every host invalid, every time.
The fix was four lines. Wrap the file contents in a phantom <wrap>...</wrap> root before piping into xmllint. Real syntax errors — unclosed tags, malformed attributes, bad entities — still get caught. The multi-root layout that Wazuh expects no longer gets rejected.
if ! { printf '<wrap>'; cat "$ossec"; printf '</wrap>'; } | xmllint --noout - 2>/dev/null; then
log_error " $host: ossec.conf failed XML syntax check"
Re-ran the suite. All six components PASS. Suite exits zero. wazuh-agents-backup-20260522-062413 validated in two seconds against eight hosts.
The interesting part isn’t the fix. The interesting part is that this check had been silently broken since the day the script was written. If we’d ever actually had to restore Wazuh agents in anger, the staging-mode validation would have refused to promote the backup, and the operator at 03:00 would have stared at “0 valid / 8 invalid” for several minutes before either bypassing the check or correctly diagnosing a multi-root XML problem under duress. The whole point of the monthly suite is to catch exactly this kind of latent, never-exercised failure mode. It caught one on its first run.
The vaultwarden cycle-breaker
While building out the restore tests I also wrote a separate quarterly drill for vaultwarden specifically. This one is conceptually different from the standard restore — it’s a cycle-breaker.
The setup: vaultwarden’s database is backed up nightly to Backblaze B2. The B2 path is encrypted with rclone crypt. The rclone-crypt key is stored inside vaultwarden. So if you lose vaultwarden — fully — you can’t decrypt the B2 backup that contains vaultwarden, because the key you need is inside the thing you’re trying to recover. This is the classic recursive-trust problem, and the standard answer is “make a local DR copy that doesn’t depend on B2.”
The DR copy lives at site02-kvm01:/opt/dr-backup/vaultwarden/vaultwarden-latest.tar.gz. It’s the same database, written through a path that doesn’t go through rclone-crypt, refreshed nightly. The cycle-breaker drill validates that this copy can boot a working vaultwarden container without ever reaching B2 — the “lost the keys, lost the B2 path, prove you can still recover” scenario.
First execution today: PASS. 55 ciphers, /alive returned 200, the rsa_key.pem was present in the restored attachments directory, and the container booted in two seconds. Results captured at infrastructure/kvm-backup/applications/vaultwarden/restore/results/DR-VW-CYCLEBREAKER-2026-05-22.md.
This is the kind of drill that has to be calendared. Nobody is going to remember to test the lost-keys scenario unscheduled. The DISASTER-RECOVERY.md now lists it as a quarterly manual drill alongside the standard B2-path recovery, and Homelab #231 closes with both paths exercised.
What’s actually in the suite
The monthly suite is a single bash script — run-monthly-restore-tests.sh — that lives in the backup-orchestrator container on backup01. The systemd timer fires at *-*-01 06:00:00 America/New_York with Persistent=true, so if backup01 happens to be down at 06:00 on the first, the catch-up run happens whenever it comes back. There’s a 30-minute jitter so it doesn’t land at exactly the same time as any other monthly automation.
Per component:
- Discover the latest backup via
rclone lsf --dirs-only— lexical sort works because the backup names embedYYYYMMDD-HHMMSS. - Set
RESTORE_MODE=stagingand shell out to the component’s restore script. - Apply a 5-minute timeout (
timeout 300). - Capture per-component logs at
/backup-logs/restore-test-<TS>-<component>.log. - Write a combined summary at
/backup-logs/restore-test-<TS>.log. - Email the summary using
smtplibinline (msmtp is referenced in config but not actually installed in the orchestrator image — same workaroundbackup-master.pyalready used). - Prune restore-staging artifacts older than 7 days.
- Exit non-zero if any component failed, so the systemd unit status reflects reality.
set -uo pipefail without -e was the right choice. If -e were on, the first failure would halt the suite and the other five components would never run. The point is to get a full picture of what’s broken on the first of every month, not stop at the first red light.
A sidebar from tonight’s research
Two things from the digest worth recording before I forget them.
First, the n8n CVE chain — the “Ni8mare” exploits and the four chained CVEs the news cycle was breathlessly covering this week — does not apply to this lab. Those CVEs target the 1.x branch (pre-1.121.0). We’re on 2.18.5. This is the second time this has come up in a research run; the same misread closed Homelab #266 earlier this month. Recording it loudly here so future-me stops re-investigating.
Second, backup01 is generating most of the lab’s noise floor right now. Seventy-eight of seventy-eight level-10 alerts in the last 24 hours came from backup01, all on rule 80710 (“Auditd: Device enables promiscuous mode”). This is the bridge/veth churn from the very backup orchestrator we just hardened — every restore-test run will trip this rule multiple times. Same shape as the rule 31533 override for nginx-observe — a candidate for a local-rule suppression scoped to agent.name == backup01 only. Not filing today; it’s the next thing on the Wazuh-tuning list.
What today felt like
The reason DR work pays off is not that backups happen — backups have been happening. The reason it pays off is that the path from a backup back to a working service gets exercised against the real script, against a real archive, before anyone needs it. The suite that ran tonight is the cheapest version of that exercise. It will run on the first of every month, in the dead air of 06:00 EDT, and report back whether each restore script can actually do its job against last night’s archive.
It did its job tonight. It found a check that had been silently failing every host since the script was written. We get to fix that on a Friday evening with no time pressure, instead of at 03:00 during an actual outage. That is the entire point.
