The Playbook Found the Bugs

The plan was to scaffold the disaster-recovery playbooks for a one-shot drill that restores every backed-up component from Backblaze B2 onto site02-kvm01 — KVM XML configs, Wazuh agents, Vaultwarden, the five server01 apps, the Wazuh manager, and one Ceph RBD image — validates each with a smoke test, tears it down, and produces a results doc plus one GitHub issue per gap. The spec went in last night. The plan went in this morning. By the time I tagged out tonight there were eleven playbook files on disk totalling about 5,400 lines of markdown across the two repos.

The drill itself has not run. Not one of those playbooks has been executed against real B2 data.

The playbooks have already filed seven issues.

That is the part I want to write down, because before tonight I would have told you that DR playbooks are documentation — a thing you produce for the next operator or for future-you. They are. They are also a remarkably efficient way to find out what’s wrong with the system you’re documenting, because the act of writing one forces you to commit, in writing, to claims about production state that you can immediately check.

Drift the spec didn’t know about

The spec I shipped last night had a narrative arc — “you’ll pull this tarball from B2, extract it here, restore it there” — that was confident and wrong in four places. None of those places were obvious until I sat down to write the corresponding playbook section and found the path it pointed at didn’t exist or the format it described wasn’t what the backup actually wrote.

The B2 layout was the first one. Spec said <latest>.tar.gz. Reality is <latest>/ — the rclone-sync pipeline writes timestamped directories, not tarballs. Postgres versions in the OurHomePort sub-procedures: spec was implicit on Postgres 15 across the board; n8n and Authentik are actually pinned to 16-alpine, and ezbookkeeping is on 15. The Authentik image: :latest in spec, ghcr.io/goauthentik/server:2026.2 in the running unit. Wazuh source path: applications/wazuh/restore/ in the spec table, but the directory in the repo is security/wazuh/restore/. None of those would have stopped a drill — the operator would have figured each out in the first sixty seconds — but the cumulative effect of four small wrongnesses in a procedure you’re following stressed and tired is one of those compounding failure modes I’d rather not test in a real recovery.

The fix was a single commit to the spec (Homelab #237). The lesson is that a spec written from memory and a playbook written from production are different artifacts, and the only one of those two I want to be using when site01 is gone is the second.

A P0 gap that turned out to be by design

The bigger reframing of the day was around Wazuh. The plan-review pass last night had flagged “indexer + dashboard volumes empty after restore” as a critical-priority data-loss gap, on the assumption that the backup of the Wazuh manager ought to include the indexer’s history of alerts and saved searches. The playbook section was going to log this as a P0, file it as Homelab #232, and recommend extending the backup pipeline to include the OpenSearch data volume.

What I found when I looked at the actual backup script was that the indexer volume is deliberately excluded. The pipeline backs up internal_users.yml, client.keys, local_rules.xml, the manager’s API certs, and the Filebeat config — every piece of state required to bring a fresh manager up and have it accept reconnections from existing agents — but not the alerts data or the saved searches or the dashboards. The trade is documented: backup size and restore simplicity in exchange for sacrificing historical alerts. Agents re-enroll from client.keys post-DR and start forwarding fresh alerts within minutes.

So the reframe was “header note: ‘Critical gap’ → ‘Expected by design’” and “smoke test: ‘WARN … log P0 gap’ → ‘INFO … expected baseline.’” The playbook now teaches the right thing on first read instead of teaching the operator to go file an issue against a working system.

I would not have caught this by re-reading the backup pipeline. I caught it because the playbook required me to write down, in a numbered section, what the operator should expect to see in the indexer after restore — and the only honest answer was “nothing, and that’s correct.”

Vaultwarden was running `:latest` with AutoUpdate

This one is the same shape as last night’s n8n-frozen-at-2.9.4 story but rotated. The Vaultwarden quadlet was running vaultwarden/server:latest with AutoUpdate=registry. While I was writing the section that says “to restore Vaultwarden during DR, run a container from this image tag,” I realized I could not name the image tag, because the production unit didn’t pin one and podman-auto-update.timer had been moving the running version forward unannounced.

podman exec vaultwarden /vaultwarden --version returned 1.35.3. So I pinned the unit to vaultwarden/server:1.35.3, dropped AutoUpdate=registry, and updated the playbook’s section 3 podman-run to match. The repo change is not yet deployed to kvm02 — the live container is still managed by the old unit and will keep auto-updating until the next scheduled deploy. But the next time someone (me) edits the unit and reloads it, the pin holds, and the playbook becomes accurate at the same instant.

The general failure-mode label I kept writing into these playbook sections is “production tag bumped after playbook last updated.” Pinning makes that failure mode a build-time inconsistency instead of a runtime surprise.

Termix needs a guacd it never told anyone about

The OurHomePort side surfaced its own gap. Termix is a SSH-in-the-browser app, and the deployment doc described it as a single container running termix.container. The DR sub-procedure required me to actually list everything that had to come up for terminal sessions to work. Termix is two containers — the web frontend and a guacd sidecar that does the actual SSH protocol relay. A new operator deploying only termix.container would see HTTP 200 on the UI, log in fine, and lose ten or twenty minutes diagnosing why every terminal session silently fails to connect.

That earned its own commit and its own issue (OHP #76) — a “Dependencies” section in applications/termix/docs/deployment.md that lists both images and ports and the operational gotcha, plus a cross-link from the DR playbook back to the dependency section.

The cheap drill

The full taxonomy of what got written today is six playbooks on the Homelab side (KVM configs, Wazuh agents, Vaultwarden, the server01-apps wrapper, the Wazuh manager, and the Ceph-RBD smtp playbook), five server01 sub-procedures on the OurHomePort side (Actual Budget, Termix, ezbookkeeping, n8n, Authentik), a Stage 0 preflight that proves B2 connectivity and rclone keys before any restore runs, a results-doc template, a gap-issue template, and a master runbook that links the whole thing together. Plus seven issues filed against gaps the writing surfaced and one spec-drift commit and one image-pinning commit and one deployment-doc commit.

The drill is still not run. When it does run, in some week when I have a clean three-hour block on site02-kvm01, those gap counts will probably double — actually pulling tarballs from B2 and trying to bring services up will fail in ways the writing didn’t catch. That’s the point of doing the drill at all.

But the cheapest drill is the one where you write down, in unforgiving detail, what you would do — and the inconsistencies between what you wrote and what’s actually on disk fall out of the page on their own. I closed seven of those today without ever pulling a tarball. I’d rather be doing that than discovering them at 2 a.m. with site01 gone.

The thing this week’s worth of posts keeps circling is that “what is actually running” is rarely the same as “what we believe is running,” and the difference is only ever bridged by a question pointed at the running system itself. Yesterday it was podman exec n8n n8n --version. Today it was the act of writing down, in a numbered playbook section, exactly which image tag a recovery would pull — and finding I could not name it.

The playbook is the question. The system answers, eventually, in the form of every place the playbook turns out to be wrong.

Drift the spec didn’t know about#

A P0 gap that turned out to be by design#

Vaultwarden was running :latest with AutoUpdate#

Termix needs a guacd it never told anyone about#

The cheap drill#

Drift the spec didn’t know about

A P0 gap that turned out to be by design

Vaultwarden was running `:latest` with AutoUpdate

Termix needs a guacd it never told anyone about

The cheap drill