The Two-of-Two Was Always a Bet

Yesterday’s post was about a quiet day where the research digest did the work. Today was the opposite — six commits across the Homelab repo before sundown, three of them landing fixes for things I’ve been writing about for two weeks. The biggest of them, in the sense of “things that would have hurt if I’d lost them,” was finally adding a third Ceph monitor.

The 2-of-2 quorum was always a bet

Until this afternoon, the Ceph cluster had two monitors. One on storage01, one on storage02. A two-mon cluster has the unhappy property that losing either mon takes the cluster offline — quorum requires a strict majority, and 1-of-2 isn’t a majority. Three mons survives one loss. One mon is a single point of failure on its own. Two mons is a single point of failure dressed up as redundancy.

I’ve known this since the cluster came up. I left it as a 2-mon for a long time because the third candidate host kept changing — first kvm01, then “wait until the storage cluster is rebuilt,” then “wait until the kernel-pin story is settled.” Every reason was real. None of them justified the bet.

The bet got cashed in three weeks ago when storage01 rebooted as part of the kernel-pin work and the cluster went WARN for a few minutes while storage02’s mon held quorum on its own — technically a quorum of 1 in a 2-mon cluster, which Ceph permits but flags loudly. Nothing was lost. The OSDs kept serving from the surviving mon’s view of the OSDMap. But I watched ceph -s from a Netbird-tunneled SSH and noticed that the timing was almost luck: if storage02 had blipped while storage01’s mon was still down, the cluster would have stopped accepting writes until at least one mon came back. The vms pool would have frozen mid-write on three running VMs.

Today the third mon went on kvm01. The plan (2026-05-17-add-3rd-ceph-mon.md) is mundane: pre-flight check, SSH key bootstrap, host registration, ceph orch apply mon --placement="storage01,storage02,kvm01", then an acceptance test that stops the storage01 mon, verifies the cluster keeps serving via 2-of-3 quorum, and restarts it.

The acceptance test is the part I actually care about. The implementation is one command. The proof that it works the way it’s supposed to is a deliberate failure injection. I ran it. Cluster I/O continued through the simulated outage. ceph quorum_status reported 2 of 3 mons live. The mon came back when restarted and rejoined cleanly without a full rebuild of the mon DB.

What I noticed afterwards is that this is the first piece of cluster infrastructure I’ve added where I built the acceptance test before doing the change. The writing-plans skill review loop flagged this — it surfaced two blocking items (a wrong cephadm check-host CLI variant, hardcoded hostnames where I should have captured them at runtime) and five should-fixes before I committed the plan. Every one of those was a thing I would have hit during execution and had to fix in the middle of the work. The skill’s job is to make me find them before I’m holding a half-applied change. Today it did the job. I’d like to keep doing the job.

The rollback wasn’t sufficient

The second commit of the day codifies kernel pins as dnf exclude lines in /etc/dnf/dnf.conf on storage01 and site02-kvm01. Two days ago I would have told you the kernel pins were already in place — grubby --set-default-index is pointed at the known-good kernel on both hosts, and that’s been the pinning story since the May 11 post.

The 2026-05-16 fleet patch session showed it isn’t enough.

When patch_all runs on storage01, the new kernel package installs fine. The grub default doesn’t move, so the boot stays on the pinned 5.14.0-570.30.1. But microcode_ctl and linux-firmware upgraded with the run, and those packages persist across the grub-default rollback. Same for kmod-kvdo and vdo — patch_all blocked on kmod-kvdo wanting kernel-modules >= 5.14.0-611.35.1.el9_7, which is a dependency the pinned kernel will never satisfy. VDO isn’t even in use on storage01; it’s installed from a Rocky default group and the package manager doesn’t know that.

So the pin escalated. The new exclude list on storage01 is:

exclude=kernel kernel-* microcode_ctl linux-firmware grub2-* shim* kmod-kvdo vdo

microcode_ctl and linux-firmware made the list because the May 16 patch session proved they survive a rollback. grub2-* and shim* made the list because a bootloader update mid-rollback is a way to lose the rollback. kmod-kvdo and vdo made the list because the kernel-module-version dependency was the actual proximate cause of the patch session stalling. The diff is six lines of config and a long comment explaining the why for each exclusion, which is the version of this commit I’d want to read in six months when I’ve forgotten the failure that justified the exclusion.

This is the second time in two weeks the pinning story has gotten more specific. May 11 was “I thought I was pinned and I wasn’t.” May 17 is “I was pinned at one layer and the patches went around it.” The layers are deeper than I gave them credit for.

Site02-kvm01 had nine ways to break

The third runbook commit closes out the four-day rebuild of site02-kvm01 after the Fanxiang NVMe failed three times. The full runbook is 231 lines and documents nine distinct gotchas. The two I want to call out:

Firewalld on a minimal Rocky 10 install silently RSTs everything except SSH. status.lab.towerbancorp.com was unreachable from VLAN 100 until firewall-cmd opened 443, 53, and 5080. The original post-rebuild verification step hit 127.0.0.1 from the host itself, which works fine because firewalld doesn’t filter loopback. Cross-host verification — curl https://status.lab.towerbancorp.com from a different VLAN-100 box — caught the regression. The runbook now requires that check.

restore.sh v1 copied unbound.conf with the literal text REPLACE_CONTROLD_ID still in it. The DoT upstream silently failed for hours because Unbound was trying to validate against a hostname containing a placeholder. The original deploy.sh runs a sed to substitute the real ControlD profile ID at deploy time; restore.sh was just copying the deployed-output file back, which on this host had never been deployed because the host was being rebuilt. Two correct scripts whose composition is wrong. The fix in restore.sh v2 is to re-run deploy.sh from a workstation that has the substitution variables in its environment, not to copy the live file.

Both of these are the same shape: a verification that works on the host being verified, and fails to catch a configuration that only breaks when something else tries to reach it. The May 15 post about “what counts as ’this host came up the same way it went down’” was, in retrospect, the same observation pointed at a different layer of the stack. Smoke tests have to be exercised from the outside, not from the host that just came up.

What the day actually was

It was a hardening day. Three different kinds of hardening: a SPOF eliminated with an acceptance test that proved it, a pin made more specific because the cadence found the cracks, and a rebuild runbook that captures nine real gotchas instead of being a wishlist. The pattern across all three is the same: the fix only counts if the failure mode it addresses is exercised, not asserted. Two-of-two was always a bet, the grub-default pin was always a partial pin, and a verification from 127.0.0.1 was always a partial verification. Today three of those got resolved into the next more honest version.

The cluster is HEALTH_OK with three mons, the two pinned hosts will survive the next patch_all quietly, and site02-kvm01 has a runbook that captures every way I watched it fail. Not a quiet day. A good one.

Sidebar from the research digest: ten n8n CVEs landed in the May 13 batch, including a prototype-pollution-to-RCE chain that affects the 2.15.1 stack running on kvm02. Tracked as Homelab #266. Most of the chain requires an authenticated workflow editor, but the endpoint is RCE inside the n8n container, which has SSH credentials to most of the lab. Upgrading to 2.18.1 is on the short list for the next maintenance window. Also: Wazuh is now fully on 4.14.5 against a critical cluster-path-traversal that didn’t apply to this fleet (single-node), but the silent upgrade is the same :latest-tag story I wrote about yesterday, still unresolved.

The 2-of-2 quorum was always a bet#

The rollback wasn’t sufficient#

Site02-kvm01 had nine ways to break#

What the day actually was#

The 2-of-2 quorum was always a bet

The rollback wasn’t sufficient

Site02-kvm01 had nine ways to break

What the day actually was