The Ceph RBD backup pipeline runs at 02:00 EDT every morning on backup01. It snapshots six images out of the vms pool, exports each as a diff against the previous snapshot, gzips, ships to Backblaze, then prunes old snapshots. Last night it failed with one image written and five skipped. The visible error was one bug. Investigating it found a second bug — a much older, much quieter one — that had been pretending to work since October.
Both bugs landed in Homelab #240. The post-mortem is the more interesting half.
The visible failure
The orchestrator’s email went out around 02:04 with the kind of subject line you don’t want: RBD Backup: 1 ok, 5 failed. The log told a clean story for image #1 (smtp.lab.towerbancorp.com), then trailed off mid-cleanup with a single transient error:
Removing old snapshot: backup-20260423-020000
ssh: Could not resolve hostname tbc-site01-storage01.lab.towerbancorp.com:
Name or service not known
After which the script exited and never started on tbc-site01-hycu01, tbc-site01-hycu01-data, rocky10-vm01.img, filedrop, or wazuh-queue.
The DNS hiccup itself was not interesting. The lab DNS mirrors restarted earlier in the night and a single SSH lookup happened to land in a sub-second window where the mirror’s UDP listener was up but its container hadn’t reattached its bridge alias yet. By the time the next image’s run would have started, the resolver was fine again. There’s no fleet-wide problem to fix here. The interesting part is that the script gave up on a transient cleanup error.
The script header begins with set -euo pipefail. This is the right default for most of the body — Ceph snapshot creation, diff export, B2 upload, all of those need hard-fail-on-error semantics, because silently continuing past a snapshot creation failure means uploading a stale or empty diff. Cleanup is the one block in the script where that contract is wrong. Cleanup is a maintenance step: it removes old snapshots to bound the retention window. If it fails, the worst case is one extra snapshot lying around for 24 hours until tomorrow’s run gets another chance. That is not a reason to abort the rest of the loop and leave five images unbacked-up.
The fix is two lines:
set +e
echo "Cleaning up old snapshots..."
# ... cleanup loop ...
set -e
That’s the part anyone reading the diff will spot first. It is also not the bug worth writing about.
The bug it surfaced
While I was looking at the cleanup block — the one that just had set +e wrapped around it — I noticed the snapshot list it was iterating over, and stopped.
OLD_SNAPS=$(... "rbd --id $CEPH_USER snap ls ${CEPH_POOL}/${image}" \
| grep "backup-" | tail -n +5 | awk '{print $2}')
rbd snap ls outputs snapshots sorted oldest-first by SNAPID. The intent of the comment immediately above this block was “keep the newest 4 backup snapshots, delete everything older.” tail -n +5 selects from line 5 to the end. With oldest-first input, lines 5 and beyond are the newest snapshots, not the oldest. The cleanup was deleting the newest. It was keeping the four oldest snapshots in the pool.
I went and looked at what was actually in the cluster. On smtp.lab.towerbancorp.com, the surviving four backup-* snapshots in vms/smtp were dated October 23, October 24, October 25, and October 26 of last year. On the other five images, the four oldest were all from April 23, 2026 (the day after the previous deploy). And there was one fresh snapshot per image from this morning — except for smtp, where this morning’s snap had also been the one image where cleanup ran successfully and dutifully deleted the newest snap instead of the oldest.
This bug had been live since October 23, the day someone first ran the script with cleanup enabled. Everything since then has been a slow-motion artefact of one inverted command flag.
It is very worth tracing what this meant in practice, because the failure mode is non-obvious.
What “every diff was a full diff” actually means
Earlier in the same loop, the script picks a “previous snapshot” for rbd export-diff --from-snap:
PREVIOUS_SNAP=$(... "rbd --id $CEPH_USER snap ls ${CEPH_POOL}/${image}" \
| grep "backup-" | tail -2 | head -1 | awk '{print $2}')
tail -2 | head -1 picks the second-to-last line. With oldest-first sorting and only the four ancient snaps surviving, plus today’s freshly-created one, the second-to-last was one of the ancient ones — backup-20251025-020000, say. The script then exported a diff covering every change to that image from October 25 onward. Six months of writes. Every day. Compressed and shipped to B2 every morning, and labelled incremental.
The B2 storage bill says hi. The bandwidth is the smaller cost; the more interesting one is what this did to the restore semantics. An incremental backup chain is supposed to be: full snapshot at T0, diff T0→T1, diff T1→T2, diff T2→T3 … and you replay them in order. What we actually had was: full snapshot at T0 (October), diff T0→T1, diff T0→T2, diff T0→T3, … each diff a parallel sibling of the others, all rooted at the same six-month-old base. To restore, you’d take the most recent diff in B2 and apply it to the October base — which is fine, the latest day’s data still lands cleanly — but the entire intermediate chain is essentially unused. The “incremental” framing in the metadata file was technically true (each was a diff, not a full export) and operationally a lie (none of them was a true incremental over the previous one).
If the failure on October 23 had been a more visible one — say, the diff exports were 200x larger than the day before, or the metadata file showed “from-snap: backup-20251025-020000” against a “from-snap: backup-20260427-020000” expectation — I’d have seen it. The thing that hid it was that all the daily artifacts looked normal in isolation, and nothing in the pipeline ever compared two consecutive days against each other. The retention output was reasonable; the diff exports were valid; B2 was happy. The aggregate behaviour was wrong, but no single observation surfaced the aggregate.
The fix was a one-line swap: tail -n +5 → head -n -4. (Both are GNU coreutils idioms for “everything except the last N.” head -n -4 reads more directly given the sort order; tail -n +5 reads correctly only if you remember that “+5” is “from line 5 onward,” which is exactly the brittleness that produced this bug.) The first run after the fix is going to produce one large catch-up diff that covers the previous two months of real changes, then settle into actual day-over-day diffs. The retention will start working tomorrow, and inside four daily runs the surviving snapshot set will be (today, T-1, T-2, T-3) the way the comment always claimed it was.
The B2 cost-recovery is the other half — once the retention actually evicts the October base, the restore-from-B2 procedure has to use the most recent full export instead. That part of the playbook needs an update tomorrow. (As does yesterday’s post: the parallel raw-export work that landed for smtp on May 3 didn’t depend on the diff chain being correct, so it was unaffected; that part of the DR story still holds.)
What I want to remember
There are two patterns from this one I’m trying to pin down for next time.
Negative-offset arithmetic in shell pipelines is a footgun. tail -n +5 and head -n -4 look almost identical and mean orthogonal things relative to a sorted list. Whichever one you write, the adjacent command’s behaviour is the bug-shaped one. The comment above this block in the source described intent correctly, in plain English, and the code under it did the opposite — for six months. A unit-style assertion (“after cleanup, surviving snapshot timestamps are within the retention window”) would have caught this; a code review of tail -n +5 against a 30-line script with no test harness did not, originally, and would not now.
Hard-fail defaults in maintenance steps eat your lunch. The script’s set -euo pipefail was correct for the export and upload paths and incorrect for the cleanup path. The right discipline is to scope strictness per-block: hard-fail where data is being created, soft-fail where data is being garbage-collected. I had this same shape come up four days ago in the n8n volume tarball post (where a different set -e was the right call); the difference is direction-of-data-flow. Creating: fail loudly. Discarding: fail quietly and continue. I’d like the third instance of this pattern to feel familiar, not novel, so I’m noting it explicitly.
The script change is in main as of this commit. The orchestrator container on backup01 still has the old script bind-mounted from before, so tomorrow’s 02:00 run will be the first one with the corrected logic. I’ll be watching the diff size on the second image come through.
Sidebar: per the digest tonight, every Wazuh agent in the fleet plus the manager is on 4.14.5. This is the third silent version drift I’ve caught in two weeks (n8n on server01, the Vaultwarden/Actual Budget/ezbookkeeping :latest cluster, now this one). The pattern from yesterday’s post still holds: the repo is the source of truth for the value, the running process is the source of truth for the system, and the two have drifted three times for three different reasons. A nightly version-reconciliation task is starting to feel less like a nice-to-have.
