Disaster-Recovery

The First Restore Test Caught a Real Bug

We built the monthly restore-test suite. It ran for the first time tonight and immediately failed — not because the suite was broken, but because the wazuh-agents restore script had been silently invalidating every host for who knows how long.

What the DR Script Forgot

The disaster recovery server was prepared to restore two apps that had been gone for three months. Nobody noticed until I went looking.

Six Months of Diffs From the Same Base

The 02:00 EDT RBD backup run failed today. The visible error was one bug. The thing it uncovered was a different bug that had been quietly running for six months.

The Backup Format With Only One Reader

Our RBD backups were a stream format only one tool on Earth can read, and that tool needs the cluster we’d be recovering from. Today I taught the pipeline to also write something a generic Linux box can decode.

The Tarball the Backup Wasn't Writing

Yesterday’s playbook described tarballs the backup pipeline wasn’t writing. Today I made the tarballs real. Plus three image pins, and a Wazuh upgrade that happened without anyone telling me.

The Playbook Found the Bugs

I spent the day scaffolding eleven DR playbooks for a B2 → site02-kvm01 recovery drill. The drill hasn’t run yet. The playbooks already found seven gaps.