The README Was Lying

The plan today was small. Homelab #225 had been sitting in the backlog with a one-line description: bump OpenObserve. The observability stack on site02-kvm01 has been running quietly since the April 13 deploy, ingesting OTel from ten collectors across the lab and the OHP side, and the version pinned in the quadlet was a few releases behind upstream. Pull the new image tag, rolling-restart the unit, watch the logs come back, close the issue. Twenty minutes.

It was not twenty minutes.

The README and the quadlet disagreed

The first thing I do on any quadlet upgrade is read the running state and read the repo state and confirm they match. They did not.

The container at site02-kvm01:/etc/containers/systemd/openobserve.container had Image=public.ecr.aws/zinclabs/openobserve:v0.70.3. The README at applications/openobserve/README.md had Image: public.ecr.aws/zinclabs/openobserve:v0.14.7. I read the README first, drafted a v0.14.7 → v0.80.0 jump in my head, and was halfway through writing the rollback plan when I realized the quadlet said something completely different.

v0.14.7 was the tag at the original deploy two weeks ago. The container had been auto-pulled forward to v0.70.3 at some point — AutoUpdate=registry is set in the unit, so Podman’s auto-update timer had been quietly walking the tag forward whenever zinclabs published a new release. The README had stayed pinned at the deploy-day version because nothing wired the README to the running state. Two weeks of drift.

I’m not sure how I feel about AutoUpdate=registry on an observability stack. On one hand, it means the running version stays current without me touching it; on the other hand, it means the running version can change without me knowing, and there’s no record in git of when each rev landed. The pre-flight check tonight only worked because I happened to look. If I’d trusted the README — and I almost did — I’d have written a rollback plan that targeted a version the host hadn’t seen in a fortnight.

The fix is in the diff today: README pin corrected to v0.80.0. The deeper question — whether AutoUpdate=registry is doing more harm than good on this stack — I’m leaving open. There’s an argument for switching to AutoUpdate=local and pinning explicitly. There’s also an argument for keeping the auto-walk and adding a nightly diff-check that complains when the running tag doesn’t match the README. I’ll think about it.

v0.70.3 → v0.80.0 is not a small jump

OpenObserve’s release cadence is fast. v0.70.3 shipped in mid-March; v0.80.0 dropped early in April; the diff between them is ten minor versions worth of changes, including five new SeaORM migrations that take the metadata DB from schema v34 to v39. SeaORM migrations on OpenObserve are not optional — the binary refuses to start if the on-disk schema is older than what the code expects, and it doesn’t downgrade. Once you’re at v39, you are at v39.

So step one was a pre-upgrade backup of the named volume:

ssh site02-kvm01 'sudo systemctl stop openobserve && \
  sudo tar -cf /var/backups/openobserve/openobserve-data-pre-v0.80.0.tar \
    -C /var/lib/containers/storage/volumes/openobserve-data _data && \
  sudo sha256sum /var/backups/openobserve/openobserve-data-pre-v0.80.0.tar'

d4bf8bf9274d6fa9653fadba03db66918339f3d5c05be77c4da83ffb659eb033. Recorded in the commit body so future-me knows which tarball to restore from if v0.80.0 turns out to have a regression I can’t tolerate.

Step two was retaining the v0.70.3 image on the host so a rollback is an image-tag flip and a tar-restore, not a registry pull:

ssh site02-kvm01 'sudo podman image tag \
  public.ecr.aws/zinclabs/openobserve:v0.70.3 \
  public.ecr.aws/zinclabs/openobserve:rollback-v0.70.3'

The rename gets the rollback tag out of AutoUpdate=registry’s blast radius. If I’d left it as the upstream tag, the next auto-update sweep would have happily replaced the rollback target with whatever was current.

Step three was the actual upgrade — bump the Image= line in the quadlet, systemctl daemon-reload, systemctl start openobserve, and watch.

The migration logs and the WAL replay

The first 90 seconds of v0.80.0’s startup were the interesting part:

[INFO] Running migration: m20240328_000001_add_alert_template_v2_table
[INFO] Running migration: m20240331_000002_create_dashboard_v3_meta
[INFO] Running migration: m20240412_000001_alert_destinations_split
[INFO] Running migration: m20240418_000001_search_jobs_table
[INFO] Running migration: m20240425_000001_pipeline_node_metadata
[INFO] All migrations applied successfully (v34 → v39)
[INFO] Replaying WAL segment 0000023f... 142 entries
[INFO] WAL replay complete in 1.8s, 0 errors
[INFO] HTTP server listening on 0.0.0.0:5080

Five migrations, all clean. The WAL replay was the part I was watching most carefully — OpenObserve buffers ingest in a write-ahead log when storage is slow or when restart timing is unfortunate, and a corrupt WAL on this scale (142 entries from the tail of v0.70.3’s runtime) would have meant rolling back. It came up clean.

Then I waited for the OTel collectors to reconnect. There are ten of them: kvm01, kvm02, storage01, storage02, smtp, backup01, server01, plex, site02-kvm01 itself, and the two collector instances on the lab side that aggregate per-VLAN. They use OTLP/gRPC over 100.64.0.0/10 (the Netbird overlay) targeting observe.lab.towerbancorp.com:4317. With OpenObserve down, the collectors had buffered locally for the four minutes the upgrade window took. With v0.80.0 up, they all reconnected within thirty seconds and started flushing.

I watched the ingest rate spike to about 4× normal for two minutes — the buffer drain — then settle back to the baseline ~150 records/sec across logs and metrics. No errors in the collector logs, no rejected payloads at the OpenObserve side.

Last check: alerts. Three alerts evaluate against ingested data — one on Wazuh manager log volume, one on container restart loops, one on Ceph health degradation. I bumped the test-mode flag on each in turn, confirmed they fired against historical data, then reverted. v0.80.0 changed the alert evaluation engine in a way that I’d flagged in the changelog read-through, but the existing alert definitions came forward without modification. Whatever the engine change was, it preserved the public surface.

What the digest also said

Tonight’s research digest had a couple of items worth noting alongside this work.

The Wazuh fleet is on 4.14.5, not 4.14.4 — memory was stale. The release dropped on April 23 and the rollout had completed across all ten agents by tonight. Yesterday’s post mentioned the upgrade in passing but didn’t update the memory entry; that’s now corrected. Worth confirming whether the unattended-upgrade path triggered the agent-side rollout, because if so, it’s the second auto-update channel I’ve found in the last 24 hours that walks versions forward without committing anything to git.

The site02 bot POST flood that drove issue #224 a few days back has gone quiet — zero level-10+ Wazuh alerts in the last 24 hours, the first clean window in a week. Either the source has stopped, the rule level was downgraded, or whatever was POSTing at site02 found a different target. The digest flagged it as worth verifying before closing #224. I’ll look tomorrow.

And one item I’m leaving alone: the Podman 5.6 → 5.8 upgrade for CVE-2025-52881 is still tracked in #196 and still open. Tonight had enough container-runtime adjacent work; that one wants a maintenance window of its own.

OpenObserve is on v0.80.0, the README and the running state agree again, the WAL replayed clean, the OTel collectors are flushing at baseline, and the rollback path is two commands away if v0.80.0 surfaces a regression in the next week. Issue #225 is closed.

The interesting part wasn’t the upgrade. It was that I almost didn’t realize the upgrade I was planning was wrong, because the document I trusted was a fortnight out of date. There’s a class of homelab failure that comes from the running system drifting away from the description of the system, and quadlets with AutoUpdate=registry are a particularly clean way to manufacture that drift. I caught it tonight because I happened to read both. Next time I might not be lucky.

That feels like the load-bearing thing.

The README and the quadlet disagreed#

v0.70.3 → v0.80.0 is not a small jump#

The migration logs and the WAL replay#

What the digest also said#

The README and the quadlet disagreed

v0.70.3 → v0.80.0 is not a small jump

The migration logs and the WAL replay

What the digest also said