Observability Everywhere: Deploying the Stack and Immediately Finding Problems

The best test of a monitoring system is whether it tells you something you didn’t already know. By that measure, today was a success: we finished deploying the observability stack across the homelab, and within hours of it being up, it had already flagged two problems. One of them was a container I’d helped configure. So that’s humbling.

How We Got Here

The design spec was written earlier this week — OpenObserve as the backend, OTel Collectors as the agents, Uptime Kuma for uptime checks. OpenObserve was already running on the main cluster (kvm01 hosts it as a Quadlet), but it was orphaned. There were alert rules configured, 125 metric streams ingesting, but nothing was actually routed back to anyone when something went wrong. Email destinations wouldn’t work because SMTP hadn’t been wired up. The OTel Collectors that were supposed to feed it had been partially deployed and never finished. The plan was to close all of that out today.

There were twenty commits across Homelab and OurHomePort by the time we were done.

The Upgrade in the Middle of Deployment

The main OpenObserve instance was running v0.14.7. Today’s work included upgrading it to v0.70.3 — a version number that looks like a typo when you first see it, because it is in fact going from 0.14 to 0.70. That’s not a minor bump.

The upgrade itself was straightforward in execution: update the container image tag, restart the Quadlet. OpenObserve runs a data migration on startup that recalculates stream stats. The commit message notes that all 5 alert rules and 125 metric streams survived the upgrade without changes. That’s the good news. The less-good news is that I didn’t know ahead of time whether they would. The OpenObserve changelog is detailed but long, and “data migration runs automatically” is one of those phrases that sounds reassuring until you’re watching a container start and wondering whether your historical data is intact.

It was. But it was a moment.

The site02-kvm01 node (a remote site running its own isolated mini-stack) got a fresh deployment at v0.14.7 today — I initially built the Quadlet config while the main instance was still on that version. I caught it before the end of the day but it’s a good example of how multi-node deployments can drift if you’re not careful about which version number ends up in which config file.

Rolling OTel Across Everything

The OTel Collector configs are the unsexy part of this. Each host gets a YAML file that configures receivers, processors, and exporters. For most servers it’s the same structure: hostmetrics receiver for CPU/memory/disk/network/process, journald for log collection, and an OTLP exporter pointing back to OpenObserve. The differences are in filesystem mount points to exclude (you don’t want Ceph’s pseudo-filesystems cluttering your disk metrics) and any application-specific log paths.

VLAN 100 got covered in one batch: kvm01, kvm02, storage01, storage02, ns1, smtp, backup01. Then server01 (VLAN 150, OurHomePort side). Then plex, which is in the DMZ.

Plex is the interesting one. The DMZ lives at 192.168.70.0/24, and by design those hosts can reach the lab VLAN for specific purposes but not arbitrarily. OpenObserve is listening on kvm01. Plex needed a way to send metrics to it. The solution was Netbird — plex is enrolled as a peer, and the dmz↔lab policy allows the connection. So plex’s OTel Collector sends to OpenObserve at its Netbird mesh IP rather than its VLAN 100 address. This works but it means plex’s telemetry is dependent on the VPN staying up. Worth noting for future troubleshooting.

The plex config also includes Plex Media Server log ingestion from /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs/. That path is obnoxiously long and has spaces in it, which required some extra quoting in the YAML. File paths with spaces in application data directories are a very specific kind of frustrating.

site02-kvm01 Gets Its Own Stack

site02-kvm01 is at a remote site — it routes through kvm01’s VPN but it’s physically elsewhere. If the VPN goes down, it loses visibility into the main OpenObserve instance. The design called for a local mini-stack: its own OpenObserve + Uptime Kuma, behind its own nginx reverse proxy, deployed as Quadlets.

This required adding it as a monitored node in two directions: the main stack monitors site02-kvm01, and site02-kvm01 monitors itself. Double coverage. Uptime Kuma on the main cluster does HTTP checks against site02-kvm01’s nginx endpoints. Uptime Kuma on site02-kvm01 watches local services directly. If the VPN drops, the main cluster loses visibility but site02-kvm01 can still see itself — and alert locally if anything goes wrong.

The nginx reverse proxy was a straightforward Quadlet deployment. The DNS for it lives in lab.towerbancorp.com, which is a dynamic BIND zone. Per my standard operating procedure, changes go through rndc freeze → edit zone file → rndc thaw. I did this correctly. I’m mentioning it only because I’ve done it wrong before and it’s worth the reminder to future-me.

The SMTP Env Var Nobody Documents Well

With the stack up, I tried to configure email alert destinations in OpenObserve. The SMTP host, port, and from-address were already set via environment variables in the Quadlet — ZO_SMTP_HOST, ZO_SMTP_PORT, ZO_SMTP_FROM_EMAIL. Looked right. Tried to create an email destination in the UI. Got a 400 error: “Email destination must have SMTP configured.”

The error message is accurate but not helpful. SMTP is configured — just not enabled. There’s a separate boolean: ZO_SMTP_ENABLED=true. Without it, OpenObserve treats the SMTP config as not present regardless of what you’ve put in the other variables. This closes issue #180, which was opened specifically for “email alerts not working” without knowing why.

I added the variable, restarted the container, and email destinations started working. The fix is one line. The debugging was longer. This is the kind of thing that should be in a big red box in the documentation, and instead it’s findable only by knowing which GitHub issue to search for or by trial and error.

While working through the stack, I also fixed a latent bug in the rbd-filedrop.service systemd unit. It was using a hard-coded /dev/rbd0 device path for the Ceph RBD volume mount. This was fine when there was only one RBD image mapped on kvm02. Now there are two — filedrop and wazuh-queue — and the device number assigned to each depends on which one gets mapped first. If the order shifts (after a reboot, say), /dev/rbd0 might not be filedrop anymore.

The fix is to capture the device path dynamically from rbd map output, which is what rbd-wazuh-queue.service already did correctly. I just hadn’t applied the same pattern to filedrop when I first deployed it. The code that was already doing it right was sitting two files away. These are the bugs that feel the most avoidable in retrospect.

What the Monitoring Found

Here’s where the monitoring earns its keep.

Within hours of the OTel Collectors reporting in, the research agent (which runs separately and queries the infrastructure for its nightly digest) found two things:

kvm02 root filesystem is at 81%. The threshold for a warning alert is 70%, which means this has already crossed it. The likely culprit is /var/lib/containers — Podman image layers accumulate over time, and kvm02 runs a lot of containers. A round of podman image prune and journal cleanup should recover significant space, but it needs investigation before it hits the 90% critical threshold. This is now tracked as Homelab #184.

The filebrowser container on kvm02 is unhealthy. Filebrowser has been running for years as a web UI for the Ceph RBD file drop volume. It was up — nginx was responding, the RBD volume was mounted at 2% usage — but the container health check was failing. Something in the application layer is wrong. Tracked as Homelab #183.

Both of these would have gone unnoticed without active monitoring. I would have found the disk space issue eventually (probably when something failed to write), and the filebrowser health failure would have been discovered only when someone tried to use it. Instead they’re flagged as issues, sitting in the GitHub project queue, ready to be investigated.

That’s what observability is supposed to do. It doesn’t prevent problems; it reduces the time between when they start and when you know about them. Today it started paying for itself before the deployment was even committed.

The Queue

With the observability stack deployed and the research agent running nightly, I now have a clearer picture of what’s actually pending across both repos. Here’s where things stand — roughly ordered by how much I think they matter.

Homelab #147 — Apply BIND security errata on ns1 This is the most important item on the list and has been sitting since April 7th. ns1 is the authoritative DNS server for lab.towerbancorp.com and it runs BIND. There are four CVEs patched in BIND 9.18.47, two of them rated HIGH: CVE-2026-3104 (memory leak DoS via crafted DNSSEC domains) and CVE-2026-1519 (CPU exhaustion DoS via malicious zones). DNS is the kind of service where a DoS is quietly catastrophic — nothing needs to be breached, just made unavailable, and suddenly every service that depends on name resolution starts failing. The fix is dnf update bind on ns1. The reason it hasn’t been done yet is the same reason a lot of these things wait: the dynamic zone requires rndc freeze/thaw around file edits, and there’s always something else going on. That’s not a good reason.

Homelab #184 — kvm02 root filesystem at 81% This one has a clear escalation path: 81% is already past the 70% warning threshold I configured. At 90% the alert fires as critical. kvm02 runs essentially everything on the lab side — Wazuh, Vaultwarden, n8n, filebrowser, the backup container. If the root filesystem fills up, containers start failing to write logs, systemd journals stop, and things break in confusing non-obvious ways. The fix is almost certainly podman image prune plus journal truncation, but I want to profile the actual disk usage first before deleting anything. The disk situation is urgent enough that it should happen before the week is out.

Homelab #185 — OpenObserve: enrich alert email body + diagnose restarting container This one was filed today. Apparently OpenObserve itself has been restarting on the main cluster, which is ironic given that the whole point of today was getting monitoring working. The alert email body is also too sparse to be actionable — it fires but doesn’t include enough context to know what’s wrong without logging into the dashboard. Both issues undermine the monitoring stack’s usefulness. It’s hard to trust a watchdog that might itself be down.

Homelab #178 — Implement patch management for Linux servers The BIND CVEs are a symptom of a larger gap: there’s no systematic process for tracking and applying security errata across eleven servers. Right now it’s ad-hoc — the research agent flags something, I create an issue, it sits in the queue. The fix for #147 will be manual. The fix for #178 is building something that makes #147 not happen again. Whether that’s unattended-upgrades with security-only repos, a Podman-based patching workflow, or something else is TBD — but this is the issue that pays for itself the most over time.

OurHomePort #58 — Upgrade Netbird server to v0.68.2 The Netbird server is running an older version. v0.68.2 adds NAT-PMP/UPnP support, which is interesting because kvm01’s Netbird connection to remote admin peers is currently relayed rather than P2P (Homelab #154). It’s possible the new NAT traversal options improve that situation — worth testing when this upgrade happens. It’s also just good hygiene to keep the VPN server current.

Homelab #181 — Logs only ingesting from site02-kvm01 and storage01 We deployed OTel Collectors across nine hosts today. If only two of them are actually getting logs into OpenObserve, something is systematically misconfigured. This might be a permissions issue on the journald socket, or the OTel Collector failing silently on some hosts. Needs investigation — the whole point of today’s work was full coverage.

Homelab #183 — Filebrowser container unhealthy on kvm02 Less urgent than the disk space issue, but an active health failure. The RBD volume is mounted and accessible, so this is most likely something in the application layer — maybe a config file issue or a database migration that didn’t complete. Worth looking at the container logs before assuming anything.

OurHomePort #52 — Deploy container health monitoring on server01 The Homelab side now has Uptime Kuma watching containers via HTTP. server01 (the OurHomePort node) has the same workloads running — Actual Budget, ezbookkeeping, BentoPDF — without equivalent coverage. This is the same gap on a different node. Should be straightforward to close once the Homelab pattern is proven.

Homelab #152 — DNS modernization: retire ns1 pet VM ns1 is a pet VM — manually configured, SPOF, dependent on specific Rocky Linux packages, running an older BIND setup that predates the dynamic zone work. The longer-term vision is retiring it in favor of something more resilient: distributed authoritative DNS, or at minimum a secondary. Not urgent, but it’s the kind of technical debt that grows the longer it waits.

Homelab #163 — Add third Ceph monitor for quorum resilience The Ceph cluster currently has two monitors. Ceph requires a majority quorum to elect a primary, which means two monitors is worse than one: if one monitor goes down, the remaining one can’t reach quorum alone and the cluster hangs. A third monitor on storage02 (or another host) gives genuine resilience. Currently HEALTH_OK and not urgent, but it’s the kind of thing that only matters when it suddenly matters a lot.

How We Got Here#

The Upgrade in the Middle of Deployment#

Rolling OTel Across Everything#

site02-kvm01 Gets Its Own Stack#

The SMTP Env Var Nobody Documents Well#

The rbd-filedrop Sidebar#

What the Monitoring Found#

The Queue#