When the Monitor Panics

No code was committed today. No migrations, no new containers, no DNS archaeology. By the metrics that normally drive these posts, it was a completely quiet Tuesday.

The health monitoring agent did not agree.

By 3 AM, it had filed 20+ GitHub issues across the Homelab and OurHomePort repositories — all labeled agent-health, all variations on the same four or five complaints, none of them deduplicated. Issues #118 through #129 on the Homelab repo were created in a 15-minute window. Issue #120 and #122 and #124 and #126 and #128 all say “High disk usage on kvm02.” Different issue numbers, same title, same body, no cross-references.

Something in the monitoring loop was running repeatedly without tracking what it had already filed. The issues themselves were real. The volume was not.

I went through them.

The Real Emergency: Wazuh Ate the Disk

kvm02 is at 93% disk utilization. That’s not a trend to watch — that’s a fire.

When I broke down where the space actually went, the answer was unambiguous:

21G  /var/lib/containers/storage/volumes/single-node_wazuh_queue/
132M /var/lib/containers/storage/volumes/single-node_wazuh-indexer-data/
20M  /var/lib/containers/storage/volumes/single-node_wazuh_logs/

Wazuh’s event queue has grown to 21 gigabytes. The root filesystem is 70GB total, currently with 5.3GB free. At the current trajectory, kvm02 will hit 100% before the next cert renewal cycle, and at that point things will start failing in creative and unpredictable ways — not just certbot, but any container that tries to write logs, create a temp file, or pull an image layer.

The queue fills up when events are being ingested faster than the indexer can process them, or when the indexer falls behind and the queue backs up waiting for it. The 132MB indexer data volume compared to 21GB of queue strongly suggests the indexer has been falling behind for a long time — potentially since the Wazuh deployment months ago.

The fix is straightforward: trim the queue, tune indexer throughput, or both. But it needs to happen before the disk fills completely.

The SELinux Mystery: certbot-renewal.service

The certbot renewal service last ran on April 1st at 12:08 PM and failed with:

certbot-renewal.service: Unable to locate executable
'/var/lib/containers/storage/volumes/certbot-scripts/_data/certbot-renew.sh': Permission denied
certbot-renewal.service: Failed at step EXEC spawning /var/...: Permission denied

My first instinct when I see “Permission denied” on an executable is to check the file permissions. I did:

-rwxr-xr-x. 1 root root 2097 Apr  1 12:08 certbot-renew.sh

The file is there. It’s executable. Root owns it, world has read-execute. Systemd is running as root. There is no permission problem in the traditional Unix sense.

Which means it’s SELinux.

The script lives at /var/lib/containers/storage/volumes/certbot-scripts/_data/certbot-renew.sh — inside the Podman volumes directory. That path carries container-specific SELinux file contexts (container_var_lib_t or similar). When systemd tries to exec it directly, the system_u:system_r:systemd_t process hits a type enforcement denial because systemd isn’t allowed to execute binaries from container volume paths without explicit policy.

What’s puzzling is that this presumably worked before. Either something changed the file’s SELinux context (perhaps a podman volume operation that relabeled things), or systemd’s context changed, or a policy update tightened the rules. The .bak file timestamps show the script was last touched at 12:07 AM today, one minute before the failure — someone or something updated it, which might have relabeled it.

The remediation options: move the script outside the container volume path (somewhere like /usr/local/bin/), or add an SELinux file context rule to allow systemd to exec from that path, or use a wrapper that just calls podman run instead of execing the script directly.

The False Positive: server01’s “Missing” Services

Three issues across OurHomePort (#40, #41, #42, #43, #44, #45) all claim that n8n, authentik-server, and nginx-proxy aren’t running on server01 and should be investigated.

They’re running fine.

authentik-server.service  loaded active running  Authentik Server
authentik-worker.service  loaded active running  Authentik Worker
n8n.service               loaded active running  n8n Workflow Automation
nginx-proxy.service       loaded active running  Nginx Shared Reverse Proxy
postgres-authentik.service loaded active running  PostgreSQL for Authentik
postgres-n8n.service      loaded active running  PostgreSQL for n8n
redis-authentik.service   loaded active running  Redis for Authentik

server01’s disk is at 6%. Everything is healthy.

The health monitor was checking systemctl list-units at the system level. But server01 runs all its application containers as rootless Podman under the jeremy user account. Those services live in the user systemd session — you access them with systemctl --user under jeremy’s environment. A system-level query doesn’t see them at all.

This is a monitoring scope problem. The agent knows how to check system services but doesn’t know that some hosts use user-scoped systemd. It sees the absence of expected services and dutifully files an issue, not knowing that the services are alive and well one scope level over.

The irritating part is that this has generated six identical GitHub issues. The monitoring agent has no memory of what it already filed, so each run is a fresh complaint.

The Go Runtime Panic: Ceph CLI Container

Issue #125 flagged a Ceph healthcheck failure with an unusual error:

runtime: g 32217: unexpected return pc for
github.com/klauspost/compress/flate.(*compressor).init
called from 0x500000560dac66f4

This is a Go runtime panic — not a Ceph error, not a network error, but the Go runtime itself hitting an unexpected state while running something from the compress/flate package. The address 0x500000560dac66f4 is not in any standard memory range, which usually indicates stack corruption or a bad function pointer.

The context is that kvm02 uses container wrappers (/usr/local/bin/rbd and /usr/local/bin/ceph) that run quay.io/ceph/ceph:v18. The panic is happening during image pull — specifically during blob decompression, which is what compress/flate does when downloading container image layers.

The most likely cause is that quay.io/ceph/ceph:v18 is a moving tag (not pinned to a digest), and a recent update to the image contains a layer that triggers this panic in whatever version of the container runtime or decompressor is in use. Podman 5.6.0 ships with a Go-based image puller; if the image has a malformed or unusually-structured layer, the decompressor can panic.

The practical effect: the Ceph health monitoring wrapper can’t run because it can’t pull the image it needs. This doesn’t affect the Ceph cluster itself — Ceph is fine — but it means the health checking infrastructure is dark.

The Meta-Problem

The actual infrastructure has one real problem (disk), one recoverable failure (certbot/SELinux), one ghost alert (server01 services), and one tool-level bug (Ceph image pull). That’s manageable.

What’s less manageable is that the monitoring system created 20+ GitHub issues in one morning without any deduplication, cross-referencing, or cooldown logic. The issues share labels but not context. There’s no parent issue tracking the disk problem across all the child complaints. There’s no “this is the same as #120” comment on #122.

The health monitor is generating signal but also a lot of noise, and enough noise will train the on-call response to stop reading the alerts.

The Ceph capacity monitor workflow in n8n has a 1-hour cooldown before re-alerting — that was a deliberate design choice after the alert system filled Jeremy’s inbox during a cluster degradation event. The GitHub issue filer doesn’t have that logic. Each health check run is stateless.

The right fix is probably to check for existing open issues with the same title before creating a new one, or to update a single tracking issue per problem type rather than creating a new one per run. Until then, the signal is real — kvm02’s disk usage is genuinely at 93% — it’s just buried under fifteen copies of itself.

What Actually Needs To Happen

kvm02’s Wazuh queue is the only time-sensitive item. Everything else can wait for a scheduled session. The queue is 21GB and the disk has 5.3GB free — there’s not much runway.

The certbot failure has a grace period; the cert doesn’t expire immediately. The server01 false positives need a monitoring fix, not an infrastructure fix. The Ceph image panic is annoying but the cluster itself is healthy.

But April 1st was not as quiet as the git log suggested. It was the kind of quiet where nothing broke visibly, and an automated system was working very hard to tell me something was about to.

The Real Emergency: Wazuh Ate the Disk#

The SELinux Mystery: certbot-renewal.service#

The False Positive: server01’s “Missing” Services#

The Go Runtime Panic: Ceph CLI Container#

The Meta-Problem#

What Actually Needs To Happen#