The Alert That Cried Wolf Twice a Day

Yesterday I was writing about three ways the observability stack broke within 24 hours of deployment. One of the recurring patterns in that debugging session was exit code 137 — SIGKILL, the OS losing patience with something that wouldn’t stop gracefully. Today there was another one, from a completely different source, with a cleaner fix.

The certbot systemd unit was restarting nginx-proxy twice a day. Not because certificates were being renewed. Just because it ran.

How Certbot Works in This Setup

The certbot container on server01 runs as a oneshot systemd service, triggered by a timer twice daily at 06:00 and 18:00. The job of each run is to check whether any of the managed certificates are within 30 days of expiration. If they are, renew them. If not, exit cleanly.

Most runs do nothing. Let’s Encrypt certificates last 90 days. Unless a cert is within that 30-day window, certbot exits with code 0 and a message along the lines of “Certificate not yet due for renewal.” This is the expected happy path — certbot runs constantly, but renewals only happen a few times a year per domain.

The systemd unit file had this:

[Service]
Type=oneshot
ExecStartPost=/usr/bin/systemctl --user restart nginx-proxy

Every time certbot ran — renewal or not — the ExecStartPost line executed after it. nginx-proxy restarted. Unconditionally. Twice a day.

What This Looked Like in Practice

The alert system dutifully reported container-restart-detected for nginx-proxy at 06:00 and 18:00, every day. When the observability stack was first deployed, these showed up immediately as a wall of false positives in the alert history. Six identical alerts across three days, all nginx-proxy, all aligned with the certbot schedule.

The noise problem was compounded by how nginx-proxy actually stopped. Same exit-137 pattern from the previous post: nginx uses SIGQUIT for graceful shutdown by default, which means it waits for existing connections to close before exiting. In a reverse proxy with long-lived connections from the web apps it fronts, that wait can exceed systemd’s timeout window. Systemd escalates to SIGKILL. Exit 137. Alert fires.

So each twice-daily certbot run was generating one “container restarted” alert, and the restart itself was often a SIGKILL, which looked alarming in the logs even if nothing was actually wrong.

The monitoring was working correctly. The underlying behavior was the problem.

The Fix

Certbot has a --deploy-hook flag that runs a shell command only when a certificate is actually renewed. If certbot runs and decides nothing needs to be done, the deploy hook doesn’t execute. This is exactly the signal we needed.

The updated container definition:

Exec=renew --deploy-hook "echo CERT_RENEWED=true"

[Service]
Type=oneshot
TimeoutStartSec=300
ExecStartPost=-/bin/sh -c 'if podman logs certbot-renew 2>&1 | grep -q CERT_RENEWED; then systemctl --user restart nginx-proxy; fi'

Two changes. First, the --deploy-hook writes a recognizable string to stdout when renewal happens. Second, ExecStartPost now reads the container’s logs and only restarts nginx-proxy if it finds that string.

The - prefix on ExecStartPost tells systemd to ignore the exit code of that command. This is important because grep -q returns exit code 1 when it finds no match — which is the expected case on every non-renewal run. Without the -, a quiet certbot run would fail the service unit.

The podman logs certbot-renew call reads the container’s log from the just-completed run, which includes the deploy hook output if it fired. This is slightly fragile in the sense that it depends on the container name (certbot-renew) matching what was set in ContainerName=. It does. But it’s worth noting for future readers who might rename things.

Why Unconditional Restart Was There In the First Place

The original setup was doing the right thing conceptually — after certbot renews a cert, nginx needs to reload to pick up the new certificate files. The problem was conflating “certbot ran” with “certbot renewed something.” Those are not the same event, but the original unit treated them as if they were.

The idiomatic certbot way to handle this is exactly the deploy hook pattern. Most certbot documentation shows it as a --deploy-hook "systemctl reload nginx", which is cleaner still since it’s a reload rather than a full restart. In this setup, nginx-proxy is a Podman container managed by systemd, and a userspace systemctl --user restart is simpler than trying to send SIGHUP into a running container — but reload would work too if I wanted to avoid the brief downtime of a full restart. For now, full restart is fine. The services behind nginx-proxy stay up during the few seconds it takes to restart.

The Wazuh Situation

While sorting out the certbot alerts, the research agent flagged a new problem: Wazuh’s wazuh-modulesd daemon crashed on kvm02 and hasn’t recovered. This is a known upstream bug in the 4.13.1/4.14.x line — the daemon exits abnormally and the container itself stays “running” (the other Wazuh processes are fine), but the API returns error 1017 on everything because modulesd handles the query interface.

The consequence is that SCA compliance scores, vulnerability detection, and agent status queries are all offline until it’s restarted. The fix is probably just restarting the container, but there’s a complicating factor.

kvm02 has an XFS corruption issue (Homelab #187) that’s been slowly escalating. It started as occasional “structure needs cleaning” errors in dmesg. It’s now blocking container image pulls — the Podman overlay storage lives on that filesystem, and when the research agent tried to run the Ceph CLI wrapper (which podman run quay.io/ceph/ceph:v19), the pull failed with the XFS error. Ceph cluster status had to be retrieved via kvm01 instead.

There’s a real question about whether the Wazuh modulesd crash is related to the filesystem corruption. The modulesd daemon writes state files to disk. If those writes hit the XFS error and corrupted the state, restarting the container might just crash it again. Or the corruption might be limited to the overlay storage path and the Wazuh volume (a separate Ceph RBD mount) is fine.

Until someone runs xfs_repair on the affected filesystem — which requires unmounting it, which requires stopping all the containers that use it — this is going to keep accumulating. The XFS issue just became the prerequisite for most other kvm02 work.

A Side Note on Certbot Alerts and Monitoring Hygiene

One thing the certbot fix made obvious: the monitoring system had been seeing these twice-daily restart alerts for as long as it had been deployed, and they’d simply become background noise. Not because they were silenced or acknowledged — just because they were there, consistently, alongside legitimate data.

That’s how false positives corrupt monitoring. It’s not that any single false alert causes a problem. It’s that the accumulated noise raises the threshold at which something unusual registers as worth investigating. The certbot alerts weren’t dangerous, but they were training the habit of dismissing “container-restart-detected for nginx-proxy” — which is exactly the alert that would fire if nginx-proxy actually crashed for a real reason.

The fix took about ten minutes to implement. The cost of not fixing it was a slowly degrading signal-to-noise ratio on container restart events. Neither of those numbers is zero, but they’re not close.

The Wazuh modulesd situation is going to need attention tomorrow, probably alongside a plan for the XFS repair. That’s a longer maintenance window than a one-liner certbot fix — scheduling downtime for kvm02, unmounting the overlay filesystem, running xfs_repair, bringing everything back up. Worth doing cleanly rather than rushing.

For today, two less false alerts per day. The quiet kind of progress.

How Certbot Works in This Setup#

What This Looked Like in Practice#

The Fix#

Why Unconditional Restart Was There In the First Place#

The Wazuh Situation#

A Side Note on Certbot Alerts and Monitoring Hygiene#