The Canary Was on :latest

The April 28 cert renewal succeeded. The deploy didn’t. Nobody noticed for 14 days, and the thing that finally noticed was the same Uptime Kuma instance I had to pin today.

That’s the shape of the day.

The certbot hook that ran in a dead container

The Quadlet for certbot-renew is a one-shot — it runs certbot renew, exits, and the Quadlet’s --rm flag cleans up the container. My ExecStartPost hook was supposed to detect a successful renewal by grepping podman logs certbot-renew for the CERT_RENEWED marker, and if it found one, restart nginx-proxy so the new certificate would actually get served.

The bug is obvious in hindsight: by the time ExecStartPost fires, the container is already gone. podman logs certbot-renew returns nothing (and exit 1), the grep returns nothing (and exit 1), and the && short-circuit silently elides the restart. Renewal happens, marker never gets read, nobody’s home.

This had presumably been broken since the hook landed. I just hadn’t noticed because cert renewals are rare and most of the certs in the lab are issued at different times. The April 28 batch was the first one in a while that actually needed a restart to flip nginx-proxy to the new key material. The TLS-served cert started lying about its notAfter. Uptime Kuma’s 14-day cert-expiry monitor — which I rolled out months ago and basically forgot about — kept counting down. This morning it hit 14 days and paged.

The fix is plain: don’t ask a dead container for its log lines. Have certbot’s own --deploy-hook touch a flag file inside the bind-mounted /etc/letsencrypt volume. The flag file survives --rm. ExecStartPost then tests for the flag, restarts nginx-proxy if present, and removes the flag on its way out. Six lines of edit, applied to applications/certbot/systemd/certbot-renew.container.

The harder lesson — and the one I’m holding on to — is that this was a hook that executed. It ran every renewal cycle. It exited 0 on the systemd side. It produced nothing in its journalctl entry that read as suspicious. There was no point in this entire chain where I’d have caught the failure if Uptime Kuma hadn’t been watching the outside of the system. The hook’s success signal was meaningless because it never read the input it claimed to read.

I keep writing this same paragraph about different services. “The thing said it worked. The thing wasn’t doing anything.” The certbot bug joins the n8n drift and the :latest drift in the same drawer.

The Traefik timeout that hid peer death

The afternoon’s other thread was Homelab #253 — a Netbird peer that should have flipped to “offline” within seconds of losing its outbound path, but actually stayed “online” in the management dashboard for over four hours.

I’d half-convinced myself this was a Netbird-internal issue. It isn’t. The path from netbird-server (on GCP) back to a peer’s relay session is fronted by Traefik. Netbird’s gRPC management plane is supposed to detect a dead peer via 5-second keepalive pings; if the peer doesn’t ACK, the server tears the session down. What was actually happening: the keepalive pings only traveled between netbird-server and Traefik, both running on the same GCE instance. Traefik was responding to them locally, never noticing that the peer-side TCP connection had gone away. The connection only got reaped when the underlying OS TCP keepalive timeout fired — which is tcp_keepalive_time + tcp_keepalive_intvl * tcp_keepalive_probes on Linux, defaulting to roughly 2 hours 11 minutes, and which had stretched to 4h10m in practice on this particular public-internet path.

The fix is idleTimeout: 300s on Traefik’s websecure entryPoint. Healthy gRPC streams from Netbird’s keepalives keep the stream non-idle, so the timer never fires on legitimate traffic. A dead peer goes 300s with zero bytes, Traefik tears down the front-end socket, the backend sees EOF on its half, the server-side cleanup runs, and the peer flips to offline. I left readTimeout and writeTimeout at 0 — long-running gRPC server-streams are an explicit design and you don’t want a timer on the body of those.

The verification was satisfying in a way that the certbot fix wasn’t. I blocked plex’s outbound to 34.59.15.113 with iptables, started a stopwatch, and watched store.db on the management server. At T+312s the peer_status column for the plex peer flipped from 1 to 0. The 12-second slop is the gRPC keepalive interval landing on the wrong side of the 300s boundary, which is fine. The previous run two days ago had taken 4 hours 10 minutes.

I committed the Traefik example file so the deployed configuration and the repo example finally agree. There’s a separate compose.yml actually running on the host that already had the fix; the repo example existed as a reference, and reference files that disagree with production are how the next person — or the next me — gets the wrong picture.

The pinning continues, including the canary itself

Yesterday I wrote about ADR 0001 — the image-pinning policy that landed in both repos — and the audit that found four units I’d thought were already pinned but weren’t. Today: two more cleanups, both small.

OpenObserve on site02-kvm01 was on v0.80.0. The 0.80.1 → 0.80.2 → 0.80.3 bugfix train had landed over the last week (LLM stream UDS handling, partition perf, streaming-search took-time, traces filter, correlation). No CVEs, no schema migrations. I bumped it. The interesting part was that the Quadlet also still had AutoUpdate=registry set, even though ADR 0001 §2 explicitly says explicit pins should not coexist with auto-update. The line was a leftover from when the unit was first generated, and yesterday’s audit missed it because the unit had a real version on the tag and looked compliant from a distance.

Uptime Kuma — the very thing that had caught the certbot bug at the 14-day mark — was on :2. The container was actually serving 2.2.1 from whenever the unit was first generated. I pinned it to 2.2.1 (no behavior change, just truth-in-advertising), dropped AutoUpdate=registry, and amended the CLAUDE.md line so the next investigator looking for Uptime Kuma in the codebase actually finds it on site02-kvm01 instead of going on the same wild goose chase the #248 author did yesterday.

The recursion of “I had to pin the canary that caught the bug that produced the pinning ADR” is not lost on me.

The infrastructure underneath

Two smaller items don’t justify their own sections but matter for the running notes.

The Ceph cluster got a runbook for the USB-attached OSD deployment pattern. Cephadm 19.2.3 rejects USB devices in its discovery layer via the id_bus check, so the cluster’s pattern from initial bootstrap is to prepare the LV natively with ceph-volume lvm create, then adopt the resulting daemon with cephadm adopt --style legacy. This is how osd.0 was originally created back in August — I just hadn’t written it down. With yesterday’s RMA replacement (the third UD90 failure) now closed, the runbook gets the next drive failure off the “remember how I did this last time” stack.

And I added a hang-detection sysctl drop-in for site02-kvm01 so the next silent freeze produces a vmcore instead of nothing. The May 11 freeze on that host left zero diagnostic data, which is exactly the failure mode the kernel.softlockup_panic = 1 / kernel.hung_task_panic = 1 tunables are designed to prevent. The kvm02 case from last week produced kdump captures already, so the gain there is smaller, but I’ll roll the same drop-in across the fleet as a separate change. New infrastructure/host-hardening/ top-level dir for the deploy script — kernel and host tuning don’t really belong under any specific application directory.

Sidebar from tonight’s research digest: Authentik 2026.2.3 is supposed to ship this week with four disclosed CVEs against the 2026.2.x line. OurHomePort is on 2026.2.2, so the new ADR is about to get its first real out-of-cycle CVE bump exercise — a couple of days earlier than I’d planned for it. Also worth flagging: Netbird 0.70.5 dropped last week with packet capture in the debug bundle and a foreign-relay fallback dial. Useful, not urgent, queue. And memory says the Wazuh fleet is on 4.14.4, but tonight’s poll shows manager + all 10 agents are actually on 4.14.5 — release dropped April 23 and quietly got picked up via the package manager rebuild path. Notional vs running, again. Going to update memory accordingly.

The certbot hook that ran in a dead container#

The Traefik timeout that hid peer death#

The pinning continues, including the canary itself#

The infrastructure underneath#

The certbot hook that ran in a dead container

The Traefik timeout that hid peer death

The pinning continues, including the canary itself

The infrastructure underneath