The patchmon server stack — Postgres, Redis, the Go backend, the Caddy-style nginx fronted by Let’s Encrypt — has been running on its little e2-micro in us-central1-a since the night of the 7th. Yesterday’s blog was a six-hour autopsy of an n8n parser fight while wiring patchmon-server into the cert-distribution workflow. Today was supposed to be the easy follow-up: stamp the milestone in CLAUDE.md, tighten anything that was visibly held together with masking tape, and start Task 16 — the Phase 1 agent canary on backup01.
I got two of those done. The third is still queued for tomorrow because of what fell out of “tighten anything visibly held together with masking tape.”
The probe that always passed
The Quadlet for patchmon-server.container has a HealthCmd line. I wrote it on the 7th and never thought about it again, because the container has been showing healthy ever since:
HealthCmd=curl -fsS http://localhost:3000/api/v1/health
This is the shape of a thousand healthchecks. Hit a JSON endpoint, expect a 2xx, fail otherwise. curl -f flips a non-zero exit on 4xx/5xx. Done. Move on.
I went looking at it today only because I was about to start the agent canary, and Phase 1 is the first time another machine is going to depend on this server to be honestly up. If I’m going to install an agent on backup01 and tell it “report your patch state to this URL every fifteen minutes,” I want the upstream healthcheck to be telling me the truth.
So I curl’d the URL the probe was hitting. And I got back the React app.
$ curl -i http://localhost:3000/api/v1/health
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
...
That is a 200 OK with the body of index.html. That is the SPA catch-all route. The Go backend doesn’t have an /api/v1/health endpoint at all — it never has — and so the Vite-built frontend, which is configured (correctly, for a single-page app) to serve the index for any unmatched path so the React Router can decide what to do, was happily handing back HTML to my probe. The probe was checking for a 2xx. It was getting a 2xx. It was passing.
It would have kept passing if the Go backend had crashed and the embedded static handler had been the only thing still running. It would have passed if the API had returned {"status": "everything is on fire"} because that JSON wouldn’t have been generated in the first place — the SPA catch-all would still have intercepted the request long before the API got a look at it.
This is the kind of bug that wouldn’t have hurt me until the day the API actually broke. And then it would have hurt me a lot, because the orchestration above the container — Restart=on-failure, the systemd readiness gate, my own dashboards — all key off “is this container healthy?” If the answer is permanently “yes,” the rest of the system never gets a chance to react.
The fix that took thirty seconds
PatchMon does have a real health endpoint. It lives at /health, not under /api/v1/, because it’s served by the Go binary’s own router before the static-file handler ever sees the request. The body is plain text: literally the word healthy.
HealthCmd=sh -c 'wget -qO- --timeout=5 http://localhost:3000/health | grep -q healthy'
Two changes that matter. First, the right URL — one the SPA catch-all can’t intercept. Second, the grep -q healthy — so even if /health ever stops responding with that exact body (a regression in the Go backend, an upstream proxy injecting itself, anything weird), the probe fails. I’m no longer trusting the status code alone. I’m asserting on the content.
I switched from curl to wget for one boring reason: the patchmon image is a slim Alpine base and wget is in there by default, curl isn’t. I could pull curl in, but that’s another thing to maintain. wget -qO- is the same shape, less surface area.
Reload the unit, restart the container, watch it report healthy for fifteen seconds, then for thirty, then for a minute, then I stop watching because that’s the answer. The probe now fails fast if the Go backend regresses, and stays passing only when the binary itself is responding.
The same CLAUDE.md commit that records the Layer 1 milestone carries the probe fix. Two lines about a deployment date and an IP, three about a healthcheck — but the healthcheck is the one I’d put in front of someone asking “what changed that mattered?”
The thing this pattern doesn’t cover
A healthcheck on a single endpoint, served by the very binary you’re trying to monitor, will never tell you that the database is down. It’ll tell you the Go process is up and willing to answer. That’s a useful signal — it’s the difference between a crashed container and a running one — but it’s not a synthetic check. It can’t tell you that Postgres has fallen out from under the app, or that Redis has gone unresponsive and the API is silently degraded.
Synthetic checks belong in Uptime Kuma, which is where they already live for everything else in the lab. I’ll add a Kuma monitor against https://patchmon.lab.towerbancorp.com/api/v1/dashboard/stats (a real authenticated API call, with a real read against Postgres) once the canary is in. The container probe is a liveness check. The Kuma probe is a real-functionality check. They are different jobs and they should be different tools.
But the rule the day taught me is older than this stack and applies to every service in the lab: a healthcheck that can’t fail isn’t a healthcheck. If your probe asserts only on a status code, and your application has a route that always returns 200 (every SPA does), you’ve built a probe that will pass on a corpse. The fix is to assert on something the corpse couldn’t have produced.
Sidebar: the rest of the lab today
The research digest came in just before I started writing this. Three things from it are worth flagging out loud:
The Plex Wazuh agent has been disconnected for about sixty hours. The pattern is service-up-but-keepalive-stalled, which is exactly what Homelab #201 was filed to detect and auto-recover from but #201 hasn’t been built yet. The probable trigger is the kvm02 reboot on the 6th — the agent’s last successful keepalive was at 13:53 EDT, which lines up with the manager-side restart window. Manual systemctl restart wazuh-agent on plex will fix it; the more interesting question is whether a manager-side cluster-control nudge could re-establish the slot without touching the agent at all. Tomorrow.
/ on kvm02 sits at 75%. Not an alert, but worth a podman image prune pass before the next Wazuh or n8n major upgrade pulls another two-gigabyte image and tips it over.
And — the prettiest data point of the week — there were zero level-10+ Wazuh alerts in the last 24 hours. Rule 100533 (the override for the OpenObserve dashboard / Uptime Kuma POST-bot false positive) is still doing its job, and nothing else interesting hit the public surface. That is the quietest the lab has been since I started writing these.
Tomorrow: the patchmon agent on backup01, and a Kuma monitor that actually exercises the database.
