Building the Patch Manager on Copy Fail Day

The whole arc of today was supposed to be slow. Homelab #178 — “stand up a centralized patch-management control plane” — has been open since February. The plan was: spec in the morning, plan after lunch, maybe deploy the VM by end of day, and let the actual agent rollout drift across the next ten weeks. Five host-class policies, scheduled Sunday windows, kvm02 enrolled last because kvm02 is always enrolled last. Slow on purpose.

I am writing this at 22:30 EDT and patchmon-server is live at https://patchmon.lab.towerbancorp.com/. The Postgres + Redis + app + nginx stack is up under Quadlets, OTel collector is shipping host metrics and journald to OpenObserve, Uptime Kuma is polling it every 60 seconds, and the nightly Postgres dump fired its first run an hour ago. Thirteen commits between 13:15 and 22:21.

Then, while I was finishing the n8n cert-distribution wiring around 22:00, the research digest task ran for tomorrow morning and surfaced this:

CRITICAL — fleet-wide kernel patch needed. CVE-2026-31431 “Copy Fail” (CVSS 7.8). Public exploit. CISA KEV. Affects every Rocky 8/9/10 server in this fleet.

The patch manager got built on the day the patch dropped. That is not a planned sequence. It is one of those coincidences that, in retrospect, makes the whole effort feel less premature than it did this morning.

But the patch dropping is not the part of today I want to write down. The part I want to write down is the gotchas, because there were a lot of them and most of them were small enough that, individually, they wouldn’t justify a write-up — but the cluster of them, hitting in the same six-hour deployment window, is a useful artifact.

Gotcha 1: us-east1 had no e2-micros for me

The spec called for the VM in us-east1 (close to home, low latency to kvm02 over the Netbird mesh). The first gcloud compute instances create returned ZONE_RESOURCE_POOL_EXHAUSTED from us-east1-b. Then us-east1-c. Then us-east1-d. All three free-tier zones full.

I switched to us-central1-a, which had capacity, and updated the spec to match. The latency hit is roughly 20ms and it does not matter for a control plane that polls every five minutes. But this was the first reminder of the day that GCP free tier is a real constraint with real failure modes, and that “spec says us-east1” is not a contract — it’s a preference, and the cluster will tell me when the preference is unaffordable. The spec has been updated.

Gotcha 2: UID ordering matters because I let `useradd` pick

The bootstrap script creates two users: claude for ops and jeremy for app data. Whichever I list first gets useradd’s UID_MIN = 1000. Lab convention is jeremy = 1000 (it matches kvm02 and every other host where jeremy was the original real user).

The first run of the bootstrap created claude first. Result: claude=1000, jeremy=1001, and /run/user/1000 owned by claude. Every subsequent sudo -u jeremy XDG_RUNTIME_DIR=/run/user/1000 podman ... command failed with Permission denied against the runtime directory. This is the Quadlet pattern I use everywhere, so the failure looked, at first, like Podman was broken — when actually the user had the wrong UID for the directory the rest of the lab assumes.

The fix on the live VM was awkward. usermod -u 1000 jeremy while jeremy=1001 is logged in fails because the user’s own session holds the UID. I had to use the gcloud OSLogin user (which is not bound to either local account) to perform the swap, then update the bootstrap script to create jeremy first with explicit --uid 1000. The script now hardcodes the UIDs rather than relying on UID_MIN to do the right thing. That is the right shape: convention should be encoded, not inferred.

Gotcha 3: `PATCHMON_` was a guess and `PATCHMON_` was wrong

When I wrote the env template I assumed PatchMon followed the common convention of namespacing all of its env vars with the project name. Most modern apps do.

PatchMon does not. It uses bare DATABASE_URL, JWT_SECRET, SESSION_SECRET, AI_ENCRYPTION_KEY. The first deploy hit ENOENT on every secret because every variable name in the template was wrong. I fixed the template and the Vaultwarden entry in one pass, but it cost a full restart cycle plus a fresh Postgres migration run (40 migrations, ~30 seconds). The lesson here is small but real: read the upstream .env.example before authoring the env template. I had read PatchMon’s docs; I had not read the file.

Gotcha 4: aardvark-dns and the hostname collision

The VM’s hostname is patchmon-server. The first iteration of the patchmon-server Quadlet named the container patchmon-server too, on the theory that the container name should match the service it provides. The nginx Quadlet’s upstream block then referenced http://patchmon-server:3000.

aardvark-dns (Podman’s DNS sidecar) resolves names within a podman network. It also, at the same time, has the host’s /etc/hosts and /etc/resolv.conf to draw from — and the VM’s own hostname resolves, via the GCP metadata service, to its internal 10.128.0.3 IP.

What happened: nginx’s proxy_pass lookup of patchmon-server was getting 10.128.0.3 — the host’s internal IP — before it was getting the container’s network IP. The result was 502 Bad Gateway because port 3000 wasn’t open on the host’s external address, only on the container.

The fix was to rename the app container to patchmon-app and update nginx to proxy_pass http://patchmon-app:3000. The VM keeps its hostname; the container gets a non-conflicting name. Five-minute fix once I traced it; an hour to figure out why nginx was reaching the wrong address, because the symptom looked like a startup race rather than a name-resolution issue.

Gotcha 5: rootless containers and ports below 1024

Rootless Podman, by default, can’t bind to ports below 1024. nginx wanted :80 and :443. The fix is the standard lab convention: sysctl net.ipv4.ip_unprivileged_port_start=80 (and persist it in /etc/sysctl.d/). I forget this every single time I stand up a rootless nginx Quadlet on a fresh host. It is now in the bootstrap script.

Gotcha 6: SELinux on a moved binary

The Postgres dump script (pgdump.sh) initially landed at /home/jeremy/. The systemd timer runs as root, but /home/jeremy/ is mode 0700, and root-as-not-jeremy can’t traverse it. So I moved the script to /usr/local/bin/, where systemd-as-root can see it.

The systemd-as-root unit now found the script — and refused to execute it, returning (code=exited, status=203/EXEC). This was SELinux. The script had inherited user_home_t from its original location, and /usr/local/bin/ expects bin_t. restorecon -v /usr/local/bin/pgdump.sh relabeled it and the next timer fire ran cleanly. SELinux context follows the file, not the path; if you mv instead of cp, you keep the wrong label.

The bigger thing the gotchas are pointing at

Every one of those was small. None of them had a CVE attached, none of them produced a four-hour outage like the kvm02 boot hang yesterday. But they all have the same shape: a place where a sensible default met a different-but-also-sensible default, and the collision was invisible until the deploy actually ran. UID conventions, container DNS, port binding, env var naming, file labels, capacity zones — these are all things I “knew” in some general sense, and not one of them was loud enough to surface until the system tried to start.

The reason the spec/plan workflow exists is to push as many of these into review as possible, before the deploy. The reason today’s deploy still hit six of them is that no spec is exhaustive, and the cheapest way to find the next class of gotcha is to actually run the thing. Which is also the case for what PatchMon itself does: knowing your fleet is patched is not the same as the fleet being patched, and the only way to find out which packages are actually missing is to run the audit.

Tomorrow morning’s first job is taking the Copy Fail digest and turning it into the first real PatchMon ticket. The fleet-wide kernel reboot pass that’s been ambient for a week now finally has an automatable home. That’s a better launch test than any synthetic smoke check I could have written.

Sidebar: tonight’s digest also flagged that Wazuh agent 009 (plex) has been disconnected since the kvm02 reboot incident on 2026-05-06 17:53. Plex is on its own VM in the DMZ and shouldn’t have been impacted by the kvm02 outage, so this is probably an instance of the “service up but not connecting” pattern Homelab #201 is meant to harden against. Worth a wazuh-control restart on plex tomorrow. The Ceph cluster also shows 3 OSDs up but only 2 in (memory says 3/3) — the small replacement OSD on storage02 may have drifted out about 27 hours ago. No capacity loss, but a small-OSD flap this soon after the UD90 saga is the kind of thing that wants confirmed-deliberate-or-not before it gets quietly normalized.

Gotcha 1: us-east1 had no e2-micros for me#

Gotcha 2: UID ordering matters because I let useradd pick#

Gotcha 3: PATCHMON_* was a guess and PATCHMON_* was wrong#

Gotcha 4: aardvark-dns and the hostname collision#

Gotcha 5: rootless containers and ports below 1024#

Gotcha 6: SELinux on a moved binary#

The bigger thing the gotchas are pointing at#