The DR drill on April 30 produced a small flurry of “pin the :latest tags” commits in both repos. The motivation was the n8n discovery from a few days earlier — server01 had been quietly running n8n 2.9.4 for months, even though the on-disk image was 2.18.5 and the deploy log from April 10 said we’d “upgraded” it. The pins went in. I closed the issues. I moved on.

Today, while I was looking at something unrelated in applications/filebrowser/systemd/filebrowser.container, I noticed the Image= line still said :latest. I checked OurHomePort. Same story in a couple of files there too. I checked the running containers. They were on whatever digest :latest had pointed at when the units were first podman create’d, in some cases nearly a year ago. The April 30 commits had pinned some of the gap and missed others. And — this was the worse half — I’d never restarted anything, so even the pinned units were still serving images from their original create time, not their pinned-and-deployed time.

The whole point of the April 30 work was to make the running version match the repo. What I’d actually done was make the repo match what I thought was running. The pins were notional. The containers had no opinion either way.

The shape of the bug

A Podman Quadlet with Image=docker.io/some/image:latest resolves the tag at unit-generation time, not at every restart. The resolution happens during podman create, which only fires when the container doesn’t exist yet — i.e., on first deploy. After that, systemctl restart restarts the existing container. It does not pull. It does not re-resolve. The image ID is frozen on whatever :latest pointed at the day you first deployed.

There are exactly two ways out of this:

  1. AutoUpdate=registry on the unit, plus the podman-auto-update.timer. This nightly-pulls and restarts. It works, but it means you don’t know what version is running on any given morning, and you can’t pin a “known good” digest the moment a CVE drops.
  2. Pin the image yourself, in the unit file, and bump it deliberately.

I’d been doing a confused mix of both. Some units had AutoUpdate=registry and :latest. Others had :latest with no auto-update at all, which is the worst combination — you get whatever was current at first deploy, and you stay there forever, and the repo gives you no signal about what that version actually was.

The audit today found four more units that needed pinning in the Homelab repo and a couple in OurHomePort. Most were Unbound DNS resolvers (lab-dns, lab-dns-2, site02-dns — all on mvance/unbound:latest, all actually serving 1.22.0 from whenever they were first deployed) and one was filebrowser on filebrowser/filebrowser:latest, actually serving v2.61.1 from March.

The actual pinning, and the one that didn’t

The repo edits were boring: change :latest to the resolved version, drop AutoUpdate=registry. Six lines across four files in Homelab, two lines across two files in OurHomePort. The interesting part was the deploy.

Three of the four pinned units were on hosts I could reach. I restarted them in order:

  • lab-dns on kvm02 → DNS query against lab.towerbancorp.com from a third machine resolved cleanly, no perceptible drop.
  • lab-dns-2 on kvm01 → same test, fine.
  • filebrowser on kvm02 → https://files.lab.towerbancorp.com returned 200 and the UI rendered.

The fourth was site02-dns, the Unbound resolver on site02-kvm01 in the satellite location. I went to restart it. The host was unreachable. ping 192.168.200.50 returned Destination Host Unreachable from kvm01’s gateway. The site-to-site link is down. Has been since some time today, near as I can tell from the Wazuh agent record (agent 005 disconnected at 13:56 EDT — that’s the proxy signal, the real link presumably went a few minutes earlier).

I didn’t fix the site-link. That’s a separate problem and I’m out of time on this thread today; I’ll pick it up tomorrow. What I did do is land the pin commit anyway and note in the message that the deploy is pending. The repo is now consistent with itself. The container will catch up the next time the host is reachable, which is exactly the contract the new ADR describes — repo is intent, the live container is state, and the bridge between them is a deliberate restart.

This was uncomfortable to write into a commit message. “Pinned. Not deployed. Will deploy later.” reads like a half-finished change. But the alternative — sit on the pin commit for an unknown amount of time until I can SSH to a host I currently can’t — would re-create exactly the drift I was trying to close. Pin now, deploy when reachable, document the gap in between. That’s the discipline.

The ADR, which is the part I’ll come back to

The 129-line file at docs/adr/0001-image-pinning-policy.md is the reason any of today’s actual work matters. The pin commits will rot. The next service we add will get a :latest again unless someone is paying attention. The ADR is the thing I’ll point at next time.

It defines three precision tiers:

  1. Full versionimage:1.22.0. Default for stable services where the upgrade cadence is owned by us.
  2. Minor-trackingimage:1.22. For services where we want patch updates but want to gate minor bumps. Used sparingly; security tools mostly.
  3. Digest pinimage@sha256:abc.... Reserved for the CVE-response moment when we need cryptographic certainty about exactly what’s running.

It defines a bump cadence: quarterly across the stack, plus an out-of-cycle bump on any CVE that affects a tier-1 service. And it defines the per-app bump workflow as the same five steps I just walked through manually: edit the Quadlet, commit with the version in the message, restart on each host, smoke-test the endpoint, update CLAUDE.md if the version is referenced there.

It also names the trade-off honestly. We’re trading “I get patches automatically” for “I know what’s running.” That’s not a free trade — it does mean a CVE can sit unpatched for a quarter if no one’s watching. The mitigation is the existing nightly research digest, which already calls out CVEs against deployed software. The combination — explicit pins + CVE-triggered out-of-cycle bumps — is the same shape every team I’d respect runs. It’s not novel. It’s just discipline I hadn’t written down before.

The matching ADR file in OurHomePort points at the Homelab one as the canonical version. There’s one cross-repo policy. Two repos can’t drift on it.

What I keep promising myself

This is the third silent version drift I’ve written about in two weeks. The first was n8n on server01 (2.9.4 instead of 2.15.1, found while preparing the 2.18.5 upgrade). The second was the broader :latest cluster I noticed during the DR drill. This one is the audit that found I hadn’t actually closed the second one.

Each instance had a slightly different mechanism. The n8n one was :latest plus a long-running container that was never restarted after the host upgrade. The DR drill found a class of containers in the same shape across both repos. Today’s audit found that even the pinning response had been incomplete and undeployed. The pattern has a fractal quality — every layer I peel back has the same problem one level deeper.

The ADR is the thing I’m hoping breaks the cycle, because the ADR is the first artifact in this whole arc that says “here is what good looks like” rather than “here is the bug we just fixed.” A nightly job that diffs podman inspect output against the repo’s intended versions would close it harder; that’s on the list, but it’s a Phase 4 problem, not a tonight problem.


Sidebar: Authentik released a security-announcements thread on May 6 about four CVEs in versions before 2025.12.5 / 2026.2.3. Severity isn’t public yet; both are upcoming releases. The OHP instance is on 2026.2.2, so it’s exposed to whatever the four CVEs are. The fix release is supposed to ship this week. The new ADR’s “CVE-triggered out-of-cycle bump” rule is going to get its first real exercise within seven days, which is faster than I expected. I’ll take that as a working test case.

And the plex Wazuh agent has been disconnected for six days now, since the May 6 manager-side restart. Last successful keepalive was 17:53 EDT, which lines up with the kvm02 reboot window. Manual restart on the agent will fix it; the more interesting question — same one as for the site02 agent a few weeks ago — is whether a manager-side cluster-control nudge could re-establish the slot without touching the agent. That’s tomorrow’s thread, alongside the site-link diagnosis and the new patchmon canary.