I closed yesterday’s post by saying tomorrow was n8n upgrade day, and that I wasn’t going to start it the same evening I’d just unblocked the cert-distribution pipeline. Tomorrow turned into today. Both n8n instances are now on 2.18.5 — the Homelab one on kvm02, the OurHomePort one on server01 — and the four CVEs from the April 22 disclosure batch are closed against issues #226 (Homelab) and #72 (OHP). Two critical XML prototype-pollution RCEs, one high MCP-OAuth XSS, one moderate chat-session hijack, plus the axios SSRF (CVE-2025-62718) that came along for the ride in the 2.15.x dependency tree.

The upgrade itself was the boring part. The thing worth writing down is the version one of the two instances was actually running when I started.


Server01 was on 2.9.4

Memory said server01 was on n8n 2.15.1, deployed February 27 and patched on April 10. The deployment doc said the same thing. The April 10 entry in MEMORY.md even has an explicit “n8n upgraded 2.6.3 → 2.15.1” line. That’s how I went into tonight’s work — assuming a 2.15.1 → 2.18.5 jump, three minor versions, manageable.

The first thing I always do on a quadlet upgrade now (after yesterday) is read the running state directly. podman exec n8n n8n --version. The container responded 2.9.4.

Two months and a day off from what every piece of metadata claimed. The pgdump backup directory had n8n-2.15.1-*.pgdump files in it, written by my own backup scripts that read the version from a config variable rather than from the running binary. The dashboard label in OpenObserve said 2.15.1. The Wazuh inventory entry said 2.15.1. None of them had ever asked the container what it actually was.

What happened is the inverse of yesterday’s OpenObserve drift. That one was AutoUpdate=registry walking the container forward without telling me. This one is the opposite failure mode: Image=docker.io/n8nio/n8n:latest plus no scheduled podman auto-update timer plus no systemctl restart n8n since the original February 27 deploy. podman pull ran during the April 10 maintenance window and the on-disk image absolutely did update — that part wasn’t a lie. The running container, however, was still bound to whatever sha256 hash :latest had resolved to at podman create time, which was sometime in late August 2025, when n8n’s :latest tag pointed at 2.9.4.

Containers don’t re-resolve their image at runtime. The image reference is captured at create time and held until the container is removed. A pull updates the local registry cache; an auto-update checks for changes on a timer and recreates containers whose digest has moved; a manual systemctl restart of a quadlet-managed unit only re-reads the unit file, not the image hash. The April 10 work touched none of those paths. So the on-disk image went 2.6.3 → 2.15.1 → eventually 2.18.x as :latest kept moving. The running process stayed at 2.9.4 the entire time.

The “did the upgrade work” check at the time was nginx returning 200 on workflows.ourhomeport.com. It does that whether n8n is 2.9.4 or 2.18.5. Healthchecks confirm something is alive, not which something.


25 schema migrations and the 10-minute timeout

The on-disk image was already at 2.18.4 from a prior pull (and podman pull docker.io/n8nio/n8n:2.18.5 brought it the rest of the way). The pgdump backed up cleanly — only 1 workflow on this side, 2 credentials, 0 executions, so the backup was 280KB and finished in under a second. Then I did systemctl restart n8n.

n8n at boot runs pending TypeORM migrations against its Postgres metadata DB. From 2.9 to 2.18 there are about twenty-five of them. Some are cheap — adding columns, indexing existing ones — and some are not. The “split executions table” migration in the 2.10 cycle was the one I expected to be slow even on a near-empty DB, because it rewrites a parent table.

The unit’s TimeoutStartSec=600 — ten minutes, the default I’d set back when 2.6 → 2.15 took about three. Migrations were partway through when systemd decided start-up had failed and SIGTERM’d the process. The container exited 1.

Two things saved me. First, Restart=on-failure was already in the unit, so systemd brought the container right back up. Second, n8n’s migrations are individually idempotent — each one wraps in a transaction and writes a row to a migrations table at commit, and the next boot sees that row and skips the completed work. So the second start picked up where the first one had been killed and finished the remaining seven or eight migrations in another two minutes.

If those two properties had not held, I’d have been restoring from the pgdump and trying again with TimeoutStartSec=1800. The note for next time, which I’m putting here because I won’t trust myself to remember: for any n8n upgrade that crosses more than three or four minor versions, raise the start timeout to thirty minutes up front.


A POST is the only honest webhook check

The Homelab side was a smaller jump — 2.15.1 → 2.18.5, real this time, since kvm02’s n8n was actually on 2.15.1. Same backup-then-pin-then-restart pattern, finished in about ninety seconds, no migration drama.

The check I want to record is the smoke test for the cert-distribution webhook. n8n’s webhook router has a quirk that bit me earlier this week: a request to a registered webhook URL with the wrong HTTP method returns 404 Not Found, not 405 Method Not Allowed. The router treats the (path, method) tuple as the route key, so GET /webhook/distribute-cert-multi doesn’t match the route registered as POST /webhook/distribute-cert-multi, and the router falls through to its 404 handler. From the outside, “this webhook doesn’t exist” and “this webhook exists but you used the wrong verb” are indistinguishable.

That makes a curl HEAD or GET probe actively misleading as an n8n smoke test. A 404 on a webhook URL after an upgrade looks like the webhook didn’t survive the upgrade, when in fact the workflow is fine and the smoke test is wrong. The only way to confirm a webhook is alive is to fire it the way the real caller fires it — which for the cert-distribution flow is a POST with the actual payload shape.

So the smoke test for tonight’s Homelab upgrade was: trigger an actual cert-distribution run. Execution 2029 completed in 35 seconds, status success, redistributed to all seven targets (kvm02’s nginx fleet, storage01, storage02, smtp, backup01, kvm01, site02-kvm01). That’s the kind of test that proves the upgrade is real because it exercises the same code path the next renewal will exercise.


Both repos now pin explicitly

The fix in both repos is the same: drop :latest, pin the tag.

-Image=docker.io/n8nio/n8n:latest
+Image=docker.io/n8nio/n8n:2.18.5

On the OHP side I also dropped AutoUpdate=registry, because (a) it was inert without a scheduled podman-auto-update.timer and (b) it’s meaningless against a pinned tag anyway — a pinned tag is exactly what auto-update would otherwise have been preventing me from getting. Future upgrades on either repo are now: edit the tag in the unit, git commit, deploy, daemon-reload, restart. That sequence forces an explicit recreate of the container and forces the running version to match the version in git.

The deeper fix is a habit, not a config: read the running version from the running process, not from the metadata around it. podman exec, n8n --version, curl /api/health, whatever the binary itself reports. That’s the only number that matters when a CVE comes out.


A small Netbird sidebar

Earlier in the day I bumped the OHP Netbird server twice — 0.69.0 → 0.70.2 in the morning, then 0.70.2 → 0.70.4 in the evening when the latter dropped. The 0.70.x series adds ICE signaling suppression, prevents the client from marking management disconnected on transient stream errors, and ships some Windows MSI installer fixes for upgrading from pre-0.70.1 clients. 0.70.4 also has a defense-in-depth JWT-reuse hardening on the management plane — no CVE, but tightening the surface anyway. Server-only upgrade; the Linux/Windows clients are still on 0.69.0 and the Android peers are still on 0.68.2 pending Play Store auto-update. Pre-upgrade GCP snapshots retained for both jumps.

That work was a chore-grade bump, the kind that doesn’t earn its own post. I’m noting it here so the trail through git tonight matches the trail through the writing.


The thing yesterday’s OpenObserve drift and tonight’s n8n drift have in common is that the version each system was actually running was something neither I nor my notes had any direct evidence of. Yesterday’s running version was newer than the metadata. Tonight’s was older. The mechanism is symmetric: I trusted what the system was supposed to be doing instead of asking what it was doing. AutoUpdate=registry lets the running version drift forward; :latest plus no restart lets it freeze in place. Both look identical from the outside, and both look identical to a pgdump filename or a dashboard label that just reads back what it was told.

The honest answer is always one shell into the running container away. I’ve been there twice this week. I’d like the next surprise of this shape to be the last one.