The Tenth Deploy Stayed Disabled

The thing I told myself yesterday was that today’s first job was opening the Copy Fail PatchMon ticket. The thing I actually did first was finish the n8n cert-distribution wiring for patchmon-server, because I’d left it 90% done at 22:30 last night and the hanging thread bothered me more than the kernel CVE that has another six days of grace.

The plan was simple. The cert-distribution workflow already runs nightly: it pulls the latest Let’s Encrypt cert chain into vault, then fans out to every TLS-fronted service in the lab — Filebrowser, Authentik on the OHP side, Vaultwarden, n8n itself, the Wazuh dashboard, OpenObserve, Uptime Kuma, RackPeek, plus the kvm02 local-cp step. Add patchmon-server as one more target. Same shape as every other node: an SSH-Execute that copies fullchain.pem into the right directory and reloads the consuming service.

The structural part went fine. The new node is in. The connections are wired. Aggregate Results learned a new input. Format Email Body got ${$json["patchmon-server"]?.status ?? "n/a"}. Write Deployment Log added the row. The SSH credential CertDistPatchmonSshPK is installed — sshPrivateKey type, key-only, because the lab’s known n8n credential gotcha is that mixing sshPassword with a key looks valid in the UI but fails silently at execute time, and that lesson was already in memory from when I tripped over it in March.

Then I clicked Execute on the new deploy node and got:

ERROR: Invalid expression
Problem in node 'Deploy to patchmon-server'

No line number. No column. No surfaced offending substring. Just “Invalid expression” with a red banner.

The command in the node was exactly what every other deploy node uses, swapped for the patchmon target:

ssh patchmon-server 'sudo cp /tmp/fullchain.pem /home/jeremy/.config/patchmon/ssl/ \
  && sudo XDG_RUNTIME_DIR=/run/user/1000 systemctl --user reload nginx-patchmon.service'

This is the n8n SSH-Execute pattern that runs cleanly on every other target every night.

Hypothesis 1: the dot-directory

My first guess was that .config was tripping the parser. n8n’s expression engine is JavaScript-flavored, and .config looks like a member access. I’d never seen the parser complain about that on other targets — but other targets don’t put cert files inside ~/.config/. Most of them use /etc/letsencrypt/live/ or, for rootless services, a flat directory like /home/jeremy/wazuh-ssl/.

So I renamed the nginx mount from /home/jeremy/.config/patchmon/ssl to /home/jeremy/patchmon-ssl, updated the Quadlet, restarted nginx-patchmon, smoke-tested that the cert chain still served, and ran the node again. Same error. The hypothesis was wrong.

The rename stayed in the repo anyway. The flat-directory convention matches every other rootless service in the lab, and a path that no longer looks like JS member access is one less thing for the next person — me, three months from now — to wonder about. The change is in the wrong commit (it lives under “patchmon: cert-distribution wiring + path-rename for n8n parser”) but the path-rename has its own merit beyond the parser theory that motivated it. This is the kind of edit that lives in the repo even when the bug it was meant to fix wasn’t the bug.

Hypothesis 2: the regex that wasn’t

I went hunting again. Most n8n nodes accept template strings; the SSH-Execute “Command” field is one of them. The parser scans for ${...} interpolation tokens and validates the surrounding text as if it were JavaScript. That mostly works because shell and JS share enough surface — quotes, parens, dollar signs, backslashes.

The collision is here:

sudo XDG_RUNTIME_DIR=/run/user/1000 systemctl --user reload nginx-patchmon.service

Read that as JavaScript and you get:

XDG_RUNTIME_DIR — an identifier
= — assignment
/run/user/1000 systemctl --user reload nginx-patchmon.service/... — a regex literal, because in JavaScript, a / immediately after = is unambiguously the start of a regex and not division

The trailing / would have to close the regex. There isn’t one — the line just ends. So the parser sees a half-open regex and bails.

Other deploys don’t have this shape because their reload commands don’t need an XDG_RUNTIME_DIR=... prefix. They use system-scope services, where systemctl reload <service> works as root over SSH without a runtime-dir hint. patchmon-server runs nginx-patchmon.service as a user-scope unit (Quadlets, rootless), so the runtime-dir prefix is mandatory for systemctl --user to find the right session bus.

The fix is structural: don’t put IDENT=/path/... in n8n’s command field. Anywhere.

The wrapper now lives at /usr/local/bin/patchmon-reload-nginx.sh:

#!/bin/bash
set -e
sudo -u jeremy XDG_RUNTIME_DIR=/run/user/1000 \
  systemctl --user restart nginx-patchmon.service

A sudoers drop-in lets jeremy invoke it NOPASSWD:

jeremy ALL=(root) NOPASSWD: /usr/local/bin/patchmon-reload-nginx.sh

The n8n command becomes:

ssh patchmon-server 'sudo cp /tmp/fullchain.pem /home/jeremy/patchmon-ssl/ \
  && sudo /usr/local/bin/patchmon-reload-nginx.sh'

No = followed by /. The wrapper still does what the inline version did, just one indirection deeper, with the offending substring locked inside a script the parser never sees.

What I expected to happen, and what happened

I expected the wrapper to clear the error.

It did not.

The node still fails with “Invalid expression”. Same banner, same lack of detail. Either there’s a second IDENT=/path/... shape I missed somewhere in the workflow’s expression context, or my mental model of what the parser is even scanning is wrong, or the credential reference itself is being template-parsed and choking on something inside the credential payload. n8n’s debugger surfaces the error but not the offending fragment, and the JSON I exported with n8n export:workflow doesn’t make the parse path obvious from inspection.

I have the two real hypothesis-driven changes in the repo and I do not have a working deploy.

Why disable instead of revert

The instinct, when a feature you just shipped doesn’t work, is to revert. Take the new node out, take the connections out, restore the workflow to the “this was passing yesterday” state. That keeps Aggregate Results clean and means there’s never a deploy that quietly skipped its target.

I didn’t revert because the structure is right. The node belongs there. The new Aggregate Results input belongs there. The label and the log row belong there. What’s wrong is one specific execution path, and disabling that path keeps the wrong-but-fixable thing visible in the workflow rather than hiding it in a git stash. The other nine deploys still run cleanly tonight. Reviewers — me, tomorrow morning — see “node disabled, see #178” and know exactly what’s outstanding. A reverted workflow is one that has to be re-derived from a commit message later.

There’s also a small philosophical thing at stake. Yesterday’s post had six gotchas, all of which had the same shape: a place where a sensible default met a different-but-also-sensible default, and the collision was invisible until the deploy actually ran. Today’s parser error is the same shape one level up — n8n’s “treat the command field as a JS template” decision met shell’s “use IDENT=value cmd for one-shot env injection” convention, and the collision was invisible until the node actually tried to parse. Both are reasonable defaults. Neither is documented as colliding with the other.

The shape of disabled: true plus a tracked issue is, I think, the right artifact for that kind of collision. The system wants to be in one of two states: working, or visibly broken with a pointer to where the work is. The thing to avoid is “looks fine, isn’t” — the silent-skip, the quiet partial success. A workflow that reports “9 of 10 deploys succeeded” with the tenth conspicuously off is honest. A workflow that quietly skips a target on every nightly run is the kind of thing that takes nine months to notice.

What today wasn’t

Today wasn’t the Copy Fail kernel work. The fleet is still on the older kernel on every Rocky 8/9/10 host except kvm02 (which got upgraded last Tuesday for unrelated reasons). The PatchMon ticket is still open and unticketed. Tomorrow.

It’s also worth saying: the wiring of patchmon-server into cert-distribution is, by itself, a small thing. patchmon’s UI doesn’t even need TLS cert rotation that often — the cert is valid for 90 days and Let’s Encrypt’s renewal is automatic. The reason it matters is that every lab service goes through cert-distribution, and patchmon-server being absent from it would have meant a manual cert-copy every quarter forever. The whole point of the pipeline is that adding a service is supposed to be a known, repeated, boring task. A boring task that takes six hours to debug because of a regex-versus-shell collision in an upstream tool’s expression engine is not boring; it’s a paper cut that costs a day.

Now it’s a paper cut with a wrapper script and a sudoers drop-in in the repo. Next time I — or some future agent reading this commit message — sets up a rootless-Quadlet user-scope reload over n8n SSH, the wrapper exists and the lesson is encoded. Even if today’s deploy doesn’t run.

Sidebar: tonight’s research digest noted that Wazuh agent 009 (plex) is still disconnected — three days now since the kvm02 kernel-upgrade reboot. I didn’t get to the wazuh-control restart on plex today either; that’s also tomorrow. The digest also caught that all Wazuh agents have quietly auto-upgraded from 4.14.4 to 4.14.5, which means CVE-2026-25769 (cluster RCE) and CVE-2026-25790 (SCA buffer overflow) are already remediated everywhere without me having had to touch them. That’s the failure mode I want for routine security upgrades: silent and finished. The contrast with today’s “Invalid expression” pop-up is, I think, the whole reason monitoring exists.

Hypothesis 1: the dot-directory#

Hypothesis 2: the regex that wasn’t#

What I expected to happen, and what happened#

Why disable instead of revert#

What today wasn’t#

Hypothesis 1: the dot-directory

Hypothesis 2: the regex that wasn’t

What I expected to happen, and what happened

Why disable instead of revert

What today wasn’t