Writing to the Wrong Zone

There’s a particular shape of bug where every step succeeds and the result is still wrong. The certbot renewal pipeline for *.lab.towerbancorp.com had been quietly failing for at least a week. The container exited 0. The Google Cloud DNS API was returning 200s. TXT records were being created. They just weren’t being created in the zone Let’s Encrypt was going to query.

I noticed because the cert finally hit its 30-day window and ACME started caring whether renewal actually worked.

The failure looked normal

Certbot’s dns-google plugin runs the standard DNS-01 dance: ask Let’s Encrypt for a challenge, write _acme-challenge.lab.towerbancorp.com TXT, wait for propagation, ask Let’s Encrypt to verify, delete the TXT. The plugin uses a service account credentials JSON to talk to Cloud DNS. The SA in question is acme-dns-manager@homelab-certificates.iam.gserviceaccount.com.

The error in the certbot log was the bland kind:

Error finalizing order: urn:ietf:params:acme:error:dns
Detail: DNS problem: NXDOMAIN looking up TXT for
_acme-challenge.lab.towerbancorp.com - check that a DNS record
exists for this domain

NXDOMAIN. The classic “you didn’t write the record.” Except the certbot logs from the same run had Successfully added TXT record ten seconds earlier. The API was telling certbot the write worked. The resolver was telling Let’s Encrypt no such record existed. Both were correct.

Two projects, one zone name

The lab DNS lives across two Google Cloud projects, and that split is the load-bearing piece nobody told the SA about.

homelab-certificates was the original project. It hosts the towerbancorp.com parent zone, on the c-shard nameservers (ns-cloud-c1 through ns-cloud-c4). When the lab was small, this is where everything lived.

towerbancorp-lab is the newer project. It hosts the lab.towerbancorp.com child zone, on the e-shard nameservers. The parent zone in homelab-certificates has an NS delegation that says “for anything under lab.towerbancorp.com, ask the e-shard servers.” That delegation is what makes the split work — it’s also what makes it invisible from the parent’s API surface.

The SA had been provisioned back when there was only one project. It still had its roles/dns.admin binding on homelab-certificates. Nobody had migrated it when the child zone was carved out. The credentials JSON’s project_id field still pointed at the old project. The dns-google plugin reads that field at startup, builds a Cloud DNS client scoped to that project, and from then on every TXT write is routed to whatever zone in that project matches the FQDN.

And here’s the kicker: there IS a zone for towerbancorp.com in homelab-certificates. It’s the parent. So when certbot asked the plugin to write _acme-challenge.lab.towerbancorp.com, the plugin found a zone whose name was a valid suffix of the record (towerbancorp.com), and it wrote the TXT there. The API returned success. The record literally exists in that zone.

Resolvers, of course, don’t ask the parent for _acme-challenge.lab.towerbancorp.com. They follow the NS delegation down to the e-shard nameservers, which know nothing about homelab-certificates and have nothing for that name. NXDOMAIN. Records below an NS delegation can’t be shadowed by the parent — that’s the whole point of delegation.

The TXT was sitting in a zone no resolver would ever ask.

Path A and Path B

Two ways to fix it.

Path A: leave the SA where it is, but grant it roles/dns.admin on towerbancorp-lab (cross-project IAM binding), then flip the project_id field in the credentials JSON from homelab-certificates to towerbancorp-lab. The plugin then targets the live e-shard zone and the writes land where resolvers look.

Path B: provision a new SA inside towerbancorp-lab (project-local), grant it roles/dns.admin on the project, generate a new key, swap the credentials, rotate.

Path A is one IAM binding and one JSON edit. Path B is correct hygiene — the SA’s identity should match the project it operates in, partly so a future zone rename in homelab-certificates doesn’t reach back through cross-project IAM and confuse anyone. I picked Path A tonight because the cert was already expiring and I wanted a renewal to land within the next hour, and filed #228 for the Path B follow-up during a quieter maintenance window.

gcloud projects add-iam-policy-binding towerbancorp-lab \
  --member="serviceAccount:acme-dns-manager@homelab-certificates.iam.gserviceaccount.com" \
  --role="roles/dns.admin"

Then a one-character diff in the JSON:

-  "project_id": "homelab-certificates",
+  "project_id": "towerbancorp-lab",

Renewal ran cleanly the next time the timer fired. New cert valid 2026-04-29 → 2026-07-28, R13 issuer, sha256 starting 96:38:2C:F9. The hash is recorded in the commit body so a future me has the exact handoff if the next renewal fails for some other reason.

The downstream pipeline finally got tested

Once the cert renewed, the cert-distribution n8n workflow took over. That’s the workflow I rewrote last week to push the freshly issued PEMs from server01 out to every node that terminates TLS — five nginx containers on kvm02, plus storage01, storage02, smtp, backup01, kvm01, and site02-kvm01. It uses sshPrivateKey credentials with authentication: 'privateKey' set on each Execute Command node, which is the configuration that actually works on n8n 2.15.1+ (the password+key hybrid that the UI lets you save silently fails at execute time).

Tonight was its first run against a real renewal rather than a dry-run with a forged cert payload. All eight nodes accepted the new files, restarted their nginx instances, and reported back. The workflow finished in 23 seconds.

I also patched one piece of monitoring noise. OpenObserve has a container-restart-detected alert that fires when any container in the Wazuh / nginx / certbot fleet restarts. Certbot’s container is intentionally short-lived — every renewal attempt spawns it, runs the plugin, exits it — so each run was tripping the alert. I’d been ignoring those because they were obviously spurious, but with the renewal pipeline back online they were about to come back at twice-daily cadence forever. Patched the alert filter to exclude images matching certbot:. Narrow enough that an actual long-running certbot container restart would still alert; if I ever build one (I won’t), the rule survives.

A small Netbird carve-out, while we’re here

Earlier in the day — before the certbot work — I spent twenty minutes on fallout from yesterday’s OHP#63, the “Closing the Default Allow” change that deleted the default all-to-all Netbird policy and replaced it with eighteen explicit ones. RDP from the work laptop (CDN-MV05EJ0, in work-devices) to the home desktop (SER5-Desk, in admin-devices) had been silently relying on that default-allow. It broke at 09:42 when mstsc just hung.

The fix is a pattern I’m calling peer-singleton groups: two new groups, peer-cdn-mv05ej0 and peer-ser5-desk, each containing exactly one peer, plus a new policy CDN to SER5-Desk RDP that allows TCP 3389 one-way between them. Eighteen policies became nineteen. The deployment doc now documents the pattern so the next narrow exception doesn’t get hand-waved into a broader role group that opens up more than was needed.

The Netbird group model wants you to think in roles (admin-devices, lab-servers, work-devices), but real-world exceptions are often peer-to-peer. Singleton groups are how you express “just these two, just this port” without inflating a role. They sort awkwardly in the dashboard. They’re correct.

What’s parked

Tonight’s research digest dropped four advisories against n8n versions older than 2.17.4 / 2.18.1. Both n8n instances — Homelab’s on kvm02, OHP’s on server01 — are running 2.15.1, which is in scope for all four, including two critical XML prototype-pollution-to-RCE chains. Latest stable is 2.18.4. Issues #226 (Homelab) and #72 (OHP) are filed.

I’m not touching that tonight. The cert-distribution workflow that just succeeded above lives on the Homelab n8n, and starting an upgrade on the same evening I unblocked the cert pipeline is exactly the sequence that turns a five-line commit into an outage post. Tomorrow.

The thing I’ll remember from tonight is the gap between API success and resolution effect. The DNS plugin wrote a record. The API returned 200. The record existed. None of that means it would be visible to a resolver, because resolution doesn’t traverse the Cloud DNS API — it traverses delegated nameservers, which knew nothing about the project I was writing into. There’s no write-time validation that could catch this; the call was syntactically correct and operated on a zone the SA had permission to modify. The only honest check is end-to-end: query the resolver from outside, see the value you just wrote, then declare success.

Tomorrow’s certbot timer will fire at 06:00 against a cert that’s already fresh, so it won’t actually do anything. But the next time the 30-day window comes around, the plugin will write to the right zone, the right resolver will answer, and the right cert will land. That’s the version of “working” that survives.

The failure looked normal#

Two projects, one zone name#

Path A and Path B#

The downstream pipeline finally got tested#

A small Netbird carve-out, while we’re here#

What’s parked#