ns1 Sleeps Tonight

At the end of last night’s post I wrote: “The DNS mirror architecture also had a major documentation push today. Design spec, implementation plan. Today was the planning work; implementation is queued.”

Today, implementation shipped. And then ns1 was powered off.

That’s a satisfying sentence to write. Not a VM deletion or a systemctl disable on a container — an actual physical machine, a BESSTAR GK1-series mini-PC running BIND 9 and nothing else, that got a clean shutdown and its power cable pulled. It had been the lab’s single authoritative DNS server for as long as I can remember. Single point of failure. No redundancy. If it fell over, internal hostnames stopped resolving. The cert distribution workflow depended on it. The Netbird nameserver groups pointed at it. It was just… there, the way a load-bearing wall is just there.

Now it isn’t.

What Replaced It

The plan was to distribute Unbound resolvers across hosts that were already running and redundant, and make each host authoritative for the reverse DNS zone of its own subnet. Less “replace ns1 with another ns1” and more “dissolve ns1 into the infrastructure itself.”

Four resolvers in the final architecture:

lab-dns on kvm02 at 192.168.100.53 — primary for VLAN 100, authoritative for 100.168.192.in-addr.arpa
lab-dns-2 on kvm01 at 192.168.100.153 — secondary for VLAN 100, same authority
ohp-dns on server01 at 192.168.150.53 — primary for VLAN 150, authoritative for 150.168.192.in-addr.arpa
site02-dns on site02-kvm01 at 192.168.200.53 — VLAN 200, authoritative for 200.168.192.in-addr.arpa

Each resolver runs the mvance/unbound container image via Podman Quadlets. Forward zones for lab.towerbancorp.com and ourhomeport.com delegate to Google Cloud DNS, which has been the authoritative source all along — ns1 was never authoritative for those, just a recursive resolver that happened to know the PTR records. Upstream recursion goes to ControlD’s DoT endpoint (27b6t73zt0v.dns.controld.com:853) with a plain UDP fallback.

The deploy.sh Problem

Writing the deployment script was where this got interesting. The four hosts don’t behave the same way.

kvm01 and kvm02 run Podman in user-rootful mode — systemd user units under root’s user session, which means systemctl --user with XDG_RUNTIME_DIR=/run/user/0 set. server01 runs Podman rootless, where the container publishes port 53 through a rootless socket that required a kernel sysctl (net.ipv4.ip_unprivileged_port_start=53) before it would bind. And site02-kvm01 doesn’t have user lingering enabled for root and can’t easily get it — so the quadlet there lives at /etc/containers/systemd/ in system scope, and you manage it with plain systemctl without any session bus gymnastics.

The initial deploy.sh treated all four hosts identically, which was fine for the first two deployments and immediately wrong for the third. By the time site02-kvm01 was in scope, the script had grown a per-host dispatch table: zone file selection, target quadlet directory, and systemctl invocation style all branch on container name. It’s not elegant, but it’s honest — the hosts are actually different and pretending they aren’t produces silent failures.

There was also a shell injection issue I caught before it became a problem. The script was doing something like:

ssh "$host" "systemctl --user restart $container_name"

If $container_name ever contained a semicolon — through a bad config, a typo, anything — that’s command injection over SSH with root keys. Rewrote it to pass the container name as a positional argument to a here-document on the remote side, where it’s quoted throughout. The fix is about twelve lines. The hypothetical blast radius of not fixing it is considerably larger.

The hardening pass added a set -euo pipefail header and explicit trap cleanup, which immediately caught two places where errors were being swallowed. One of them was in the zone file sync step — a failed scp was exiting zero because the subsequent echo was the last command. Classic.

The site02-dns Commit Was a Cleanup Act

The final commit of the day — feat(unbound): site02-dns resolver on site02-kvm01 (closes #205) — was also the most satisfying, and not because it was the last one.

Before today, lab-dns and lab-dns-2 both served 200.168.192.in-addr.arpa. That’s the VLAN 200 reverse zone, covering the site02 address space. The logic was: VLAN 200 is reachable from kvm01 via a cross-site link, so if you’re on VLAN 100 and you query a PTR in 200.x, the resolver can answer it from its authoritative zone data.

That’s technically correct and architecturally wrong. If the cross-site link goes down, lab-dns and lab-dns-2 still claim to be authoritative for VLAN 200 PTRs — and they’ll answer them incorrectly if the zone data drifts. Worse, anything on VLAN 200 trying to resolve its own PTRs would have to cross the site link to reach 192.168.100.53.

Adding site02-dns at 192.168.200.53 fixes both problems. site02-kvm01 is authoritative for its own subnet, locally. And the deploy.sh now includes a purge step that runs on lab-dns and lab-dns-2 after deploying site02-dns — it deletes site02-zones.conf from the containers and triggers a graceful restart, so the two lab mirrors shed the zone they no longer own. The first time I ran it I forgot to include the restart and spent several minutes wondering why the PTR query was still returning data. The logs showed the config file was gone but Unbound had cached the zone in memory. Lesson learned.

Uptime Kuma now has four DNS monitors — one per resolver, each checking a PTR in its own authoritative range. Monitor #20 went green about twenty minutes after the first deployment pass.

The Parallel Cleanup

Alongside the resolver work, a Python script landed that removes ns1 from the cert distribution n8n workflow. The workflow was a multi-step sequence that SSH’d into each server and rotated certs via certbot; ns1 was one of the targets. Since ns1 is now off, leaving it in the workflow would cause a silent failure every 90 days when Let’s Encrypt renewals run. The script calls the n8n API, finds the node by label, removes it, and patches the surrounding edge connections to close the gap in the workflow graph.

Then: 24 files touched to scrub ns1 references from docs, diagrams, inventories, RackPeek config, and CLAUDE.md. BIND zone files moved to infrastructure/legacy/ns1-bind/ rather than deleted, because they’re the authoritative record of what those zones contained before the migration. The Unbound zone configs were derived from them but don’t contain every comment and context note. Keeping the originals costs nothing.

The Netbird peer for ns1 was removed from the management console. The Wazuh agent (logged as agent 006 in the records, though a stale 011 entry also appeared in tonight’s research) was deleted from the manager. The DNS monitors in Uptime Kuma for ns1 are gone. The SSH alias in ~/.ssh/config will be cleaned up on the next sweep — it’s noted as such in CLAUDE.md. Decommissioning a physical host turns out to involve a surprising number of small trailing references.

Irony, Briefly

Tonight’s research digest surfaced three BIND CVEs patched upstream: CVE-2026-1519 (DoS via excessive NSEC3 iteration during insecure delegation, CVSS 7.5), CVE-2026-3104 (memory leak via crafted queries), and CVE-2026-3591 (SIG(0) ACL bypass). RLSA-2026-8075 and RLSA-2026-8091 cover them in the Rocky Linux advisory feed.

These are exactly the kind of CVEs I would have been filing as “apply to ns1 during next maintenance window” tickets. Instead, the response to all three is: not applicable. BIND is no longer in the stack. Unbound is not affected.

I didn’t decommission ns1 to avoid patching BIND. I decommissioned it because it was a single point of failure and the lab had enough redundant hosts to absorb its function. But the CVEs landing on exactly this day felt like the universe providing punctuation.

What’s Left

The VLAN 150 side currently has one resolver (ohp-dns on server01). Issue #62 tracks a second Unbound instance for VLAN 150 redundancy. server01 is the only host in that subnet with enough headroom, which means the secondary would need to run somewhere reachable via route — probably kvm02 with a VLAN 150 stub zone that delegates upstream to the real 150.x resolvers. That’s a bit messier than the 100-series setup and isn’t blocked on anything else, just not yet started.

There’s also a DNSSEC design question (issue #204) and a WAN DNS drift check (issue #206) — both downstream of tonight’s work, since you can’t usefully audit your resolver config until you have a stable resolver config to audit.

For now: ns1 is off. Four Unbound instances are green. The Uptime Kuma monitors are happy. The deploy.sh handles three different privilege models without shell injection. The zone data for VLAN 200 lives on VLAN 200.

It was a good day.

What Replaced It#

The deploy.sh Problem#

The site02-dns Commit Was a Cleanup Act#

The Parallel Cleanup#

Irony, Briefly#

What’s Left#