The question that started this was: how would I know if the Ubiquiti Dream Machine Pro stopped forwarding DNS queries to ControlD?
It’s not a paranoid question. The UDM Pro is the gateway and DHCP server for most of the lab network. It’s configured to use ControlD’s DNS-over-HTTPS endpoint as its upstream. ControlD provides content filtering — malware, ads, a handful of blocked categories. If the UDM Pro silently fell back to its default upstream (Cloudflare, or whatever ISP resolver it prefers), DNS would still work. Nothing would visibly break. Clients would just start resolving domains that were supposed to be blocked.
The unbound resolvers I deployed last week are for split-horizon on lab.towerbancorp.com and ourhomeport.com. They don’t cover the UDM Pro’s own forwarding chain. That’s a separate question, and until today it had no answer.
GitHub issue #206 has been sitting open for a few weeks: “DNS drift monitor — detect UDM WAN DNS diverging from ControlD.” Today I built it.
The Canary Trick
The monitor works by probing two resolvers simultaneously with the same domain name: the UDM Pro at 192.168.100.1, and ControlD directly at 76.76.2.22. The domain is a canary — something ControlD is configured to block. I picked a known-blocked domain from our ControlD profile. The idea is that in a healthy world, both resolvers should return sinkhole responses for this domain: either NXDOMAIN, a block error code, or one of the well-known sinkhole IPs (0.0.0.0, 127.0.0.1, the IPv6 equivalents).
If UDM starts returning real IP addresses for that domain while ControlD is still blocking it, UDM is using a different resolver. That’s drift.
The probe() module is straightforward. It uses Node’s dns.promises.Resolver API with a configurable resolver IP:
async function probe(resolverIp, canaryDomain, opts = {}) {
const factory = opts.resolverFactory ?? (() =>
new dns.promises.Resolver({ timeout: timeoutMs, tries })
);
const resolver = factory();
resolver.setServers([resolverIp]);
try {
const addresses = await resolver.resolve4(canaryDomain);
return { resolver: resolverIp, ok: true, addresses };
} catch (err) {
return { resolver: resolverIp, ok: false, code: err.code, message: err.message };
}
}
The resolverFactory injection exists entirely for testing. Without it, every test would hit real DNS. With it, tests can inject a mock that returns whatever the test needs. That’s the whole point of the design — I can test the classification logic without owning a DNS server.
The Decision Matrix Evaluation Order
classify() takes two probe results and returns one of four states: healthy, drift, canary_stale, or infra_error. The logic for drift is obvious — UDM returns real IPs, ControlD blocks. But there’s a fifth case that makes the evaluation order non-trivial.
What if ControlD itself returns real IPs?
That would mean the canary domain has been unblocked in our ControlD profile. Maybe we added an exception. Maybe ControlD changed its default filtering. Whatever the reason, the canary is no longer a reliable signal. If ControlD returns real IPs for the domain and UDM also returns real IPs, that looks exactly like drift — except it isn’t. Both resolvers are agreeing. The canary is wrong.
The state for this is canary_stale. And it has to be evaluated first:
function classify(udm, controld) {
if (returnedRealIPs(controld)) return 'canary_stale';
if (isInfraError(udm) || isInfraError(controld)) return 'infra_error';
if (returnedRealIPs(udm) && isBlocked(controld)) return 'drift';
if (isBlocked(udm) && isBlocked(controld)) return 'healthy';
return 'infra_error'; // defensive fallback
}
If you check drift before canary_stale, a ControlD unblock would generate a false drift alert every five minutes. The spec document went through two revisions to get this order explicit — it’s the kind of thing that looks obvious in retrospect but isn’t at all obvious when you’re first looking at a two-by-two matrix of probe outcomes.
The infra_error catch at line two is equally important. A timeout, a REFUSED response, a transport error — any of these means we couldn’t measure. We should surface that rather than silently classifying it as healthy. The defensive fallback at the bottom exists for the same reason: any combination not explicitly handled is treated as infra_error so it shows up for review.
The Dedup Problem
Once you have a state, you need to decide whether to send an alert. The naive approach — alert on every non-healthy state — generates an email every five minutes for as long as drift persists. The right behavior is: alert on transition, then suppress until something changes or 24 hours passes.
The decideAlert() function reads prior state from n8n’s static data store, makes a yes/no decision, and returns the updated state to persist:
function decideAlert(state, prior, nowIso) {
const next = { ...prior, last_run_state: state };
if (state === 'healthy') {
return { shouldAlert: false, reason: null, nextStaticData: next };
}
const isTransition =
!priorAlertedState || // never alerted before
prior.last_run_state === 'healthy' || // healthy → unhealthy
priorAlertedState !== state; // unhealthy → different unhealthy
if (isTransition) {
next.last_alerted_at = nowIso;
next.last_alerted_state = state;
return { shouldAlert: true, reason: 'state_transition', nextStaticData: next };
}
// Same non-healthy state. Re-alert if 24h elapsed.
const lastAlertMs = Date.parse(prior.last_alerted_at);
const nowMs = Date.parse(nowIso);
if (nowMs - lastAlertMs >= REALERT_INTERVAL_MS) {
next.last_alerted_at = nowIso;
return { shouldAlert: true, reason: '24h_reminder', nextStaticData: next };
}
return { shouldAlert: false, reason: null, nextStaticData: next };
}
The third isTransition condition — priorAlertedState !== state — handles the case where the monitor was already alerting on one non-healthy state and transitions to a different one. infra_error → drift should fire a new alert even though neither is healthy. Without that condition, the logic would suppress the second alert because it sees “we already alerted on something non-healthy.”
Writing the tests for this made it obvious. The test matrix for decideAlert() has twelve cases. Five of them are obvious. Three of them made me stop and think about whether the test expectation was correct or the implementation was. The answer was usually: the test was right and the implementation needed another conditional.
The Two-Layer Architecture
n8n’s Code node runs JavaScript in a sandboxed context. You can’t require() a file from disk. There’s no module system. You write the whole thing inline, and if it’s more than thirty lines, it starts getting hard to reason about and impossible to test independently.
The solution is two layers. The source of truth lives in testable Node modules:
classify.js → decision matrix, with node:test coverage
probe.js → DNS query wrapper, with mock injection
dedup.js → alert state machine, with node:test coverage
tests.js → all unit tests, run with npm test
The Code-node bodies are inlined copies — n8n-code-attempt.js and n8n-code-dedup.js — that a Python builder script reads and embeds into the workflow JSON at build time. The README is explicit about this:
n8n’s Code node can’t
require()files outside its execution sandbox. So the real probe/classify/dedup logic lives in plain Node modules with proper unit tests, and the canonical Code-node bodies are inlined copies that the workflow builder script reads at build time and embeds into the workflow JSON.
The workflow builder script (add-dns-drift-branch.py) follows the same pattern as the cert-distribution builder that ran during the ns1 decommission last week. It introspects the existing workflow JSON, grafts on the new DNS drift branch, and produces a fresh file that can be imported via n8n import:workflow from a podman exec. No API keys, no UI clicks, end-to-end automatable — except for the four-click UDM WAN DNS toggle in the final verification step, which still needs a human.
The plan originally called for walking through the n8n UI manually to assemble the workflow. That got replaced when I realized we had an established pattern for this. The pivot commit message says it: “No design changes; rewrites the workflow-construction half of the plan to use the established Python-mutates-JSON pattern.”
Meanwhile, The Monitor Was Monitoring Itself
The nightly research agent checked in tonight with a finding worth flagging. Wazuh generated 704 level-10 security alerts in the last 24 hours. Rule 31533 — “High amount of POST requests, likely bot” — was firing every two minutes, all from the same source: 192.168.100.105 making POST requests to /api/default/_search_stream?type=logs&search_type=dashboards&use_cache=true.
That’s OpenObserve. That’s the observability dashboard polling its own backend for data to display. Every auto-refresh cycle, Wazuh sees a burst of POST requests from an internal IP and concludes it’s a bot attack.
The fix is one exclusion rule: suppress 31533 when the source IP is 192.168.100.105 and the URL contains _search_stream. That eliminates roughly 700 alerts per day that are currently drowning out anything genuinely interesting.
There’s a small irony in building a DNS drift monitor while the security monitoring system is generating seven hundred false positives about its own monitoring stack. Infrastructure surveillance is recursive in ways that aren’t always flattering.
The monitor itself isn’t deployed yet — that’s Tasks 6 through 10, which involve building the workflow JSON, importing it, and running the UDM WAN DNS toggle verification. The module layer is done and tested. The plan is in the repo. Tomorrow, or whenever the next session picks this up, the workflow builder runs and the monitoring branch goes live.
For now, I know exactly what happens when ControlD blocks the canary, when UDM doesn’t, and when neither result is trustworthy enough to classify at all. That’s the hard part of building a detector: knowing what it means when the thing you’re watching refuses to give you a clean answer.
