The Audit Was Wrong

Yesterday’s audit document opened with a confident diagnosis: the cross-cluster Relayed peer-pairs in the Netbird mesh were caused by symmetric NAT between two physically separate UDMs. The recommended fix was a UDM port forward and an --external-ip-map flag. It was a clean, plausible story that explained 41% of the mesh sitting on relay.

It was also wrong.

There aren’t two UDMs. There’s one. And the NAT isn’t symmetric. And the port forward isn’t really the fix. Today I rewrote most of the audit, applied the actual fix across three zone boundaries, and watched the mesh re-pair into something I’m finally willing to call done.

How I got the topology wrong

The mesh has eleven peers I can SSH to: seven on the lab cluster (VLAN 100), one transit-routed in from site02 (VLAN 200), three on the OurHomePort side (VLAN 150 / DMZ VLAN 70 / Trusted_Devices VLAN 10), and the GCP-hosted controller. When I ran netbird status -d from each peer and parsed the connection types, the pattern was clean: every pair within a VLAN was P2P, every pair across the home/lab boundary was Relayed.

I’d been carrying a mental model where the lab and home networks were behind two different physical UDMs — maybe one at site01, one at the house — and reasoned from there. With two separate WANs and a typical residential NAT, you’d see exactly this kind of cross-WAN relay pattern. So I wrote it up that way and recommended a port forward.

What changed my mind was running a STUN probe from server01 and kvm01 to a public reflector and noticing the external port was the same for both. Same external IPv4 too: 108.11.247.66. That’s not two WANs. That’s one WAN with endpoint-independent NAT (port-preserving, not symmetric). Which means the NAT-traversal premise of the original audit was wrong from sentence one.

The actual topology: a single UDM Pro at the house running UniFi OS 9.x with zone-based firewall, with all VLANs (lab, OHP, DMZ, Trusted_Devices) NAT’d to one public IP. Verizon Fios hands the UDM a global IPv6 /56, and every VLAN gets its own routable /64. Cross-VLAN traffic isn’t a NAT problem. It’s a firewall problem. The “Block All” default between zones was silently dropping the host-candidate UDP between, say, 192.168.100.x:51820 and 192.168.150.100:51820, and ICE was correctly falling back to relay because all three of its candidate types had failed.

I preserved the original analysis verbatim under a banner that says “the diagnosis below was wrong, skip to the bottom for the working state.” Deleting it felt dishonest. The methodology — the per-peer matrix, the connection-type counts — was still useful as evidence, even though the interpretation had been off.

The fix, in three zone boundaries

Once the diagnosis was right, the fix pattern was repetitive. For every pair of zones where I wanted Netbird P2P:

Add a UDM zone-firewall ALLOW rule for UDP in each direction.
Open UDP 51820 on each peer’s host firewall.
Bounce Netbird on the affected peers to force ICE to re-gather.

Three boundaries needed it: OurHomePort ↔ TBC_Lab, TBC_Lab ↔ Dmz (for plex), and TBC_Lab ↔ Trusted_Devices (for the SER5-Desk workstation). Six zone rules total — every pair gets two rules because the policies are directional.

I found two API gotchas worth recording.

The CSRF token. The UDM’s UniFi OS 9.x firewall-policies endpoint at /proxy/network/v2/api/site/default/firewall-policies accepts cookie-only auth for GET requests. POST/PUT/DELETE silently return 403 unless you also send X-Csrf-Token with the value from the x-updated-csrf-token response header on /api/auth/login. I figured this out the first time by getting 403’d on an obviously well-formed POST and going looking for the missing header. The second time, on the DMZ and Trusted_Devices rules, I just used it.

Port match silently broken. The same API accepts destination.port: "51820" with port_matching_type: "SPECIFIC". It stores the value. The dashboard renders the rule. And the compiled nftables ruleset doesn’t honor it — packets matching every other field still get dropped. Switching to port_matching_type: "ANY" (allow all UDP between the zones) makes the rule actually fire. I don’t know if this is a UniFi bug, an undocumented behavior, or an interaction between my API path and the UI’s, but the workaround is “allow all UDP cross-zone,” which is acceptable here because no other UDP services exist between these zones and Netbird’s authenticated WireGuard handshake gates the only thing that lands at port 51820 anyway.

The host-firewall side was less interesting — firewall-cmd --add-port=51820/udp --permanent --reload on every Linux peer, and on the SER5-Desk workstation an explicit Windows Defender Firewall rule because the Netbird installer’s rule had been scoped to the WireGuard overlay IP 100.69.208.87, which is only reachable after the handshake. ICE needs UDP 51820 open on the physical NIC for the handshake to happen in the first place. Chicken and egg.

The IPv6 surprise inside the fix

Once all the firewalling was open, most of the new P2P pairs landed on IPv6 host/host rather than the IPv4 path I’d been preparing for. The lab peers and server01 were both on routable Verizon-delegated IPv6 /64s, and ICE preferred the direct v6 path over the now-permitted v4 cross-VLAN one. That was fine — better than fine, actually, because the v6 path skips the gateway hop entirely.

The exception was smtp. It came up Relayed even after the UDM rule was in. Its enp1s0 NetworkManager profile had ipv6.method:disabled, which translates to a kernel-level disable_ipv6=1 on that interface, which means SLAAC never fired and smtp had no global IPv6 to advertise as an ICE candidate. The IPv4 path was permitted but a per-peer ICE worker race on smtp’s side (“ICE Agent is not initialized yet” for the server01 peer specifically) had already locked it onto relay priority for the session.

ssh smtp 'sudo nmcli connection modify enp1s0 ipv6.method auto ipv6.addr-gen-mode default && sudo nmcli connection up enp1s0 && sudo systemctl restart netbird'

SLAAC delivered 2600:4040:f0a3:d702:4973:b10b:12ed:51df/64 and the next ICE pass came up host/host over IPv6. Eight samples across a 96-second poll, no flap. Plex went the other way — its DMZ VLAN doesn’t have IPv6 prefix delegation, so its lab pairs landed on IPv4 host/prflx over the new cross-zone UDP path. Both fine.

Final tally: every cross-cluster pair I have visibility into — 7 lab↔server01, 7 lab↔plex, 7 lab↔SER5-Desk, 21 in total — is now stable P2P. The mesh now relays only for genuinely-mobile peers (an Android phone, an Android tablet, the work laptop on cellular) where ICE doesn’t have a host candidate to begin with. That’s expected and not an infrastructure issue.

And, separately: Wazuh 4.14.5

While I had the lab open I also pulled Wazuh from 4.14.4 to 4.14.5. The release notes call out a buffer overflow in analysisd’s regex matcher and a rate-limit bypass on the cluster /events endpoint, both of which we’d have to hold our nose to ignore on a security monitoring system. Manager + indexer + dashboard restarted on kvm02, all nine agents picked up the new RPM via dnf update wazuh-agent without protest, cluster came back green within five minutes. Closes Homelab #223.

The research digest noted that the 4.14.5 rollout actually completed across all agents in the last two hours — auto-upgrade timing suggests the unattended path triggered around the same time I was running the manager upgrade by hand. Worth confirming the policy is intentional rather than a surprise; it’s the kind of thing that should be deliberate, not coincidental.

A handful of advisories that are already covered by the current stack: the Authentik delegated-permission CVE was patched in the 2026.2.x line we’re on, the n8n Content-Type confusion RCE (CVSS 10.0) was patched in the 2.x train we upgraded to two weeks ago, and the BIND CVEs from this cycle simply don’t apply because ns1 was decommissioned six days ago. Two stack-adjacent items still on the books: the Podman 5.6→5.8 upgrade for CVE-2025-52881 (already tracked in #196) and the Rocky 10 RLSAs from this month (will land with #178 patch management).

The mesh is quieter than yesterday by every measure — fewer Relayed pairs, fewer “Connecting” entries in netbird status, and one fewer wrong audit document. I’ll take it.

How I got the topology wrong#

The fix, in three zone boundaries#

The IPv6 surprise inside the fix#

And, separately: Wazuh 4.14.5#

Sidebar: what the digest also said#

How I got the topology wrong

The fix, in three zone boundaries

The IPv6 surprise inside the fix

And, separately: Wazuh 4.14.5

Sidebar: what the digest also said