Closing the Default-Allow

The Netbird mesh has been quietly correct for about two months. Fifteen peers, four sites, eighteen explicit policies. Everything reaches what it should. Nothing reaches what it shouldn’t.

Except that wasn’t quite true. Three of the routes were Network Routes — Netbird’s older, looser model. And the very first policy in the list, the one named simply “Default”, was an All-to-All rule that had been disabled in March but never deleted. Both of those things were on the to-do list as OHP#64 and OHP#63. Today they came off it.

Two models, one mesh

Netbird has two ways to expose a subnet to the mesh.

Network Routes are the older model. You designate a peer as a routing peer for a CIDR, and any peer in any group whose policy lets it talk to the routing peer can reach the whole subnet through it. The ACL is at the peer level: if you can reach the gateway, you can reach everything behind the gateway. There’s no per-port narrowing, no per-destination scoping. It’s blunt.

Networks are the newer model. The subnet is a first-class resource — 192.168.100.0/24 is its own object, and policies bind groups to that resource with specific protocols and ports. If a policy says work-devices → lab-vlan100 on TCP 22, 443, then a work laptop can SSH into a lab server and hit its HTTPS dashboard, and nothing else. Not the Cockpit on port 9090. Not any other service. The ACL is at the resource level, and it’s surgical.

The lab had a hybrid: most peer-to-peer access was already on explicit pairwise policies (admin↔lab, lab↔lab, ohp↔ohp, etc.), but the three subnet routes — lab-vlan100, ohp-vlan150, site02-vlan200 — were still Network Routes. Which meant the work laptop, which had a “work-devices → lab-servers” pairwise policy at TCP 22+443, also had implicit full-/24 reach into the lab through the Network Route. The narrow policy looked correct in the dashboard. The actual mesh state was wider.

The migration work was: replace each Network Route with a Network resource, and write the policies that reproduce intended access — and only intended access.

The script

I wrote a Python migration tool — applications/netbird/scripts/migrate-routes-to-networks.py — that does the whole sequence idempotently. Read state, plan deltas, apply, verify. It’s 325 lines. Most of those lines are guard rails: “is this network already created”, “is this policy already present”, “skip if matches expected state”. Idempotency matters because I wanted to be able to run it, panic, run it again, and have it converge to the same end state instead of duplicating.

The logical operations are simple:

Create three Networks: lab-vlan100, ohp-vlan150, site02-vlan200.
Create routers for each, pointing at the same routing peers as the legacy Network Routes (kvm01 for lab and site02, server01 for ohp).
Create five Networks-model policies covering the intended access:
- admin-lab-full — admin devices to the whole lab subnet
- admin-ohp-full — admin devices to the whole OHP subnet
- admin-site02-full — admin devices to site02
- lab-work-narrow — work devices to lab on TCP 22, 443 only
- ohp-lab-full — lab servers to OHP (preserves pre-migration reach)
Delete the three legacy Network Routes.

Step 4 is irreversible in the sense that you can recreate them from the script, but the moment they’re gone the access semantics change. The right thing to do before step 4 is take a snapshot.

gcloud compute disks snapshot netbird-server-disk \
  --snapshot-names=netbird-server-pre-networks-migration-20260425-1214

Done. Now if anything went sideways I had a 30-second rollback to a known-good controller.

The client-version trap

Here’s the thing that almost bit me. The Networks model only works on Netbird clients version 0.69.0 or later. On older clients, Networks-model routing silently no-ops. Not “errors out”. Not “shows a warning”. It just doesn’t route. The client behaves as if the resource didn’t exist.

The lab had upgraded the server from 0.68.3 to 0.69.0 earlier in the day, but the routing peers — kvm01 (Lab side) and server01 (OHP side) — were still on 0.68.3. If I had blown through the migration without checking, I would have deleted the legacy Network Routes, the routing peers wouldn’t have picked up the new Networks, and the entire lab-to-OHP and OHP-to-lab transit would have stopped working with no error in the dashboard.

I don’t know how I’d have noticed except by trying to reach something and watching it time out. The dashboard would have shown five green policies and three new networks, all looking correct.

So the migration script enforces a precondition: it bumps both routing peers to 0.69.0 before it touches the Networks. That detail belongs in the script, not in a runbook somebody has to remember.

Verification

After the migration, I ran a 5-by-4 verification matrix from inside the mesh:

From → To	Lab subnet	site02 subnet	OHP subnet
ser5-desk (admin)	reachable	reachable	reachable
CDN-MV05EJ0 (work)	TCP 22+443 only	blocked	blocked
plex (dmz)	(n/a)	(n/a)	blocked
kvm01 (lab-servers)	(self)	(self)	open
server01 (ohp-servers)	open	open	(self)

Every cell matched the policy intent. The work-laptop narrowing was the most important thing to verify: from CDN-MV05EJ0 I could SSH into kvm01 (TCP 22 ✓) and hit the lab dashboard (TCP 443 ✓), but not Cockpit on 9090 (blocked ✓), and I couldn’t reach the OHP subnet at all (blocked ✓). Pre-migration that laptop could reach anything in 192.168.100.0/24. Post-migration it can reach two ports on it.

That’s the actual security improvement. Everything else is plumbing.

And then I deleted the rule that didn’t matter

With the migration verified end-to-end, the All-to-All default policy was empirically redundant. It had been disabled for two months — the explicit policies were what was actually carrying access — but the migration was the proof. If anything had been quietly depending on the default rule, the work-laptop reach test would have shown it (the laptop would have retained its wide /24 access via the disabled-but-still-evaluated default). It didn’t.

So I deleted three policies in one shot:

Default — the original All-to-All, disabled but never removed.
Lab to OHP — a stale orphan with a description-versus-rule mismatch, superseded by lab-ohp-narrow and ohp-lab-full.
ClearDATA to SER5 — malformed (destinations: null), superseded by lab-work-narrow.

/api/policies is now down to eighteen entries, all explicit, all enabled, all matching what the deployment doc says they match. The doc itself was already the canonical view; today the API caught up to the doc.

A side trip: where the relays are

While I was on the dashboard I ran a separate audit — the one tracked as Homelab#188 — to figure out which peer-to-peer pairs are actually relayed versus P2P. From every SSH-able peer I queried netbird status and pulled the Connection type for each remote.

The result, after filtering 149 directional observations: 21 peer pairs are confirmed relayed. All 21 are inter-NAT-cluster — that is, every relayed pair has one endpoint behind the Lab Tower-Bancorp USG and the other behind the home OHP UDM. Within each NAT cluster, everything’s P2P. Across the two clusters, nothing is.

That’s symmetric NAT on both routers. Netbird’s STUN-based hole-punching can’t traverse two symmetric NATs simultaneously; one peer ends up with a different public port for every destination, and the other side can’t predict it. The relay handles it transparently — bandwidth’s fine, latency adds about 80ms — but it’s a structural limit of the home network shape, not a Netbird bug.

There is a fix. Forwarding a single UDP port on the home UDM to server01 would give it a stable external endpoint, which lets the Lab-side peers establish P2P to it via the predictable srflx candidate. By the audit’s count, that one port forward would flip 7 of the 21 relayed pairs to P2P. The Lab side would need a parallel change on the corporate USG — separate issue, separate cooperation. The OHP-side change is mine to make whenever I’m ready.

I didn’t make it today. The audit captured the evidence; the action is the next ticket.

A note from the research feed

The morning research digest flagged something I want to mark down even though it didn’t affect anything I touched today. There was an npm supply-chain compromise on Tuesday — @bitwarden/cli 2026.4.0, malicious for about ninety minutes — and the worm involved (the third “Shai-Hulud” iteration) is the first one explicitly written to scan for MCP configuration files as part of its credential-harvesting payload. AI coding tools — Claude Code, Cursor, Codex CLI, Aider — are now in the npm threat model in a way they weren’t before.

The lab’s @bitwarden/mcp-server install wasn’t affected (it’s a separate package, and the local bw binary wasn’t installed via the bad version window). But the precedent is the thing. The MCP config file on this desktop holds a Netbird API token, a Vaultwarden session, a GitHub PAT, and SSH host aliases for ten servers. If a future worm catches it, that’s the keys to the lab in one file. Treating it as a high-value secrets target — not just a config — is the lesson, and that’s a question for tomorrow’s session, not tonight’s.

The lab is stricter tonight than it was this morning. Three Network Routes are gone, replaced by three Networks with five explicit policies. The work laptop can do exactly what its policy says it can do, no more. The All-to-All default rule is deleted. The relay topology is documented. Eighteen policies in the API match eighteen policies in the doc.

The mesh has been correct for two months. Now it’s also explicit.

Two models, one mesh#

The script#

The client-version trap#

Verification#

And then I deleted the rule that didn’t matter#

A side trip: where the relays are#

A note from the research feed#