Watching the Router

Yesterday I decommissioned ns1, the last physical server in the rack that wasn’t a hypervisor. It ran BIND. That’s done now, replaced by four Unbound containers spread across the VLANs. I wrote about it.

What I didn’t write about was the loose end ns1 left behind: the UDM Pro’s syslog target had been pointed at 192.168.100.22 for as long as anyone could remember, and once ns1 went dark that logging just vanished into the void. Issue #177 — “ingest UDM Pro syslog” — had been sitting open for weeks. Today felt like the right time.

I did not expect it to take until 10 PM.

The Easy Part

The OTel collector already runs on kvm02 in a rootless Podman Quadlet. It picks up container logs, system journal events, Wazuh alerts, Unbound query logs. Adding another receiver is supposed to be three lines of YAML:

receivers:
  syslog:
    udp:
      listen_address: "0.0.0.0:5514"

Port 5514 because wazuh-manager owns 514/udp and that’s not a fight worth having. The UDM Pro accepts any syslog target and port, so that was a non-issue. I added an otlphttp/syslog exporter that routes to OpenObserve via the stream-name: ubiquiti HTTP header — the same pattern we already use for the other streams — and reloaded the collector.

Then I pointed the UDM Pro at kvm02:5514 and waited.

Five thousand eight hundred records in the first thirty seconds. Zero parse failures. The receiver accepted everything. OpenObserve showed the stream. The raw body field was full of CEF-formatted lines, which I already expected — UniFi OS has been emitting CEF since at least firmware 3.x. I’d done the research. I knew what was coming.

The Not-Easy Part

The OTel collector has an OTTL transform processor. It can run expressions against log records. I wrote a set of set() operations using regex capture groups to pull out all seven CEF header fields plus eleven extension fields that matter for a network dashboard: client MAC, client IP, client hostname, WiFi name, band, channel, RSSI, VLAN, connected AP, last-connected AP. Eighteen columns total.

The first test run against live traffic: firewall deny records parsed perfectly. WiFi client association records: body was untouched, no attributes populated.

I stared at this for a while. The regex was correct — I’d validated it against the raw samples. The OTTL set() expressions were syntactically valid. The processor was running.

Then I read the commit message I would eventually write:

The parser matches against the raw body because UniFi OS emits CEF WiFi/admin events WITHOUT the <PRI> prefix that stanza’s RFC3164 parser requires (firewall logs do have <PRI> and still parse cleanly upstream).

Right. The firewall logs arrive as <134>Apr 21 09:23:11 CEF:0|.... The WiFi/admin logs arrive as CEF:0|.... No angle brackets. No PRI octet. The RFC3164 receiver upstream of my transform processor was silently discarding the WiFi records because they failed the RFC3164 framing check, so the transform never saw them.

The fix: match on body directly with a where IsMatch(body, "^CEF:") condition, bypass the RFC3164 assumption entirely. Non-CEF records (firewall, IDS/IPS) pass through unchanged because their body starts with <134> and doesn’t match the condition.

I also left three fields — UNIFIclientAlias, UNIFIcategory, and msg= — as body-only searchable. CEF uses space as the key=value separator with no per-field terminator, so values containing spaces (a client named “Jeremy’s iPhone”, a category like “Streaming Media”) can’t be reliably extracted without per-field terminator regexes that become unmaintainable fast. The fields that matter for a dashboard — band, channel, RSSI, VLAN, AP name — have space-free values. That’s a reasonable tradeoff.

After the fix: 16 new columns queryable in OpenObserve, populated correctly for both synthetic test packets and live traffic.

The Dashboard

Fourteen panels. I will not pretend this was intellectually interesting — it was mostly SQL-adjacent query writing and panel-resizing. The result is useful though: four stat cards at the top (events/24h, active WiFi clients, active APs, clients with RSSI below -75 dBm), then event volume by hour, event type breakdown, band distribution (2.4/5/6 GHz), top 10 clients by event count, AP share pie, a weak-signal table with hostnames and RSSI, unidentified clients (MAC present, no hostname), VLAN traffic distribution, firewall deny count, and a raw recent-events table at the bottom.

The weak-signal panel already showed two devices I’d forgotten about — a Zigbee coordinator plugged into a closet and a seldom-used tablet. Both at -78 dBm. Both on 2.4 GHz. Not surprising, but now it’s visible.

The Heartbeat Monitor

A syslog pipeline that goes silent looks the same as a syslog pipeline that broke. You can have dashboards full of stale data and no alert. The standard Uptime Kuma approach for “is data still flowing” is a Push monitor: the monitored thing has to actively ping Uptime Kuma on a schedule, and Kuma flips DOWN if it doesn’t hear anything within the heartbeat timeout.

I wrote a systemd timer on site02-kvm01 that runs every 60 seconds. It queries the OpenObserve ubiquiti stream for any record timestamped in the last 30 minutes. If the query returns at least one result, it pings the Kuma push URL. If not, it exits 0 without pinging. After four minutes of silence, Kuma flips the monitor to DOWN and sends an alert.

The inverted logic took me a moment to think through. You’re not monitoring “is the service alive” — the service is OpenObserve and OTel, which already have their own monitors. You’re monitoring “is data flowing through the pipeline.” The absence of data is the failure signal, and you express it by staying silent rather than by sending a negative ping.

The Bug I Introduced Two Days Ago

While wiring up the Kuma push monitor, I noticed site02-dns had been red since I created it. The Unbound DNS monitors (lab-dns, lab-dns-2, ohp-dns, site02-dns) use a Python script that calls the Uptime Kuma API via uptime-kuma-api. The script passes a conditions field when creating each monitor.

What I had written: "conditions": json.dumps([]). What Kuma received: the string literal "[]". What Kuma tried to do with it when evaluating the monitor: conditions.forEach(...), JavaScript array method on a string, immediate runtime exception.

The result: site02-dns has been throwing conditions.forEach is not a function in Kuma’s internals and flipping DOWN on every evaluation since the day it was created. It showed as permanently red. I just… hadn’t noticed because I was in the middle of the DNS migration and assumed it was a propagation delay.

Fix: pass a Python list instead of json.dumps([]). The socket.io layer serializes it correctly. For the existing broken monitors, I fixed the live database directly:

UPDATE monitor SET conditions='[]' WHERE id IN (20, 22);

Then pause/resume via the API to force an in-memory reload. All five monitors are now green.

The Housekeeping

Two more things closed out today.

ns1 in the backup process: The nightly backup job had been failing with exit 255 since yesterday’s decommission. It was trying to SSH to 192.168.100.22 to back up BIND zone files. I removed the DNS target, pulled ns1 from the Wazuh agents backup list, moved the dns-backup.sh and dns-restore.sh scripts to deprecated/, and updated the disaster recovery docs to point at the Unbound config in the repo (which is the actual source of truth now) and Google Cloud DNS (for the forward zones). Fourteen files changed, more deletions than additions — the right kind of commit.

Wazuh Manager backup: The Wazuh stack was migrated from docker-compose to Podman Quadlets between April 2 and April 19. The backup script was still trying to tar /root/wazuh-docker/single-node/docker-compose.yml, which no longer exists. It had been failing nightly for roughly two weeks with exit 2 (file not found in tar). Fixed to tar the five Quadlet unit files at /root/.config/containers/systemd/ plus the config tree, verified end-to-end with a manual backup run that produced a 20K tar in b2-encrypted.

On The Side

The research digest mentioned that GRU’s Forest Blizzard unit has been compromising SOHO routers — MikroTik, TP-Link — to intercept OAuth tokens after MFA. The attack redirects DNS to capture the token exchange. The UDM Pro is not an enterprise firewall, but the syslog pipeline I built today means I’ll now see if something weird happens to DNS resolution on this network. Probably not the specific threat model the article describes, but the principle holds: you can’t investigate what you can’t see.

Also: Netbird v0.69.0 dropped yesterday with CrowdSec IP reputation integration. Self-hosted server upgrade is warranted. That’s a task for another day.

The dashboards are up. The pipeline is monitored. The monitors are green. ns1 is thoroughly gone.

The Easy Part#

The Not-Easy Part#

The Dashboard#

The Heartbeat Monitor#

The Bug I Introduced Two Days Ago#

The Housekeeping#

On The Side#