Homelab Implementation Roadmap¶

Last updated: 2026-05-20 Scope: Dell R430 (proxfold) — infrastructure hardening, config-as-code, hardware upgrades, automation Pace: Weekend-warrior (~8 weeks active, then ongoing) Status: Phase 1/2/3 done · Phase 4A/4C done · Phase 5A/5B/5C/5D/5E done · Phase 6A/6D/6E done · Phase 7D done · Phase 6G (Komga) deployed 2026-07-06 · Phase 6H (Kapowarr + komf) deployed 2026-07-18 — pending first-run config/verification · Phase 4B re-prioritised (CPU 2 + 8× 32 GB / 256 GB at 1 DPC; unblocks Phase 7) · Phase 6B/6C/6F + Phase 7A/B/C/E next

The critical path runs through PVE 9 upgrade → Ansible codification → hardware upgrades → boot drive swap. Everything else layers on top. Phase 7 (security stack) depends on 4B for RAM headroom.

gantt
    title Homelab implementation timeline
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 1 - Foundation
    Ansible control node + common role   :done, a1, 2026-04-01, 7d
    Dockhand deploy + gluetun VPN        :done, a2, 2026-04-01, 10d

    section Phase 2 - PVE 9 Upgrade
    PVE 8.4 → 9.x upgrade                :done, p1, 2026-04-19, 2d
    ZFS RAIDZ expansion                  :done, p3, 2026-04-20, 1d

    section Phase 3 - Codify
    3A proxmox + nut roles               :done, b1, 2026-04-21, 1d
    3B plex data codification            :done, b2, 2026-04-21, 1d
    3C rebuild kit (auto-install)        :done, b3, 2026-04-21, 1d
    3D scheduled drift detection         :done, b4, 2026-04-21, 1d

    section Phase 4 - Hardware
    UPS purchase + install               :done, c1, 2026-04-19, 1d
    CPU 2 + RAM upgrade                  :c2, after b4, 5d
    Boot drive swap (ZFS mirror)         :c3, after c2, 7d

    section Phase 5 - Backup + monitoring
    5A Proxmox Backup Server             :done, d1, 2026-04-23, 1d
    5B Beszel + notifications            :done, d2, 2026-04-24, 1d
    5C n8n (lab automation)              :done, d3, 2026-04-28, 1d
    5D edge LXC + Caddy (NPM → Caddy)    :done, d4, 2026-05-02, 1d

    section Phase 6 - New services
    Home Assistant                       :e1, after d2, 14d
    Obico (3D print monitoring)          :e2, after d2, 14d
    6D Music acquisition pipeline        :done, e3, 2026-05-16, 1d
    Matrix server                        :e4, after d2, 14d
    6F Music recommendations / discovery :e5, after e3, 7d

    section Phase 7 - Security
    7A Wazuh AIO + first agents          :f1, after c2, 7d
    7B Agent fleet rollout               :f2, after f1, 5d
    7C Suricata NIDS + integration       :f3, after f2, 5d
    7D CrowdSec edge                     :f4, 2026-05-04, 3d

Phase 1 — Foundation (weeks 1–2)¶

Goal: Get Ansible running against at least one host and migrate container management to Git-backed Dockhand with VPN routing in place. These two tracks are independent and can run in parallel.

1A. Ansible control node¶

Why first: Every subsequent phase depends on having playbooks ready — especially the boot drive swap in Phase 4, which becomes a single ansible-playbook site.yml instead of hours of manual work.

Note

The Ansible repo is already scaffolded at rampantlemming/homelab-ansible with all roles written. These steps cover standing up the control node and testing the roles against live hosts for the first time. See the Ansible section for full repo documentation.

Create the Ansible control node LXC on Proxmox
- Unprivileged, Debian 12, 1 vCPU, 512MB RAM, 8GB disk
- Install pip and Ansible:
```
apt update && apt install -y python3-pip git
pip install ansible --break-system-packages
```
Note

--break-system-packages is required on Debian 12. PEP 668 prevents pip from installing into the system Python environment by default — Debian enforces this to avoid conflicts with apt-managed packages. Using a virtualenv is the alternative, but for a dedicated control node LXC this flag is fine.
Clone the homelab-ansible repo
- git clone https://github.com/rampantlemming/homelab-ansible.git ~/homelab-ansible
- cd ~/homelab-ansible
Install Galaxy collections (requires requirements.yml from the cloned repo)
- ansible-galaxy collection install -r requirements.yml

Generate SSH keypair on the control node

ssh-keygen -t ed25519 -C "ansible@homelab"

Distribute the public key to all managed hosts:

ssh-copy-id root@192.168.1.250   # proxfold
ssh-copy-id root@192.168.1.252   # arrstack
ssh-copy-id root@192.168.1.230   # plex
ssh-copy-id root@192.168.1.249   # nginx

Populate and encrypt the vault file
- The repo contains group_vars/all/vault.yml as a template — fill in real secrets before encrypting
- ansible-vault encrypt group_vars/all/vault.yml
- Store the vault password in a safe location (e.g. a password manager); the password file itself must be excluded from git via .gitignore
- To edit secrets later: ansible-vault edit group_vars/all/vault.yml
Test the common role against arrstack first
- ansible-playbook playbooks/arrstack.yml --tags common --check --diff
- Verify idempotency: run twice, second run should show zero changes
Once clean, run common across all hosts
- ansible-playbook site.yml --tags common

1B. Dockhand + gluetun deployment¶

Why now: Getting Dockhand deployed means container management is Git-backed before the boot drive swap. Gluetun gives qBittorrent proper VPN routing.

Commit compose files to the homelab GitHub repo
- Create stacks/arrstack/docker-compose.yml
- Confirm current state: Sonarr, Radarr, Prowlarr, Seerr, qBittorrent, MediaBot
- Remove deprecated version: key if still present
- Set TZ=Australia/Adelaide across all services
Add gluetun to the compose stack
- Add gluetun service with ProtonVPN WireGuard config
- Move qBittorrent to network_mode: "service:gluetun"
- Expose qBittorrent ports (8080, 6881) on the gluetun service
- Set FIREWALL_OUTBOUND_SUBNETS=192.168.1.0/24 so Sonarr/Radarr can still reach qBittorrent's API over LAN
- Ensure this subnet does not overlap with the ProtonVPN WireGuard tunnel address range (gluetun will reject the config if they do)
- Generate ProtonVPN WireGuard private key via the ProtonVPN website (select P2P-capable Australian server with NAT-PMP)
Test VPN routing
- docker exec qbittorrent curl ifconfig.me — should return a ProtonVPN IP, not your home IP
- Verify Sonarr/Radarr can still communicate with qBittorrent
Deploy Dockhand container
- Configure GitHub PAT for repo access
- Point at the stacks/arrstack/ directory
- Test: push a trivial compose change, confirm Dockhand picks it up
Remove Watchtower container
- Dockhand handles update management — Watchtower is redundant

Phase 2 — PVE 9 Upgrade + RAIDZ Expansion (weeks 2–3)¶

Goal: Upgrade Proxmox VE from 8.4 to 9.x (Debian Bookworm → Trixie) to get OpenZFS 2.3, then expand the RAIDZ1 vdev with new drives. This must happen before Ansible codification so that roles capture the stable PVE 9 target state rather than the soon-to-be-replaced PVE 8 config.

Why before codification: Codifying PVE 8 state in Ansible and immediately rewriting it for PVE 9 is throwaway work. Upgrading first means Phase 3 captures reality from day one.

Warning

The ZFS raidz_expansion feature flag is a one-way operation. Once enabled, the pool cannot be imported on OpenZFS < 2.3. There is no going back to PVE 8 after zpool upgrade.

See the PVE 9 Upgrade runbook for the full step-by-step procedure — and the Lessons from the April 2026 run appendix capturing 8 deviations found during execution. Key stages:

Pre-flight checks — completed 2026-04-19 (state bundle, pve8to9 --full green, BIOS X2APIC + I/OAT DMA enabled)
Upgrade — completed 2026-04-20: bookworm→trixie repos, apt full-upgrade, GRUB-pinned to kernel 6.14.11-6-pve (not PVE 9.1's 6.17 default, to keep Nvidia 550.x compatibility)
Post-upgrade verification — PVE 9.1.7 running, ZFS 2.4.1 userland / 2.3.4 kmod, Nvidia 550.163.01 via DKMS, Plex HW transcode confirmed ((hw) in dashboard) after fixing cgroup major drift 235→234 / 238→237
Stability soak — abbreviated. No regressions in the ~6h between upgrade and Phase 3 kickoff; stash at 91% full overrode the usual soak window.
RAIDZ expansion — completed 2026-04-20. zpool upgrade stash enabled raidz_expansion + 4 other features (irreversible). Drive 1 (sdg, scsi-35002538a97b1c620) attached at 13:59:59 ACST, reflow 2h08m at 1.73 GB/s. Drive 2 (sdh, scsi-35002538a97b19c40) attached after drive 1 auto-scrub, reflow 1h45m at 2.00 GB/s. All-SSD pool finished in hours, not days (runbook assumed HDD speed).
Post-expansion verification — pool 14.0T→21.0T, 91%→61% full, scrub repaired 0B with 0 errors, all 6 disks ONLINE, stash/plex-data quota intact

Phase 3 — Codify the stack¶

Goal: Every piece of host-level configuration that matters is captured in Ansible roles so a proxfold rebuild in Phase 4 (or after a hardware failure) is a playbook run rather than hours of manual work.

Phase 3 breaks into four sub-phases. 3A/3B/3C all landed on 2026-04-21; 3D is the remaining piece.

3A. PVE host baseline — `proxmox` and `nut` roles¶

Completed 2026-04-21

Merged in homelab-ansible#3A. See the proxmox role and nut role pages for the full task inventory.

What landed:

proxmox role — deb822 repo management (no-sub enabled, enterprise + ceph disabled), kernel pin via proxmox-boot-tool (6.14.11-6-pve), nouveau blacklist, /etc/sysctl.conf → /etc/sysctl.d/ migration, stash pool import, nasbackup CIFS registration
nut role — Network UPS Tools server + monitor for CyberPower PR1500ERT2U (vendor 0764:0601), standalone mode, shutdown at 600s runtime-low
proxmox-host playbook wired: proxmox → common → security → zfs → nfs → nvidia → nut
Vault entries added: vault_nasbackup_password, vault_nut_admin_password, vault_nut_upsmon_password

3B. Plex data codification¶

Completed 2026-04-21

Merged in homelab-ansible#3B. Variables and structure documented on the plex role page.

What landed:

plex_data_zfs_dataset / plex_data_zfs_quota / plex_data_mount variables plumbed through plex host_vars
ZFS dataset stash/plex-data (100G quota) creation delegated to proxfold
LXC 100 mp1 mount point registered via lineinfile against /etc/pve/lxc/100.conf
Plex data symlink /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/ → /stash/plex-data (fail-hard if real dir exists)
docker-restart handler gotcha documented (breaks network_mode: service:* chains — don't touch compose during a plex role run)

3C. Proxmox auto-install kit¶

Completed 2026-04-21

Merged in homelab-ansible#3C. Full operator procedure at the Proxfold rebuild runbook.

What landed:

rebuild/answer.toml.j2 (production — Samsung SSD disk filter, static IP) and rebuild/answer-test.toml.j2 (rehearsal — QEMU disk filter, DHCP)
rebuild/render.yml Ansible playbook that decrypts the vault and writes both answers; wrappers render-answer.sh + build-iso.sh
Rendered TOML + built ISOs gitignored (contain root password hash + SSH keys)
Nested-VM rehearsal validated template rendering, vault integration, disk filter fail-safe, and boot order. Installer initramfs finalisation fails inside nested PVE — documented as a known limitation (does not affect bare metal)
WSL control node bootstrap formalised as the DR cold-start path (CT104 doesn't exist until proxfold is rebuilt and its backup restored)

3D. Scheduled drift detection (done — 2026-04-21)¶

Kit and operator docs

Implementation kit lives in homelab-ansible/drift-detection/. Operator procedure: Drift Detection.

What landed:

Systemd timer on CT104 — daily at 04:00 ACST with ±5 min randomised delay and Persistent=true for missed-run catch-up
drift-check.sh wrapper — git pull --ff-only, ansible-playbook playbooks/site.yml --check --diff, parses the PLAY RECAP for changed / failed / unreachable totals, classifies outcomes
Dedicated #homelab-drift Discord webhook — amber embed for drift, red for failure, silent for clean runs (opt-in DRIFT_SUMMARY=1 for confirmation pings)
WSL fallback — same wrapper runs from WSL via env overrides, matching the dual-control pattern (CT104 is primary, WSL is DR cold-start)
Role hardening surfaced by the first live run — zfs + nvidia made check-mode safe (check_mode: false on health probes, not ansible_check_mode gate on systemd enables), nvidia passthrough strip is now conditional on the managed block being absent (killed a perennial false-positive), reload udev handler switched from command to shell so && actually parses

Decided during implementation (the "open questions" from the original 3D scope):

Cadence: daily, not weekly — catches drift inside 24h, and with no-signal runs silent the cost is only the Discord webhook embed on actual drift
Notification target: dedicated #homelab-drift webhook, separate from the MediaBot channel; n8n deferred to Phase 5A and not needed here
Vault handling: no_log: true already set on vault-rendered NUT files; webhook only posts the PLAY RECAP (truncated to Discord's 1024-char field limit), no task-level diff

Not done, deferred:

Failure quarantine / flap suppression — deferred until we actually see a flapping host; premature abstraction today
Dynamic fan curve for the R430 T400 — looked at during the 3D converge, concluded the BMC handles it fine in auto on this unit (see the Drift Detection page and the host_vars comment on proxfold's ipmi_fan_fix: false)

Phase 4 — Hardware upgrades (weeks 5–7)¶

Goal: UPS protection, full CPU/RAM capacity, and a clean boot drive on mirrored ZFS. The order is strict — each step protects or enables the next.

flowchart LR
    A[UPS install] --> B[CPU 2 + RAM]
    B --> C[Boot drive swap]
    C --> D[Ansible rebuild]

    style A fill:#FAECE7,stroke:#993C1D,color:#712B13
    style B fill:#FAECE7,stroke:#993C1D,color:#712B13
    style C fill:#FAECE7,stroke:#993C1D,color:#712B13
    style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489

4A. UPS purchase + install¶

Why first: A power event during a boot drive swap or ZFS rebuild would be catastrophic. UPS goes in before any hardware is touched.

Completed 2026-04-19 (pulled forward from planned order)

The UPS was installed ahead of Phases 2 and 3 because the hardware arrived early and a safe install window was available. See UPS for the as-built configuration, USB IDs, NUT config, and battery transfer test results.

Purchase the CyberPower PR1500ERT2U — acquired from Scorptec for ~$1,179 AUD
- 1500VA/1500W, pure sine wave, 2U rackmount
- Active PFC compatible (required for R430 PSUs)
Install and connect
- Both R430 PSUs (dual 550W redundant) on battery-backed outlets via 2× IEC C13-to-C14 cables
- NAS and networking gear still on wall power — revisit as part of future rack work
Configure NUT on Proxmox
- nut 2.8.0 installed directly on the Proxmox host in standalone mode
- usbhid-ups driver matching 0764:0601 (CyberPower HID)
- battery.runtime.low raised to 600 s (from 300 s) for ZFS/VM shutdown buffer
- Battery transfer test 2026-04-19 — clean OL → OB → OL, ~2.5 min on battery, no guest disruption
- NUT monitoring integration with n8n deferred to Phase 5

4B. CPU 2 + RAM upgrade¶

Why now: The second CPU socket unlocks all 12 DIMM slots. Buying RAM in one lot gets a matched set and provides headroom for additional VMs.

Note

The R430 Hardware Upgrade runbook covers the full procedure including heatsink installation, CPU seating, and BIOS verification.

Source before starting:

1× Intel Xeon E5-2680 v4 (S-Spec: SR2N7) — match the existing socket 1 CPU
1× Dell heatsink P/N 02FKY9
1× 6th fan module (Dell P/N DNHNR or 79WM9) — Dell's minimum for dual-CPU is 5 fans; the recommended layout is 6. As-built proxfold has 5 populated bays + Fan6 empty (verified 2026-04-28), so this step adds the 6th fan to bring the chassis to the recommended layout

8× 32 GB DDR4-2400 PC4-19200R 2Rx4 ECC Registered RDIMM — single-vendor matched lot; 256 GB total. Populates 1 DIMM per channel per CPU (optimal 1 DPC electrical config — every channel active, no second-DIMM signal loading). Leaves 4 slots free for future expansion. SK Hynix is the recommended vendor since the chassis is already running Hynix sticks today (proxfold dmidecode 2026-04-28 shows 6× HMA41GR7-family, mixed AFR + MFR die — both work, no issues).

Acceptable part numbers (pick one vendor — don't mix across the kit):

Vendor	Part number	Notes
SK Hynix	`HMA84GR7AFR4N-UH` (A-die) or `HMA84GR7MFR4N-UH` (M-die)	32 GB 2Rx4 PC4-19200R, CL17. Either die revision works; matches what's already in proxfold today. Recommended.
Samsung	`M393A4K40CB1-CRC`	32 GB 2Rx4 PC4-19200R, CL17
Micron	`MTA36ASF4G72PZ-2G3` (and `-2G3B1` / `-2G3A1` suffix variants)	32 GB 2Rx4 PC4-19200R, CL17

Confirm the seller is selling a true matched 8-stick lot (single decommissioned server pull) and not 8 random sticks from different sources. Within a single matched kit, all sticks should share the same die revision.

Alternatives if 32 GB sticks aren't available

Plan	Total	Trade-off
8× 32 GB / 256 GB (default)	256 GB	Optimal 1 DPC config + most capacity + 4 slots free
12× 16 GB / 192 GB	192 GB	All slots filled. Slightly less bandwidth-clean (2 DPC on 4 of 8 channels). 16 GB part numbers: Samsung `M393A2G40EB1-CRC`, SK Hynix `HMA42GR7AFR4N-UH`, Micron `MTA36ASF2G72PZ-2G3`. Avoid Samsung `M393A2K40BB1-CRC` — that's the 1Rx4 single-rank variant.
8× 16 GB / 128 GB	128 GB	Lowest cost. Tight headroom — Wazuh + Phase 6 services + ARC squeeze the budget; ARC re-tune target drops to ~48 GB instead of ~128 GB. Workable but the 32 GB option is the better deal at typical price-per-GB ratios.

Original spec was 12× 32 GB / 384 GB — re-scoped 2026-04-28 to right-size for actual fleet load and to dodge the 2026 DDR4 price spike.

2026 DDR4 RDIMM market context

DDR4 RDIMM pricing is elevated through 2026 because foundry capacity has been pulled toward HBM3E/DDR5 for AI workloads (Tom's Hardware DDR price tracker). Expect more variance than usual on eBay AU listings; watch the market for 1–2 weeks before committing to a kit. Don't panic-buy the first matched lot — and don't pay for DDR4-2666 sticks "for future-proofing" since the E5-2680 v4 caps at 2400 MT/s and 2666 sticks just downclock (Intel E5-2680 v4 product specs).

Steps:

Shut down R430 and disconnect power
Install 6th fan module (slot furthest from PSUs)
Install CPU 2 + heatsink
Remove all 6× existing 8 GB DDR4-2133 DIMMs (don't mix old + new — speed clocks down to slowest stick, and the chassis becomes asymmetric). Install the new RDIMM kit. Slot population:
- 8-stick (32 GB) plan — default: A1, A2, A3, A4 + B1, B2, B3, B4 (1 DIMM per channel per CPU; A5/A6/B5/B6 left empty)
- 12-stick (16 GB) alternative: all 12 slots A1–A6 + B1–B6
- Follow Dell population order for optimal memory interleaving (consult the R430 Owner's Manual or iDRAC lifecycle log if memory training errors appear after install)
Power on and verify in iDRAC
- Both CPUs recognised, correct stepping (SR2N7)
- Total RAM visible: 256 GB (8× 32 GB default) or 192 GB (12× 16 GB alternative), all DIMMs healthy
- No memory training errors in lifecycle log
Verify in Proxmox: lscpu shows 28C/56T, free -h shows ~252 GB (8× 32 GB) or ~189 GB (12× 16 GB) after kernel reservation
Re-tune ZFS ARC cap post-converge. Current cap is 14 GiB (changelog 2026-03 entry) — sized for the 48 GB-physical era. With 256 GB physical and a 21 T pool at 62% used, lift the cap toward ~half-RAM (≈ 128 GB) for materially better cache hit rates on media + PBS reads. Variable lives in the proxmox role (/etc/modprobe.d/zfs.conf); revisit after at least 24 h of stability soak with the new RAM, not on day 0. (For 12× 16 GB / 192 GB alternative: target ≈ 96 GB instead.)

4C. Boot drive swap to mirrored ZFS¶

Why: Replace the single 128GB LVM boot drive with two 960GB SSDs in a ZFS mirror, providing OS-level redundancy.

Completed 2026-04-22

Executed via the Boot Drive Swap runbook + Proxfold rebuild runbook. Boot drive is now rpool ZFS mirror across Samsung SM843T + Intel DC S4500 (888G usable, 2% used after restore). See the "Lessons from the 2026-04-22 run" appendix at the bottom of the boot-drive-swap runbook for what bit during execution.

What landed:

Pre-swap backups — full vzdump of all VMs/containers to NAS (192.168.1.253) verified before power-down
Ansible playbooks committed and pushed prior to swap; fresh production ISO built via rebuild/build-iso.sh
answer.toml.j2 disk filter updated — used filter.ID_BUS="ata" (not array-form ID_MODEL — that was rejected by the installer validator); filter matches both 960GB SATA SSDs and the installer builds rpool across them as RAID1
Hardware swap — 128GB Samsung 840 PRO removed; Samsung SM843T 960GB + Intel DC S4500 960GB installed in Dell caddies in bays 2 & 3
Auto-install ran clean — ZFS mirror rpool created across both new drives, SSH/network up on 192.168.1.250, Ansible converge re-ran, stash pool re-imported with -f (new hostid), CIFS re-registered, CT100 + guest VMs restored from NAS
Post-swap verification — zpool status rpool both drives ONLINE, both drives populated in proxmox-boot-tool status, stale Boot0007 UEFI entry pointing at the defunct 840 PRO GPT UUID removed via efibootmgr -b 0007 -B
Followups codified — nvidia role LXC passthrough rewritten from blockinfile to per-line lineinfile (PVE's conf parser moves raw lxc.* keys to EOF on every rewrite, breaking marker semantics); proxmox role now disables ceph enterprise deb822 source and uses zpool import -f for fresh-hostid cases; ipmi_fan_fix: false path now self-healing

Phase 5 — Backup + monitoring (weeks 7–9)¶

Goal: Close the two real gaps left after Phase 4 — there's no scheduled backup and no infrastructure-wide health visibility beyond the 3D drift heartbeat. Everything in Phase 5 is additive; nothing blocks Phase 6.

Scope reset — 2026-04-23

The original Phase 5 scope (n8n + Ollama, vulnerability management) was written before Phase 3D landed its own Discord integration and before the 4C boot-drive swap exposed how thin the backup story actually is. Revised shape:

5A (new, priority) — Proxmox Backup Server, replacing the manual-only vzdump path
5B (new, priority) — Beszel + PVE 9 notification system + ZED webhook, one stack covering host metrics / backup status / zpool events
5C (de-scoped) — n8n as a lab/glue capability only; Ollama dropped (no GPU budget, no RAM headroom pre-4B, and the drift/ZFS/UPS workflows the original 5A called out are already covered elsewhere)
Re-scoped from "dropped" — unattended-upgrades landed 2026-04-25 as a standalone auto_updates role (wraps hifis.toolkit.unattended_upgrades). Security-only origins across the fleet, Proxmox repo added on proxfold + pbs, hypervisor kernels blocklisted, manual reboot with a /var/run/reboot-required Discord nag on proxfold. See auto_updates role page. The rest of the old 5B (full vulnerability management stack) remains out of scope
Not adopting — Grafana / Prometheus. Beszel's built-in historical metrics cover the use case at a fraction of the operational footprint

5A. Proxmox Backup Server¶

Completed 2026-04-23

PBS live as CT 105 (192.168.1.246, Debian 13 privileged, features=nesting=1). Datastore nas-primary on NFSv3 from the TS-269L, prune daily 03:00 (keep 7d/4w/6m), verify sun 04:00, GC mon 04:00. PVE storage registered on proxfold; daily 02:00 all-guest backup job pbs-daily active. First full backup + manual verify ran clean same day — ~95 GiB across 5 guests, ~11 min wall clock. See the pbs role page for the as-built task list and the gotchas captured during execution.

Why: /etc/cron.d/vzdump was empty post-4C — every backup on the NAS was a manual push. PBS gives scheduled incremental + deduplicated backups with verify jobs, and the dedup ratio typically runs 5:1+ on VM images. The existing CIFS nasbackup path stays available for manual vzdump fallback during the transition.

Shape: PBS LXC on proxfold, datastore on an NFS mount from the QNAP TS-269L at 192.168.1.253. The NAS can't run PBS itself (Atom D2701, EOL QTS 4.3.4) but it does export NFS, which is materially better-behaved than CIFS as a PBS datastore backend — no uid/gid remapping, no reported .chunks inode-collision issues.

QNAP firmware end-of-life

QTS 4.3.4 is the final firmware for the TS-269L and has been unpatched since 2020. Accepted as part of Phase 5A scoping — not a blocker, but a ticking replacement clock. See Accepted Risks — QNAP TS-269L firmware EOL for the full treatment.

Architecture decisions (locked in 2026-04-23):

Datastore on NAS, not local — DR requires a second failure domain. A datastore on stash would die with the pool if stash fails. The NAS is the only meaningful second domain in this environment.
NFS, not CIFS — PBS datastore on CIFS works but is fragile (uid/gid remap, inode collisions on some NAS firmwares). NFSv3 on the QNAP is the well-trodden path.
Not on rpool — the whole point of the 4C boot-mirror design was isolating OS from data; backups belong on the data path.
PBS as LXC, not VM — standalone service, no Docker needed, fits the established LXC-for-standalone / VM-for-Docker pattern.

Steps:

What landed:

5B. Server notification stack¶

Completed 2026-04-24

Three notification paths live into the shared #homelab-ops Discord channel: Beszel (CT 106, 192.168.1.247) scraping 6 agents (proxfold/arrstack/nginx/plex/pbs/control), ZED webhook on proxfold for zpool events, and a PVE 9 webhook endpoint with match-all matcher for backup/replication/node events. Install delegated to upstream scripts for hub + agents. See the beszel role page and the zfs role page for the as-built tasks.

Why: Before 5B the only Discord signal from the homelab was the 3D drift heartbeat. No visibility into backup job results, zpool degradation, SMART health, temperature, or guest resource state. Fixed all of that in one stack.

Shape: three independent notification paths, all landing in a single #homelab-ops Discord channel (separate from #homelab-drift):

Beszel — host/container metrics, historical data, threshold alerts
PVE 9 built-in notification target — backup job results, replication, node up/down, update availability
ZED webhook — zpool state changes, scrub completion, vdev removal

Decisions (locked in 2026-04-23):

Beszel not Netdata / Prometheus / Grafana — ~10MB per agent vs 200-500MB for Netdata, SQLite historical store is enough for a 4-host homelab, no dashboard engineering needed
Beszel hub as dedicated LXC, not on arrstack — arrstack stays dedicated to the media compose; hub runs the native binary + systemd (no Docker, no nested-virt), matches the nut role shape
Beszel agents as systemd binaries, not Docker — one agent per host (proxfold, arrstack, nginx, plex, pbs, optionally control)
Uptime Kuma deferred — no external HTTP endpoints being tracked yet

What landed:

Beszel hub LXC — CT 106, Debian 13 (not 12 — systemd 257 in trixie), unprivileged, features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, 192.168.1.247. Install delegated to get.beszel.dev/hub; cached installer at /root/install-beszel-hub.sh. Hub listens on plain HTTP :8090 (TLS terminates at future nginx reverse proxy).
Beszel agents — installer at get.beszel.dev runs with -k <hub-pubkey> -p 45876 to bake env into the systemd unit. Deployed to proxfold (with beszel_agent_enable_smart: true for smartmontools + disk-group SMART access), arrstack, nginx, plex, pbs, control. Agents registered manually in the hub UI.
Hub pubkey capture — derived via ssh-keygen -y -f /opt/beszel/beszel_data/id_ed25519 (Beszel doesn't write a separate .pub file); vaulted as vault_beszel_hub_pubkey.
Alert rules — left UI-owned per the Phase 5 scope reset decision; not codified in Ansible.
PVE 9 notification target — webhook endpoint discord-ops + match-all matcher ops-all, codified via roles/proxmox/tasks/notifications.yml. Body template uses Handlebars {{ escape ... }} with severity-based color conditionals, stored in roles/proxmox/files/discord-notification-body.json so Ansible's Jinja doesn't interfere. Critical gotcha: pvesh expects --body, --header value, and --secret value as base64-encoded strings (per the schema), not raw. Raw JSON silently stores but fails at delivery with "could not decode base64 value"; pass via | b64encode. Default mail-to-root matcher left intact — events fan out to both.
ZED webhook on proxfold — zed_webhook_enabled: true in proxfold host_vars triggers the zfs role's ZED tasks: installs curl, renders /etc/zfs/zed.d/discord.sh, and symlinks statechange-discord.sh, scrub_finish-discord.sh, resilver_finish-discord.sh onto it. Webhook URL comes from vault_discord_webhook_homelab_ops. Smoke-tested by invoking the script directly with ZEVENT_POOL=stash ZEVENT_SUBCLASS=statechange; embed delivered clean.
CT 104 (control) added to inventory — previously not an Ansible-managed host, now under lxc_containers with its own common + security + beszel_agent playbook. Adds the drift runner itself to observability.
WSL → plex SSH gap fixed — WSL pubkey appended to CT 100's authorized_keys so site.yml runs cleanly from either control or WSL. Pre-existing since the 4C rebuild.
Notification routing summary doc — deferred; content lives in the role pages for now and the header overview in changelog.

5C. n8n — lab automation (de-scoped)¶

Completed 2026-04-28

n8n live as a Docker stack on a dedicated host VM. Reverse-proxied at n8n.rampancy.cloud. SQLite-backed (Postgres deferred until execution volume warrants). Captured by pbs-daily. See the n8n service page for the full picture and execution-time lessons.

Why: Lab/glue capability for cross-service workflows that don't belong in Ansible. Explicitly not replacing anything in 5A/5B/3D.

What changed from the original scope:

Pivoted from npm-on-LXC to Docker-on-VM mid-execution. Original plan was native Node.js install on an LXC. Modern n8n (v2.17+) bundles a heavy AI SDK ecosystem that OOMs a 4 GB LXC during install and won't compile against Debian trixie's apt-managed Node.js (isolated-vm ABI mismatch with Debian's V8 patches). Pivoted to the n8n team's recommended Docker path on a fresh VM, using Dockhand for git-managed deploys via the hawser agent on the host.
Ollama dropped (no GPU budget; lab-scope already covered by Anthropic API in workflows if/when needed).
Infrastructure workflows the original called out (ZFS health, drift detection, NUT) are now owned by ZED webhook (5B), the 3D heartbeat, and Beszel/NUT respectively. n8n is for user-facing glue only.

What landed:

n8n VM — Proxmox VM 108, Debian 13 trixie genericcloud qcow2, cloud-init seeded, 192.168.1.248, 2 vCPU / 8 GB / 32 GB on local-zfs
Roles applied — common, security, auto_updates, docker, beszel_agent, hawser
hawser role added — Dockhand's remote-host agent codified for the first time. Per-host vault token (vault_hawser_token_<host>), RW socket mount, named volume for stack cache, REQUEST_TIMEOUT=600s default. See hawser role page.
n8n stack — stacks/n8n/docker-compose.yml in the homelab-ansible repo, deployed via Dockhand → Hawser. SQLite under the n8n_data named volume. See n8n service page.
Reverse proxy — n8n.rampancy.cloud → 192.168.1.248:5678 via the nginx VM
Beszel — agent registered, host visible in the Beszel UI
PBS — first ad-hoc snapshot 2026-04-28 (1m 21s, 32 GiB transferred, 86% deduped); pbs-daily covers ongoing
Site-wide drift — --check --diff from CT104 reported changed=0 failed=0 unreachable=0 across all 9 hosts post-deploy

Execution-time lessons (full detail in n8n service page lessons section):

Modern n8n's npm install peaks past 4 GB RAM and won't compile against Debian's apt Node — Docker is the right path in 2026.
Docker 29's default storage driver (containerd-snapshotter / overlayfs) breaks pulls on certain layer patterns — pinned to legacy overlay2 per-host via docker_daemon_config. The new default only applies to fresh Docker 29.x installs; arrstack + nginx upgraded across the 28→29 boundary so they kept their existing overlay2 and were never bitten.
Hawser's default REQUEST_TIMEOUT=30s is too short for any real image pull — bumped to 600s as a role default. This was the actual blocker on multiple deploy attempts.
Dockhand v1.0.18 (what was running on arrstack) has a separate timeout bug (Finsys/dockhand#587); upgraded to v1.0.27 during the deploy.

Deferred:

First workflows — left for user exploration per Phase 5C's lab-scope intent
Postgres companion — only if execution volume crosses ~5–10k/day or DB ~4–5 GiB
Webhook/OAuth env vars in compose — N8N_HOST / N8N_PROTOCOL / WEBHOOK_URL are commented; uncomment when first workflow needs public callback URLs
~~nginx Hawser codification~~ — superseded by Phase 5D. nginx VM is retiring; new edge LXC runs Caddy (no Docker, no Hawser). Codification gap closed by the caddy role.
Internal wss:// for Hawser → Dockhand — currently plain ws:// on LAN; tighten when the internal-traffic TLS story lands

Phase 5D — Edge LXC + Caddy migration (NPM → Caddy)¶

Completed 2026-05-02

Caddy live on edge LXC (CT 107, 192.168.1.244) replacing hand-clicked NPM on VM 102. Single Let's Encrypt wildcard cert *.rampancy.cloud issued via DNS-01 against Cloudflare. Four hosts migrated. NPM container stopped — VM 102 in soak before destroy. See edge-cutover runbook.

Why: NPM was the only piece of homelab infra still managed by web UI rather than Git. Hand-clicked rules, no IaC story, upstream stalled (last release Feb 2025), CVEs in the LE-cert add flow. Caddy on a single static binary closes the codification gap, picks up automatic HTTPS renewal, and supports a single wildcard cert via DNS-01.

What landed:

edge LXC — CT 107, Debian 13 (trixie), unprivileged, features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, 192.168.1.244 (mirrors CT 106 Beszel pattern)
caddy role — single binary built via xcaddy + caddy-dns/cloudflare, version-pinned in roles/caddy/files/caddy-2.11.2, deployed to /usr/local/bin/caddy, systemd unit with AmbientCapabilities=CAP_NET_BIND_SERVICE, Caddyfile templated from inventory, ACME DNS-01 against CF via vaulted scoped token. See caddy role page.
Roles applied — common, security, caddy, beszel_agent. No docker, no hawser — Caddy is config-as-code, not compose-as-code.
Four hosts migrated to Caddy: requests.rampancy.cloud (Overseerr), dash.rampancy.cloud (Beszel), n8n.rampancy.cloud (n8n), kosync.rampancy.cloud (korrosync). All return same response codes as before, served by the LE wildcard.
UDM port-forward swap — TCP 80/443 → 192.168.1.244 (was 192.168.1.249)
NPM container stopped on VM 102 (docker stop nginx_proxy_manager-app-1); VM left running through soak

Cloudflare orange-cloud — attempted then reverted:

CF SSL/TLS encryption mode set to Full (strict) at zone level
Four CNAMEs flipped to orange via API
~~Reverted same day~~ — CF Universal SSL is disabled at the zone (no edge cert auto-issued for the new orange hostnames). Without a paid Advanced Certificate Manager order or Universal SSL re-enable, every TLS handshake to those hostnames at the CF edge failed (no SNI match). All four flipped back to gray; traffic resumed via direct WAN → UDM → Caddy. Edge security gap captured as an accepted risk until Phase 7D CrowdSec lands.

Execution-time lessons (full detail in edge-cutover runbook lessons):

Caddy upstream systemd unit ships --environ flag — prints all env vars (including secret tokens) to journalctl on startup. Removed in our rendered unit.
caddy validate doesn't see EnvironmentFile — ad-hoc validation needs the env var passed via Ansible's environment: task param. systemd's EnvironmentFile= only applies to ExecStart.
Caddyfile canonical fmt uses tabs — gofmt-style. Template indented with tabs to match.
CF token scopes are granular — Zone:Read + DNS:Edit are needed for ACME DNS-01; Zone Settings:Read is a separate scope (token cannot inspect SSL mode without it).
CF Universal SSL provisioning is per-hostname, not zone-wide — flipping orange on a hostname without an existing edge cert results in immediate handshake failure. Always verify Universal SSL is enabled before flipping orange.

Deferred:

VM 102 decom — done 2026-05-03. Stopped, pre-decom vzdump on nasbackup (1.65 GB compressed), qm destroy 102 --purge clean. Ansible inventory retired (host_vars/nginx.yml, playbooks/nginx.yml, stacks/nginx/ removed).
CrowdSec on edge — covered separately as Phase 7D
Caddy access logs to file — currently stdout/journalctl only; add log directive when CrowdSec phase lands (bouncer reads structured access logs from a file)
trusted_proxies config — only relevant if/when CF orange is re-enabled (Universal SSL would need to land first, or per-host Advanced Certificate Manager)

Phase 5E — Host-level file backup (proxfold → PBS)¶

Completed 2026-05-06

Daily file-level backup of proxfold host config (/etc, /root, /var/lib/pve-cluster) to PBS namespace host/proxfold via proxmox-backup-client + systemd timer. First snapshot 7.4 MB in 0.42s; idempotent re-run reports changed=0. Three execution-time bugs caught (CLI --output-format rejection, missing namespace subcommand, ACL-on-token bug); the third revealed a latent bug in the existing pbs role which was patched in the same cycle. Full lessons in backup-restore runbook.

Why: pbs-daily (Phase 5A) backs up guests via vzdump; the PVE host itself isn't included. A full-loss scenario not covered by the 4C ZFS boot mirror (config corruption, fat-fingered rm, full reinstall) currently means rebuilding from arrstack docs. Host configs are sub-MB, so chunked + dedup'd backups are negligible cost.

Out of scope (deliberate):

Off-host replication of the PBS datastore. Single-failure-domain on the QNAP TS-269L is documented and accepted — no viable second target exists in the current environment. Not re-evaluated here.
Host backups for other PVE hosts. Single-host fleet; design uses an host/<hostname> namespace so a second host could opt in cleanly later.

Architecture decisions (locked in 2026-05-06):

Extend the proxmox role. New tasks file host_backup.yml and four templates (env, wrapper script, .service, .timer) — no new role, mirrors the existing pbs_client.yml split.
Separate PBS user + token. New host-backup@pbs!proxfold distinct from the guest-backup token, blast-radius separation. Same fingerprint reused.
Namespace host/proxfold. Cleanly separates from VM/CT snapshot listings.
Schedule 02:30. Vzdump kickoff is 02:00, recent runs complete in ~5 min (PVE task log 2026-05). 02:30 has >20 min headroom and stays ahead of the 03:00 PBS prune.
Crypt-mode none. Chunks live on a private NAS already; client-side encryption adds a key-loss footgun for negligible gain on host config files.
Retention — namespace-scoped prune-job, longer than the global pbs-daily retention since the dataset is tiny and per-snapshot deduplication is high. Patterned after the Proxmox PBS docs prune example, tuned for homelab: keep-daily 14, keep-weekly 8, keep-monthly 12, keep-yearly 2.
Manual one-shot bootstrap on PBS. Mirrors the Phase 5A token-gen pattern. Adding multi-user / multi-namespace abstraction to the pbs role for one extra user wasn't worth it. Procedure codified in the backup-restore runbook.

Code landed:

roles/proxmox/tasks/host_backup.yml — installs client + creds + script + unit + timer
roles/proxmox/templates/pbs-host-backup.{env,sh,service,timer}.j2
roles/proxmox/tasks/main.yml — gated include (pbs_host_backup is defined and vault_pbs_host_token_secret is defined)
roles/proxmox/defaults/main.yml — example block + bootstrap pointer
inventory/host_vars/proxfold.yml — pbs_host_backup block

Bootstrap executed:

PBS-side one-shot on CT 105 — host-backup@pbs user + host-backup@pbs!proxfold token + host/proxfold namespace + DatastoreBackup ACL on both authids + namespace-scoped prune-job (14d/8w/12m/2y, 03:15 daily). Token captured to PBS tmpfs, SCP'd to control tmpfs, shred at both ends.
Vault append — vault_pbs_host_token_secret written via decrypt-to-tmpfs / append / re-encrypt pattern; round-trip verified, source token file shred.
First playbook run — --limit proxfold --tags host_backup reported changed=5 (env file, script, service, timer, timer enable).
Smoke test — systemctl start pbs-host-backup.service succeeded; snapshot host/proxfold/2026-05-06T02:01:48Z (7.4 MB, 3 pxar archives) registered on PBS.
Idempotency check — full site.yml --check --diff --limit proxfold reports ok=84 changed=0 failed=0.

Open follow-up:

Patch the pbs role to grant ACL on both user AND token auth-ids (closed 2026-05-06, same cycle). New Ensure datastore ACLs for PBS client TOKEN auth-id task in roles/pbs/tasks/main.yml mirrors the existing user-grant loop, gated on vault_pbs_token_id is defined. New default pbs_client_token_name: "pve". Verified idempotent against live PBS (manual 5A grant matches what the task produces — changed=0 on apply). A fresh PBS rebuild via the role no longer needs the manual second grant.

Phase 6 — New services (weeks 9+)¶

Greenfield — no dependencies between items. Work on whichever is most interesting at the time.

6A. Forgejo (self-hosted git, GitHub-mirrored)¶

Stand up a dedicated LXC running Forgejo, front it with the existing Caddy edge, migrate four GitHub repos with push-mirrors back to GitHub. Forgejo becomes source of truth; GitHub stays as private mirror so external Actions (e.g. meat-helmet's scheduled cron) keep firing.

Sub-stages 6A.1 (LXC stand-up), 6A.2 (edge integration + CF DNS token rotation), 6A.3 (repo migrations + push mirrors), 6A.4 (close-out + docs sweep). Full procedure in the Forgejo Setup runbook.

Phase 6A complete — 2026-05-05

All four sub-stages executed across two days. Forgejo 11.0.13 live at git.rampancy.cloud, all 4 repos imported with sync_on_commit push-mirrors to GitHub, local origins flipped, CF DNS token rotation closed the 7D leak follow-up. End-to-end mirror chain validated by the closing docs commit (Forgejo push → GitHub mirror within ~5s).

6A.1 — CT 109 created on rpool; forgejo-sqlite installed via Codeberg APT; service running on LAN at :3000 (executed 2026-05-04)
6A.2 — git.rampancy.cloud live behind Caddy on CT 107; CF DNS token Rolled + API-validated (closes 7D follow-up); Forgejo ROOT_URL flipped to https + DISABLE_SSH=true (executed 2026-05-05)
6A.3 — arrstack, homelab-ansible, mediabot, meat-helmet imported with full history (Forgejo migrate API, 2026-05-05); per-repo push-mirror live (fine-grained PATs, sync_on_commit: true); local remotes flipped on WSL with github retained as fallback
6A.4 — Phase 6A close-out (2026-05-05): docs sweep across role/service/runbook + homelab-ansible README refresh; Discord push-event webhook per repo deferred as low-signal for single-operator setup (procedure preserved in runbook for future collaboration trigger)

6B. Home Assistant¶

Stand up HAOS as a sealed Proxmox VM, integrate the Hue V2 bridge + Tapo P110M (via Matter, LAN-local) + Bambu A1 Mini (via HACS + ha-bambulab), front via Caddy edge at home.rampancy.cloud, build dashboard + automations.

Sub-stages 6B.1 (VM stand-up), 6B.2 (core integrations), 6B.3 (edge integration), 6B.4 (dashboard + automations + close-out). Full procedure in the Home Assistant Setup runbook.

HAOS is opaque to the homelab Ansible baseline

HAOS is a sealed Buildroot appliance — no SSH, no apt, no common / security / auto_updates / beszel_agent. Drift detection is blind to it; the inventory entry exists only as a documentation anchor and to drive the Caddy edge route. Compensating controls: PBS daily VM-layer backup (more comprehensive than HAOS's native backup), HA Supervisor's own update mechanism for HA Core + add-ons, HA's native Discord integration into #homelab-ops. To be recorded as an accepted risk at close-out. Wazuh agent on HAOS is deferred to Phase 7B — BeardedTinker's HAOS rule pack works API-side without an in-VM agent.

6B.1 — VM stand-up (2026-05-23): HAOS 17.3 (haos_ova-17.3.qcow2.xz, release 2026-05-06 — pin bumped from scaffold's 17.2 since 17.3 was current on the day of execution) imported to local-zfs as VM 110 (hass, 192.168.1.241). Q35 / OVMF / pre-enrolled-keys=0 (the boot-loop gotcha) / EFI disk on local-zfs / virtio-scsi-single / 2 vCPU / 4 GB / 32 GB. First boot via HA onboarding wizard at http://192.168.1.241:8123; static IP set via Settings → System → Network. Boot-to-HTTP-200 was ~90 s (not the scaffold's ~5 min). PBS daily pickup at 02:00 ACST 2026-05-24 — spot-check pending.
6B.2 — Core integrations: HACS bootstrap via the Studio Code Server add-on shell, accepted as a managed dependency. Tapo P110M via tplink integration primary, Matter as fallback (revised from scaffold's Matter-primary — community + upstream HA-core issues show P110M Matter pairing flaky and energy-endpoint enumeration incomplete; python-kasa now handles KLAP locally without cloud creds). Hue V2 bridge auto-discovered → push event stream confirmed. ha-bambulab installed via HACS; A1 Mini flipped to LAN Mode + Developer Mode to permit MQTT writes under firmware ≥ 01.05. Deferred — user driving hands-on.
6B.3 — Edge integration (2026-05-23): home.rampancy.cloud vhost added to host_vars/edge.yml (homelab-ansible commit b19f662); CrowdSec coverage automatic via the existing wildcard handler. HA's http: block in configuration.yaml set with use_x_forwarded_for: true + trusted_proxies: 192.168.1.244. Gotcha: after editing configuration.yaml, verify HA Core has actually restarted (docker inspect homeassistant --format '{{.State.StartedAt}}' newer than config mtime) before applying Caddy — otherwise the trusted-proxies block stays unloaded and every X-Forwarded-For-bearing request gets HTTP 400, easy to misread as edge config error. End-to-end validated from cellular.
6B.4 — Dashboard + automations + close-out: print-completion → #homelab-ops Discord webhook (rest notify platform with unified data: block — drop scaffold's deprecated data_template: split syntax), sunset → Hue lights blueprint, P110M energy draw surfaced on the printer dashboard tile. Remaining docs sweep: hosts/hass/index.md, services/home-assistant.md, mkdocs.yml nav, reference/accepted-risks.md HAOS-opacity entry.

6C. Obico (3D print failure detection)¶

On the A1 Mini, enable LAN Mode (Settings → Network) and then Developer Mode — these are two separate toggles; both are required for third-party integrations
Connect a USB webcam for AI detection (A1 Mini built-in camera stream is not suitable for Obico's failure detection model)
Deploy self-hosted Obico server in Docker
- Obico now supports direct Bambu connection without OctoPrint as middleware — use the native Bambu integration in recent Obico releases
Configure Discord/Telegram notifications

6D. Music acquisition pipeline¶

Phase 6D complete — 2026-05-16

Lidarr (hotio plugins branch image) + Tubifarry plugin + slskd (via gluetun) + beets landed on arrstack VM 101. PlexAmp/Plex stay as the streaming layer (no Navidrome — explicit decision; pipeline scoped to acquisition only). Music library shared back read-only on Soulseek (1,905 dirs / 15,986 files) with 5-slot / 1 MB/s upload cap, polite per-peer throttle (3 grabs/day per peer, max 5 queued/peer, min 5 files for "real album" filter). Smoke test artist (Gotye) downloaded Making Mirrors Deluxe (22 FLACs) end-to-end. The bulk-import-from-root-folder auto-triggered when the root folder was added — 877 albums / 10,691 tracks registered in one pass. Lessons in music-acquisition-bringup runbook.

Music library was already on ZFS at /stash/rodneystash/Music (Plex Tunes indexes it via the /mnt/plex/Music symlink managed by the plex role)
Lidarr on ghcr.io/hotio/lidarr:pr-plugins (mainline has no plugin system) + Tubifarry plugin installed via Lidarr UI from https://github.com/TypNull/Tubifarry
slskd 0.25 routed through existing gluetun container (network_mode: service:gluetun), native VPN integration via SLSKD_VPN=true + SLSKD_VPN_GLUETUN_* env vars
Gluetun control API auth wired (X-API-Key role for slskd, config TOML at /opt/mediaserver/gluetun/auth/config.toml)
Soulseek citizenship: shared /music (read-only) + /downloads, 5 upload slots, 1 MB/s cap, polite profile blurb signalling automated library
beets installed (on-demand sanity tagger, no Lidarr-pipeline integration)
Lidarr Plex Connect notification wired via Plex.tv OAuth (auto-rescan on import)
Existing 612-artist library imported (877 albums, 10,691 tracks) — auto-triggered on root folder add
Gotye Making Mirrors (Deluxe Edition, 22 FLACs) downloaded + imported + visible in Plex Tunes

6E. Matrix server¶

Goal: Stand up a closed, federated Matrix homeserver that replicates Discord's day-to-day chat experience — text rooms, threads, spaces, and group voice/video — for me plus a small circle of family and friends. Element X / Element Web are the clients. Mobile push, Discord bridging, and OIDC SSO are explicitly deferred.

Scoped 2026-05-21 — not started

Scoping replaces the original four-line stub ("Synapse + PostgreSQL + Caddy via Docker Compose"). Two pivots vs the original: Synapse → Tuwunel (Rust, embedded RocksDB, much lighter at idle) and LXC → VM (spantaleev's playbook explicitly warns against LXC). See the Matrix setup runbook for the execution checklist.

Architecture — Docker stack on a new VM, fronted by the existing edge Caddy + CrowdSec:

Component	Purpose	Notes
Tuwunel	Homeserver (text / rooms / spaces / threads / E2EE / federation)	Embedded RocksDB — no Postgres. Official conduwuit successor per the project's own README.
LiveKit Server	WebRTC SFU for group voice/video calls (MatrixRTC)	Single binary. Default media UDP range trimmed via ICE/UDP mux on 7882.
lk-jwt-service	Issues short-lived JWTs that authenticate Matrix clients into LiveKit	Tiny Go binary.
Traefik (bundled)	Internal HTTP router for the playbook's containers	Bound to `0.0.0.0:81` on VM 111; external Caddy on CT 107 terminates TLS.
matrix-static-files	Serves `.well-known/matrix/*` from `matrix.rampancy.cloud`	Apex `rampancy.cloud` well-known files come from Caddy directly.

Deployment approach — spantaleev's matrix-docker-ansible-deploy vendored separately on CT 104 at ~/matrix-deploy/, parallel to homelab-ansible/. The integration surface (well-known apex, federation routing, MatrixRTC ↔ LiveKit JWT signing, WebSocket forwarding for SFU) is large enough that re-deriving it in a custom role would cost more than the external-playbook tax. Drift detection skips the matrix VM — config drift is checked manually via the playbook's --check mode when wanted, not via the nightly drift cron.

Cross-phase decisions:

Homeserver: Tuwunel (over Synapse). Rust, ~512 MiB at rest, embedded RocksDB, no Postgres dependency. Synapse is more featureful but eats 2–4 GiB and brings Postgres in tow — wrong fit until 4B RAM upgrade lands.
VM, not LXC. Docker-in-LXC works elsewhere (arrstack VM 101 is technically a VM for the same reason); MatrixRTC's UDP networking + multiple-container fan-out + AppArmor footprint are even less LXC-friendly than the arrstack pattern.
Federation via well-known delegation, not port 8448. Apex rampancy.cloud serves /.well-known/matrix/server pointing federation to matrix.rampancy.cloud:443. Keeps all inbound traffic on the existing edge Caddy + CrowdSec path; no new TCP port-forward at the UDM.
LiveKit UDP is unavoidable. First service in the homelab that needs router-level UDP forwarding. Ports: 7881/tcp + 7882/udp (ICE) + 3479/udp + 5350/tcp (TURN) + 30000-30020/udp (TURN relay range). Forwarded straight to VM 111 — bypassing the edge LXC, no Caddy/CrowdSec coverage on those ports.
Audience: closed. matrix_tuwunel_config_allow_registration: false plus a vault_matrix_registration_token for invite-style signups. Federation is enabled, so my rooms can include matrix.org users; my server just doesn't accept walk-ins.
Element Call frontend skipped. Element X (mobile) and Element Web both embed the RTC widget internally. Standalone call.rampancy.cloud deployment is unnecessary unless we want a clickable web-call landing page later.

Resource footprint (VM 111, sized to match arrstack/n8n VM pattern):

Slice	Target
RAM	8 GiB (Tuwunel ~2, LiveKit ~1, everything else <1, OS+Docker ~1, headroom ~3)
Disk	32 GiB on rpool (RocksDB growth is the unknown — Beszel disk alert will catch it)
vCPU	4
IP	192.168.1.243

Completed 2026-05-22

All five sub-phases closed 2026-05-22. Tuwunel v1.7.0 on VM 111, federation green via apex well-known, MatrixRTC live with Element Call validated end-to-end (audio + video + screen-share). Two headline gotchas in the runbook's Lessons appendix: matrix_tuwunel_config_allowed_remote_server_names filtering events from our OWN server (text/federation phase) and the apex well-known missing org.matrix.msc4143.rtc_foci (RTC phase). Both fixes are now baked into the Caddy template and vars.yml respectively.

Execution checklist (actual):

6E.1 — VM 111 stand-up: Debian 13 genericcloud image, 4 vCPU / 8 GiB / 32 GiB at 192.168.1.243; common + security + beszel_agent applied (docker role intentionally dropped from playbooks/matrix.yml post-execution — spantaleev owns Docker here, see Lessons). Completed 2026-05-21.
6E.2 — spantaleev playbook bootstrap on CT 104: cloned at commit 9bd9d1a to /root/matrix-deploy/, inventory scaffolded for Tuwunel, vault-bridge symlink set up. matrix_tuwunel_version: v1.7.0 pinned (v1.6.2 has a token+password registration regression). ensure-matrix-users-created is Synapse-only — Tuwunel's grant_admin_to_first_user handles first admin via client-side registration token instead. just substituted with direct ansible-galaxy install -r requirements.yml -p roles/galaxy/ --force (no just on Debian bookworm). Completed 2026-05-22.
6E.3 — Edge integration: caddy role extended with caddy_matrix_enabled gate + caddy_matrix_upstream (single upstream — federation collapsed onto web entrypoint via matrix_federation_public_port: 443). Template adds apex rampancy.cloud block (well-known statics) and matrix.rampancy.cloud block (federation path skips CrowdSec). Apex LE cert issued via existing CF DNS-01. Completed 2026-05-22.
6E.4 — MatrixRTC UDM port-forwards: 5 forwards configured via UDM UI (matrix-rtc-ice-tcp 7881/tcp, matrix-rtc-ice-udp-mux 7882/udp, matrix-rtc-turn-udp 3479/udp, matrix-rtc-turn-tcp 5350/tcp, matrix-rtc-turn-relay 30000-30020/udp → 192.168.1.243). Apex .well-known/matrix/client extended to advertise org.matrix.msc4143.rtc_foci (Element Call queries the apex well-known and doesn't fall through to the matrix subdomain — see Lessons). Element Call validated end-to-end: desktop ↔ Element X mobile on cellular, audio + video + screen-share both directions. Completed 2026-05-22.
6E.5 — First user + smoke: @rampancy:rampancy.cloud registered via Element Web's token-gated registration, auto-promoted to admin by Tuwunel, joined the auto-created #admins:rampancy.cloud room with @conduit:rampancy.cloud bot. Federation tester green. Completed 2026-05-22.
6E.6 — Docs sweep: this roadmap entry flipped; matrix-setup runbook rewritten with actual execution + 9-item Lessons appendix; changelog entry below; hosts/proxfold guests update + services/matrix.md page pending (separate one-off cleanups, no functional block).

Deferred / future sub-phases (not committed; reassess after 6E lands):

6E.7 — Mobile push: Sygnal + FCM keys OR self-hosted UnifiedPush. Adds a Google dependency or a third small service. Skip if Element Web is good enough day-to-day.
6E.8 — Discord bridge: mautrix-discord. Adds Postgres 16+ (Tuwunel doesn't need it but the bridge does) and a per-user puppeting workflow. Worth doing only if there's an active Discord community to keep bridging into.
6E.9 — Pocket-ID OIDC integration — bundles cleanly into Phase 7E.

6F. Music recommendations / discovery¶

Goal: Add Spotify-style Discover Weekly / Daily Jams on top of the existing PlexAmp listening experience. Recommendations come from community listening data (ListenBrainz), missing tracks dispatch through the 6D acquisition pipeline, and the result lands as Plex playlists.

Scoped, not started — reference only

Scoped from a research session on 2026-05-16 immediately after 6D close-out; reassess at execution time. No work begun.

Architecture — four small Docker services, all fitting on arrstack VM 101 alongside the existing 6D stack:

Layer	Purpose	Pick
Scrobble source	Captures what's been played	Plex / PlexAmp native scrobble webhook
Scrobble bridge	Forwards Plex listens to ListenBrainz (Plex doesn't speak LB natively)	RustyRin/Plex_Scrobble_App or FoxxMD/multi-scrobbler
Recommendation engine	Generates Discover Weekly / Daily Jams / similar-artists from listening history	ListenBrainz public instance (MetaBrainz)
Orchestrator	Pulls LB recs → resolves against local library → dispatches missing tracks via slskd → publishes a Plex playlist	LumePart/Explo

Pipeline:

PlexAmp listening
   ↓ Plex webhook
Scrobble bridge (Docker, arrstack VM 101)
   ↓
ListenBrainz (public cloud)
   ↓ recommendations API
Explo (Docker, arrstack VM 101)
   ├── checks Plex/local library
   ├── missing → dispatches to slskd (already built in 6D)
   └── publishes Discover Weekly / Daily Jams as Plex playlists

Execution checklist (when picking this up):

Stand up scrobble bridge (Plex_Scrobble_App or multi-scrobbler) as a Docker service on arrstack VM 101; wire Plex webhook in
Register ListenBrainz account on the public instance, wire the scrobbler → LB API
Accumulate ≥ 2 weeks of scrobble history before standing up Explo (recs are cold-start-sensitive)
Stand up Explo on arrstack VM 101; wire LB API → Plex library lookup → slskd dispatch → Plex playlist publish
Decide how Explo-grabbed-via-slskd items get tagged/imported: round-trip through Lidarr (consistent with 6D) or direct-land in the Plex library (faster but bypasses Tubifarry's import path)
Smoke test: confirm Discover Weekly / Daily Jams playlists appear in Plex with plausible picks based on actual recent listening
New services/listenbrainz.md + services/explo.md pages on close-out; services/arrstack.md services list updated; mkdocs.yml nav entries added

Open decisions / risks:

ListenBrainz public vs self-host. Public instance is free and rec quality benefits from cross-user collaborative filtering (self-hosting with N=1 dataset hurts the model). Default to public. Privacy posture: listening history is visible to anyone who looks (Last.fm-class).
Cold-start. First 2–3 weeks of recs will be weird/generic until LB has enough history. Mitigation: stand up scrobbling well before Explo so the dataset is already accumulating when the orchestrator lands.
Explo is single-maintainer (LumePart). Active per commit history in 2026 but smaller than the *arr stack — maintenance risk worth knowing; have a fallback in mind (manual LB rec → Lidarr import).
Plex scrobble webhook is reliable but not perfect — occasionally misses listens during network blips. Not a deal-breaker for recs.
No mood/activity-aware curation (Spotify "Focus" / "Late Night"). That's where the streaming-service experience still wins; explicitly out of scope for 6F.
Caveat captured but not validated: assumes Explo speaks slskd directly. Verify against current Explo README at execution time — if it only writes Lidarr Wanted entries, the slskd dispatch happens via Tubifarry as a side effect rather than directly.

6G. Comic / graphic-novel library (Komga)¶

Goal: Store a comic / graphic-novel library and read it on the iPad plus a broad range of devices. Komga serves a web reader + OPDS v1/v2; comics only — ebooks stay on the korrosync KOReader flow.

Deployed 2026-07-06 — reachable; library-loading + iPad validation user-driven

Chosen over Kavita (best comic reading engine + OPDS v2 for Panels/Chunky; Kavita's EPUB support is redundant given korrosync already covers ebooks). Deployed as its own Dockhand stack, not folded into the arrstack compose — matches the korrosync/mealie standalone-app precedent; the arr external network is reserved for the acquisition pipeline. Container up on VM 101, /data read-only mount verified, and https://comics.rampancy.cloud returns 200 through the edge (502→200 after Dockhand deploy). Remaining hands-on: populate the library + validate the iPad web reader / Panels OPDS from cellular. Full detail on the Komga service page.

Architecture — single container, own komga network + komga_config named volume on arrstack VM 101:

Decision	Choice
Image	`gotson/komga` (upstream first-party — matches korrosync/mealie using official images)
Config	named volume `komga_config` (vzdump-captured); Docker seeds ownership from the image, no PUID/user override
Library	read-only bind of `/stash/rodneystash/Comics` → `/data:ro` (reader writes only DB/thumbnails to `/config`)
Port	`25600` (free on the host)
Edge	`comics.rampancy.cloud` via CT 107 Caddy — wildcard cert + CrowdSec cover it automatically
Acquisition	out of scope — Mylar3 is a possible future stack if wanted

Sanity flag: the stash pool is at ~94% (zpool list stash: 19.7T/21.0T, ~787 GB free) — a comic library fits easily, but the pool is near the ZFS-degradation line independent of this, worth a separate cleanup decision.

Execution checklist (actual):

stacks/komga/docker-compose.yml added (own network + volume, gotson/komga) — homelab-ansible 806f30c
comics.rampancy.cloud route added to inventory/host_vars/edge.yml; caddy role applied to edge (changed=2, failed=0)
/stash/rodneystash/Comics created on proxfold (root:root 0755, matching sibling libs)
Cloudflare CNAME comics → rampancy.cloud, gray cloud — missed in the original plan, caught by the resolve test; no wildcard A record exists
Stack registered in the Dockhand UI pointing at stacks/komga/; container Up; /data read-only bind verified; edge returns 200
Komga admin account created (strong password — edge is internet-facing); library added at /data; test CBZ scans — user-driven
iPad validated: Safari web reader + Panels over OPDS v2; public route confirmed from cellular (not LAN — hairpin NAT) — user-driven
Docs: service page, nav, changelog, roadmap

6H. Comic acquisition (Kapowarr) + metadata (komf)¶

Goal: Turn the 6G library into a full pipeline — automated acquisition in front of Komga, and accurate metadata inside it. Both land in the existing komga stack (now 4 containers: reader + VPN + acquisition + metadata).

Deployed 2026-07-18 — pending first-run config + verification

Compose, .env.example, temp dir and docs shipped; Dockhand redeploy + app configuration are the remaining steps. Kapowarr handles acquisition (ComicVine + GetComics DDL) behind a third ProtonVPN session; komf handles metadata. The komf addition is the important one — see rationale below.

Why komf is not optional. Komga has no built-in online metadata scraper (komga#1577) — it reads only embedded ComicInfo.xml or the filename. And Kapowarr does not write ComicInfo.xml (Kapowarr#50) — its ComicVine matching serves only its own naming. So acquisition alone leaves Komga with bare filename-derived metadata. komf closes the gap: Kapowarr gets & organises → Komga displays → komf makes it accurate.

Decision	Choice
Acquisition	`mrcas/kapowarr` :5656 — ComicVine-driven, GetComics is its only source (DDL via Mega/MediaFire/Pixeldrain)
Kapowarr egress	Own `gluetun-kapowarr` (3rd ProtonVPN WG session), `VPN_PORT_FORWARDING=off` — DDL is outbound-only
Metadata	`sndxr/komf` :8085 — ComicVine (Western) + MangaUpdates/AniList/MangaDex (manga), API mode (writes into Komga's DB, non-destructive)
komf egress	Not tunnelled — metadata-provider APIs only, and it must resolve `komga` on the stack network
Temp downloads	`/stash/rodneystash/ComicsDL` — sibling of `Comics` (Kapowarr's non-intersect rule), same pool → atomic moves
ComicVine key	One key, entered in both apps; they share the ~200 req/hour limit — stagger big scans
Exposure	Both admin UIs LAN-only (no Caddy route/CNAME), like the *arrs

Execution checklist:

Why bulk matching is parked

ComicVine indexes issue runs, TPBs, one-shots and hardcovers as separate releases, and this library is almost entirely collected editions (TPBs + manga volumes). Name and year matching alone is not sufficient — Kapowarr's own Library Import screen warns of exactly this. Demonstrated live: 100 Bullets matched to ComicVine 6306 "100 Bullets (1999) (DC Comics)" (the 100-issue monthly run) when the files on disk are the 5-volume 2014–2016 collection, so komf mapped issue-level numbering onto 5 TPBs. Metadata was reset. Every series therefore needs a human decision about which ComicVine release it is — ~180 of those is disproportionate to the payoff.

If picked up later, two changes make it far less painful:

Flip provider priority — comicVine: 10, mangaUpdates: 20. The original ordering (mangaUpdates first) was based on a wrong assumption that its MANGA filter would let Western comics fall through; in practice MangaUpdates returns fuzzy junk for Western titles (searching "100 Bullets" yields Bully Bullets, Trigun – Multiple Bullets), so plain auto-match is unsafe for this library as configured.
Disable ComicVine's bookMetadata fields (title, number, numberSort, summary, releaseDate) so komf enriches series metadata only and leaves Komga's filename-derived volume numbering intact. Keep MangaUpdates' book metadata on — it is volume-oriented and maps correctly onto the manga.

Also use Kapowarr's Library Import with Import, never Import and Rename — the latter moves/renames files underneath Komga.

Gotchas captured during execution (2026-07-18):

komf requires /config/application.yml to exist — it does not create a default, and crash-loops with NoSuchFileException on a fresh named volume. Env vars alone are insufficient. Config had to be seeded by hand into komf_config; since it lives in a volume it is not in Git, so a rebuild must re-seed it. See komf bootstrap.
An enabled provider with no API key blocks startup — IllegalArgumentException: Api key is not configured for ComicVine provider. komf won't run degraded, so the ComicVine key became a required stack env var (KOMF_METADATA_PROVIDERS_COMIC_VINE_API_KEY) rather than a UI-entered setting as originally planned.
Provider ordering for a mixed library — mangaUpdates was originally given higher priority than comicVine on the assumption its MANGA mediaType filter would let Western comics fall through. This was wrong — MangaUpdates returns fuzzy matches for Western titles and can win a match it shouldn't (see gotcha 4).
komf's event listener auto-matched newly added series — now disabled. eventListener.enabled: true makes komf match every series added to Komga, with no prompting. Combined with gotcha 3 this silently mis-matched Astro City → "Astro Baby" (MangaUpdates) and rewrote book metadata on Lobo (1990) on 2026-07-18. Both were reset; the listener is now off and matching is manual only. Parking the bulk pass did not stop this — the listener is independent of it. Verify with docker logs komf | grep -c "connected to Komga event source" → should be 0.

Deferred: manga acquisition (Suwayomi if that side grows — komf already covers manga metadata); torrent-backed GetComics posts via qBittorrent (needs a remote-path-mapping); komf COMIC_INFO write-back mode (bakes metadata into the CBZs — now possible since the library is :rw).

Housemate Proxmox access¶

Trusted-LAN scope: VMs land on vmbr0/VLAN 1 today, with Proxmox-side controls (RBAC, ZFS quota, scoped storage) as the boundary. L2 isolation onto a dedicated Lab VLAN is captured separately as Phase 8 — Network segmentation.

Trust posture: housemate is a known person, written guidelines on resource limits + risky operations, no hostile-tenant assumptions.

See the Housemate Access runbook for the full ACL set + execution order.

Enable UDM Network API token (or Local Admin user fallback) for read-only config access
Create PVE realm user hazel@pve with TOTP enforced
Create group housemate-lab-admins and ZFS dataset stash/housemate-vms (quota=500G)
Create PVE storage housemate-zfs (ZFS plugin, content images,rootdir) and resource pool housemate-lab
Apply group ACLs: pool → PVEVMAdmin, storage → PVEDatastoreUser, /sdn/zones/localnetwork/vmbr0 → PVESDNUser
PBS coverage: existing pbs-daily job picks up Hazel's pool VMs automatically (root namespace); operator-mediated restore. Self-service PBS UI for Hazel is deliberately deferred — see runbook step 6 future-enhancement block
Written guidelines for Hazel: resource caps, snapshot discipline, ping-before-passthrough

Phase 7 — Security stack (weeks 10+)¶

Goal: Add SIEM / endpoint detection coverage and close the edge-security gap left by Phase 5D. Drift detection, Beszel, and PBS cover availability; nothing today covers intent. Phase 7 closes that gap with Wazuh as the centrepiece, network IDS via Suricata, edge IPS via CrowdSec, and a lightweight identity provider (Pocket-ID) for OIDC-native admin UIs.

Pre-requisite: Phase 4B

Wazuh's all-in-one server wants 8–16 GB on a dedicated VM. Current 48 GB physical leaves no comfortable headroom once arrstack/n8n/PBS/Beszel/control/plex/nginx are accounted for. 4B's 8× 32 GB / 256 GB target is sized partly to unblock this phase, with comfortable ARC headroom + future expansion room (4 slots free). Without 4B done first, Phase 7 squeezes everything else and hits its retention/heap ceilings within months.

flowchart LR
    A[7A: Wazuh AIO + first agents] --> B[7B: Agent fleet rollout]
    B --> C[7C: Suricata NIDS + integration]
    B --> D[7D: CrowdSec edge]
    B --> E[7E: Pocket-ID identity + SSO]

    style A fill:#FAECE7,stroke:#993C1D,color:#712B13
    style B fill:#FAECE7,stroke:#993C1D,color:#712B13
    style C fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style E fill:#EEEDFE,stroke:#534AB7,color:#3C3489

7D + 7E ship together

CrowdSec covers edge protection (drop scanners before they reach app login). Pocket-ID covers identity (single sign-on for OIDC-native admin UIs). They're complementary, share an "edge hardening" theme, and are scoped to land in one cycle. 7E does not depend on Phase 4B's RAM upgrade — Pocket-ID's footprint (~50 MB RAM, single Go binary) is trivial.

7A. Wazuh AIO + Ansible scaffolding + first agents¶

Why: Stand up Wazuh manager + indexer + dashboard on a single VM. Pin to Wazuh 4.14.x — current stable is 4.14.5 (release notes, 2026-04-23). Wazuh 5.0 beta1 shipped April 2026 with rewritten agent protocol, removed Filebeat, and cluster-by-default (5.0 brief) — treat 4.14.x as the runway through 5.0 GA.

7B. Agent fleet rollout¶

Why: Coverage. The proxfold host agent gives the most signal value (kernel events, package changes, ZFS state, SSH attempts, auditd) but every additional agent multiplies the security picture.

Roll out agents to remaining VMs: n8n (and plex if/when it moves from LXC to VM)
Roll out agents to LXCs: control, pbs, beszel, edge — but with caveats. Wazuh agent has documented install issues inside unprivileged LXCs (wazuh#24954). Test on one CT first; evaluate before promising fleet coverage. Worst case: skip agent in LXCs and rely on the proxfold host's view of /var/log/lxc/* and pct exec audit
Tune VPN-induced noise — qBittorrent's source IP via gluetun looks "wrong" to geoIP-style rules; suppress or whitelist early
Pull in BeardedTinker's UniFi/Synology/HomeAssistant rule packs (BeardedTinker/wazuh-homelab-security) — UDM and future Phase 6 HAOS integrate near zero-config
Rebuild-kit gotcha — Wazuh agent name can't be changed post-install (wazuh#19710); the rebuild/ kit must install agents after hostname is set, not as part of a base template. Codify or document the constraint

7C. Suricata NIDS + Wazuh integration¶

Why: Without an NIDS, Wazuh is host-side blind. Suricata + Wazuh is a first-class integration — Wazuh auto-parses /var/log/suricata/eve.json and surfaces alerts in the dashboard (Wazuh PoC: Suricata integration).

Decide host shape — dedicated Suricata VM, or co-located on the nginx/Caddy edge VM
Decide tap point — port-mirror from MikroTik PENFOLD-SW01, or in-line on the gateway path. Mirror is non-invasive and the right starting point; in-line gives blocking capability later
Codify as roles/suricata with roles/wazuh_agent already installed for the eve.json forward
Tune EmergingThreats Open ruleset — homelab traffic generates noise on alerts written for enterprise environments

7D. CrowdSec on edge¶

Completed 2026-05-04

CrowdSec engine + hslatman Caddy bouncer module live on edge (CT 107). Bouncer registered, polling LAPI on 15s ticker, crowdsec directive in every per-host handle block. End-to-end validation via cellular phone confirmed: blocked IP got 403, removed IP returned to 200. Accepted risk: edge security gap closed same day. See crowdsec_engine role and crowdsec-validation runbook — the runbook captures six bugs / gotchas hit during first execution.

crowdsec_engine role landed: packagecloud any/any apt source, crowdsecurity/caddy collection, Restart=on-failure drop-in, stat-gated for --check --diff cleanliness
caddy role updated: xcaddy-built binary caddy-2.11.2-cs1 with hslatman bouncer module + cloudflare DNS, gated bouncer block in Caddyfile template ({$CROWDSEC_BOUNCER_API_KEY} via parse-time substitution), site-block JSON access logs to journal
One-time cscli bouncers add caddy-edge operator step + vaulted key (registration is non-idempotent; not in the role)
Validation runbook executed end-to-end from cellular (LAN test masked by hairpin NAT — runbook now leads with this constraint)
Accepted-risks register entry closed
Closed 2026-05-05 — rotated via CF Roll as part of Phase 6A.2. New secret validated by API probe (token verify + Zone:DNS:Edit TXT round-trip); script-based vault swap so the token never appeared in scrollback.

Validated post-go-live (2026-05-04): within 4 hours of bouncer enable, the engine's local Caddy-log parser caught a real HTTP-probing attempt from 49.178.191.113 (AU, Microplex) and auto-banned via the crowdsecurity/http-probing scenario. Confirms cscli setup auto-discovery wired Caddy log acquisition + base-http-scenarios + http-cve collections at install time — the role doesn't need to do that work explicitly. SSH log acquisition + crowdsecurity/sshd collection also auto-installed on edge; SSH brute-force detection now feeds the same reputation pool as Caddy, but enforcement for SSH still needs cs-firewall-bouncer (Caddy bouncer is HTTP-only).

7D follow-ups (deferred — flag for later)¶

Captured 2026-05-04 from a maturity-of-implementation review. None are blocking; pull into a future cycle when one feels warranted.

Discord notification profile. Engine-side notification system supports HTTP webhooks (/etc/crowdsec/notifications/http.yaml + /etc/crowdsec/profiles.yaml). Wire bans → vault_discord_webhook_homelab_ops for visibility per-effort. ~30 min of work; shares the channel pattern used by PBS/ZFS/PVE/auto-updates events.
cs-firewall-bouncer + retire fail2ban on edge. SSH detection is already happening (auto-installed crowdsecurity/sshd parses ssh.service journald), but SSH enforcement still relies on the Caddy-side security role's fail2ban. Adding cs-firewall-bouncer (nftables-based) gives a unified federated reputation feed for both SSH and HTTP, and lets us cleanly retire fail2ban from the security role on edge. New crowdsec_firewall_bouncer role; small bootstrap (similar non-idempotent cscli bouncers add operator step pattern as the Caddy bouncer).
CrowdSec Console enrollment. app.crowdsec.net — free SaaS dashboard showing alerts/decisions/top scenarios. One-command enroll: cscli console enroll <token>. Privacy tradeoff: ships event metadata (timestamps, scenario names, IPs) to CrowdSec's cloud. Considered worth weighing if CLI feedback ever stops being enough.
Wazuh forwarding — already deferred to Phase 7A/B per scoping.

Scope cuts (2026-05-04):

CrowdSec ↔ Wazuh forwarding deferred to Phase 7A/B. Wazuh is gated on Phase 4B (CPU + RAM upgrade), which is deprioritised. Edge gap is more urgent than the broader SIEM build; closing 7D standalone unblocks it. Forwarding hook documented in the role doc as a one-line wiring job when Wazuh exists.
Lynis weekly cron split out to Phase 7A/B. Functionally orthogonal to edge bouncing; pairs naturally with Wazuh's SCA module via custom decoder. Not in 7D scope.
SOAR-lite stretch moves with the Wazuh piece — depends on Wazuh active-response.

Decisions (locked 2026-05-04):

Apt source = packagecloud any/any, not Debian trixie main. Upstream's trixie repo returns 404 (issue #3909, unresolved 2026-05-04); any/any is upstream's documented workaround. Debian trixie ships its own crowdsec package but the version freezes at trixie-release time and falls behind hub items — picked the upstream repo for engine/parser currency despite the external dep. Path component and suite must both be any (/debian + suite any returns HTTP 422; caught on first apply).
Integration via Caddy module, not file-log parser. hslatman/caddy-crowdsec-bouncer plugs in via xcaddy --with (we already xcaddy-build), per-request IP check against LAPI, no logfile detour, dodges the known caddy-logs parser bug. Tradeoff: Caddy upgrades require a rebuild — already our workflow.
fail2ban kept for SSH-only. The CrowdSec bouncer module is HTTP-only (Caddy choke-point). fail2ban on edge stays for SSH brute-force protection (LAN-only path but still sensible). No retire-fail2ban work in 7D.
LAPI stays on stock 127.0.0.1:8080. Initial design moved it to 6060 to dodge a hypothetical Caddy alt-port collision; reverted on first apply because the agent's local_api_credentials.yaml is hardcoded to 8080 by the installer, so the engine wouldn't start. Lesson: don't optimise for hypothetical future state when it touches multiple components.

7E. Pocket-ID identity provider + selective SSO¶

Why: Adds OIDC-based single sign-on in front of the OIDC-native admin UIs (Proxmox VE, PBS) without the operational footprint of a full IdP. Bundles with Phase 7D — CrowdSec drops unauthorized traffic at the edge; Pocket-ID handles identity for the apps that benefit from SSO. The two together close the edge-security accepted risk captured 2026-05-02.

New LXC auth (next free CTID), Debian 13 trixie, unprivileged, features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, static IP in 192.168.1.0/24 (mirrors CT 106 Beszel / CT 107 edge pattern)
New pocket_id Ansible role — pinned binary (or apt install if/when Pocket-ID ships a Debian package), systemd unit, SQLite-backed, vault-stored signing secret. Same shape as the caddy role — single binary, config templated from inventory.
Caddy: add auth.rampancy.cloud → auth:1411 to caddy_proxy_hosts in host_vars/edge.yml. The existing LE wildcard *.rampancy.cloud already covers; Cloudflare needs a CNAME auth → rampancy.cloud (gray-cloud).
OIDC integration: configure Proxmox VE OIDC realm pointing at Pocket-ID; configure PBS similarly. Both PVE 8+ and PBS 4+ support OIDC realms natively — no proxy auth needed.
User accounts: primary (operator) + Hazel. Each registers two passkeys — primary in Bitwarden (cloud-synced), secondary on a hardware key (YubiKey 5) or device-native (Touch ID / Windows Hello). Bitwarden emergency access pre-configured for Hazel as the recovery path.
Beszel agent on the new LXC.
Docs: new services/pocket-id.md + hosts/auth/index.md + ansible/roles/pocket-id.md, all wired into mkdocs.yml. New runbook runbooks/sso-rollout.md capturing the cutover (Proxmox/PBS realm switch + first passkey registration walkthrough).

Apps explicitly NOT in scope for SSO (deliberate, captured here so future-self doesn't re-litigate):

App	Reason
Beszel hub UI	No OIDC support. CrowdSec + Beszel's own auth is sufficient.
Dockhand UI	No OIDC support. CrowdSec + Dockhand's own auth.
n8n	OIDC is a paid Enterprise feature; free version stays on its own auth.
Overseerr (`requests`)	Already auth'd via Plex OAuth — already SSO of a sort.
arrstack admin UIs (Sonarr/Radarr/Prowlarr/qBittorrent)	LAN-only, no OIDC, low value.

Decisions (locked at scope time, 2026-05-03):

Pocket-ID, not Authentik. Authentik bundles OIDC + forward-auth in one component, would gate the non-OIDC apps too. But it introduces PostgreSQL + Redis (everything else in the homelab is SQLite), runs ~3-4 GB RAM vs Pocket-ID's ~50 MB, and the bundled forward-auth job is already covered by 7D CrowdSec. Reconsider Authentik if the public-facing SSO surface grows beyond ~5 services with weak per-app auth.
No Tinyauth pairing. The 2026 community pattern is Pocket-ID + Tinyauth for double-coverage of OIDC + forward-auth. Tinyauth in front of apps that have their own login produces a double-login UX (gate at proxy + app's own login still fires). With CrowdSec in 7D, the marginal benefit doesn't justify the friction.
Bitwarden as primary passkey vault, hardware key as backup. Operator already runs Bitwarden paid plan. Bitwarden emergency access closes the break-glass gap flagged during the 5D close-out — Hazel is the natural emergency-access nominee.
Skip Authelia. Slowing release cadence (Nov 2025 → Mar 2026 release gap, patch-only since), classical password+TOTP UX is dated relative to passkey-first. Pocket-ID is the cohort-aligned 2026 pick for new deployments.

Phase 8 — Network segmentation (deferred)¶

Goal: Migrate the homelab off a single flat VLAN onto a segmented design — Mgmt for infrastructure admin, IoT for smart-home devices, Lab for the housemate sandbox. Switch PENFOLD-SW01 from SwOS to RouterOS to enable programmatic config management. Reuses spare proxfold NICs to add the Lab VLAN bridge without reconfiguring vmbr0, so existing services don't blip during the additive phases.

Why this is its own phase

Triggered by the housemate-access work (Phase 6 entry above), but deliberately scoped separately. The immediate housemate change ships with Hazel's VMs on vmbr0/VLAN 1 with Proxmox-side controls only; their migration onto the Lab VLAN is captured here as 8G. Doing this work alongside Hazel's onboarding would conflate "first VLAN rollout" with "first RouterOS exposure" — too many simultaneous variables.

VLAN scheme (10-spacing buffer for future classes):

ID	Name	Subnet (TBD at scope time)	Purpose
1	LAN	192.168.1.0/24	Existing trusted servers/clients
10	Mgmt	192.168.10.0/24	UDM admin, switch admin, iDRAC, future managed APs
20	IoT	192.168.20.0/24	Smart-home devices
30	Lab	192.168.30.0/24	Housemate sandbox
40	Guest	192.168.40.0/24	Visitor WiFi (reserved, not built)

flowchart LR
    A[8A: Console cable + SwOS backup] --> B[8B: Switch flip to RouterOS]
    B --> C[8C: RouterOS hardening]
    C --> D[8D: UDM VLAN networks + firewall rules]
    D --> E[8E: Switch port VLAN config]
    E --> F[8F: proxfold vmbr1 + PVE SDN]
    F --> G[8G: Migrate Hazel onto Lab VLAN]
    F --> H[8H: Migrate mgmt plane onto Mgmt VLAN]
    F --> I[8I: Migrate IoT devices onto IoT VLAN]

    style A fill:#FAECE7,stroke:#993C1D,color:#712B13
    style B fill:#FAECE7,stroke:#993C1D,color:#712B13
    style C fill:#FAECE7,stroke:#993C1D,color:#712B13
    style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style E fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style F fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style G fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
    style H fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
    style I fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20

8A. Console cable + dated SwOS backup¶

Why: Escape path for the OS flip. Without console access, recovery from a misconfigured switch reboot is "laptop direct-cabled to RouterOS's default 192.168.88.1 mgmt subnet" — works but slower and error-prone. Switch has no autobackup automation today (UDM does; switch does not — gap worth closing regardless).

Order RJ45-to-USB-serial console adapter
Manual SwOS config export (download .swb from http://192.168.1.3/backup, dated copy stored alongside the ~/proxfold-pve9-upgrade/ artifacts)
(Optional) Codify daily SwOS backup as a small cron on the control LXC to fill the autobackup gap

8B. Switch flip — SwOS → RouterOS¶

Why: SwOS has no SSH or scriptable API. RouterOS gives full SSH + API + /export config readout. Switching is reversible — both OSes coexist on the device, configs preserved per slot. Marvell switching chip handles L2 in hardware on both OSes; line-rate preserved as long as RouterOS bridge VLAN filtering keeps hw=yes.

Pre-stage RouterOS minimal config as a .rsc script that reproduces today's behavior (24× 1G + 2× 10G in one untagged bridge, mgmt IP 192.168.1.3)
Console cable in hand, flip via /system routerboard settings set boot-os=routeros then reboot
Import .rsc on first boot
Verify all wired hosts reachable; verify hw=yes on every bridge port (/interface bridge port print)
Escape path: if anything fails, /system routerboard settings set boot-os=swos + reboot — instant rollback

8C. RouterOS hardening¶

Why: RouterOS exposes more services than SwOS — SSH, API, web, WinBox, optional FTP/Telnet. MikroTik has a CVE history when these are left exposed. Discipline: management plane on Mgmt VLAN only, unused services off.

Disable telnet, ftp, www (use www-ssl only), api if unused
Restrict winbox/ssh/www-ssl to a management address allowlist (admin IP for now; Mgmt VLAN once 8H lands)
Pin firmware version, document update cadence in hosts/penfold-sw01/

8D. UDM — VLAN networks + inter-VLAN firewall rules¶

Why: UDM controller defines the L3 + DHCP + inter-VLAN policy. Adding networks is additive; existing VLAN 1 untouched. WiFi clients on VLAN 1 don't blip from network creation alone, but UniFi controller commits do trigger AP provisioning — schedule outside peak housemate hours.

Create UniFi Networks: Mgmt (10), IoT (20), Lab (30); reserve Guest (40) as a placeholder
Inter-VLAN firewall: deny by default; explicit allowlist for required flows (e.g. PBS host → Lab on backup ports, Mgmt → all for admin)
Document policy in hosts/the-egg/firewall.md

8E. Switch — VLAN port assignments¶

Why: L2 enforcement of the network design. Trunk to UDM tagged for VLANs 10/20/30/40; access ports per device classification.

Trunk to UDM: tag VLANs 10/20/30/40, untagged VLAN 1
proxfold ports: existing nic0 port stays untagged VLAN 1; new spare-NIC port (8F) configured for VLAN 30 (or VLAN-aware trunk if multi-VLAN exposure is wanted)
IoT device ports: untagged access VLAN 20

8F. proxfold — `vmbr1` VLAN-aware bridge + PVE SDN¶

Why: Additive to existing vmbr0. Spare NIC (nic1) bound to a new VLAN-aware bridge means existing services on vmbr0 are untouched throughout. PVE SDN gives a clean VNet abstraction with proper permission scoping.

Cable a spare proxfold NIC (nic1) to a switch port configured for VLAN 30 trunking
Add vmbr1 to /etc/network/interfaces with bridge-vlan-aware yes, bridge-vids 2-4094
Define PVE SDN zone (VLAN-aware) bound to vmbr1; create VNet lab for VLAN 30
Update Hazel's group ACLs: replace /sdn/zones/localnetwork/vmbr0 → PVESDNUser with /sdn/zones/<labzone>/lab → PVESDNUser
Codify as Ansible role or runbook step (sync-docs precedent)

8G. Migrate Hazel's VMs onto Lab VLAN¶

Why: Completes the housemate-access architecture. Pre-Phase-8 her VMs are on vmbr0/VLAN 1 with only Proxmox-side controls; post-Phase-8 they're L2-isolated.

Per VM: shut down, change network bridge to the Lab VNet, boot, verify DHCP from UDM Lab pool, verify firewall behavior
Update permissions: revoke Hazel's SDN.Use on /sdn/zones/localnetwork/vmbr0

8H. Migrate management plane onto Mgmt VLAN¶

Why: Reduces blast radius. iDRAC, switch admin, UDM admin currently sit on VLAN 1 alongside production guests.

iDRAC: reconfigure dedicated NIC onto VLAN 10
Switch (RouterOS): mgmt IP onto VLAN 10
UDM: dedicated mgmt interface on VLAN 10
Update network/overview.md host map

8I. Migrate IoT devices onto IoT VLAN¶

Why: Default-deny outbound to LAN; cap blast radius from compromised IoT firmware.

Inventory IoT devices currently on VLAN 1
Migrate device-by-device (DHCP reservation move + reconnect to IoT WiFi SSID)

Decisions to lock at scope time:

PVE SDN VLAN zone vs. plain VLAN-aware bridge — SDN is the cleaner answer for permission scoping (per-VNet ACLs); plain bridge-vlan-aware yes is simpler if SDN feels like overkill for one VLAN
Guest VLAN — provisioned now or later — reserved as 40 either way
Bond instead of single NIC for vmbr1 — proxfold has 4 NICs, only nic0 in use; could bond 2-3 spares for the Lab bridge. Probably overkill for housemate experiments

Design decisions¶

These don't need answers now but will come up during implementation.

Decision	When it matters	Options
Ansible for compose deploys vs Dockhand	Phase 1–2	Use both (Ansible for OS, Dockhand for containers) or consolidate to Ansible's `community.docker.docker_compose_v2` module
Caddy vs Nginx Proxy Manager	Phase 1B or later	Caddy is more Git-friendly and lighter; NPM has a GUI. Can migrate incrementally
n8n deployment host	Phase 5C	LXC with native Node.js install (standalone service, fits the LXC-for-standalone pattern); no Docker/VM needed
UDM upgrade to Dream Router 7	Any time	Independent of everything else. WiFi 7 + better hardware

Infrastructure reference¶

Current state (pre-Phase 4 hardware upgrades):

graph TB
    subgraph net["Network"]
        udm["The-Egg · 192.168.1.1\nUniFi UDM · Gateway · WAP"]
        sw["PENFOLD-SW01 · 192.168.1.3\nMikroTik CRS326-24G-2S+"]
        udm --> sw
    end

    subgraph proxfold["proxfold · 192.168.1.250 — Proxmox VE · Dell R430"]
        subgraph ct100["CT 100 — plex · 192.168.1.230"]
            plex["Plex Media Server\nNvidia T400 GPU"]
        end
        subgraph vm101["VM 101 — arrstack · 192.168.1.252"]
            media["Sonarr · Radarr · Prowlarr\nSeerr · qBittorrent · MediaBot\ngluetun (ProtonVPN)"]
        end
        subgraph vm102["VM 102 — nginx · 192.168.1.249"]
            npm["Nginx Proxy Manager"]
        end
        subgraph ct104["CT 104 — control · 192.168.1.245"]
            ansible["Ansible control node"]
        end
        pbs["CT — pbs · 192.168.1.246\n(Phase 5A)"]
        beszel["CT — beszel · 192.168.1.247\n(Phase 5B)"]
        n8n["CT — n8n\n(Phase 5C)"]
    end

    nas["QNAP TS-269L · 192.168.1.253\nNFS datastore (Phase 5A)"]
    sw --> proxfold
    sw --> nas

Post-Phase 4 changes: Boot drive becomes ZFS mirror (rpool), RAM increases from 48GB to 384GB, second CPU socket populated (28C/56T total).

Homelab Implementation Roadmap¶

Phase 1 — Foundation (weeks 1–2)¶

1A. Ansible control node¶

1B. Dockhand + gluetun deployment¶

Phase 2 — PVE 9 Upgrade + RAIDZ Expansion (weeks 2–3)¶

Phase 3 — Codify the stack¶

3A. PVE host baseline — proxmox and nut roles¶

3B. Plex data codification¶

3C. Proxmox auto-install kit¶

3D. Scheduled drift detection (done — 2026-04-21)¶

Phase 4 — Hardware upgrades (weeks 5–7)¶

4A. UPS purchase + install¶

4B. CPU 2 + RAM upgrade¶

4C. Boot drive swap to mirrored ZFS¶

Phase 5 — Backup + monitoring (weeks 7–9)¶

5A. Proxmox Backup Server¶

5B. Server notification stack¶

5C. n8n — lab automation (de-scoped)¶

Phase 5D — Edge LXC + Caddy migration (NPM → Caddy)¶

Phase 5E — Host-level file backup (proxfold → PBS)¶

Phase 6 — New services (weeks 9+)¶

6A. Forgejo (self-hosted git, GitHub-mirrored)¶

6B. Home Assistant¶

6C. Obico (3D print failure detection)¶

6D. Music acquisition pipeline¶

6E. Matrix server¶

6F. Music recommendations / discovery¶

6G. Comic / graphic-novel library (Komga)¶

6H. Comic acquisition (Kapowarr) + metadata (komf)¶

Housemate Proxmox access¶

Phase 7 — Security stack (weeks 10+)¶

7A. Wazuh AIO + Ansible scaffolding + first agents¶

7B. Agent fleet rollout¶

7C. Suricata NIDS + Wazuh integration¶

7D. CrowdSec on edge¶

7D follow-ups (deferred — flag for later)¶

7E. Pocket-ID identity provider + selective SSO¶

Phase 8 — Network segmentation (deferred)¶

8A. Console cable + dated SwOS backup¶

8B. Switch flip — SwOS → RouterOS¶

8C. RouterOS hardening¶

8D. UDM — VLAN networks + inter-VLAN firewall rules¶

8E. Switch — VLAN port assignments¶

8F. proxfold — vmbr1 VLAN-aware bridge + PVE SDN¶

8G. Migrate Hazel's VMs onto Lab VLAN¶

8H. Migrate management plane onto Mgmt VLAN¶

8I. Migrate IoT devices onto IoT VLAN¶

Design decisions¶

Infrastructure reference¶

3A. PVE host baseline — `proxmox` and `nut` roles¶

8F. proxfold — `vmbr1` VLAN-aware bridge + PVE SDN¶