Skip to content

Homelab Implementation Roadmap

Last updated: 2026-05-20 Scope: Dell R430 (proxfold) — infrastructure hardening, config-as-code, hardware upgrades, automation Pace: Weekend-warrior (~8 weeks active, then ongoing) Status: Phase 1/2/3 done · Phase 4A/4C done · Phase 5A/5B/5C/5D/5E done · Phase 6A/6D done · Phase 7D done · Phase 4B re-prioritised (CPU 2 + 8× 32 GB / 256 GB at 1 DPC; unblocks Phase 7) · Phase 6B/6C/6E/6F + Phase 7A/B/C/E next

The critical path runs through PVE 9 upgrade → Ansible codification → hardware upgrades → boot drive swap. Everything else layers on top. Phase 7 (security stack) depends on 4B for RAM headroom.

gantt
    title Homelab implementation timeline
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 1 - Foundation
    Ansible control node + common role   :done, a1, 2026-04-01, 7d
    Dockhand deploy + gluetun VPN        :done, a2, 2026-04-01, 10d

    section Phase 2 - PVE 9 Upgrade
    PVE 8.4 → 9.x upgrade                :done, p1, 2026-04-19, 2d
    ZFS RAIDZ expansion                  :done, p3, 2026-04-20, 1d

    section Phase 3 - Codify
    3A proxmox + nut roles               :done, b1, 2026-04-21, 1d
    3B plex data codification            :done, b2, 2026-04-21, 1d
    3C rebuild kit (auto-install)        :done, b3, 2026-04-21, 1d
    3D scheduled drift detection         :done, b4, 2026-04-21, 1d

    section Phase 4 - Hardware
    UPS purchase + install               :done, c1, 2026-04-19, 1d
    CPU 2 + RAM upgrade                  :c2, after b4, 5d
    Boot drive swap (ZFS mirror)         :c3, after c2, 7d

    section Phase 5 - Backup + monitoring
    5A Proxmox Backup Server             :done, d1, 2026-04-23, 1d
    5B Beszel + notifications            :done, d2, 2026-04-24, 1d
    5C n8n (lab automation)              :done, d3, 2026-04-28, 1d
    5D edge LXC + Caddy (NPM → Caddy)    :done, d4, 2026-05-02, 1d

    section Phase 6 - New services
    Home Assistant                       :e1, after d2, 14d
    Obico (3D print monitoring)          :e2, after d2, 14d
    6D Music acquisition pipeline        :done, e3, 2026-05-16, 1d
    Matrix server                        :e4, after d2, 14d
    6F Music recommendations / discovery :e5, after e3, 7d

    section Phase 7 - Security
    7A Wazuh AIO + first agents          :f1, after c2, 7d
    7B Agent fleet rollout               :f2, after f1, 5d
    7C Suricata NIDS + integration       :f3, after f2, 5d
    7D CrowdSec edge                     :f4, 2026-05-04, 3d

Phase 1 — Foundation (weeks 1–2)

Goal: Get Ansible running against at least one host and migrate container management to Git-backed Dockhand with VPN routing in place. These two tracks are independent and can run in parallel.

1A. Ansible control node

Why first: Every subsequent phase depends on having playbooks ready — especially the boot drive swap in Phase 4, which becomes a single ansible-playbook site.yml instead of hours of manual work.

Note

The Ansible repo is already scaffolded at rampantlemming/homelab-ansible with all roles written. These steps cover standing up the control node and testing the roles against live hosts for the first time. See the Ansible section for full repo documentation.

  • Create the Ansible control node LXC on Proxmox
    • Unprivileged, Debian 12, 1 vCPU, 512MB RAM, 8GB disk
    • Install pip and Ansible:
      apt update && apt install -y python3-pip git
      pip install ansible --break-system-packages
      

    Note

    --break-system-packages is required on Debian 12. PEP 668 prevents pip from installing into the system Python environment by default — Debian enforces this to avoid conflicts with apt-managed packages. Using a virtualenv is the alternative, but for a dedicated control node LXC this flag is fine.

  • Clone the homelab-ansible repo
    • git clone https://github.com/rampantlemming/homelab-ansible.git ~/homelab-ansible
    • cd ~/homelab-ansible
  • Install Galaxy collections (requires requirements.yml from the cloned repo)
    • ansible-galaxy collection install -r requirements.yml
  • Generate SSH keypair on the control node
    • ssh-keygen -t ed25519 -C "ansible@homelab"
    • Distribute the public key to all managed hosts:
      ssh-copy-id root@192.168.1.250   # proxfold
      ssh-copy-id root@192.168.1.252   # arrstack
      ssh-copy-id root@192.168.1.230   # plex
      ssh-copy-id root@192.168.1.249   # nginx
      
  • Populate and encrypt the vault file
    • The repo contains group_vars/all/vault.yml as a template — fill in real secrets before encrypting
    • ansible-vault encrypt group_vars/all/vault.yml
    • Store the vault password in a safe location (e.g. a password manager); the password file itself must be excluded from git via .gitignore
    • To edit secrets later: ansible-vault edit group_vars/all/vault.yml
  • Test the common role against arrstack first
    • ansible-playbook playbooks/arrstack.yml --tags common --check --diff
    • Verify idempotency: run twice, second run should show zero changes
  • Once clean, run common across all hosts
    • ansible-playbook site.yml --tags common

1B. Dockhand + gluetun deployment

Why now: Getting Dockhand deployed means container management is Git-backed before the boot drive swap. Gluetun gives qBittorrent proper VPN routing.

  • Commit compose files to the homelab GitHub repo
    • Create stacks/arrstack/docker-compose.yml
    • Confirm current state: Sonarr, Radarr, Prowlarr, Seerr, qBittorrent, MediaBot
    • Remove deprecated version: key if still present
    • Set TZ=Australia/Adelaide across all services
  • Add gluetun to the compose stack
    • Add gluetun service with ProtonVPN WireGuard config
    • Move qBittorrent to network_mode: "service:gluetun"
    • Expose qBittorrent ports (8080, 6881) on the gluetun service
    • Set FIREWALL_OUTBOUND_SUBNETS=192.168.1.0/24 so Sonarr/Radarr can still reach qBittorrent's API over LAN
    • Ensure this subnet does not overlap with the ProtonVPN WireGuard tunnel address range (gluetun will reject the config if they do)
    • Generate ProtonVPN WireGuard private key via the ProtonVPN website (select P2P-capable Australian server with NAT-PMP)
  • Test VPN routing
    • docker exec qbittorrent curl ifconfig.me — should return a ProtonVPN IP, not your home IP
    • Verify Sonarr/Radarr can still communicate with qBittorrent
  • Deploy Dockhand container
    • Configure GitHub PAT for repo access
    • Point at the stacks/arrstack/ directory
    • Test: push a trivial compose change, confirm Dockhand picks it up
  • Remove Watchtower container
    • Dockhand handles update management — Watchtower is redundant

Phase 2 — PVE 9 Upgrade + RAIDZ Expansion (weeks 2–3)

Goal: Upgrade Proxmox VE from 8.4 to 9.x (Debian Bookworm → Trixie) to get OpenZFS 2.3, then expand the RAIDZ1 vdev with new drives. This must happen before Ansible codification so that roles capture the stable PVE 9 target state rather than the soon-to-be-replaced PVE 8 config.

Why before codification: Codifying PVE 8 state in Ansible and immediately rewriting it for PVE 9 is throwaway work. Upgrading first means Phase 3 captures reality from day one.

Warning

The ZFS raidz_expansion feature flag is a one-way operation. Once enabled, the pool cannot be imported on OpenZFS < 2.3. There is no going back to PVE 8 after zpool upgrade.

See the PVE 9 Upgrade runbook for the full step-by-step procedure — and the Lessons from the April 2026 run appendix capturing 8 deviations found during execution. Key stages:

  • Pre-flight checks — completed 2026-04-19 (state bundle, pve8to9 --full green, BIOS X2APIC + I/OAT DMA enabled)
  • Upgrade — completed 2026-04-20: bookworm→trixie repos, apt full-upgrade, GRUB-pinned to kernel 6.14.11-6-pve (not PVE 9.1's 6.17 default, to keep Nvidia 550.x compatibility)
  • Post-upgrade verification — PVE 9.1.7 running, ZFS 2.4.1 userland / 2.3.4 kmod, Nvidia 550.163.01 via DKMS, Plex HW transcode confirmed ((hw) in dashboard) after fixing cgroup major drift 235→234 / 238→237
  • Stability soak — abbreviated. No regressions in the ~6h between upgrade and Phase 3 kickoff; stash at 91% full overrode the usual soak window.
  • RAIDZ expansion — completed 2026-04-20. zpool upgrade stash enabled raidz_expansion + 4 other features (irreversible). Drive 1 (sdg, scsi-35002538a97b1c620) attached at 13:59:59 ACST, reflow 2h08m at 1.73 GB/s. Drive 2 (sdh, scsi-35002538a97b19c40) attached after drive 1 auto-scrub, reflow 1h45m at 2.00 GB/s. All-SSD pool finished in hours, not days (runbook assumed HDD speed).
  • Post-expansion verification — pool 14.0T→21.0T, 91%→61% full, scrub repaired 0B with 0 errors, all 6 disks ONLINE, stash/plex-data quota intact

Phase 3 — Codify the stack

Goal: Every piece of host-level configuration that matters is captured in Ansible roles so a proxfold rebuild in Phase 4 (or after a hardware failure) is a playbook run rather than hours of manual work.

Phase 3 breaks into four sub-phases. 3A/3B/3C all landed on 2026-04-21; 3D is the remaining piece.

3A. PVE host baseline — proxmox and nut roles

Completed 2026-04-21

Merged in homelab-ansible#3A. See the proxmox role and nut role pages for the full task inventory.

What landed:

  • proxmox role — deb822 repo management (no-sub enabled, enterprise + ceph disabled), kernel pin via proxmox-boot-tool (6.14.11-6-pve), nouveau blacklist, /etc/sysctl.conf/etc/sysctl.d/ migration, stash pool import, nasbackup CIFS registration
  • nut role — Network UPS Tools server + monitor for CyberPower PR1500ERT2U (vendor 0764:0601), standalone mode, shutdown at 600s runtime-low
  • proxmox-host playbook wired: proxmox → common → security → zfs → nfs → nvidia → nut
  • Vault entries added: vault_nasbackup_password, vault_nut_admin_password, vault_nut_upsmon_password

3B. Plex data codification

Completed 2026-04-21

Merged in homelab-ansible#3B. Variables and structure documented on the plex role page.

What landed:

  • plex_data_zfs_dataset / plex_data_zfs_quota / plex_data_mount variables plumbed through plex host_vars
  • ZFS dataset stash/plex-data (100G quota) creation delegated to proxfold
  • LXC 100 mp1 mount point registered via lineinfile against /etc/pve/lxc/100.conf
  • Plex data symlink /var/lib/plexmediaserver/Library/Application Support/Plex Media Server//stash/plex-data (fail-hard if real dir exists)
  • docker-restart handler gotcha documented (breaks network_mode: service:* chains — don't touch compose during a plex role run)

3C. Proxmox auto-install kit

Completed 2026-04-21

Merged in homelab-ansible#3C. Full operator procedure at the Proxfold rebuild runbook.

What landed:

  • rebuild/answer.toml.j2 (production — Samsung SSD disk filter, static IP) and rebuild/answer-test.toml.j2 (rehearsal — QEMU disk filter, DHCP)
  • rebuild/render.yml Ansible playbook that decrypts the vault and writes both answers; wrappers render-answer.sh + build-iso.sh
  • Rendered TOML + built ISOs gitignored (contain root password hash + SSH keys)
  • Nested-VM rehearsal validated template rendering, vault integration, disk filter fail-safe, and boot order. Installer initramfs finalisation fails inside nested PVE — documented as a known limitation (does not affect bare metal)
  • WSL control node bootstrap formalised as the DR cold-start path (CT104 doesn't exist until proxfold is rebuilt and its backup restored)

3D. Scheduled drift detection (done — 2026-04-21)

Kit and operator docs

Implementation kit lives in homelab-ansible/drift-detection/. Operator procedure: Drift Detection.

What landed:

  • Systemd timer on CT104 — daily at 04:00 ACST with ±5 min randomised delay and Persistent=true for missed-run catch-up
  • drift-check.sh wrapper — git pull --ff-only, ansible-playbook playbooks/site.yml --check --diff, parses the PLAY RECAP for changed / failed / unreachable totals, classifies outcomes
  • Dedicated #homelab-drift Discord webhook — amber embed for drift, red for failure, silent for clean runs (opt-in DRIFT_SUMMARY=1 for confirmation pings)
  • WSL fallback — same wrapper runs from WSL via env overrides, matching the dual-control pattern (CT104 is primary, WSL is DR cold-start)
  • Role hardening surfaced by the first live run — zfs + nvidia made check-mode safe (check_mode: false on health probes, not ansible_check_mode gate on systemd enables), nvidia passthrough strip is now conditional on the managed block being absent (killed a perennial false-positive), reload udev handler switched from command to shell so && actually parses

Decided during implementation (the "open questions" from the original 3D scope):

  • Cadence: daily, not weekly — catches drift inside 24h, and with no-signal runs silent the cost is only the Discord webhook embed on actual drift
  • Notification target: dedicated #homelab-drift webhook, separate from the MediaBot channel; n8n deferred to Phase 5A and not needed here
  • Vault handling: no_log: true already set on vault-rendered NUT files; webhook only posts the PLAY RECAP (truncated to Discord's 1024-char field limit), no task-level diff

Not done, deferred:

  • Failure quarantine / flap suppression — deferred until we actually see a flapping host; premature abstraction today
  • Dynamic fan curve for the R430 T400 — looked at during the 3D converge, concluded the BMC handles it fine in auto on this unit (see the Drift Detection page and the host_vars comment on proxfold's ipmi_fan_fix: false)

Phase 4 — Hardware upgrades (weeks 5–7)

Goal: UPS protection, full CPU/RAM capacity, and a clean boot drive on mirrored ZFS. The order is strict — each step protects or enables the next.

flowchart LR
    A[UPS install] --> B[CPU 2 + RAM]
    B --> C[Boot drive swap]
    C --> D[Ansible rebuild]

    style A fill:#FAECE7,stroke:#993C1D,color:#712B13
    style B fill:#FAECE7,stroke:#993C1D,color:#712B13
    style C fill:#FAECE7,stroke:#993C1D,color:#712B13
    style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489

4A. UPS purchase + install

Why first: A power event during a boot drive swap or ZFS rebuild would be catastrophic. UPS goes in before any hardware is touched.

Completed 2026-04-19 (pulled forward from planned order)

The UPS was installed ahead of Phases 2 and 3 because the hardware arrived early and a safe install window was available. See UPS for the as-built configuration, USB IDs, NUT config, and battery transfer test results.

  • Purchase the CyberPower PR1500ERT2U — acquired from Scorptec for ~$1,179 AUD
    • 1500VA/1500W, pure sine wave, 2U rackmount
    • Active PFC compatible (required for R430 PSUs)
  • Install and connect
    • Both R430 PSUs (dual 550W redundant) on battery-backed outlets via 2× IEC C13-to-C14 cables
    • NAS and networking gear still on wall power — revisit as part of future rack work
  • Configure NUT on Proxmox
    • nut 2.8.0 installed directly on the Proxmox host in standalone mode
    • usbhid-ups driver matching 0764:0601 (CyberPower HID)
    • battery.runtime.low raised to 600 s (from 300 s) for ZFS/VM shutdown buffer
    • Battery transfer test 2026-04-19 — clean OLOBOL, ~2.5 min on battery, no guest disruption
    • NUT monitoring integration with n8n deferred to Phase 5

4B. CPU 2 + RAM upgrade

Why now: The second CPU socket unlocks all 12 DIMM slots. Buying RAM in one lot gets a matched set and provides headroom for additional VMs.

Note

The R430 Hardware Upgrade runbook covers the full procedure including heatsink installation, CPU seating, and BIOS verification.

Source before starting:

  • 1× Intel Xeon E5-2680 v4 (S-Spec: SR2N7) — match the existing socket 1 CPU
  • 1× Dell heatsink P/N 02FKY9
  • 1× 6th fan module (Dell P/N DNHNR or 79WM9) — Dell's minimum for dual-CPU is 5 fans; the recommended layout is 6. As-built proxfold has 5 populated bays + Fan6 empty (verified 2026-04-28), so this step adds the 6th fan to bring the chassis to the recommended layout
  • 8× 32 GB DDR4-2400 PC4-19200R 2Rx4 ECC Registered RDIMM — single-vendor matched lot; 256 GB total. Populates 1 DIMM per channel per CPU (optimal 1 DPC electrical config — every channel active, no second-DIMM signal loading). Leaves 4 slots free for future expansion. SK Hynix is the recommended vendor since the chassis is already running Hynix sticks today (proxfold dmidecode 2026-04-28 shows 6× HMA41GR7-family, mixed AFR + MFR die — both work, no issues).

    Acceptable part numbers (pick one vendor — don't mix across the kit):

    Vendor Part number Notes
    SK Hynix HMA84GR7AFR4N-UH (A-die) or HMA84GR7MFR4N-UH (M-die) 32 GB 2Rx4 PC4-19200R, CL17. Either die revision works; matches what's already in proxfold today. Recommended.
    Samsung M393A4K40CB1-CRC 32 GB 2Rx4 PC4-19200R, CL17
    Micron MTA36ASF4G72PZ-2G3 (and -2G3B1 / -2G3A1 suffix variants) 32 GB 2Rx4 PC4-19200R, CL17

    Confirm the seller is selling a true matched 8-stick lot (single decommissioned server pull) and not 8 random sticks from different sources. Within a single matched kit, all sticks should share the same die revision.

Alternatives if 32 GB sticks aren't available

Plan Total Trade-off
8× 32 GB / 256 GB (default) 256 GB Optimal 1 DPC config + most capacity + 4 slots free
12× 16 GB / 192 GB 192 GB All slots filled. Slightly less bandwidth-clean (2 DPC on 4 of 8 channels). 16 GB part numbers: Samsung M393A2G40EB1-CRC, SK Hynix HMA42GR7AFR4N-UH, Micron MTA36ASF2G72PZ-2G3. Avoid Samsung M393A2K40BB1-CRC — that's the 1Rx4 single-rank variant.
8× 16 GB / 128 GB 128 GB Lowest cost. Tight headroom — Wazuh + Phase 6 services + ARC squeeze the budget; ARC re-tune target drops to ~48 GB instead of ~128 GB. Workable but the 32 GB option is the better deal at typical price-per-GB ratios.

Original spec was 12× 32 GB / 384 GB — re-scoped 2026-04-28 to right-size for actual fleet load and to dodge the 2026 DDR4 price spike.

2026 DDR4 RDIMM market context

DDR4 RDIMM pricing is elevated through 2026 because foundry capacity has been pulled toward HBM3E/DDR5 for AI workloads (Tom's Hardware DDR price tracker). Expect more variance than usual on eBay AU listings; watch the market for 1–2 weeks before committing to a kit. Don't panic-buy the first matched lot — and don't pay for DDR4-2666 sticks "for future-proofing" since the E5-2680 v4 caps at 2400 MT/s and 2666 sticks just downclock (Intel E5-2680 v4 product specs).

Steps:

  • Shut down R430 and disconnect power
  • Install 6th fan module (slot furthest from PSUs)
  • Install CPU 2 + heatsink
  • Remove all 6× existing 8 GB DDR4-2133 DIMMs (don't mix old + new — speed clocks down to slowest stick, and the chassis becomes asymmetric). Install the new RDIMM kit. Slot population:
    • 8-stick (32 GB) plan — default: A1, A2, A3, A4 + B1, B2, B3, B4 (1 DIMM per channel per CPU; A5/A6/B5/B6 left empty)
    • 12-stick (16 GB) alternative: all 12 slots A1–A6 + B1–B6
    • Follow Dell population order for optimal memory interleaving (consult the R430 Owner's Manual or iDRAC lifecycle log if memory training errors appear after install)
  • Power on and verify in iDRAC
    • Both CPUs recognised, correct stepping (SR2N7)
    • Total RAM visible: 256 GB (8× 32 GB default) or 192 GB (12× 16 GB alternative), all DIMMs healthy
    • No memory training errors in lifecycle log
  • Verify in Proxmox: lscpu shows 28C/56T, free -h shows ~252 GB (8× 32 GB) or ~189 GB (12× 16 GB) after kernel reservation
  • Re-tune ZFS ARC cap post-converge. Current cap is 14 GiB (changelog 2026-03 entry) — sized for the 48 GB-physical era. With 256 GB physical and a 21 T pool at 62% used, lift the cap toward ~half-RAM (≈ 128 GB) for materially better cache hit rates on media + PBS reads. Variable lives in the proxmox role (/etc/modprobe.d/zfs.conf); revisit after at least 24 h of stability soak with the new RAM, not on day 0. (For 12× 16 GB / 192 GB alternative: target ≈ 96 GB instead.)

4C. Boot drive swap to mirrored ZFS

Why: Replace the single 128GB LVM boot drive with two 960GB SSDs in a ZFS mirror, providing OS-level redundancy.

Completed 2026-04-22

Executed via the Boot Drive Swap runbook + Proxfold rebuild runbook. Boot drive is now rpool ZFS mirror across Samsung SM843T + Intel DC S4500 (888G usable, 2% used after restore). See the "Lessons from the 2026-04-22 run" appendix at the bottom of the boot-drive-swap runbook for what bit during execution.

What landed:

  • Pre-swap backups — full vzdump of all VMs/containers to NAS (192.168.1.253) verified before power-down
  • Ansible playbooks committed and pushed prior to swap; fresh production ISO built via rebuild/build-iso.sh
  • answer.toml.j2 disk filter updated — used filter.ID_BUS="ata" (not array-form ID_MODEL — that was rejected by the installer validator); filter matches both 960GB SATA SSDs and the installer builds rpool across them as RAID1
  • Hardware swap — 128GB Samsung 840 PRO removed; Samsung SM843T 960GB + Intel DC S4500 960GB installed in Dell caddies in bays 2 & 3
  • Auto-install ran clean — ZFS mirror rpool created across both new drives, SSH/network up on 192.168.1.250, Ansible converge re-ran, stash pool re-imported with -f (new hostid), CIFS re-registered, CT100 + guest VMs restored from NAS
  • Post-swap verificationzpool status rpool both drives ONLINE, both drives populated in proxmox-boot-tool status, stale Boot0007 UEFI entry pointing at the defunct 840 PRO GPT UUID removed via efibootmgr -b 0007 -B
  • Followups codifiednvidia role LXC passthrough rewritten from blockinfile to per-line lineinfile (PVE's conf parser moves raw lxc.* keys to EOF on every rewrite, breaking marker semantics); proxmox role now disables ceph enterprise deb822 source and uses zpool import -f for fresh-hostid cases; ipmi_fan_fix: false path now self-healing

Phase 5 — Backup + monitoring (weeks 7–9)

Goal: Close the two real gaps left after Phase 4 — there's no scheduled backup and no infrastructure-wide health visibility beyond the 3D drift heartbeat. Everything in Phase 5 is additive; nothing blocks Phase 6.

Scope reset — 2026-04-23

The original Phase 5 scope (n8n + Ollama, vulnerability management) was written before Phase 3D landed its own Discord integration and before the 4C boot-drive swap exposed how thin the backup story actually is. Revised shape:

  • 5A (new, priority) — Proxmox Backup Server, replacing the manual-only vzdump path
  • 5B (new, priority) — Beszel + PVE 9 notification system + ZED webhook, one stack covering host metrics / backup status / zpool events
  • 5C (de-scoped) — n8n as a lab/glue capability only; Ollama dropped (no GPU budget, no RAM headroom pre-4B, and the drift/ZFS/UPS workflows the original 5A called out are already covered elsewhere)
  • Re-scoped from "dropped"unattended-upgrades landed 2026-04-25 as a standalone auto_updates role (wraps hifis.toolkit.unattended_upgrades). Security-only origins across the fleet, Proxmox repo added on proxfold + pbs, hypervisor kernels blocklisted, manual reboot with a /var/run/reboot-required Discord nag on proxfold. See auto_updates role page. The rest of the old 5B (full vulnerability management stack) remains out of scope
  • Not adopting — Grafana / Prometheus. Beszel's built-in historical metrics cover the use case at a fraction of the operational footprint

5A. Proxmox Backup Server

Completed 2026-04-23

PBS live as CT 105 (192.168.1.246, Debian 13 privileged, features=nesting=1). Datastore nas-primary on NFSv3 from the TS-269L, prune daily 03:00 (keep 7d/4w/6m), verify sun 04:00, GC mon 04:00. PVE storage registered on proxfold; daily 02:00 all-guest backup job pbs-daily active. First full backup + manual verify ran clean same day — ~95 GiB across 5 guests, ~11 min wall clock. See the pbs role page for the as-built task list and the gotchas captured during execution.

Why: /etc/cron.d/vzdump was empty post-4C — every backup on the NAS was a manual push. PBS gives scheduled incremental + deduplicated backups with verify jobs, and the dedup ratio typically runs 5:1+ on VM images. The existing CIFS nasbackup path stays available for manual vzdump fallback during the transition.

Shape: PBS LXC on proxfold, datastore on an NFS mount from the QNAP TS-269L at 192.168.1.253. The NAS can't run PBS itself (Atom D2701, EOL QTS 4.3.4) but it does export NFS, which is materially better-behaved than CIFS as a PBS datastore backend — no uid/gid remapping, no reported .chunks inode-collision issues.

QNAP firmware end-of-life

QTS 4.3.4 is the final firmware for the TS-269L and has been unpatched since 2020. Accepted as part of Phase 5A scoping — not a blocker, but a ticking replacement clock. See Accepted Risks — QNAP TS-269L firmware EOL for the full treatment.

Architecture decisions (locked in 2026-04-23):

  • Datastore on NAS, not local — DR requires a second failure domain. A datastore on stash would die with the pool if stash fails. The NAS is the only meaningful second domain in this environment.
  • NFS, not CIFS — PBS datastore on CIFS works but is fragile (uid/gid remap, inode collisions on some NAS firmwares). NFSv3 on the QNAP is the well-trodden path.
  • Not on rpool — the whole point of the 4C boot-mirror design was isolating OS from data; backups belong on the data path.
  • PBS as LXC, not VM — standalone service, no Docker needed, fits the established LXC-for-standalone / VM-for-Docker pattern.

Steps:

What landed:

  • NFS export on the QNAPbackup share already exported; added 192.168.1.246 to the NFS host-access ACL with NO_ROOT_SQUASH via QTS 4.3.4 UI (Control Panel → Shared Folders → backup → Edit NFS host access). QTS's UI doesn't expose async/secure toggles; it uses sane defaults internally.
  • PBS LXC — CT 105, Debian 13 (not 12 — PBS 4.x is trixie-only), privileged (avoids the uid 34 shift for NFS writes), 2 vCPU / 4 GB / 8 GB on local-zfs, 192.168.1.246, features=nesting=1 (required for systemd 257 in Debian 13 LXCs).
  • PBS install — deb822 repo, keyring fetched from enterprise.proxmox.com/debian/proxmox-release-trixie.gpg, proxmox-backup-server 4.x installed.
  • NFS mount + datastore/mnt/pbs-datastore owned backup:backup, nfsvers=3,x-systemd.automount, datastore nas-primary created idempotently (gated on .chunks/ existence).
  • Schedules codified — prune daily 03:00 (plain HH:MM, since PBS rejects daily HH:MMdaily is a systemd macro that can't combine with a time), verify sun 04:00, GC mon 04:00. All registered via list-and-create idempotent pattern.
  • PBS API token — user pbs-pve@pbs + token pbs-pve@pbs!pve, secret + fingerprint in vault. Critical PBS 4.x gotcha: ACLs must be granted to the user auth-id, not the token auth-id — token ACLs silently resolve to zero perms. Also needs DatastoreAudit alongside DatastoreBackup so pvesm add pbs can verify the datastore.
  • PVE-side storagepvesm add pbs nas-primary wired via roles/proxmox/tasks/pbs_client.yml (idempotent, gated on pvesm status -storage).
  • Backup job pbs-daily — daily 02:00 ACST, all guests, snapshot mode, target nas-primary, retention delegated to the PBS-side prune job.
  • First full backup + verify — 5 guests (CT 100/104/105 + VM 101/102), ~95 GiB on NAS, ~11 min wall clock, manual verify-job run returned TASK OK on all 5 snapshots.
  • Decommission old CIFS backup pathnasbackup remains registered; leave read-only for 30 days of PBS-only operation, then remove.
  • Discord notification
    • PVE 9 notification target → Discord webhook for backup job status (see 5B — the webhook is shared)

5B. Server notification stack

Completed 2026-04-24

Three notification paths live into the shared #homelab-ops Discord channel: Beszel (CT 106, 192.168.1.247) scraping 6 agents (proxfold/arrstack/nginx/plex/pbs/control), ZED webhook on proxfold for zpool events, and a PVE 9 webhook endpoint with match-all matcher for backup/replication/node events. Install delegated to upstream scripts for hub + agents. See the beszel role page and the zfs role page for the as-built tasks.

Why: Before 5B the only Discord signal from the homelab was the 3D drift heartbeat. No visibility into backup job results, zpool degradation, SMART health, temperature, or guest resource state. Fixed all of that in one stack.

Shape: three independent notification paths, all landing in a single #homelab-ops Discord channel (separate from #homelab-drift):

  • Beszel — host/container metrics, historical data, threshold alerts
  • PVE 9 built-in notification target — backup job results, replication, node up/down, update availability
  • ZED webhook — zpool state changes, scrub completion, vdev removal

Decisions (locked in 2026-04-23):

  • Beszel not Netdata / Prometheus / Grafana — ~10MB per agent vs 200-500MB for Netdata, SQLite historical store is enough for a 4-host homelab, no dashboard engineering needed
  • Beszel hub as dedicated LXC, not on arrstack — arrstack stays dedicated to the media compose; hub runs the native binary + systemd (no Docker, no nested-virt), matches the nut role shape
  • Beszel agents as systemd binaries, not Docker — one agent per host (proxfold, arrstack, nginx, plex, pbs, optionally control)
  • Uptime Kuma deferred — no external HTTP endpoints being tracked yet

What landed:

  • Beszel hub LXC — CT 106, Debian 13 (not 12 — systemd 257 in trixie), unprivileged, features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, 192.168.1.247. Install delegated to get.beszel.dev/hub; cached installer at /root/install-beszel-hub.sh. Hub listens on plain HTTP :8090 (TLS terminates at future nginx reverse proxy).
  • Beszel agents — installer at get.beszel.dev runs with -k <hub-pubkey> -p 45876 to bake env into the systemd unit. Deployed to proxfold (with beszel_agent_enable_smart: true for smartmontools + disk-group SMART access), arrstack, nginx, plex, pbs, control. Agents registered manually in the hub UI.
  • Hub pubkey capture — derived via ssh-keygen -y -f /opt/beszel/beszel_data/id_ed25519 (Beszel doesn't write a separate .pub file); vaulted as vault_beszel_hub_pubkey.
  • Alert rules — left UI-owned per the Phase 5 scope reset decision; not codified in Ansible.
  • PVE 9 notification target — webhook endpoint discord-ops + match-all matcher ops-all, codified via roles/proxmox/tasks/notifications.yml. Body template uses Handlebars {{ escape ... }} with severity-based color conditionals, stored in roles/proxmox/files/discord-notification-body.json so Ansible's Jinja doesn't interfere. Critical gotcha: pvesh expects --body, --header value, and --secret value as base64-encoded strings (per the schema), not raw. Raw JSON silently stores but fails at delivery with "could not decode base64 value"; pass via | b64encode. Default mail-to-root matcher left intact — events fan out to both.
  • ZED webhook on proxfoldzed_webhook_enabled: true in proxfold host_vars triggers the zfs role's ZED tasks: installs curl, renders /etc/zfs/zed.d/discord.sh, and symlinks statechange-discord.sh, scrub_finish-discord.sh, resilver_finish-discord.sh onto it. Webhook URL comes from vault_discord_webhook_homelab_ops. Smoke-tested by invoking the script directly with ZEVENT_POOL=stash ZEVENT_SUBCLASS=statechange; embed delivered clean.
  • CT 104 (control) added to inventory — previously not an Ansible-managed host, now under lxc_containers with its own common + security + beszel_agent playbook. Adds the drift runner itself to observability.
  • WSL → plex SSH gap fixed — WSL pubkey appended to CT 100's authorized_keys so site.yml runs cleanly from either control or WSL. Pre-existing since the 4C rebuild.
  • Notification routing summary doc — deferred; content lives in the role pages for now and the header overview in changelog.

5C. n8n — lab automation (de-scoped)

Completed 2026-04-28

n8n live as a Docker stack on a dedicated host VM. Reverse-proxied at n8n.rampancy.cloud. SQLite-backed (Postgres deferred until execution volume warrants). Captured by pbs-daily. See the n8n service page for the full picture and execution-time lessons.

Why: Lab/glue capability for cross-service workflows that don't belong in Ansible. Explicitly not replacing anything in 5A/5B/3D.

What changed from the original scope:

  • Pivoted from npm-on-LXC to Docker-on-VM mid-execution. Original plan was native Node.js install on an LXC. Modern n8n (v2.17+) bundles a heavy AI SDK ecosystem that OOMs a 4 GB LXC during install and won't compile against Debian trixie's apt-managed Node.js (isolated-vm ABI mismatch with Debian's V8 patches). Pivoted to the n8n team's recommended Docker path on a fresh VM, using Dockhand for git-managed deploys via the hawser agent on the host.
  • Ollama dropped (no GPU budget; lab-scope already covered by Anthropic API in workflows if/when needed).
  • Infrastructure workflows the original called out (ZFS health, drift detection, NUT) are now owned by ZED webhook (5B), the 3D heartbeat, and Beszel/NUT respectively. n8n is for user-facing glue only.

What landed:

  • n8n VM — Proxmox VM 108, Debian 13 trixie genericcloud qcow2, cloud-init seeded, 192.168.1.248, 2 vCPU / 8 GB / 32 GB on local-zfs
  • Roles appliedcommon, security, auto_updates, docker, beszel_agent, hawser
  • hawser role added — Dockhand's remote-host agent codified for the first time. Per-host vault token (vault_hawser_token_<host>), RW socket mount, named volume for stack cache, REQUEST_TIMEOUT=600s default. See hawser role page.
  • n8n stackstacks/n8n/docker-compose.yml in the homelab-ansible repo, deployed via Dockhand → Hawser. SQLite under the n8n_data named volume. See n8n service page.
  • Reverse proxyn8n.rampancy.cloud → 192.168.1.248:5678 via the nginx VM
  • Beszel — agent registered, host visible in the Beszel UI
  • PBS — first ad-hoc snapshot 2026-04-28 (1m 21s, 32 GiB transferred, 86% deduped); pbs-daily covers ongoing
  • Site-wide drift--check --diff from CT104 reported changed=0 failed=0 unreachable=0 across all 9 hosts post-deploy

Execution-time lessons (full detail in n8n service page lessons section):

  1. Modern n8n's npm install peaks past 4 GB RAM and won't compile against Debian's apt Node — Docker is the right path in 2026.
  2. Docker 29's default storage driver (containerd-snapshotter / overlayfs) breaks pulls on certain layer patterns — pinned to legacy overlay2 per-host via docker_daemon_config. The new default only applies to fresh Docker 29.x installs; arrstack + nginx upgraded across the 28→29 boundary so they kept their existing overlay2 and were never bitten.
  3. Hawser's default REQUEST_TIMEOUT=30s is too short for any real image pull — bumped to 600s as a role default. This was the actual blocker on multiple deploy attempts.
  4. Dockhand v1.0.18 (what was running on arrstack) has a separate timeout bug (Finsys/dockhand#587); upgraded to v1.0.27 during the deploy.

Deferred:

  • First workflows — left for user exploration per Phase 5C's lab-scope intent
  • Postgres companion — only if execution volume crosses ~5–10k/day or DB ~4–5 GiB
  • Webhook/OAuth env vars in compose — N8N_HOST / N8N_PROTOCOL / WEBHOOK_URL are commented; uncomment when first workflow needs public callback URLs
  • ~~nginx Hawser codification~~ — superseded by Phase 5D. nginx VM is retiring; new edge LXC runs Caddy (no Docker, no Hawser). Codification gap closed by the caddy role.
  • Internal wss:// for Hawser → Dockhand — currently plain ws:// on LAN; tighten when the internal-traffic TLS story lands

Phase 5D — Edge LXC + Caddy migration (NPM → Caddy)

Completed 2026-05-02

Caddy live on edge LXC (CT 107, 192.168.1.244) replacing hand-clicked NPM on VM 102. Single Let's Encrypt wildcard cert *.rampancy.cloud issued via DNS-01 against Cloudflare. Four hosts migrated. NPM container stopped — VM 102 in soak before destroy. See edge-cutover runbook.

Why: NPM was the only piece of homelab infra still managed by web UI rather than Git. Hand-clicked rules, no IaC story, upstream stalled (last release Feb 2025), CVEs in the LE-cert add flow. Caddy on a single static binary closes the codification gap, picks up automatic HTTPS renewal, and supports a single wildcard cert via DNS-01.

What landed:

  • edge LXC — CT 107, Debian 13 (trixie), unprivileged, features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, 192.168.1.244 (mirrors CT 106 Beszel pattern)
  • caddy role — single binary built via xcaddy + caddy-dns/cloudflare, version-pinned in roles/caddy/files/caddy-2.11.2, deployed to /usr/local/bin/caddy, systemd unit with AmbientCapabilities=CAP_NET_BIND_SERVICE, Caddyfile templated from inventory, ACME DNS-01 against CF via vaulted scoped token. See caddy role page.
  • Roles appliedcommon, security, caddy, beszel_agent. No docker, no hawser — Caddy is config-as-code, not compose-as-code.
  • Four hosts migrated to Caddy: requests.rampancy.cloud (Overseerr), dash.rampancy.cloud (Beszel), n8n.rampancy.cloud (n8n), kosync.rampancy.cloud (korrosync). All return same response codes as before, served by the LE wildcard.
  • UDM port-forward swap — TCP 80/443 → 192.168.1.244 (was 192.168.1.249)
  • NPM container stopped on VM 102 (docker stop nginx_proxy_manager-app-1); VM left running through soak

Cloudflare orange-cloud — attempted then reverted:

  • CF SSL/TLS encryption mode set to Full (strict) at zone level
  • Four CNAMEs flipped to orange via API
  • ~~Reverted same day~~ — CF Universal SSL is disabled at the zone (no edge cert auto-issued for the new orange hostnames). Without a paid Advanced Certificate Manager order or Universal SSL re-enable, every TLS handshake to those hostnames at the CF edge failed (no SNI match). All four flipped back to gray; traffic resumed via direct WAN → UDM → Caddy. Edge security gap captured as an accepted risk until Phase 7D CrowdSec lands.

Execution-time lessons (full detail in edge-cutover runbook lessons):

  1. Caddy upstream systemd unit ships --environ flag — prints all env vars (including secret tokens) to journalctl on startup. Removed in our rendered unit.
  2. caddy validate doesn't see EnvironmentFile — ad-hoc validation needs the env var passed via Ansible's environment: task param. systemd's EnvironmentFile= only applies to ExecStart.
  3. Caddyfile canonical fmt uses tabs — gofmt-style. Template indented with tabs to match.
  4. CF token scopes are granular — Zone:Read + DNS:Edit are needed for ACME DNS-01; Zone Settings:Read is a separate scope (token cannot inspect SSL mode without it).
  5. CF Universal SSL provisioning is per-hostname, not zone-wide — flipping orange on a hostname without an existing edge cert results in immediate handshake failure. Always verify Universal SSL is enabled before flipping orange.

Deferred:

  • VM 102 decom — done 2026-05-03. Stopped, pre-decom vzdump on nasbackup (1.65 GB compressed), qm destroy 102 --purge clean. Ansible inventory retired (host_vars/nginx.yml, playbooks/nginx.yml, stacks/nginx/ removed).
  • CrowdSec on edge — covered separately as Phase 7D
  • Caddy access logs to file — currently stdout/journalctl only; add log directive when CrowdSec phase lands (bouncer reads structured access logs from a file)
  • trusted_proxies config — only relevant if/when CF orange is re-enabled (Universal SSL would need to land first, or per-host Advanced Certificate Manager)

Phase 5E — Host-level file backup (proxfold → PBS)

Completed 2026-05-06

Daily file-level backup of proxfold host config (/etc, /root, /var/lib/pve-cluster) to PBS namespace host/proxfold via proxmox-backup-client + systemd timer. First snapshot 7.4 MB in 0.42s; idempotent re-run reports changed=0. Three execution-time bugs caught (CLI --output-format rejection, missing namespace subcommand, ACL-on-token bug); the third revealed a latent bug in the existing pbs role which was patched in the same cycle. Full lessons in backup-restore runbook.

Why: pbs-daily (Phase 5A) backs up guests via vzdump; the PVE host itself isn't included. A full-loss scenario not covered by the 4C ZFS boot mirror (config corruption, fat-fingered rm, full reinstall) currently means rebuilding from arrstack docs. Host configs are sub-MB, so chunked + dedup'd backups are negligible cost.

Out of scope (deliberate):

  • Off-host replication of the PBS datastore. Single-failure-domain on the QNAP TS-269L is documented and accepted — no viable second target exists in the current environment. Not re-evaluated here.
  • Host backups for other PVE hosts. Single-host fleet; design uses an host/<hostname> namespace so a second host could opt in cleanly later.

Architecture decisions (locked in 2026-05-06):

  • Extend the proxmox role. New tasks file host_backup.yml and four templates (env, wrapper script, .service, .timer) — no new role, mirrors the existing pbs_client.yml split.
  • Separate PBS user + token. New host-backup@pbs!proxfold distinct from the guest-backup token, blast-radius separation. Same fingerprint reused.
  • Namespace host/proxfold. Cleanly separates from VM/CT snapshot listings.
  • Schedule 02:30. Vzdump kickoff is 02:00, recent runs complete in ~5 min (PVE task log 2026-05). 02:30 has >20 min headroom and stays ahead of the 03:00 PBS prune.
  • Crypt-mode none. Chunks live on a private NAS already; client-side encryption adds a key-loss footgun for negligible gain on host config files.
  • Retention — namespace-scoped prune-job, longer than the global pbs-daily retention since the dataset is tiny and per-snapshot deduplication is high. Patterned after the Proxmox PBS docs prune example, tuned for homelab: keep-daily 14, keep-weekly 8, keep-monthly 12, keep-yearly 2.
  • Manual one-shot bootstrap on PBS. Mirrors the Phase 5A token-gen pattern. Adding multi-user / multi-namespace abstraction to the pbs role for one extra user wasn't worth it. Procedure codified in the backup-restore runbook.

Code landed:

  • roles/proxmox/tasks/host_backup.yml — installs client + creds + script + unit + timer
  • roles/proxmox/templates/pbs-host-backup.{env,sh,service,timer}.j2
  • roles/proxmox/tasks/main.yml — gated include (pbs_host_backup is defined and vault_pbs_host_token_secret is defined)
  • roles/proxmox/defaults/main.yml — example block + bootstrap pointer
  • inventory/host_vars/proxfold.ymlpbs_host_backup block

Bootstrap executed:

  • PBS-side one-shot on CT 105 — host-backup@pbs user + host-backup@pbs!proxfold token + host/proxfold namespace + DatastoreBackup ACL on both authids + namespace-scoped prune-job (14d/8w/12m/2y, 03:15 daily). Token captured to PBS tmpfs, SCP'd to control tmpfs, shred at both ends.
  • Vault appendvault_pbs_host_token_secret written via decrypt-to-tmpfs / append / re-encrypt pattern; round-trip verified, source token file shred.
  • First playbook run--limit proxfold --tags host_backup reported changed=5 (env file, script, service, timer, timer enable).
  • Smoke testsystemctl start pbs-host-backup.service succeeded; snapshot host/proxfold/2026-05-06T02:01:48Z (7.4 MB, 3 pxar archives) registered on PBS.
  • Idempotency check — full site.yml --check --diff --limit proxfold reports ok=84 changed=0 failed=0.

Open follow-up:

  • Patch the pbs role to grant ACL on both user AND token auth-ids (closed 2026-05-06, same cycle). New Ensure datastore ACLs for PBS client TOKEN auth-id task in roles/pbs/tasks/main.yml mirrors the existing user-grant loop, gated on vault_pbs_token_id is defined. New default pbs_client_token_name: "pve". Verified idempotent against live PBS (manual 5A grant matches what the task produces — changed=0 on apply). A fresh PBS rebuild via the role no longer needs the manual second grant.

Phase 6 — New services (weeks 9+)

Greenfield — no dependencies between items. Work on whichever is most interesting at the time.

6A. Forgejo (self-hosted git, GitHub-mirrored)

Stand up a dedicated LXC running Forgejo, front it with the existing Caddy edge, migrate four GitHub repos with push-mirrors back to GitHub. Forgejo becomes source of truth; GitHub stays as private mirror so external Actions (e.g. meat-helmet's scheduled cron) keep firing.

Sub-stages 6A.1 (LXC stand-up), 6A.2 (edge integration + CF DNS token rotation), 6A.3 (repo migrations + push mirrors), 6A.4 (close-out + docs sweep). Full procedure in the Forgejo Setup runbook.

Phase 6A complete — 2026-05-05

All four sub-stages executed across two days. Forgejo 11.0.13 live at git.rampancy.cloud, all 4 repos imported with sync_on_commit push-mirrors to GitHub, local origins flipped, CF DNS token rotation closed the 7D leak follow-up. End-to-end mirror chain validated by the closing docs commit (Forgejo push → GitHub mirror within ~5s).

  • 6A.1 — CT 109 created on rpool; forgejo-sqlite installed via Codeberg APT; service running on LAN at :3000 (executed 2026-05-04)
  • 6A.2git.rampancy.cloud live behind Caddy on CT 107; CF DNS token Rolled + API-validated (closes 7D follow-up); Forgejo ROOT_URL flipped to https + DISABLE_SSH=true (executed 2026-05-05)
  • 6A.3 — arrstack, homelab-ansible, mediabot, meat-helmet imported with full history (Forgejo migrate API, 2026-05-05); per-repo push-mirror live (fine-grained PATs, sync_on_commit: true); local remotes flipped on WSL with github retained as fallback
  • 6A.4 — Phase 6A close-out (2026-05-05): docs sweep across role/service/runbook + homelab-ansible README refresh; Discord push-event webhook per repo deferred as low-signal for single-operator setup (procedure preserved in runbook for future collaboration trigger)

6B. Home Assistant

Stand up HAOS as a sealed Proxmox VM, integrate the Hue V2 bridge + Tapo P110M (via Matter, LAN-local) + Bambu A1 Mini (via HACS + ha-bambulab), front via Caddy edge at home.rampancy.cloud, build dashboard + automations.

Sub-stages 6B.1 (VM stand-up), 6B.2 (core integrations), 6B.3 (edge integration), 6B.4 (dashboard + automations + close-out). Full procedure in the Home Assistant Setup runbook.

HAOS is opaque to the homelab Ansible baseline

HAOS is a sealed Buildroot appliance — no SSH, no apt, no common / security / auto_updates / beszel_agent. Drift detection is blind to it; the inventory entry exists only as a documentation anchor and to drive the Caddy edge route. Compensating controls: PBS daily VM-layer backup (more comprehensive than HAOS's native backup), HA Supervisor's own update mechanism for HA Core + add-ons, HA's native Discord integration into #homelab-ops. To be recorded as an accepted risk at close-out. Wazuh agent on HAOS is deferred to Phase 7B — BeardedTinker's HAOS rule pack works API-side without an in-VM agent.

  • 6B.1 — VM stand-up (2026-05-23): HAOS 17.3 (haos_ova-17.3.qcow2.xz, release 2026-05-06 — pin bumped from scaffold's 17.2 since 17.3 was current on the day of execution) imported to local-zfs as VM 110 (hass, 192.168.1.241). Q35 / OVMF / pre-enrolled-keys=0 (the boot-loop gotcha) / EFI disk on local-zfs / virtio-scsi-single / 2 vCPU / 4 GB / 32 GB. First boot via HA onboarding wizard at http://192.168.1.241:8123; static IP set via Settings → System → Network. Boot-to-HTTP-200 was ~90 s (not the scaffold's ~5 min). PBS daily pickup at 02:00 ACST 2026-05-24 — spot-check pending.
  • 6B.2 — Core integrations: HACS bootstrap via the Studio Code Server add-on shell, accepted as a managed dependency. Tapo P110M via tplink integration primary, Matter as fallback (revised from scaffold's Matter-primary — community + upstream HA-core issues show P110M Matter pairing flaky and energy-endpoint enumeration incomplete; python-kasa now handles KLAP locally without cloud creds). Hue V2 bridge auto-discovered → push event stream confirmed. ha-bambulab installed via HACS; A1 Mini flipped to LAN Mode + Developer Mode to permit MQTT writes under firmware ≥ 01.05. Deferred — user driving hands-on.
  • 6B.3 — Edge integration (2026-05-23): home.rampancy.cloud vhost added to host_vars/edge.yml (homelab-ansible commit b19f662); CrowdSec coverage automatic via the existing wildcard handler. HA's http: block in configuration.yaml set with use_x_forwarded_for: true + trusted_proxies: 192.168.1.244. Gotcha: after editing configuration.yaml, verify HA Core has actually restarted (docker inspect homeassistant --format '{{.State.StartedAt}}' newer than config mtime) before applying Caddy — otherwise the trusted-proxies block stays unloaded and every X-Forwarded-For-bearing request gets HTTP 400, easy to misread as edge config error. End-to-end validated from cellular.
  • 6B.4 — Dashboard + automations + close-out: print-completion → #homelab-ops Discord webhook (rest notify platform with unified data: block — drop scaffold's deprecated data_template: split syntax), sunset → Hue lights blueprint, P110M energy draw surfaced on the printer dashboard tile. Remaining docs sweep: hosts/hass/index.md, services/home-assistant.md, mkdocs.yml nav, reference/accepted-risks.md HAOS-opacity entry.

6C. Obico (3D print failure detection)

  • On the A1 Mini, enable LAN Mode (Settings → Network) and then Developer Mode — these are two separate toggles; both are required for third-party integrations
  • Connect a USB webcam for AI detection (A1 Mini built-in camera stream is not suitable for Obico's failure detection model)
  • Deploy self-hosted Obico server in Docker
    • Obico now supports direct Bambu connection without OctoPrint as middleware — use the native Bambu integration in recent Obico releases
  • Configure Discord/Telegram notifications

6D. Music acquisition pipeline

Phase 6D complete — 2026-05-16

Lidarr (hotio plugins branch image) + Tubifarry plugin + slskd (via gluetun) + beets landed on arrstack VM 101. PlexAmp/Plex stay as the streaming layer (no Navidrome — explicit decision; pipeline scoped to acquisition only). Music library shared back read-only on Soulseek (1,905 dirs / 15,986 files) with 5-slot / 1 MB/s upload cap, polite per-peer throttle (3 grabs/day per peer, max 5 queued/peer, min 5 files for "real album" filter). Smoke test artist (Gotye) downloaded Making Mirrors Deluxe (22 FLACs) end-to-end. The bulk-import-from-root-folder auto-triggered when the root folder was added — 877 albums / 10,691 tracks registered in one pass. Lessons in music-acquisition-bringup runbook.

  • Music library was already on ZFS at /stash/rodneystash/Music (Plex Tunes indexes it via the /mnt/plex/Music symlink managed by the plex role)
  • Lidarr on ghcr.io/hotio/lidarr:pr-plugins (mainline has no plugin system) + Tubifarry plugin installed via Lidarr UI from https://github.com/TypNull/Tubifarry
  • slskd 0.25 routed through existing gluetun container (network_mode: service:gluetun), native VPN integration via SLSKD_VPN=true + SLSKD_VPN_GLUETUN_* env vars
  • Gluetun control API auth wired (X-API-Key role for slskd, config TOML at /opt/mediaserver/gluetun/auth/config.toml)
  • Soulseek citizenship: shared /music (read-only) + /downloads, 5 upload slots, 1 MB/s cap, polite profile blurb signalling automated library
  • beets installed (on-demand sanity tagger, no Lidarr-pipeline integration)
  • Lidarr Plex Connect notification wired via Plex.tv OAuth (auto-rescan on import)
  • Existing 612-artist library imported (877 albums, 10,691 tracks) — auto-triggered on root folder add
  • Gotye Making Mirrors (Deluxe Edition, 22 FLACs) downloaded + imported + visible in Plex Tunes

6E. Matrix server

Goal: Stand up a closed, federated Matrix homeserver that replicates Discord's day-to-day chat experience — text rooms, threads, spaces, and group voice/video — for me plus a small circle of family and friends. Element X / Element Web are the clients. Mobile push, Discord bridging, and OIDC SSO are explicitly deferred.

Scoped 2026-05-21 — not started

Scoping replaces the original four-line stub ("Synapse + PostgreSQL + Caddy via Docker Compose"). Two pivots vs the original: Synapse → Tuwunel (Rust, embedded RocksDB, much lighter at idle) and LXC → VM (spantaleev's playbook explicitly warns against LXC). See the Matrix setup runbook for the execution checklist.

Architecture — Docker stack on a new VM, fronted by the existing edge Caddy + CrowdSec:

Component Purpose Notes
Tuwunel Homeserver (text / rooms / spaces / threads / E2EE / federation) Embedded RocksDB — no Postgres. Official conduwuit successor per the project's own README.
LiveKit Server WebRTC SFU for group voice/video calls (MatrixRTC) Single binary. Default media UDP range trimmed via ICE/UDP mux on 7882.
lk-jwt-service Issues short-lived JWTs that authenticate Matrix clients into LiveKit Tiny Go binary.
Traefik (bundled) Internal HTTP router for the playbook's containers Bound to 0.0.0.0:81 on VM 111; external Caddy on CT 107 terminates TLS.
matrix-static-files Serves .well-known/matrix/* from matrix.rampancy.cloud Apex rampancy.cloud well-known files come from Caddy directly.

Deployment approach — spantaleev's matrix-docker-ansible-deploy vendored separately on CT 104 at ~/matrix-deploy/, parallel to homelab-ansible/. The integration surface (well-known apex, federation routing, MatrixRTC ↔ LiveKit JWT signing, WebSocket forwarding for SFU) is large enough that re-deriving it in a custom role would cost more than the external-playbook tax. Drift detection skips the matrix VM — config drift is checked manually via the playbook's --check mode when wanted, not via the nightly drift cron.

Cross-phase decisions:

  • Homeserver: Tuwunel (over Synapse). Rust, ~512 MiB at rest, embedded RocksDB, no Postgres dependency. Synapse is more featureful but eats 2–4 GiB and brings Postgres in tow — wrong fit until 4B RAM upgrade lands.
  • VM, not LXC. Docker-in-LXC works elsewhere (arrstack VM 101 is technically a VM for the same reason); MatrixRTC's UDP networking + multiple-container fan-out + AppArmor footprint are even less LXC-friendly than the arrstack pattern.
  • Federation via well-known delegation, not port 8448. Apex rampancy.cloud serves /.well-known/matrix/server pointing federation to matrix.rampancy.cloud:443. Keeps all inbound traffic on the existing edge Caddy + CrowdSec path; no new TCP port-forward at the UDM.
  • LiveKit UDP is unavoidable. First service in the homelab that needs router-level UDP forwarding. Ports: 7881/tcp + 7882/udp (ICE) + 3479/udp + 5350/tcp (TURN) + 30000-30020/udp (TURN relay range). Forwarded straight to VM 111 — bypassing the edge LXC, no Caddy/CrowdSec coverage on those ports.
  • Audience: closed. matrix_tuwunel_config_allow_registration: false plus a vault_matrix_registration_token for invite-style signups. Federation is enabled, so my rooms can include matrix.org users; my server just doesn't accept walk-ins.
  • Element Call frontend skipped. Element X (mobile) and Element Web both embed the RTC widget internally. Standalone call.rampancy.cloud deployment is unnecessary unless we want a clickable web-call landing page later.

Resource footprint (VM 111, sized to match arrstack/n8n VM pattern):

Slice Target
RAM 8 GiB (Tuwunel ~2, LiveKit ~1, everything else <1, OS+Docker ~1, headroom ~3)
Disk 32 GiB on rpool (RocksDB growth is the unknown — Beszel disk alert will catch it)
vCPU 4
IP 192.168.1.243

Completed 2026-05-22

All five sub-phases closed 2026-05-22. Tuwunel v1.7.0 on VM 111, federation green via apex well-known, MatrixRTC live with Element Call validated end-to-end (audio + video + screen-share). Two headline gotchas in the runbook's Lessons appendix: matrix_tuwunel_config_allowed_remote_server_names filtering events from our OWN server (text/federation phase) and the apex well-known missing org.matrix.msc4143.rtc_foci (RTC phase). Both fixes are now baked into the Caddy template and vars.yml respectively.

Execution checklist (actual):

  • 6E.1 — VM 111 stand-up: Debian 13 genericcloud image, 4 vCPU / 8 GiB / 32 GiB at 192.168.1.243; common + security + beszel_agent applied (docker role intentionally dropped from playbooks/matrix.yml post-execution — spantaleev owns Docker here, see Lessons). Completed 2026-05-21.
  • 6E.2 — spantaleev playbook bootstrap on CT 104: cloned at commit 9bd9d1a to /root/matrix-deploy/, inventory scaffolded for Tuwunel, vault-bridge symlink set up. matrix_tuwunel_version: v1.7.0 pinned (v1.6.2 has a token+password registration regression). ensure-matrix-users-created is Synapse-only — Tuwunel's grant_admin_to_first_user handles first admin via client-side registration token instead. just substituted with direct ansible-galaxy install -r requirements.yml -p roles/galaxy/ --force (no just on Debian bookworm). Completed 2026-05-22.
  • 6E.3 — Edge integration: caddy role extended with caddy_matrix_enabled gate + caddy_matrix_upstream (single upstream — federation collapsed onto web entrypoint via matrix_federation_public_port: 443). Template adds apex rampancy.cloud block (well-known statics) and matrix.rampancy.cloud block (federation path skips CrowdSec). Apex LE cert issued via existing CF DNS-01. Completed 2026-05-22.
  • 6E.4 — MatrixRTC UDM port-forwards: 5 forwards configured via UDM UI (matrix-rtc-ice-tcp 7881/tcp, matrix-rtc-ice-udp-mux 7882/udp, matrix-rtc-turn-udp 3479/udp, matrix-rtc-turn-tcp 5350/tcp, matrix-rtc-turn-relay 30000-30020/udp → 192.168.1.243). Apex .well-known/matrix/client extended to advertise org.matrix.msc4143.rtc_foci (Element Call queries the apex well-known and doesn't fall through to the matrix subdomain — see Lessons). Element Call validated end-to-end: desktop ↔ Element X mobile on cellular, audio + video + screen-share both directions. Completed 2026-05-22.
  • 6E.5 — First user + smoke: @rampancy:rampancy.cloud registered via Element Web's token-gated registration, auto-promoted to admin by Tuwunel, joined the auto-created #admins:rampancy.cloud room with @conduit:rampancy.cloud bot. Federation tester green. Completed 2026-05-22.
  • 6E.6 — Docs sweep: this roadmap entry flipped; matrix-setup runbook rewritten with actual execution + 9-item Lessons appendix; changelog entry below; hosts/proxfold guests update + services/matrix.md page pending (separate one-off cleanups, no functional block).

Deferred / future sub-phases (not committed; reassess after 6E lands):

  • 6E.7 — Mobile push: Sygnal + FCM keys OR self-hosted UnifiedPush. Adds a Google dependency or a third small service. Skip if Element Web is good enough day-to-day.
  • 6E.8 — Discord bridge: mautrix-discord. Adds Postgres 16+ (Tuwunel doesn't need it but the bridge does) and a per-user puppeting workflow. Worth doing only if there's an active Discord community to keep bridging into.
  • 6E.9 — Pocket-ID OIDC integration — bundles cleanly into Phase 7E.

6F. Music recommendations / discovery

Goal: Add Spotify-style Discover Weekly / Daily Jams on top of the existing PlexAmp listening experience. Recommendations come from community listening data (ListenBrainz), missing tracks dispatch through the 6D acquisition pipeline, and the result lands as Plex playlists.

Scoped, not started — reference only

Scoped from a research session on 2026-05-16 immediately after 6D close-out; reassess at execution time. No work begun.

Architecture — four small Docker services, all fitting on arrstack VM 101 alongside the existing 6D stack:

Layer Purpose Pick
Scrobble source Captures what's been played Plex / PlexAmp native scrobble webhook
Scrobble bridge Forwards Plex listens to ListenBrainz (Plex doesn't speak LB natively) RustyRin/Plex_Scrobble_App or FoxxMD/multi-scrobbler
Recommendation engine Generates Discover Weekly / Daily Jams / similar-artists from listening history ListenBrainz public instance (MetaBrainz)
Orchestrator Pulls LB recs → resolves against local library → dispatches missing tracks via slskd → publishes a Plex playlist LumePart/Explo

Pipeline:

PlexAmp listening
   ↓ Plex webhook
Scrobble bridge (Docker, arrstack VM 101)
ListenBrainz (public cloud)
   ↓ recommendations API
Explo (Docker, arrstack VM 101)
   ├── checks Plex/local library
   ├── missing → dispatches to slskd (already built in 6D)
   └── publishes Discover Weekly / Daily Jams as Plex playlists

Execution checklist (when picking this up):

  • Stand up scrobble bridge (Plex_Scrobble_App or multi-scrobbler) as a Docker service on arrstack VM 101; wire Plex webhook in
  • Register ListenBrainz account on the public instance, wire the scrobbler → LB API
  • Accumulate ≥ 2 weeks of scrobble history before standing up Explo (recs are cold-start-sensitive)
  • Stand up Explo on arrstack VM 101; wire LB API → Plex library lookup → slskd dispatch → Plex playlist publish
  • Decide how Explo-grabbed-via-slskd items get tagged/imported: round-trip through Lidarr (consistent with 6D) or direct-land in the Plex library (faster but bypasses Tubifarry's import path)
  • Smoke test: confirm Discover Weekly / Daily Jams playlists appear in Plex with plausible picks based on actual recent listening
  • New services/listenbrainz.md + services/explo.md pages on close-out; services/arrstack.md services list updated; mkdocs.yml nav entries added

Open decisions / risks:

  • ListenBrainz public vs self-host. Public instance is free and rec quality benefits from cross-user collaborative filtering (self-hosting with N=1 dataset hurts the model). Default to public. Privacy posture: listening history is visible to anyone who looks (Last.fm-class).
  • Cold-start. First 2–3 weeks of recs will be weird/generic until LB has enough history. Mitigation: stand up scrobbling well before Explo so the dataset is already accumulating when the orchestrator lands.
  • Explo is single-maintainer (LumePart). Active per commit history in 2026 but smaller than the *arr stack — maintenance risk worth knowing; have a fallback in mind (manual LB rec → Lidarr import).
  • Plex scrobble webhook is reliable but not perfect — occasionally misses listens during network blips. Not a deal-breaker for recs.
  • No mood/activity-aware curation (Spotify "Focus" / "Late Night"). That's where the streaming-service experience still wins; explicitly out of scope for 6F.
  • Caveat captured but not validated: assumes Explo speaks slskd directly. Verify against current Explo README at execution time — if it only writes Lidarr Wanted entries, the slskd dispatch happens via Tubifarry as a side effect rather than directly.

Housemate Proxmox access

Trusted-LAN scope: VMs land on vmbr0/VLAN 1 today, with Proxmox-side controls (RBAC, ZFS quota, scoped storage) as the boundary. L2 isolation onto a dedicated Lab VLAN is captured separately as Phase 8 — Network segmentation.

Trust posture: housemate is a known person, written guidelines on resource limits + risky operations, no hostile-tenant assumptions.

See the Housemate Access runbook for the full ACL set + execution order.

  • Enable UDM Network API token (or Local Admin user fallback) for read-only config access
  • Create PVE realm user hazel@pve with TOTP enforced
  • Create group housemate-lab-admins and ZFS dataset stash/housemate-vms (quota=500G)
  • Create PVE storage housemate-zfs (ZFS plugin, content images,rootdir) and resource pool housemate-lab
  • Apply group ACLs: pool → PVEVMAdmin, storage → PVEDatastoreUser, /sdn/zones/localnetwork/vmbr0PVESDNUser
  • PBS coverage: existing pbs-daily job picks up Hazel's pool VMs automatically (root namespace); operator-mediated restore. Self-service PBS UI for Hazel is deliberately deferred — see runbook step 6 future-enhancement block
  • Written guidelines for Hazel: resource caps, snapshot discipline, ping-before-passthrough

Phase 7 — Security stack (weeks 10+)

Goal: Add SIEM / endpoint detection coverage and close the edge-security gap left by Phase 5D. Drift detection, Beszel, and PBS cover availability; nothing today covers intent. Phase 7 closes that gap with Wazuh as the centrepiece, network IDS via Suricata, edge IPS via CrowdSec, and a lightweight identity provider (Pocket-ID) for OIDC-native admin UIs.

Pre-requisite: Phase 4B

Wazuh's all-in-one server wants 8–16 GB on a dedicated VM. Current 48 GB physical leaves no comfortable headroom once arrstack/n8n/PBS/Beszel/control/plex/nginx are accounted for. 4B's 8× 32 GB / 256 GB target is sized partly to unblock this phase, with comfortable ARC headroom + future expansion room (4 slots free). Without 4B done first, Phase 7 squeezes everything else and hits its retention/heap ceilings within months.

flowchart LR
    A[7A: Wazuh AIO + first agents] --> B[7B: Agent fleet rollout]
    B --> C[7C: Suricata NIDS + integration]
    B --> D[7D: CrowdSec edge]
    B --> E[7E: Pocket-ID identity + SSO]

    style A fill:#FAECE7,stroke:#993C1D,color:#712B13
    style B fill:#FAECE7,stroke:#993C1D,color:#712B13
    style C fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style E fill:#EEEDFE,stroke:#534AB7,color:#3C3489

7D + 7E ship together

CrowdSec covers edge protection (drop scanners before they reach app login). Pocket-ID covers identity (single sign-on for OIDC-native admin UIs). They're complementary, share an "edge hardening" theme, and are scoped to land in one cycle. 7E does not depend on Phase 4B's RAM upgrade — Pocket-ID's footprint (~50 MB RAM, single Go binary) is trivial.

7A. Wazuh AIO + Ansible scaffolding + first agents

Why: Stand up Wazuh manager + indexer + dashboard on a single VM. Pin to Wazuh 4.14.x — current stable is 4.14.5 (release notes, 2026-04-23). Wazuh 5.0 beta1 shipped April 2026 with rewritten agent protocol, removed Filebeat, and cluster-by-default (5.0 brief) — treat 4.14.x as the runway through 5.0 GA.

  • Provision new VM (e.g. wazuh, next free VMID, Debian 13 trixie genericcloud cloud-init, 4 vCPU / 8 GB / 100 GB on local-zfs, static IP in 192.168.1.0/24)
  • Apply baseline: common, security, auto_updates (with the Wazuh package itself excluded from auto-updates — pinned manually to dodge cluster-version-mismatch landmines), beszel_agent
  • Add wrapper roles roles/wazuh_server and roles/wazuh_agent that pull the upstream wazuh.wazuh collection from Galaxy via requirements.yml — mirrors the existing auto_updates wrapping pattern around hifis.toolkit.unattended_upgrades. Don't dump upstream wazuh-ansible roles directly into roles/; they have opinionated facts/handlers that collide with the homelab baseline
  • Vault entries: vault_wazuh_admin_password, vault_wazuh_api_password, vault_wazuh_cluster_key, vault_wazuh_agent_password, vault_discord_webhook_homelab_security
  • Bake homelab-specific tunings into the wrapper role on day one — agents_disconnection_time: 1d, agents_disconnection_alert_time: 0, JVM heap Xms = Xmx = 1024m. Without these, every reboot generates noise and runtime heap resizing degrades indexer perf (Lazro homelab tuning writeup)
  • Reverse-proxy wazuh.rampancy.cloudwazuh:443 via Caddy on edge (Phase 5D). TLS terminates upstream; Wazuh dashboard speaks HTTPS internally so the proxy is 443→443
  • Discord integration via Wazuh's built-in <integration> block → new #homelab-security channel (separate from #homelab-ops and #homelab-drift). Default to rule level ≥ 10; tune down later
  • Captured by pbs-daily automatically once the VM lands in inventory
  • First agents: proxfold (PVE host directly) + arrstack VM as the test fleet
  • Add docs: hosts/wazuh/index.md, services/wazuh.md, ansible/roles/wazuh-server.md, ansible/roles/wazuh-agent.md, all wired into mkdocs.yml

7B. Agent fleet rollout

Why: Coverage. The proxfold host agent gives the most signal value (kernel events, package changes, ZFS state, SSH attempts, auditd) but every additional agent multiplies the security picture.

  • Roll out agents to remaining VMs: n8n (and plex if/when it moves from LXC to VM)
  • Roll out agents to LXCs: control, pbs, beszel, edge — but with caveats. Wazuh agent has documented install issues inside unprivileged LXCs (wazuh#24954). Test on one CT first; evaluate before promising fleet coverage. Worst case: skip agent in LXCs and rely on the proxfold host's view of /var/log/lxc/* and pct exec audit
  • Tune VPN-induced noise — qBittorrent's source IP via gluetun looks "wrong" to geoIP-style rules; suppress or whitelist early
  • Pull in BeardedTinker's UniFi/Synology/HomeAssistant rule packs (BeardedTinker/wazuh-homelab-security) — UDM and future Phase 6 HAOS integrate near zero-config
  • Rebuild-kit gotcha — Wazuh agent name can't be changed post-install (wazuh#19710); the rebuild/ kit must install agents after hostname is set, not as part of a base template. Codify or document the constraint

7C. Suricata NIDS + Wazuh integration

Why: Without an NIDS, Wazuh is host-side blind. Suricata + Wazuh is a first-class integration — Wazuh auto-parses /var/log/suricata/eve.json and surfaces alerts in the dashboard (Wazuh PoC: Suricata integration).

  • Decide host shape — dedicated Suricata VM, or co-located on the nginx/Caddy edge VM
  • Decide tap point — port-mirror from MikroTik PENFOLD-SW01, or in-line on the gateway path. Mirror is non-invasive and the right starting point; in-line gives blocking capability later
  • Codify as roles/suricata with roles/wazuh_agent already installed for the eve.json forward
  • Tune EmergingThreats Open ruleset — homelab traffic generates noise on alerts written for enterprise environments

7D. CrowdSec on edge

Completed 2026-05-04

CrowdSec engine + hslatman Caddy bouncer module live on edge (CT 107). Bouncer registered, polling LAPI on 15s ticker, crowdsec directive in every per-host handle block. End-to-end validation via cellular phone confirmed: blocked IP got 403, removed IP returned to 200. Accepted risk: edge security gap closed same day. See crowdsec_engine role and crowdsec-validation runbook — the runbook captures six bugs / gotchas hit during first execution.

  • crowdsec_engine role landed: packagecloud any/any apt source, crowdsecurity/caddy collection, Restart=on-failure drop-in, stat-gated for --check --diff cleanliness
  • caddy role updated: xcaddy-built binary caddy-2.11.2-cs1 with hslatman bouncer module + cloudflare DNS, gated bouncer block in Caddyfile template ({$CROWDSEC_BOUNCER_API_KEY} via parse-time substitution), site-block JSON access logs to journal
  • One-time cscli bouncers add caddy-edge operator step + vaulted key (registration is non-idempotent; not in the role)
  • Validation runbook executed end-to-end from cellular (LAN test masked by hairpin NAT — runbook now leads with this constraint)
  • Accepted-risks register entry closed
  • Closed 2026-05-05 — rotated via CF Roll as part of Phase 6A.2. New secret validated by API probe (token verify + Zone:DNS:Edit TXT round-trip); script-based vault swap so the token never appeared in scrollback.

Validated post-go-live (2026-05-04): within 4 hours of bouncer enable, the engine's local Caddy-log parser caught a real HTTP-probing attempt from 49.178.191.113 (AU, Microplex) and auto-banned via the crowdsecurity/http-probing scenario. Confirms cscli setup auto-discovery wired Caddy log acquisition + base-http-scenarios + http-cve collections at install time — the role doesn't need to do that work explicitly. SSH log acquisition + crowdsecurity/sshd collection also auto-installed on edge; SSH brute-force detection now feeds the same reputation pool as Caddy, but enforcement for SSH still needs cs-firewall-bouncer (Caddy bouncer is HTTP-only).

7D follow-ups (deferred — flag for later)

Captured 2026-05-04 from a maturity-of-implementation review. None are blocking; pull into a future cycle when one feels warranted.

  • Discord notification profile. Engine-side notification system supports HTTP webhooks (/etc/crowdsec/notifications/http.yaml + /etc/crowdsec/profiles.yaml). Wire bans → vault_discord_webhook_homelab_ops for visibility per-effort. ~30 min of work; shares the channel pattern used by PBS/ZFS/PVE/auto-updates events.
  • cs-firewall-bouncer + retire fail2ban on edge. SSH detection is already happening (auto-installed crowdsecurity/sshd parses ssh.service journald), but SSH enforcement still relies on the Caddy-side security role's fail2ban. Adding cs-firewall-bouncer (nftables-based) gives a unified federated reputation feed for both SSH and HTTP, and lets us cleanly retire fail2ban from the security role on edge. New crowdsec_firewall_bouncer role; small bootstrap (similar non-idempotent cscli bouncers add operator step pattern as the Caddy bouncer).
  • CrowdSec Console enrollment. app.crowdsec.net — free SaaS dashboard showing alerts/decisions/top scenarios. One-command enroll: cscli console enroll <token>. Privacy tradeoff: ships event metadata (timestamps, scenario names, IPs) to CrowdSec's cloud. Considered worth weighing if CLI feedback ever stops being enough.
  • Wazuh forwarding — already deferred to Phase 7A/B per scoping.

Scope cuts (2026-05-04):

  • CrowdSec ↔ Wazuh forwarding deferred to Phase 7A/B. Wazuh is gated on Phase 4B (CPU + RAM upgrade), which is deprioritised. Edge gap is more urgent than the broader SIEM build; closing 7D standalone unblocks it. Forwarding hook documented in the role doc as a one-line wiring job when Wazuh exists.
  • Lynis weekly cron split out to Phase 7A/B. Functionally orthogonal to edge bouncing; pairs naturally with Wazuh's SCA module via custom decoder. Not in 7D scope.
  • SOAR-lite stretch moves with the Wazuh piece — depends on Wazuh active-response.

Decisions (locked 2026-05-04):

  • Apt source = packagecloud any/any, not Debian trixie main. Upstream's trixie repo returns 404 (issue #3909, unresolved 2026-05-04); any/any is upstream's documented workaround. Debian trixie ships its own crowdsec package but the version freezes at trixie-release time and falls behind hub items — picked the upstream repo for engine/parser currency despite the external dep. Path component and suite must both be any (/debian + suite any returns HTTP 422; caught on first apply).
  • Integration via Caddy module, not file-log parser. hslatman/caddy-crowdsec-bouncer plugs in via xcaddy --with (we already xcaddy-build), per-request IP check against LAPI, no logfile detour, dodges the known caddy-logs parser bug. Tradeoff: Caddy upgrades require a rebuild — already our workflow.
  • fail2ban kept for SSH-only. The CrowdSec bouncer module is HTTP-only (Caddy choke-point). fail2ban on edge stays for SSH brute-force protection (LAN-only path but still sensible). No retire-fail2ban work in 7D.
  • LAPI stays on stock 127.0.0.1:8080. Initial design moved it to 6060 to dodge a hypothetical Caddy alt-port collision; reverted on first apply because the agent's local_api_credentials.yaml is hardcoded to 8080 by the installer, so the engine wouldn't start. Lesson: don't optimise for hypothetical future state when it touches multiple components.

7E. Pocket-ID identity provider + selective SSO

Why: Adds OIDC-based single sign-on in front of the OIDC-native admin UIs (Proxmox VE, PBS) without the operational footprint of a full IdP. Bundles with Phase 7D — CrowdSec drops unauthorized traffic at the edge; Pocket-ID handles identity for the apps that benefit from SSO. The two together close the edge-security accepted risk captured 2026-05-02.

  • New LXC auth (next free CTID), Debian 13 trixie, unprivileged, features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, static IP in 192.168.1.0/24 (mirrors CT 106 Beszel / CT 107 edge pattern)
  • New pocket_id Ansible role — pinned binary (or apt install if/when Pocket-ID ships a Debian package), systemd unit, SQLite-backed, vault-stored signing secret. Same shape as the caddy role — single binary, config templated from inventory.
  • Caddy: add auth.rampancy.cloudauth:1411 to caddy_proxy_hosts in host_vars/edge.yml. The existing LE wildcard *.rampancy.cloud already covers; Cloudflare needs a CNAME auth → rampancy.cloud (gray-cloud).
  • OIDC integration: configure Proxmox VE OIDC realm pointing at Pocket-ID; configure PBS similarly. Both PVE 8+ and PBS 4+ support OIDC realms natively — no proxy auth needed.
  • User accounts: primary (operator) + Hazel. Each registers two passkeys — primary in Bitwarden (cloud-synced), secondary on a hardware key (YubiKey 5) or device-native (Touch ID / Windows Hello). Bitwarden emergency access pre-configured for Hazel as the recovery path.
  • Beszel agent on the new LXC.
  • Docs: new services/pocket-id.md + hosts/auth/index.md + ansible/roles/pocket-id.md, all wired into mkdocs.yml. New runbook runbooks/sso-rollout.md capturing the cutover (Proxmox/PBS realm switch + first passkey registration walkthrough).

Apps explicitly NOT in scope for SSO (deliberate, captured here so future-self doesn't re-litigate):

App Reason
Beszel hub UI No OIDC support. CrowdSec + Beszel's own auth is sufficient.
Dockhand UI No OIDC support. CrowdSec + Dockhand's own auth.
n8n OIDC is a paid Enterprise feature; free version stays on its own auth.
Overseerr (requests) Already auth'd via Plex OAuth — already SSO of a sort.
arrstack admin UIs (Sonarr/Radarr/Prowlarr/qBittorrent) LAN-only, no OIDC, low value.

Decisions (locked at scope time, 2026-05-03):

  • Pocket-ID, not Authentik. Authentik bundles OIDC + forward-auth in one component, would gate the non-OIDC apps too. But it introduces PostgreSQL + Redis (everything else in the homelab is SQLite), runs ~3-4 GB RAM vs Pocket-ID's ~50 MB, and the bundled forward-auth job is already covered by 7D CrowdSec. Reconsider Authentik if the public-facing SSO surface grows beyond ~5 services with weak per-app auth.
  • No Tinyauth pairing. The 2026 community pattern is Pocket-ID + Tinyauth for double-coverage of OIDC + forward-auth. Tinyauth in front of apps that have their own login produces a double-login UX (gate at proxy + app's own login still fires). With CrowdSec in 7D, the marginal benefit doesn't justify the friction.
  • Bitwarden as primary passkey vault, hardware key as backup. Operator already runs Bitwarden paid plan. Bitwarden emergency access closes the break-glass gap flagged during the 5D close-out — Hazel is the natural emergency-access nominee.
  • Skip Authelia. Slowing release cadence (Nov 2025 → Mar 2026 release gap, patch-only since), classical password+TOTP UX is dated relative to passkey-first. Pocket-ID is the cohort-aligned 2026 pick for new deployments.

Phase 8 — Network segmentation (deferred)

Goal: Migrate the homelab off a single flat VLAN onto a segmented design — Mgmt for infrastructure admin, IoT for smart-home devices, Lab for the housemate sandbox. Switch PENFOLD-SW01 from SwOS to RouterOS to enable programmatic config management. Reuses spare proxfold NICs to add the Lab VLAN bridge without reconfiguring vmbr0, so existing services don't blip during the additive phases.

Why this is its own phase

Triggered by the housemate-access work (Phase 6 entry above), but deliberately scoped separately. The immediate housemate change ships with Hazel's VMs on vmbr0/VLAN 1 with Proxmox-side controls only; their migration onto the Lab VLAN is captured here as 8G. Doing this work alongside Hazel's onboarding would conflate "first VLAN rollout" with "first RouterOS exposure" — too many simultaneous variables.

VLAN scheme (10-spacing buffer for future classes):

ID Name Subnet (TBD at scope time) Purpose
1 LAN 192.168.1.0/24 Existing trusted servers/clients
10 Mgmt 192.168.10.0/24 UDM admin, switch admin, iDRAC, future managed APs
20 IoT 192.168.20.0/24 Smart-home devices
30 Lab 192.168.30.0/24 Housemate sandbox
40 Guest 192.168.40.0/24 Visitor WiFi (reserved, not built)
flowchart LR
    A[8A: Console cable + SwOS backup] --> B[8B: Switch flip to RouterOS]
    B --> C[8C: RouterOS hardening]
    C --> D[8D: UDM VLAN networks + firewall rules]
    D --> E[8E: Switch port VLAN config]
    E --> F[8F: proxfold vmbr1 + PVE SDN]
    F --> G[8G: Migrate Hazel onto Lab VLAN]
    F --> H[8H: Migrate mgmt plane onto Mgmt VLAN]
    F --> I[8I: Migrate IoT devices onto IoT VLAN]

    style A fill:#FAECE7,stroke:#993C1D,color:#712B13
    style B fill:#FAECE7,stroke:#993C1D,color:#712B13
    style C fill:#FAECE7,stroke:#993C1D,color:#712B13
    style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style E fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style F fill:#EEEDFE,stroke:#534AB7,color:#3C3489
    style G fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
    style H fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
    style I fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20

8A. Console cable + dated SwOS backup

Why: Escape path for the OS flip. Without console access, recovery from a misconfigured switch reboot is "laptop direct-cabled to RouterOS's default 192.168.88.1 mgmt subnet" — works but slower and error-prone. Switch has no autobackup automation today (UDM does; switch does not — gap worth closing regardless).

  • Order RJ45-to-USB-serial console adapter
  • Manual SwOS config export (download .swb from http://192.168.1.3/backup, dated copy stored alongside the ~/proxfold-pve9-upgrade/ artifacts)
  • (Optional) Codify daily SwOS backup as a small cron on the control LXC to fill the autobackup gap

8B. Switch flip — SwOS → RouterOS

Why: SwOS has no SSH or scriptable API. RouterOS gives full SSH + API + /export config readout. Switching is reversible — both OSes coexist on the device, configs preserved per slot. Marvell switching chip handles L2 in hardware on both OSes; line-rate preserved as long as RouterOS bridge VLAN filtering keeps hw=yes.

  • Pre-stage RouterOS minimal config as a .rsc script that reproduces today's behavior (24× 1G + 2× 10G in one untagged bridge, mgmt IP 192.168.1.3)
  • Console cable in hand, flip via /system routerboard settings set boot-os=routeros then reboot
  • Import .rsc on first boot
  • Verify all wired hosts reachable; verify hw=yes on every bridge port (/interface bridge port print)
  • Escape path: if anything fails, /system routerboard settings set boot-os=swos + reboot — instant rollback

8C. RouterOS hardening

Why: RouterOS exposes more services than SwOS — SSH, API, web, WinBox, optional FTP/Telnet. MikroTik has a CVE history when these are left exposed. Discipline: management plane on Mgmt VLAN only, unused services off.

  • Disable telnet, ftp, www (use www-ssl only), api if unused
  • Restrict winbox/ssh/www-ssl to a management address allowlist (admin IP for now; Mgmt VLAN once 8H lands)
  • Pin firmware version, document update cadence in hosts/penfold-sw01/

8D. UDM — VLAN networks + inter-VLAN firewall rules

Why: UDM controller defines the L3 + DHCP + inter-VLAN policy. Adding networks is additive; existing VLAN 1 untouched. WiFi clients on VLAN 1 don't blip from network creation alone, but UniFi controller commits do trigger AP provisioning — schedule outside peak housemate hours.

  • Create UniFi Networks: Mgmt (10), IoT (20), Lab (30); reserve Guest (40) as a placeholder
  • Inter-VLAN firewall: deny by default; explicit allowlist for required flows (e.g. PBS host → Lab on backup ports, Mgmt → all for admin)
  • Document policy in hosts/the-egg/firewall.md

8E. Switch — VLAN port assignments

Why: L2 enforcement of the network design. Trunk to UDM tagged for VLANs 10/20/30/40; access ports per device classification.

  • Trunk to UDM: tag VLANs 10/20/30/40, untagged VLAN 1
  • proxfold ports: existing nic0 port stays untagged VLAN 1; new spare-NIC port (8F) configured for VLAN 30 (or VLAN-aware trunk if multi-VLAN exposure is wanted)
  • IoT device ports: untagged access VLAN 20

8F. proxfold — vmbr1 VLAN-aware bridge + PVE SDN

Why: Additive to existing vmbr0. Spare NIC (nic1) bound to a new VLAN-aware bridge means existing services on vmbr0 are untouched throughout. PVE SDN gives a clean VNet abstraction with proper permission scoping.

  • Cable a spare proxfold NIC (nic1) to a switch port configured for VLAN 30 trunking
  • Add vmbr1 to /etc/network/interfaces with bridge-vlan-aware yes, bridge-vids 2-4094
  • Define PVE SDN zone (VLAN-aware) bound to vmbr1; create VNet lab for VLAN 30
  • Update Hazel's group ACLs: replace /sdn/zones/localnetwork/vmbr0PVESDNUser with /sdn/zones/<labzone>/labPVESDNUser
  • Codify as Ansible role or runbook step (sync-docs precedent)

8G. Migrate Hazel's VMs onto Lab VLAN

Why: Completes the housemate-access architecture. Pre-Phase-8 her VMs are on vmbr0/VLAN 1 with only Proxmox-side controls; post-Phase-8 they're L2-isolated.

  • Per VM: shut down, change network bridge to the Lab VNet, boot, verify DHCP from UDM Lab pool, verify firewall behavior
  • Update permissions: revoke Hazel's SDN.Use on /sdn/zones/localnetwork/vmbr0

8H. Migrate management plane onto Mgmt VLAN

Why: Reduces blast radius. iDRAC, switch admin, UDM admin currently sit on VLAN 1 alongside production guests.

  • iDRAC: reconfigure dedicated NIC onto VLAN 10
  • Switch (RouterOS): mgmt IP onto VLAN 10
  • UDM: dedicated mgmt interface on VLAN 10
  • Update network/overview.md host map

8I. Migrate IoT devices onto IoT VLAN

Why: Default-deny outbound to LAN; cap blast radius from compromised IoT firmware.

  • Inventory IoT devices currently on VLAN 1
  • Migrate device-by-device (DHCP reservation move + reconnect to IoT WiFi SSID)

Decisions to lock at scope time:

  • PVE SDN VLAN zone vs. plain VLAN-aware bridge — SDN is the cleaner answer for permission scoping (per-VNet ACLs); plain bridge-vlan-aware yes is simpler if SDN feels like overkill for one VLAN
  • Guest VLAN — provisioned now or later — reserved as 40 either way
  • Bond instead of single NIC for vmbr1 — proxfold has 4 NICs, only nic0 in use; could bond 2-3 spares for the Lab bridge. Probably overkill for housemate experiments

Design decisions

These don't need answers now but will come up during implementation.

Decision When it matters Options
Ansible for compose deploys vs Dockhand Phase 1–2 Use both (Ansible for OS, Dockhand for containers) or consolidate to Ansible's community.docker.docker_compose_v2 module
Caddy vs Nginx Proxy Manager Phase 1B or later Caddy is more Git-friendly and lighter; NPM has a GUI. Can migrate incrementally
n8n deployment host Phase 5C LXC with native Node.js install (standalone service, fits the LXC-for-standalone pattern); no Docker/VM needed
UDM upgrade to Dream Router 7 Any time Independent of everything else. WiFi 7 + better hardware

Infrastructure reference

Current state (pre-Phase 4 hardware upgrades):

graph TB
    subgraph net["Network"]
        udm["The-Egg · 192.168.1.1\nUniFi UDM · Gateway · WAP"]
        sw["PENFOLD-SW01 · 192.168.1.3\nMikroTik CRS326-24G-2S+"]
        udm --> sw
    end

    subgraph proxfold["proxfold · 192.168.1.250 — Proxmox VE · Dell R430"]
        subgraph ct100["CT 100 — plex · 192.168.1.230"]
            plex["Plex Media Server\nNvidia T400 GPU"]
        end
        subgraph vm101["VM 101 — arrstack · 192.168.1.252"]
            media["Sonarr · Radarr · Prowlarr\nSeerr · qBittorrent · MediaBot\ngluetun (ProtonVPN)"]
        end
        subgraph vm102["VM 102 — nginx · 192.168.1.249"]
            npm["Nginx Proxy Manager"]
        end
        subgraph ct104["CT 104 — control · 192.168.1.245"]
            ansible["Ansible control node"]
        end
        pbs["CT — pbs · 192.168.1.246\n(Phase 5A)"]
        beszel["CT — beszel · 192.168.1.247\n(Phase 5B)"]
        n8n["CT — n8n\n(Phase 5C)"]
    end

    nas["QNAP TS-269L · 192.168.1.253\nNFS datastore (Phase 5A)"]
    sw --> proxfold
    sw --> nas

Post-Phase 4 changes: Boot drive becomes ZFS mirror (rpool), RAM increases from 48GB to 384GB, second CPU socket populated (28C/56T total).