Homelab Implementation Roadmap¶
Last updated: 2026-05-20 Scope: Dell R430 (proxfold) — infrastructure hardening, config-as-code, hardware upgrades, automation Pace: Weekend-warrior (~8 weeks active, then ongoing) Status: Phase 1/2/3 done · Phase 4A/4C done · Phase 5A/5B/5C/5D/5E done · Phase 6A/6D done · Phase 7D done · Phase 4B re-prioritised (CPU 2 + 8× 32 GB / 256 GB at 1 DPC; unblocks Phase 7) · Phase 6B/6C/6E/6F + Phase 7A/B/C/E next
The critical path runs through PVE 9 upgrade → Ansible codification → hardware upgrades → boot drive swap. Everything else layers on top. Phase 7 (security stack) depends on 4B for RAM headroom.
gantt
title Homelab implementation timeline
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 1 - Foundation
Ansible control node + common role :done, a1, 2026-04-01, 7d
Dockhand deploy + gluetun VPN :done, a2, 2026-04-01, 10d
section Phase 2 - PVE 9 Upgrade
PVE 8.4 → 9.x upgrade :done, p1, 2026-04-19, 2d
ZFS RAIDZ expansion :done, p3, 2026-04-20, 1d
section Phase 3 - Codify
3A proxmox + nut roles :done, b1, 2026-04-21, 1d
3B plex data codification :done, b2, 2026-04-21, 1d
3C rebuild kit (auto-install) :done, b3, 2026-04-21, 1d
3D scheduled drift detection :done, b4, 2026-04-21, 1d
section Phase 4 - Hardware
UPS purchase + install :done, c1, 2026-04-19, 1d
CPU 2 + RAM upgrade :c2, after b4, 5d
Boot drive swap (ZFS mirror) :c3, after c2, 7d
section Phase 5 - Backup + monitoring
5A Proxmox Backup Server :done, d1, 2026-04-23, 1d
5B Beszel + notifications :done, d2, 2026-04-24, 1d
5C n8n (lab automation) :done, d3, 2026-04-28, 1d
5D edge LXC + Caddy (NPM → Caddy) :done, d4, 2026-05-02, 1d
section Phase 6 - New services
Home Assistant :e1, after d2, 14d
Obico (3D print monitoring) :e2, after d2, 14d
6D Music acquisition pipeline :done, e3, 2026-05-16, 1d
Matrix server :e4, after d2, 14d
6F Music recommendations / discovery :e5, after e3, 7d
section Phase 7 - Security
7A Wazuh AIO + first agents :f1, after c2, 7d
7B Agent fleet rollout :f2, after f1, 5d
7C Suricata NIDS + integration :f3, after f2, 5d
7D CrowdSec edge :f4, 2026-05-04, 3d
Phase 1 — Foundation (weeks 1–2)¶
Goal: Get Ansible running against at least one host and migrate container management to Git-backed Dockhand with VPN routing in place. These two tracks are independent and can run in parallel.
1A. Ansible control node¶
Why first: Every subsequent phase depends on having playbooks ready — especially the boot drive swap in Phase 4, which becomes a single ansible-playbook site.yml instead of hours of manual work.
Note
The Ansible repo is already scaffolded at rampantlemming/homelab-ansible with all roles written. These steps cover standing up the control node and testing the roles against live hosts for the first time. See the Ansible section for full repo documentation.
- Create the Ansible control node LXC on Proxmox
- Unprivileged, Debian 12, 1 vCPU, 512MB RAM, 8GB disk
- Install pip and Ansible:
Note
--break-system-packagesis required on Debian 12. PEP 668 prevents pip from installing into the system Python environment by default — Debian enforces this to avoid conflicts with apt-managed packages. Using a virtualenv is the alternative, but for a dedicated control node LXC this flag is fine. - Clone the homelab-ansible repo
git clone https://github.com/rampantlemming/homelab-ansible.git ~/homelab-ansiblecd ~/homelab-ansible
- Install Galaxy collections (requires
requirements.ymlfrom the cloned repo)ansible-galaxy collection install -r requirements.yml
- Generate SSH keypair on the control node
ssh-keygen -t ed25519 -C "ansible@homelab"- Distribute the public key to all managed hosts:
- Populate and encrypt the vault file
- The repo contains
group_vars/all/vault.ymlas a template — fill in real secrets before encrypting ansible-vault encrypt group_vars/all/vault.yml- Store the vault password in a safe location (e.g. a password manager); the password file itself must be excluded from git via
.gitignore - To edit secrets later:
ansible-vault edit group_vars/all/vault.yml
- The repo contains
- Test the
commonrole against arrstack firstansible-playbook playbooks/arrstack.yml --tags common --check --diff- Verify idempotency: run twice, second run should show zero changes
- Once clean, run
commonacross all hostsansible-playbook site.yml --tags common
1B. Dockhand + gluetun deployment¶
Why now: Getting Dockhand deployed means container management is Git-backed before the boot drive swap. Gluetun gives qBittorrent proper VPN routing.
- Commit compose files to the homelab GitHub repo
- Create
stacks/arrstack/docker-compose.yml - Confirm current state: Sonarr, Radarr, Prowlarr, Seerr, qBittorrent, MediaBot
- Remove deprecated
version:key if still present - Set
TZ=Australia/Adelaideacross all services
- Create
- Add gluetun to the compose stack
- Add gluetun service with ProtonVPN WireGuard config
- Move qBittorrent to
network_mode: "service:gluetun" - Expose qBittorrent ports (8080, 6881) on the gluetun service
- Set
FIREWALL_OUTBOUND_SUBNETS=192.168.1.0/24so Sonarr/Radarr can still reach qBittorrent's API over LAN - Ensure this subnet does not overlap with the ProtonVPN WireGuard tunnel address range (gluetun will reject the config if they do)
- Generate ProtonVPN WireGuard private key via the ProtonVPN website (select P2P-capable Australian server with NAT-PMP)
- Test VPN routing
docker exec qbittorrent curl ifconfig.me— should return a ProtonVPN IP, not your home IP- Verify Sonarr/Radarr can still communicate with qBittorrent
- Deploy Dockhand container
- Configure GitHub PAT for repo access
- Point at the
stacks/arrstack/directory - Test: push a trivial compose change, confirm Dockhand picks it up
- Remove Watchtower container
- Dockhand handles update management — Watchtower is redundant
Phase 2 — PVE 9 Upgrade + RAIDZ Expansion (weeks 2–3)¶
Goal: Upgrade Proxmox VE from 8.4 to 9.x (Debian Bookworm → Trixie) to get OpenZFS 2.3, then expand the RAIDZ1 vdev with new drives. This must happen before Ansible codification so that roles capture the stable PVE 9 target state rather than the soon-to-be-replaced PVE 8 config.
Why before codification: Codifying PVE 8 state in Ansible and immediately rewriting it for PVE 9 is throwaway work. Upgrading first means Phase 3 captures reality from day one.
Warning
The ZFS raidz_expansion feature flag is a one-way operation. Once enabled, the pool cannot be imported on OpenZFS < 2.3. There is no going back to PVE 8 after zpool upgrade.
See the PVE 9 Upgrade runbook for the full step-by-step procedure — and the Lessons from the April 2026 run appendix capturing 8 deviations found during execution. Key stages:
- Pre-flight checks — completed 2026-04-19 (state bundle,
pve8to9 --fullgreen, BIOS X2APIC + I/OAT DMA enabled) - Upgrade — completed 2026-04-20: bookworm→trixie repos,
apt full-upgrade, GRUB-pinned to kernel 6.14.11-6-pve (not PVE 9.1's 6.17 default, to keep Nvidia 550.x compatibility) - Post-upgrade verification — PVE 9.1.7 running, ZFS 2.4.1 userland / 2.3.4 kmod, Nvidia 550.163.01 via DKMS, Plex HW transcode confirmed (
(hw)in dashboard) after fixing cgroup major drift 235→234 / 238→237 - Stability soak — abbreviated. No regressions in the ~6h between upgrade and Phase 3 kickoff;
stashat 91% full overrode the usual soak window. - RAIDZ expansion — completed 2026-04-20.
zpool upgrade stashenabledraidz_expansion+ 4 other features (irreversible). Drive 1 (sdg,scsi-35002538a97b1c620) attached at 13:59:59 ACST, reflow 2h08m at 1.73 GB/s. Drive 2 (sdh,scsi-35002538a97b19c40) attached after drive 1 auto-scrub, reflow 1h45m at 2.00 GB/s. All-SSD pool finished in hours, not days (runbook assumed HDD speed). - Post-expansion verification — pool 14.0T→21.0T, 91%→61% full, scrub repaired 0B with 0 errors, all 6 disks ONLINE,
stash/plex-dataquota intact
Phase 3 — Codify the stack¶
Goal: Every piece of host-level configuration that matters is captured in Ansible roles so a proxfold rebuild in Phase 4 (or after a hardware failure) is a playbook run rather than hours of manual work.
Phase 3 breaks into four sub-phases. 3A/3B/3C all landed on 2026-04-21; 3D is the remaining piece.
3A. PVE host baseline — proxmox and nut roles¶
Completed 2026-04-21
Merged in homelab-ansible#3A. See the proxmox role and nut role pages for the full task inventory.
What landed:
-
proxmoxrole — deb822 repo management (no-sub enabled, enterprise + ceph disabled), kernel pin viaproxmox-boot-tool(6.14.11-6-pve), nouveau blacklist,/etc/sysctl.conf→/etc/sysctl.d/migration,stashpool import, nasbackup CIFS registration -
nutrole — Network UPS Tools server + monitor for CyberPower PR1500ERT2U (vendor0764:0601), standalone mode, shutdown at 600s runtime-low - proxmox-host playbook wired:
proxmox → common → security → zfs → nfs → nvidia → nut - Vault entries added:
vault_nasbackup_password,vault_nut_admin_password,vault_nut_upsmon_password
3B. Plex data codification¶
Completed 2026-04-21
Merged in homelab-ansible#3B. Variables and structure documented on the plex role page.
What landed:
-
plex_data_zfs_dataset/plex_data_zfs_quota/plex_data_mountvariables plumbed through plex host_vars - ZFS dataset
stash/plex-data(100G quota) creation delegated to proxfold - LXC 100
mp1mount point registered vialineinfileagainst/etc/pve/lxc/100.conf - Plex data symlink
/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/→/stash/plex-data(fail-hard if real dir exists) -
docker-restarthandler gotcha documented (breaksnetwork_mode: service:*chains — don't touch compose during a plex role run)
3C. Proxmox auto-install kit¶
Completed 2026-04-21
Merged in homelab-ansible#3C. Full operator procedure at the Proxfold rebuild runbook.
What landed:
-
rebuild/answer.toml.j2(production — Samsung SSD disk filter, static IP) andrebuild/answer-test.toml.j2(rehearsal — QEMU disk filter, DHCP) -
rebuild/render.ymlAnsible playbook that decrypts the vault and writes both answers; wrappersrender-answer.sh+build-iso.sh - Rendered TOML + built ISOs gitignored (contain root password hash + SSH keys)
- Nested-VM rehearsal validated template rendering, vault integration, disk filter fail-safe, and boot order. Installer initramfs finalisation fails inside nested PVE — documented as a known limitation (does not affect bare metal)
- WSL control node bootstrap formalised as the DR cold-start path (CT104 doesn't exist until proxfold is rebuilt and its backup restored)
3D. Scheduled drift detection (done — 2026-04-21)¶
Kit and operator docs
Implementation kit lives in homelab-ansible/drift-detection/. Operator procedure: Drift Detection.
What landed:
- Systemd timer on CT104 — daily at 04:00 ACST with ±5 min randomised delay and
Persistent=truefor missed-run catch-up -
drift-check.shwrapper —git pull --ff-only,ansible-playbook playbooks/site.yml --check --diff, parses thePLAY RECAPforchanged/failed/unreachabletotals, classifies outcomes - Dedicated
#homelab-driftDiscord webhook — amber embed for drift, red for failure, silent for clean runs (opt-inDRIFT_SUMMARY=1for confirmation pings) - WSL fallback — same wrapper runs from WSL via env overrides, matching the dual-control pattern (CT104 is primary, WSL is DR cold-start)
- Role hardening surfaced by the first live run —
zfs+nvidiamade check-mode safe (check_mode: falseon health probes,not ansible_check_modegate on systemd enables), nvidia passthrough strip is now conditional on the managed block being absent (killed a perennial false-positive),reload udevhandler switched fromcommandtoshellso&&actually parses
Decided during implementation (the "open questions" from the original 3D scope):
- Cadence: daily, not weekly — catches drift inside 24h, and with no-signal runs silent the cost is only the Discord webhook embed on actual drift
- Notification target: dedicated
#homelab-driftwebhook, separate from the MediaBot channel; n8n deferred to Phase 5A and not needed here - Vault handling:
no_log: truealready set on vault-rendered NUT files; webhook only posts thePLAY RECAP(truncated to Discord's 1024-char field limit), no task-level diff
Not done, deferred:
- Failure quarantine / flap suppression — deferred until we actually see a flapping host; premature abstraction today
- Dynamic fan curve for the R430 T400 — looked at during the 3D converge, concluded the BMC handles it fine in auto on this unit (see the Drift Detection page and the host_vars comment on proxfold's
ipmi_fan_fix: false)
Phase 4 — Hardware upgrades (weeks 5–7)¶
Goal: UPS protection, full CPU/RAM capacity, and a clean boot drive on mirrored ZFS. The order is strict — each step protects or enables the next.
flowchart LR
A[UPS install] --> B[CPU 2 + RAM]
B --> C[Boot drive swap]
C --> D[Ansible rebuild]
style A fill:#FAECE7,stroke:#993C1D,color:#712B13
style B fill:#FAECE7,stroke:#993C1D,color:#712B13
style C fill:#FAECE7,stroke:#993C1D,color:#712B13
style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
4A. UPS purchase + install¶
Why first: A power event during a boot drive swap or ZFS rebuild would be catastrophic. UPS goes in before any hardware is touched.
Completed 2026-04-19 (pulled forward from planned order)
The UPS was installed ahead of Phases 2 and 3 because the hardware arrived early and a safe install window was available. See UPS for the as-built configuration, USB IDs, NUT config, and battery transfer test results.
- Purchase the CyberPower PR1500ERT2U — acquired from Scorptec for ~$1,179 AUD
- 1500VA/1500W, pure sine wave, 2U rackmount
- Active PFC compatible (required for R430 PSUs)
- Install and connect
- Both R430 PSUs (dual 550W redundant) on battery-backed outlets via 2× IEC C13-to-C14 cables
- NAS and networking gear still on wall power — revisit as part of future rack work
- Configure NUT on Proxmox
nut2.8.0 installed directly on the Proxmox host instandalonemodeusbhid-upsdriver matching0764:0601(CyberPower HID)battery.runtime.lowraised to 600 s (from 300 s) for ZFS/VM shutdown buffer- Battery transfer test 2026-04-19 — clean
OL→OB→OL, ~2.5 min on battery, no guest disruption - NUT monitoring integration with n8n deferred to Phase 5
4B. CPU 2 + RAM upgrade¶
Why now: The second CPU socket unlocks all 12 DIMM slots. Buying RAM in one lot gets a matched set and provides headroom for additional VMs.
Note
The R430 Hardware Upgrade runbook covers the full procedure including heatsink installation, CPU seating, and BIOS verification.
Source before starting:
- 1× Intel Xeon E5-2680 v4 (S-Spec: SR2N7) — match the existing socket 1 CPU
- 1× Dell heatsink P/N 02FKY9
- 1× 6th fan module (Dell P/N DNHNR or 79WM9) — Dell's minimum for dual-CPU is 5 fans; the recommended layout is 6. As-built proxfold has 5 populated bays + Fan6 empty (verified 2026-04-28), so this step adds the 6th fan to bring the chassis to the recommended layout
-
8× 32 GB DDR4-2400 PC4-19200R 2Rx4 ECC Registered RDIMM — single-vendor matched lot; 256 GB total. Populates 1 DIMM per channel per CPU (optimal 1 DPC electrical config — every channel active, no second-DIMM signal loading). Leaves 4 slots free for future expansion. SK Hynix is the recommended vendor since the chassis is already running Hynix sticks today (proxfold dmidecode 2026-04-28 shows 6× HMA41GR7-family, mixed AFR + MFR die — both work, no issues).
Acceptable part numbers (pick one vendor — don't mix across the kit):
Vendor Part number Notes SK Hynix HMA84GR7AFR4N-UH(A-die) orHMA84GR7MFR4N-UH(M-die)32 GB 2Rx4 PC4-19200R, CL17. Either die revision works; matches what's already in proxfold today. Recommended. Samsung M393A4K40CB1-CRC32 GB 2Rx4 PC4-19200R, CL17 Micron MTA36ASF4G72PZ-2G3(and-2G3B1/-2G3A1suffix variants)32 GB 2Rx4 PC4-19200R, CL17 Confirm the seller is selling a true matched 8-stick lot (single decommissioned server pull) and not 8 random sticks from different sources. Within a single matched kit, all sticks should share the same die revision.
Alternatives if 32 GB sticks aren't available
| Plan | Total | Trade-off |
|---|---|---|
| 8× 32 GB / 256 GB (default) | 256 GB | Optimal 1 DPC config + most capacity + 4 slots free |
| 12× 16 GB / 192 GB | 192 GB | All slots filled. Slightly less bandwidth-clean (2 DPC on 4 of 8 channels). 16 GB part numbers: Samsung M393A2G40EB1-CRC, SK Hynix HMA42GR7AFR4N-UH, Micron MTA36ASF2G72PZ-2G3. Avoid Samsung M393A2K40BB1-CRC — that's the 1Rx4 single-rank variant. |
| 8× 16 GB / 128 GB | 128 GB | Lowest cost. Tight headroom — Wazuh + Phase 6 services + ARC squeeze the budget; ARC re-tune target drops to ~48 GB instead of ~128 GB. Workable but the 32 GB option is the better deal at typical price-per-GB ratios. |
Original spec was 12× 32 GB / 384 GB — re-scoped 2026-04-28 to right-size for actual fleet load and to dodge the 2026 DDR4 price spike.
2026 DDR4 RDIMM market context
DDR4 RDIMM pricing is elevated through 2026 because foundry capacity has been pulled toward HBM3E/DDR5 for AI workloads (Tom's Hardware DDR price tracker). Expect more variance than usual on eBay AU listings; watch the market for 1–2 weeks before committing to a kit. Don't panic-buy the first matched lot — and don't pay for DDR4-2666 sticks "for future-proofing" since the E5-2680 v4 caps at 2400 MT/s and 2666 sticks just downclock (Intel E5-2680 v4 product specs).
Steps:
- Shut down R430 and disconnect power
- Install 6th fan module (slot furthest from PSUs)
- Install CPU 2 + heatsink
- Remove all 6× existing 8 GB DDR4-2133 DIMMs (don't mix old + new — speed clocks down to slowest stick, and the chassis becomes asymmetric). Install the new RDIMM kit. Slot population:
- 8-stick (32 GB) plan — default: A1, A2, A3, A4 + B1, B2, B3, B4 (1 DIMM per channel per CPU; A5/A6/B5/B6 left empty)
- 12-stick (16 GB) alternative: all 12 slots A1–A6 + B1–B6
- Follow Dell population order for optimal memory interleaving (consult the R430 Owner's Manual or iDRAC lifecycle log if memory training errors appear after install)
- Power on and verify in iDRAC
- Both CPUs recognised, correct stepping (SR2N7)
- Total RAM visible: 256 GB (8× 32 GB default) or 192 GB (12× 16 GB alternative), all DIMMs healthy
- No memory training errors in lifecycle log
- Verify in Proxmox:
lscpushows 28C/56T,free -hshows ~252 GB (8× 32 GB) or ~189 GB (12× 16 GB) after kernel reservation - Re-tune ZFS ARC cap post-converge. Current cap is 14 GiB (changelog 2026-03 entry) — sized for the 48 GB-physical era. With 256 GB physical and a 21 T pool at 62% used, lift the cap toward ~half-RAM (≈ 128 GB) for materially better cache hit rates on media + PBS reads. Variable lives in the
proxmoxrole (/etc/modprobe.d/zfs.conf); revisit after at least 24 h of stability soak with the new RAM, not on day 0. (For 12× 16 GB / 192 GB alternative: target ≈ 96 GB instead.)
4C. Boot drive swap to mirrored ZFS¶
Why: Replace the single 128GB LVM boot drive with two 960GB SSDs in a ZFS mirror, providing OS-level redundancy.
Completed 2026-04-22
Executed via the Boot Drive Swap runbook + Proxfold rebuild runbook. Boot drive is now rpool ZFS mirror across Samsung SM843T + Intel DC S4500 (888G usable, 2% used after restore). See the "Lessons from the 2026-04-22 run" appendix at the bottom of the boot-drive-swap runbook for what bit during execution.
What landed:
- Pre-swap backups — full vzdump of all VMs/containers to NAS (192.168.1.253) verified before power-down
- Ansible playbooks committed and pushed prior to swap; fresh production ISO built via
rebuild/build-iso.sh -
answer.toml.j2disk filter updated — usedfilter.ID_BUS="ata"(not array-formID_MODEL— that was rejected by the installer validator); filter matches both 960GB SATA SSDs and the installer buildsrpoolacross them as RAID1 - Hardware swap — 128GB Samsung 840 PRO removed; Samsung SM843T 960GB + Intel DC S4500 960GB installed in Dell caddies in bays 2 & 3
- Auto-install ran clean — ZFS mirror
rpoolcreated across both new drives, SSH/network up on 192.168.1.250, Ansible converge re-ran,stashpool re-imported with-f(new hostid), CIFS re-registered, CT100 + guest VMs restored from NAS - Post-swap verification —
zpool status rpoolboth drives ONLINE, both drives populated inproxmox-boot-tool status, staleBoot0007UEFI entry pointing at the defunct 840 PRO GPT UUID removed viaefibootmgr -b 0007 -B - Followups codified —
nvidiarole LXC passthrough rewritten fromblockinfileto per-linelineinfile(PVE's conf parser moves rawlxc.*keys to EOF on every rewrite, breaking marker semantics);proxmoxrole now disables ceph enterprise deb822 source and useszpool import -ffor fresh-hostid cases;ipmi_fan_fix: falsepath now self-healing
Phase 5 — Backup + monitoring (weeks 7–9)¶
Goal: Close the two real gaps left after Phase 4 — there's no scheduled backup and no infrastructure-wide health visibility beyond the 3D drift heartbeat. Everything in Phase 5 is additive; nothing blocks Phase 6.
Scope reset — 2026-04-23
The original Phase 5 scope (n8n + Ollama, vulnerability management) was written before Phase 3D landed its own Discord integration and before the 4C boot-drive swap exposed how thin the backup story actually is. Revised shape:
- 5A (new, priority) — Proxmox Backup Server, replacing the manual-only vzdump path
- 5B (new, priority) — Beszel + PVE 9 notification system + ZED webhook, one stack covering host metrics / backup status / zpool events
- 5C (de-scoped) — n8n as a lab/glue capability only; Ollama dropped (no GPU budget, no RAM headroom pre-4B, and the drift/ZFS/UPS workflows the original 5A called out are already covered elsewhere)
- Re-scoped from "dropped" —
unattended-upgradeslanded 2026-04-25 as a standaloneauto_updatesrole (wrapshifis.toolkit.unattended_upgrades). Security-only origins across the fleet, Proxmox repo added on proxfold + pbs, hypervisor kernels blocklisted, manual reboot with a/var/run/reboot-requiredDiscord nag on proxfold. See auto_updates role page. The rest of the old 5B (full vulnerability management stack) remains out of scope - Not adopting — Grafana / Prometheus. Beszel's built-in historical metrics cover the use case at a fraction of the operational footprint
5A. Proxmox Backup Server¶
Completed 2026-04-23
PBS live as CT 105 (192.168.1.246, Debian 13 privileged, features=nesting=1). Datastore nas-primary on NFSv3 from the TS-269L, prune daily 03:00 (keep 7d/4w/6m), verify sun 04:00, GC mon 04:00. PVE storage registered on proxfold; daily 02:00 all-guest backup job pbs-daily active. First full backup + manual verify ran clean same day — ~95 GiB across 5 guests, ~11 min wall clock. See the pbs role page for the as-built task list and the gotchas captured during execution.
Why: /etc/cron.d/vzdump was empty post-4C — every backup on the NAS was a manual push. PBS gives scheduled incremental + deduplicated backups with verify jobs, and the dedup ratio typically runs 5:1+ on VM images. The existing CIFS nasbackup path stays available for manual vzdump fallback during the transition.
Shape: PBS LXC on proxfold, datastore on an NFS mount from the QNAP TS-269L at 192.168.1.253. The NAS can't run PBS itself (Atom D2701, EOL QTS 4.3.4) but it does export NFS, which is materially better-behaved than CIFS as a PBS datastore backend — no uid/gid remapping, no reported .chunks inode-collision issues.
QNAP firmware end-of-life
QTS 4.3.4 is the final firmware for the TS-269L and has been unpatched since 2020. Accepted as part of Phase 5A scoping — not a blocker, but a ticking replacement clock. See Accepted Risks — QNAP TS-269L firmware EOL for the full treatment.
Architecture decisions (locked in 2026-04-23):
- Datastore on NAS, not local — DR requires a second failure domain. A datastore on
stashwould die with the pool ifstashfails. The NAS is the only meaningful second domain in this environment. - NFS, not CIFS — PBS datastore on CIFS works but is fragile (uid/gid remap, inode collisions on some NAS firmwares). NFSv3 on the QNAP is the well-trodden path.
- Not on rpool — the whole point of the 4C boot-mirror design was isolating OS from data; backups belong on the data path.
- PBS as LXC, not VM — standalone service, no Docker needed, fits the established LXC-for-standalone / VM-for-Docker pattern.
Steps:
What landed:
- NFS export on the QNAP —
backupshare already exported; added192.168.1.246to the NFS host-access ACL withNO_ROOT_SQUASHvia QTS 4.3.4 UI (Control Panel → Shared Folders → backup → Edit NFS host access). QTS's UI doesn't expose async/secure toggles; it uses sane defaults internally. - PBS LXC — CT 105, Debian 13 (not 12 — PBS 4.x is trixie-only), privileged (avoids the uid 34 shift for NFS writes), 2 vCPU / 4 GB / 8 GB on
local-zfs, 192.168.1.246,features=nesting=1(required for systemd 257 in Debian 13 LXCs). - PBS install — deb822 repo, keyring fetched from
enterprise.proxmox.com/debian/proxmox-release-trixie.gpg,proxmox-backup-server4.x installed. - NFS mount + datastore —
/mnt/pbs-datastoreownedbackup:backup,nfsvers=3,x-systemd.automount, datastorenas-primarycreated idempotently (gated on.chunks/existence). - Schedules codified — prune daily
03:00(plainHH:MM, since PBS rejectsdaily HH:MM—dailyis a systemd macro that can't combine with a time), verifysun 04:00, GCmon 04:00. All registered via list-and-create idempotent pattern. - PBS API token — user
pbs-pve@pbs+ tokenpbs-pve@pbs!pve, secret + fingerprint in vault. Critical PBS 4.x gotcha: ACLs must be granted to the user auth-id, not the token auth-id — token ACLs silently resolve to zero perms. Also needsDatastoreAuditalongsideDatastoreBackupsopvesm add pbscan verify the datastore. - PVE-side storage —
pvesm add pbs nas-primarywired viaroles/proxmox/tasks/pbs_client.yml(idempotent, gated onpvesm status -storage). - Backup job
pbs-daily— daily02:00ACST, all guests, snapshot mode, targetnas-primary, retention delegated to the PBS-side prune job. - First full backup + verify — 5 guests (CT 100/104/105 + VM 101/102), ~95 GiB on NAS, ~11 min wall clock, manual
verify-job runreturnedTASK OKon all 5 snapshots. - Decommission old CIFS backup path —
nasbackupremains registered; leave read-only for 30 days of PBS-only operation, then remove. - Discord notification
- PVE 9 notification target → Discord webhook for backup job status (see 5B — the webhook is shared)
5B. Server notification stack¶
Completed 2026-04-24
Three notification paths live into the shared #homelab-ops Discord channel: Beszel (CT 106, 192.168.1.247) scraping 6 agents (proxfold/arrstack/nginx/plex/pbs/control), ZED webhook on proxfold for zpool events, and a PVE 9 webhook endpoint with match-all matcher for backup/replication/node events. Install delegated to upstream scripts for hub + agents. See the beszel role page and the zfs role page for the as-built tasks.
Why: Before 5B the only Discord signal from the homelab was the 3D drift heartbeat. No visibility into backup job results, zpool degradation, SMART health, temperature, or guest resource state. Fixed all of that in one stack.
Shape: three independent notification paths, all landing in a single #homelab-ops Discord channel (separate from #homelab-drift):
- Beszel — host/container metrics, historical data, threshold alerts
- PVE 9 built-in notification target — backup job results, replication, node up/down, update availability
- ZED webhook — zpool state changes, scrub completion, vdev removal
Decisions (locked in 2026-04-23):
- Beszel not Netdata / Prometheus / Grafana — ~10MB per agent vs 200-500MB for Netdata, SQLite historical store is enough for a 4-host homelab, no dashboard engineering needed
- Beszel hub as dedicated LXC, not on arrstack — arrstack stays dedicated to the media compose; hub runs the native binary + systemd (no Docker, no nested-virt), matches the
nutrole shape - Beszel agents as systemd binaries, not Docker — one agent per host (proxfold, arrstack, nginx, plex, pbs, optionally control)
- Uptime Kuma deferred — no external HTTP endpoints being tracked yet
What landed:
- Beszel hub LXC — CT 106, Debian 13 (not 12 — systemd 257 in trixie), unprivileged,
features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, 192.168.1.247. Install delegated to get.beszel.dev/hub; cached installer at/root/install-beszel-hub.sh. Hub listens on plain HTTP:8090(TLS terminates at future nginx reverse proxy). - Beszel agents — installer at get.beszel.dev runs with
-k <hub-pubkey> -p 45876to bake env into the systemd unit. Deployed to proxfold (withbeszel_agent_enable_smart: trueforsmartmontools+ disk-group SMART access), arrstack, nginx, plex, pbs, control. Agents registered manually in the hub UI. - Hub pubkey capture — derived via
ssh-keygen -y -f /opt/beszel/beszel_data/id_ed25519(Beszel doesn't write a separate.pubfile); vaulted asvault_beszel_hub_pubkey. - Alert rules — left UI-owned per the Phase 5 scope reset decision; not codified in Ansible.
- PVE 9 notification target — webhook endpoint
discord-ops+ match-all matcherops-all, codified viaroles/proxmox/tasks/notifications.yml. Body template uses Handlebars{{ escape ... }}with severity-based color conditionals, stored inroles/proxmox/files/discord-notification-body.jsonso Ansible's Jinja doesn't interfere. Critical gotcha:pveshexpects--body,--header value, and--secret valueas base64-encoded strings (per the schema), not raw. Raw JSON silently stores but fails at delivery with "could not decode base64 value"; pass via| b64encode. Defaultmail-to-rootmatcher left intact — events fan out to both. - ZED webhook on proxfold —
zed_webhook_enabled: truein proxfold host_vars triggers thezfsrole's ZED tasks: installscurl, renders/etc/zfs/zed.d/discord.sh, and symlinksstatechange-discord.sh,scrub_finish-discord.sh,resilver_finish-discord.shonto it. Webhook URL comes fromvault_discord_webhook_homelab_ops. Smoke-tested by invoking the script directly withZEVENT_POOL=stash ZEVENT_SUBCLASS=statechange; embed delivered clean. - CT 104 (control) added to inventory — previously not an Ansible-managed host, now under
lxc_containerswith its owncommon + security + beszel_agentplaybook. Adds the drift runner itself to observability. - WSL → plex SSH gap fixed — WSL pubkey appended to CT 100's
authorized_keysso site.yml runs cleanly from either control or WSL. Pre-existing since the 4C rebuild. - Notification routing summary doc — deferred; content lives in the role pages for now and the header overview in changelog.
5C. n8n — lab automation (de-scoped)¶
Completed 2026-04-28
n8n live as a Docker stack on a dedicated host VM. Reverse-proxied at n8n.rampancy.cloud. SQLite-backed (Postgres deferred until execution volume warrants). Captured by pbs-daily. See the n8n service page for the full picture and execution-time lessons.
Why: Lab/glue capability for cross-service workflows that don't belong in Ansible. Explicitly not replacing anything in 5A/5B/3D.
What changed from the original scope:
- Pivoted from npm-on-LXC to Docker-on-VM mid-execution. Original plan was native Node.js install on an LXC. Modern n8n (v2.17+) bundles a heavy AI SDK ecosystem that OOMs a 4 GB LXC during install and won't compile against Debian trixie's apt-managed Node.js (
isolated-vmABI mismatch with Debian's V8 patches). Pivoted to the n8n team's recommended Docker path on a fresh VM, using Dockhand for git-managed deploys via thehawseragent on the host. - Ollama dropped (no GPU budget; lab-scope already covered by Anthropic API in workflows if/when needed).
- Infrastructure workflows the original called out (ZFS health, drift detection, NUT) are now owned by ZED webhook (5B), the 3D heartbeat, and Beszel/NUT respectively. n8n is for user-facing glue only.
What landed:
- n8n VM — Proxmox VM 108, Debian 13 trixie genericcloud qcow2, cloud-init seeded, 192.168.1.248, 2 vCPU / 8 GB / 32 GB on local-zfs
- Roles applied —
common,security,auto_updates,docker,beszel_agent,hawser -
hawserrole added — Dockhand's remote-host agent codified for the first time. Per-host vault token (vault_hawser_token_<host>), RW socket mount, named volume for stack cache,REQUEST_TIMEOUT=600sdefault. See hawser role page. - n8n stack —
stacks/n8n/docker-compose.ymlin the homelab-ansible repo, deployed via Dockhand → Hawser. SQLite under then8n_datanamed volume. See n8n service page. - Reverse proxy — n8n.rampancy.cloud → 192.168.1.248:5678 via the nginx VM
- Beszel — agent registered, host visible in the Beszel UI
- PBS — first ad-hoc snapshot 2026-04-28 (1m 21s, 32 GiB transferred, 86% deduped);
pbs-dailycovers ongoing - Site-wide drift —
--check --difffrom CT104 reportedchanged=0 failed=0 unreachable=0across all 9 hosts post-deploy
Execution-time lessons (full detail in n8n service page lessons section):
- Modern n8n's npm install peaks past 4 GB RAM and won't compile against Debian's apt Node — Docker is the right path in 2026.
- Docker 29's default storage driver (
containerd-snapshotter / overlayfs) breaks pulls on certain layer patterns — pinned to legacyoverlay2per-host viadocker_daemon_config. The new default only applies to fresh Docker 29.x installs; arrstack + nginx upgraded across the 28→29 boundary so they kept their existing overlay2 and were never bitten. - Hawser's default
REQUEST_TIMEOUT=30sis too short for any real image pull — bumped to 600s as a role default. This was the actual blocker on multiple deploy attempts. - Dockhand v1.0.18 (what was running on arrstack) has a separate timeout bug (Finsys/dockhand#587); upgraded to v1.0.27 during the deploy.
Deferred:
- First workflows — left for user exploration per Phase 5C's lab-scope intent
- Postgres companion — only if execution volume crosses ~5–10k/day or DB ~4–5 GiB
- Webhook/OAuth env vars in compose —
N8N_HOST/N8N_PROTOCOL/WEBHOOK_URLare commented; uncomment when first workflow needs public callback URLs - ~~nginx Hawser codification~~ — superseded by Phase 5D. nginx VM is retiring; new edge LXC runs Caddy (no Docker, no Hawser). Codification gap closed by the
caddyrole. - Internal
wss://for Hawser → Dockhand — currently plainws://on LAN; tighten when the internal-traffic TLS story lands
Phase 5D — Edge LXC + Caddy migration (NPM → Caddy)¶
Completed 2026-05-02
Caddy live on edge LXC (CT 107, 192.168.1.244) replacing hand-clicked NPM on VM 102. Single Let's Encrypt wildcard cert *.rampancy.cloud issued via DNS-01 against Cloudflare. Four hosts migrated. NPM container stopped — VM 102 in soak before destroy. See edge-cutover runbook.
Why: NPM was the only piece of homelab infra still managed by web UI rather than Git. Hand-clicked rules, no IaC story, upstream stalled (last release Feb 2025), CVEs in the LE-cert add flow. Caddy on a single static binary closes the codification gap, picks up automatic HTTPS renewal, and supports a single wildcard cert via DNS-01.
What landed:
-
edgeLXC — CT 107, Debian 13 (trixie), unprivileged,features=nesting=1, 1 vCPU / 1 GB / 4 GB on local-zfs, 192.168.1.244 (mirrors CT 106 Beszel pattern) -
caddyrole — single binary built viaxcaddy + caddy-dns/cloudflare, version-pinned inroles/caddy/files/caddy-2.11.2, deployed to/usr/local/bin/caddy, systemd unit withAmbientCapabilities=CAP_NET_BIND_SERVICE, Caddyfile templated from inventory, ACME DNS-01 against CF via vaulted scoped token. See caddy role page. - Roles applied —
common,security,caddy,beszel_agent. Nodocker, nohawser— Caddy is config-as-code, not compose-as-code. - Four hosts migrated to Caddy:
requests.rampancy.cloud(Overseerr),dash.rampancy.cloud(Beszel),n8n.rampancy.cloud(n8n),kosync.rampancy.cloud(korrosync). All return same response codes as before, served by the LE wildcard. - UDM port-forward swap — TCP 80/443 → 192.168.1.244 (was 192.168.1.249)
- NPM container stopped on VM 102 (
docker stop nginx_proxy_manager-app-1); VM left running through soak
Cloudflare orange-cloud — attempted then reverted:
- CF SSL/TLS encryption mode set to Full (strict) at zone level
- Four CNAMEs flipped to orange via API
- ~~Reverted same day~~ — CF Universal SSL is disabled at the zone (no edge cert auto-issued for the new orange hostnames). Without a paid Advanced Certificate Manager order or Universal SSL re-enable, every TLS handshake to those hostnames at the CF edge failed (no SNI match). All four flipped back to gray; traffic resumed via direct WAN → UDM → Caddy. Edge security gap captured as an accepted risk until Phase 7D CrowdSec lands.
Execution-time lessons (full detail in edge-cutover runbook lessons):
- Caddy upstream systemd unit ships
--environflag — prints all env vars (including secret tokens) to journalctl on startup. Removed in our rendered unit. caddy validatedoesn't see EnvironmentFile — ad-hoc validation needs the env var passed via Ansible'senvironment:task param. systemd'sEnvironmentFile=only applies to ExecStart.- Caddyfile canonical fmt uses tabs — gofmt-style. Template indented with tabs to match.
- CF token scopes are granular — Zone:Read + DNS:Edit are needed for ACME DNS-01; Zone Settings:Read is a separate scope (token cannot inspect SSL mode without it).
- CF Universal SSL provisioning is per-hostname, not zone-wide — flipping orange on a hostname without an existing edge cert results in immediate handshake failure. Always verify Universal SSL is enabled before flipping orange.
Deferred:
- VM 102 decom — done 2026-05-03. Stopped, pre-decom vzdump on
nasbackup(1.65 GB compressed),qm destroy 102 --purgeclean. Ansible inventory retired (host_vars/nginx.yml,playbooks/nginx.yml,stacks/nginx/removed). - CrowdSec on edge — covered separately as Phase 7D
- Caddy access logs to file — currently stdout/journalctl only; add
logdirective when CrowdSec phase lands (bouncer reads structured access logs from a file) -
trusted_proxiesconfig — only relevant if/when CF orange is re-enabled (Universal SSL would need to land first, or per-host Advanced Certificate Manager)
Phase 5E — Host-level file backup (proxfold → PBS)¶
Completed 2026-05-06
Daily file-level backup of proxfold host config (/etc, /root, /var/lib/pve-cluster) to PBS namespace host/proxfold via proxmox-backup-client + systemd timer. First snapshot 7.4 MB in 0.42s; idempotent re-run reports changed=0. Three execution-time bugs caught (CLI --output-format rejection, missing namespace subcommand, ACL-on-token bug); the third revealed a latent bug in the existing pbs role which was patched in the same cycle. Full lessons in backup-restore runbook.
Why: pbs-daily (Phase 5A) backs up guests via vzdump; the PVE host itself isn't included. A full-loss scenario not covered by the 4C ZFS boot mirror (config corruption, fat-fingered rm, full reinstall) currently means rebuilding from arrstack docs. Host configs are sub-MB, so chunked + dedup'd backups are negligible cost.
Out of scope (deliberate):
- Off-host replication of the PBS datastore. Single-failure-domain on the QNAP TS-269L is documented and accepted — no viable second target exists in the current environment. Not re-evaluated here.
- Host backups for other PVE hosts. Single-host fleet; design uses an
host/<hostname>namespace so a second host could opt in cleanly later.
Architecture decisions (locked in 2026-05-06):
- Extend the
proxmoxrole. New tasks filehost_backup.ymland four templates (env, wrapper script,.service,.timer) — no new role, mirrors the existingpbs_client.ymlsplit. - Separate PBS user + token. New
host-backup@pbs!proxfolddistinct from the guest-backup token, blast-radius separation. Same fingerprint reused. - Namespace
host/proxfold. Cleanly separates from VM/CT snapshot listings. - Schedule 02:30. Vzdump kickoff is 02:00, recent runs complete in ~5 min (PVE task log 2026-05). 02:30 has >20 min headroom and stays ahead of the 03:00 PBS prune.
- Crypt-mode
none. Chunks live on a private NAS already; client-side encryption adds a key-loss footgun for negligible gain on host config files. - Retention — namespace-scoped prune-job, longer than the global
pbs-dailyretention since the dataset is tiny and per-snapshot deduplication is high. Patterned after the Proxmox PBS docs prune example, tuned for homelab:keep-daily 14, keep-weekly 8, keep-monthly 12, keep-yearly 2. - Manual one-shot bootstrap on PBS. Mirrors the Phase 5A token-gen pattern. Adding multi-user / multi-namespace abstraction to the
pbsrole for one extra user wasn't worth it. Procedure codified in the backup-restore runbook.
Code landed:
-
roles/proxmox/tasks/host_backup.yml— installs client + creds + script + unit + timer -
roles/proxmox/templates/pbs-host-backup.{env,sh,service,timer}.j2 -
roles/proxmox/tasks/main.yml— gated include (pbs_host_backup is defined and vault_pbs_host_token_secret is defined) -
roles/proxmox/defaults/main.yml— example block + bootstrap pointer -
inventory/host_vars/proxfold.yml—pbs_host_backupblock
Bootstrap executed:
- PBS-side one-shot on CT 105 —
host-backup@pbsuser +host-backup@pbs!proxfoldtoken +host/proxfoldnamespace + DatastoreBackup ACL on both authids + namespace-scoped prune-job (14d/8w/12m/2y, 03:15 daily). Token captured to PBS tmpfs, SCP'd to control tmpfs, shred at both ends. - Vault append —
vault_pbs_host_token_secretwritten via decrypt-to-tmpfs / append / re-encrypt pattern; round-trip verified, source token file shred. - First playbook run —
--limit proxfold --tags host_backupreportedchanged=5(env file, script, service, timer, timer enable). - Smoke test —
systemctl start pbs-host-backup.servicesucceeded; snapshothost/proxfold/2026-05-06T02:01:48Z(7.4 MB, 3 pxar archives) registered on PBS. - Idempotency check — full
site.yml --check --diff --limit proxfoldreportsok=84 changed=0 failed=0.
Open follow-up:
- Patch the
pbsrole to grant ACL on both user AND token auth-ids (closed 2026-05-06, same cycle). NewEnsure datastore ACLs for PBS client TOKEN auth-idtask inroles/pbs/tasks/main.ymlmirrors the existing user-grant loop, gated onvault_pbs_token_id is defined. New defaultpbs_client_token_name: "pve". Verified idempotent against live PBS (manual 5A grant matches what the task produces —changed=0on apply). A fresh PBS rebuild via the role no longer needs the manual second grant.
Phase 6 — New services (weeks 9+)¶
Greenfield — no dependencies between items. Work on whichever is most interesting at the time.
6A. Forgejo (self-hosted git, GitHub-mirrored)¶
Stand up a dedicated LXC running Forgejo, front it with the existing Caddy edge, migrate four GitHub repos with push-mirrors back to GitHub. Forgejo becomes source of truth; GitHub stays as private mirror so external Actions (e.g. meat-helmet's scheduled cron) keep firing.
Sub-stages 6A.1 (LXC stand-up), 6A.2 (edge integration + CF DNS token rotation), 6A.3 (repo migrations + push mirrors), 6A.4 (close-out + docs sweep). Full procedure in the Forgejo Setup runbook.
Phase 6A complete — 2026-05-05
All four sub-stages executed across two days. Forgejo 11.0.13 live at git.rampancy.cloud, all 4 repos imported with sync_on_commit push-mirrors to GitHub, local origins flipped, CF DNS token rotation closed the 7D leak follow-up. End-to-end mirror chain validated by the closing docs commit (Forgejo push → GitHub mirror within ~5s).
- 6A.1 — CT 109 created on rpool;
forgejo-sqliteinstalled via Codeberg APT; service running on LAN at :3000 (executed 2026-05-04) - 6A.2 —
git.rampancy.cloudlive behind Caddy on CT 107; CF DNS token Rolled + API-validated (closes 7D follow-up); ForgejoROOT_URLflipped to https +DISABLE_SSH=true(executed 2026-05-05) - 6A.3 — arrstack, homelab-ansible, mediabot, meat-helmet imported with full history (Forgejo migrate API, 2026-05-05); per-repo push-mirror live (fine-grained PATs,
sync_on_commit: true); local remotes flipped on WSL withgithubretained as fallback - 6A.4 — Phase 6A close-out (2026-05-05): docs sweep across role/service/runbook + homelab-ansible README refresh; Discord push-event webhook per repo deferred as low-signal for single-operator setup (procedure preserved in runbook for future collaboration trigger)
6B. Home Assistant¶
Stand up HAOS as a sealed Proxmox VM, integrate the Hue V2 bridge + Tapo P110M (via Matter, LAN-local) + Bambu A1 Mini (via HACS + ha-bambulab), front via Caddy edge at home.rampancy.cloud, build dashboard + automations.
Sub-stages 6B.1 (VM stand-up), 6B.2 (core integrations), 6B.3 (edge integration), 6B.4 (dashboard + automations + close-out). Full procedure in the Home Assistant Setup runbook.
HAOS is opaque to the homelab Ansible baseline
HAOS is a sealed Buildroot appliance — no SSH, no apt, no common / security / auto_updates / beszel_agent. Drift detection is blind to it; the inventory entry exists only as a documentation anchor and to drive the Caddy edge route. Compensating controls: PBS daily VM-layer backup (more comprehensive than HAOS's native backup), HA Supervisor's own update mechanism for HA Core + add-ons, HA's native Discord integration into #homelab-ops. To be recorded as an accepted risk at close-out. Wazuh agent on HAOS is deferred to Phase 7B — BeardedTinker's HAOS rule pack works API-side without an in-VM agent.
- 6B.1 — VM stand-up (2026-05-23): HAOS 17.3 (
haos_ova-17.3.qcow2.xz, release 2026-05-06 — pin bumped from scaffold's 17.2 since 17.3 was current on the day of execution) imported tolocal-zfsas VM 110 (hass, 192.168.1.241). Q35 / OVMF /pre-enrolled-keys=0(the boot-loop gotcha) / EFI disk onlocal-zfs/ virtio-scsi-single / 2 vCPU / 4 GB / 32 GB. First boot via HA onboarding wizard athttp://192.168.1.241:8123; static IP set via Settings → System → Network. Boot-to-HTTP-200 was ~90 s (not the scaffold's ~5 min). PBS daily pickup at 02:00 ACST 2026-05-24 — spot-check pending. - 6B.2 — Core integrations: HACS bootstrap via the Studio Code Server add-on shell, accepted as a managed dependency. Tapo P110M via
tplinkintegration primary, Matter as fallback (revised from scaffold's Matter-primary — community + upstream HA-core issues show P110M Matter pairing flaky and energy-endpoint enumeration incomplete;python-kasanow handles KLAP locally without cloud creds). Hue V2 bridge auto-discovered → push event stream confirmed.ha-bambulabinstalled via HACS; A1 Mini flipped to LAN Mode + Developer Mode to permit MQTT writes under firmware ≥ 01.05. Deferred — user driving hands-on. - 6B.3 — Edge integration (2026-05-23):
home.rampancy.cloudvhost added tohost_vars/edge.yml(homelab-ansible commitb19f662); CrowdSec coverage automatic via the existing wildcard handler. HA'shttp:block inconfiguration.yamlset withuse_x_forwarded_for: true+trusted_proxies: 192.168.1.244. Gotcha: after editingconfiguration.yaml, verify HA Core has actually restarted (docker inspect homeassistant --format '{{.State.StartedAt}}'newer than config mtime) before applying Caddy — otherwise the trusted-proxies block stays unloaded and every X-Forwarded-For-bearing request gets HTTP 400, easy to misread as edge config error. End-to-end validated from cellular. - 6B.4 — Dashboard + automations + close-out: print-completion →
#homelab-opsDiscord webhook (restnotify platform with unifieddata:block — drop scaffold's deprecateddata_template:split syntax), sunset → Hue lights blueprint, P110M energy draw surfaced on the printer dashboard tile. Remaining docs sweep:hosts/hass/index.md,services/home-assistant.md,mkdocs.ymlnav,reference/accepted-risks.mdHAOS-opacity entry.
6C. Obico (3D print failure detection)¶
- On the A1 Mini, enable LAN Mode (Settings → Network) and then Developer Mode — these are two separate toggles; both are required for third-party integrations
- Connect a USB webcam for AI detection (A1 Mini built-in camera stream is not suitable for Obico's failure detection model)
- Deploy self-hosted Obico server in Docker
- Obico now supports direct Bambu connection without OctoPrint as middleware — use the native Bambu integration in recent Obico releases
- Configure Discord/Telegram notifications
6D. Music acquisition pipeline¶
Phase 6D complete — 2026-05-16
Lidarr (hotio plugins branch image) + Tubifarry plugin + slskd (via gluetun) + beets landed on arrstack VM 101. PlexAmp/Plex stay as the streaming layer (no Navidrome — explicit decision; pipeline scoped to acquisition only). Music library shared back read-only on Soulseek (1,905 dirs / 15,986 files) with 5-slot / 1 MB/s upload cap, polite per-peer throttle (3 grabs/day per peer, max 5 queued/peer, min 5 files for "real album" filter). Smoke test artist (Gotye) downloaded Making Mirrors Deluxe (22 FLACs) end-to-end. The bulk-import-from-root-folder auto-triggered when the root folder was added — 877 albums / 10,691 tracks registered in one pass. Lessons in music-acquisition-bringup runbook.
- Music library was already on ZFS at
/stash/rodneystash/Music(PlexTunesindexes it via the/mnt/plex/Musicsymlink managed by the plex role) - Lidarr on
ghcr.io/hotio/lidarr:pr-plugins(mainline has no plugin system) + Tubifarry plugin installed via Lidarr UI fromhttps://github.com/TypNull/Tubifarry - slskd 0.25 routed through existing gluetun container (
network_mode: service:gluetun), native VPN integration viaSLSKD_VPN=true+SLSKD_VPN_GLUETUN_*env vars - Gluetun control API auth wired (X-API-Key role for slskd, config TOML at
/opt/mediaserver/gluetun/auth/config.toml) - Soulseek citizenship: shared
/music(read-only) +/downloads, 5 upload slots, 1 MB/s cap, polite profile blurb signalling automated library - beets installed (on-demand sanity tagger, no Lidarr-pipeline integration)
- Lidarr Plex Connect notification wired via Plex.tv OAuth (auto-rescan on import)
- Existing 612-artist library imported (877 albums, 10,691 tracks) — auto-triggered on root folder add
- Gotye Making Mirrors (Deluxe Edition, 22 FLACs) downloaded + imported + visible in Plex Tunes
6E. Matrix server¶
Goal: Stand up a closed, federated Matrix homeserver that replicates Discord's day-to-day chat experience — text rooms, threads, spaces, and group voice/video — for me plus a small circle of family and friends. Element X / Element Web are the clients. Mobile push, Discord bridging, and OIDC SSO are explicitly deferred.
Scoped 2026-05-21 — not started
Scoping replaces the original four-line stub ("Synapse + PostgreSQL + Caddy via Docker Compose"). Two pivots vs the original: Synapse → Tuwunel (Rust, embedded RocksDB, much lighter at idle) and LXC → VM (spantaleev's playbook explicitly warns against LXC). See the Matrix setup runbook for the execution checklist.
Architecture — Docker stack on a new VM, fronted by the existing edge Caddy + CrowdSec:
| Component | Purpose | Notes |
|---|---|---|
| Tuwunel | Homeserver (text / rooms / spaces / threads / E2EE / federation) | Embedded RocksDB — no Postgres. Official conduwuit successor per the project's own README. |
| LiveKit Server | WebRTC SFU for group voice/video calls (MatrixRTC) | Single binary. Default media UDP range trimmed via ICE/UDP mux on 7882. |
| lk-jwt-service | Issues short-lived JWTs that authenticate Matrix clients into LiveKit | Tiny Go binary. |
| Traefik (bundled) | Internal HTTP router for the playbook's containers | Bound to 0.0.0.0:81 on VM 111; external Caddy on CT 107 terminates TLS. |
| matrix-static-files | Serves .well-known/matrix/* from matrix.rampancy.cloud |
Apex rampancy.cloud well-known files come from Caddy directly. |
Deployment approach — spantaleev's matrix-docker-ansible-deploy vendored separately on CT 104 at ~/matrix-deploy/, parallel to homelab-ansible/. The integration surface (well-known apex, federation routing, MatrixRTC ↔ LiveKit JWT signing, WebSocket forwarding for SFU) is large enough that re-deriving it in a custom role would cost more than the external-playbook tax. Drift detection skips the matrix VM — config drift is checked manually via the playbook's --check mode when wanted, not via the nightly drift cron.
Cross-phase decisions:
- Homeserver: Tuwunel (over Synapse). Rust, ~512 MiB at rest, embedded RocksDB, no Postgres dependency. Synapse is more featureful but eats 2–4 GiB and brings Postgres in tow — wrong fit until 4B RAM upgrade lands.
- VM, not LXC. Docker-in-LXC works elsewhere (arrstack VM 101 is technically a VM for the same reason); MatrixRTC's UDP networking + multiple-container fan-out + AppArmor footprint are even less LXC-friendly than the arrstack pattern.
- Federation via well-known delegation, not port 8448. Apex
rampancy.cloudserves/.well-known/matrix/serverpointing federation tomatrix.rampancy.cloud:443. Keeps all inbound traffic on the existing edge Caddy + CrowdSec path; no new TCP port-forward at the UDM. - LiveKit UDP is unavoidable. First service in the homelab that needs router-level UDP forwarding. Ports: 7881/tcp + 7882/udp (ICE) + 3479/udp + 5350/tcp (TURN) + 30000-30020/udp (TURN relay range). Forwarded straight to VM 111 — bypassing the edge LXC, no Caddy/CrowdSec coverage on those ports.
- Audience: closed.
matrix_tuwunel_config_allow_registration: falseplus avault_matrix_registration_tokenfor invite-style signups. Federation is enabled, so my rooms can include matrix.org users; my server just doesn't accept walk-ins. - Element Call frontend skipped. Element X (mobile) and Element Web both embed the RTC widget internally. Standalone
call.rampancy.clouddeployment is unnecessary unless we want a clickable web-call landing page later.
Resource footprint (VM 111, sized to match arrstack/n8n VM pattern):
| Slice | Target |
|---|---|
| RAM | 8 GiB (Tuwunel ~2, LiveKit ~1, everything else <1, OS+Docker ~1, headroom ~3) |
| Disk | 32 GiB on rpool (RocksDB growth is the unknown — Beszel disk alert will catch it) |
| vCPU | 4 |
| IP | 192.168.1.243 |
Completed 2026-05-22
All five sub-phases closed 2026-05-22. Tuwunel v1.7.0 on VM 111, federation green via apex well-known, MatrixRTC live with Element Call validated end-to-end (audio + video + screen-share). Two headline gotchas in the runbook's Lessons appendix: matrix_tuwunel_config_allowed_remote_server_names filtering events from our OWN server (text/federation phase) and the apex well-known missing org.matrix.msc4143.rtc_foci (RTC phase). Both fixes are now baked into the Caddy template and vars.yml respectively.
Execution checklist (actual):
- 6E.1 — VM 111 stand-up: Debian 13 genericcloud image, 4 vCPU / 8 GiB / 32 GiB at 192.168.1.243; common + security + beszel_agent applied (docker role intentionally dropped from
playbooks/matrix.ymlpost-execution — spantaleev owns Docker here, see Lessons). Completed 2026-05-21. - 6E.2 — spantaleev playbook bootstrap on CT 104: cloned at commit
9bd9d1ato/root/matrix-deploy/, inventory scaffolded for Tuwunel, vault-bridge symlink set up.matrix_tuwunel_version: v1.7.0pinned (v1.6.2 has a token+password registration regression).ensure-matrix-users-createdis Synapse-only — Tuwunel'sgrant_admin_to_first_userhandles first admin via client-side registration token instead.justsubstituted with directansible-galaxy install -r requirements.yml -p roles/galaxy/ --force(nojuston Debian bookworm). Completed 2026-05-22. - 6E.3 — Edge integration: caddy role extended with
caddy_matrix_enabledgate +caddy_matrix_upstream(single upstream — federation collapsed onto web entrypoint viamatrix_federation_public_port: 443). Template adds apexrampancy.cloudblock (well-known statics) andmatrix.rampancy.cloudblock (federation path skips CrowdSec). Apex LE cert issued via existing CF DNS-01. Completed 2026-05-22. - 6E.4 — MatrixRTC UDM port-forwards: 5 forwards configured via UDM UI (matrix-rtc-ice-tcp 7881/tcp, matrix-rtc-ice-udp-mux 7882/udp, matrix-rtc-turn-udp 3479/udp, matrix-rtc-turn-tcp 5350/tcp, matrix-rtc-turn-relay 30000-30020/udp → 192.168.1.243). Apex
.well-known/matrix/clientextended to advertiseorg.matrix.msc4143.rtc_foci(Element Call queries the apex well-known and doesn't fall through to the matrix subdomain — see Lessons). Element Call validated end-to-end: desktop ↔ Element X mobile on cellular, audio + video + screen-share both directions. Completed 2026-05-22. - 6E.5 — First user + smoke:
@rampancy:rampancy.cloudregistered via Element Web's token-gated registration, auto-promoted to admin by Tuwunel, joined the auto-created#admins:rampancy.cloudroom with@conduit:rampancy.cloudbot. Federation tester green. Completed 2026-05-22. - 6E.6 — Docs sweep: this roadmap entry flipped; matrix-setup runbook rewritten with actual execution + 9-item Lessons appendix; changelog entry below; hosts/proxfold guests update + services/matrix.md page pending (separate one-off cleanups, no functional block).
Deferred / future sub-phases (not committed; reassess after 6E lands):
- 6E.7 — Mobile push: Sygnal + FCM keys OR self-hosted UnifiedPush. Adds a Google dependency or a third small service. Skip if Element Web is good enough day-to-day.
- 6E.8 — Discord bridge: mautrix-discord. Adds Postgres 16+ (Tuwunel doesn't need it but the bridge does) and a per-user puppeting workflow. Worth doing only if there's an active Discord community to keep bridging into.
- 6E.9 — Pocket-ID OIDC integration — bundles cleanly into Phase 7E.
6F. Music recommendations / discovery¶
Goal: Add Spotify-style Discover Weekly / Daily Jams on top of the existing PlexAmp listening experience. Recommendations come from community listening data (ListenBrainz), missing tracks dispatch through the 6D acquisition pipeline, and the result lands as Plex playlists.
Scoped, not started — reference only
Scoped from a research session on 2026-05-16 immediately after 6D close-out; reassess at execution time. No work begun.
Architecture — four small Docker services, all fitting on arrstack VM 101 alongside the existing 6D stack:
| Layer | Purpose | Pick |
|---|---|---|
| Scrobble source | Captures what's been played | Plex / PlexAmp native scrobble webhook |
| Scrobble bridge | Forwards Plex listens to ListenBrainz (Plex doesn't speak LB natively) | RustyRin/Plex_Scrobble_App or FoxxMD/multi-scrobbler |
| Recommendation engine | Generates Discover Weekly / Daily Jams / similar-artists from listening history | ListenBrainz public instance (MetaBrainz) |
| Orchestrator | Pulls LB recs → resolves against local library → dispatches missing tracks via slskd → publishes a Plex playlist | LumePart/Explo |
Pipeline:
PlexAmp listening
↓ Plex webhook
Scrobble bridge (Docker, arrstack VM 101)
↓
ListenBrainz (public cloud)
↓ recommendations API
Explo (Docker, arrstack VM 101)
├── checks Plex/local library
├── missing → dispatches to slskd (already built in 6D)
└── publishes Discover Weekly / Daily Jams as Plex playlists
Execution checklist (when picking this up):
- Stand up scrobble bridge (Plex_Scrobble_App or multi-scrobbler) as a Docker service on arrstack VM 101; wire Plex webhook in
- Register ListenBrainz account on the public instance, wire the scrobbler → LB API
- Accumulate ≥ 2 weeks of scrobble history before standing up Explo (recs are cold-start-sensitive)
- Stand up Explo on arrstack VM 101; wire LB API → Plex library lookup → slskd dispatch → Plex playlist publish
- Decide how Explo-grabbed-via-slskd items get tagged/imported: round-trip through Lidarr (consistent with 6D) or direct-land in the Plex library (faster but bypasses Tubifarry's import path)
- Smoke test: confirm Discover Weekly / Daily Jams playlists appear in Plex with plausible picks based on actual recent listening
- New
services/listenbrainz.md+services/explo.mdpages on close-out;services/arrstack.mdservices list updated;mkdocs.ymlnav entries added
Open decisions / risks:
- ListenBrainz public vs self-host. Public instance is free and rec quality benefits from cross-user collaborative filtering (self-hosting with N=1 dataset hurts the model). Default to public. Privacy posture: listening history is visible to anyone who looks (Last.fm-class).
- Cold-start. First 2–3 weeks of recs will be weird/generic until LB has enough history. Mitigation: stand up scrobbling well before Explo so the dataset is already accumulating when the orchestrator lands.
- Explo is single-maintainer (LumePart). Active per commit history in 2026 but smaller than the *arr stack — maintenance risk worth knowing; have a fallback in mind (manual LB rec → Lidarr import).
- Plex scrobble webhook is reliable but not perfect — occasionally misses listens during network blips. Not a deal-breaker for recs.
- No mood/activity-aware curation (Spotify "Focus" / "Late Night"). That's where the streaming-service experience still wins; explicitly out of scope for 6F.
- Caveat captured but not validated: assumes Explo speaks slskd directly. Verify against current Explo README at execution time — if it only writes Lidarr Wanted entries, the slskd dispatch happens via Tubifarry as a side effect rather than directly.
Housemate Proxmox access¶
Trusted-LAN scope: VMs land on vmbr0/VLAN 1 today, with Proxmox-side controls (RBAC, ZFS quota, scoped storage) as the boundary. L2 isolation onto a dedicated Lab VLAN is captured separately as Phase 8 — Network segmentation.
Trust posture: housemate is a known person, written guidelines on resource limits + risky operations, no hostile-tenant assumptions.
See the Housemate Access runbook for the full ACL set + execution order.
- Enable UDM Network API token (or Local Admin user fallback) for read-only config access
- Create PVE realm user
hazel@pvewith TOTP enforced - Create group
housemate-lab-adminsand ZFS datasetstash/housemate-vms(quota=500G) - Create PVE storage
housemate-zfs(ZFS plugin, contentimages,rootdir) and resource poolhousemate-lab - Apply group ACLs: pool →
PVEVMAdmin, storage →PVEDatastoreUser,/sdn/zones/localnetwork/vmbr0→PVESDNUser - PBS coverage: existing
pbs-dailyjob picks up Hazel's pool VMs automatically (root namespace); operator-mediated restore. Self-service PBS UI for Hazel is deliberately deferred — see runbook step 6 future-enhancement block - Written guidelines for Hazel: resource caps, snapshot discipline, ping-before-passthrough
Phase 7 — Security stack (weeks 10+)¶
Goal: Add SIEM / endpoint detection coverage and close the edge-security gap left by Phase 5D. Drift detection, Beszel, and PBS cover availability; nothing today covers intent. Phase 7 closes that gap with Wazuh as the centrepiece, network IDS via Suricata, edge IPS via CrowdSec, and a lightweight identity provider (Pocket-ID) for OIDC-native admin UIs.
Pre-requisite: Phase 4B
Wazuh's all-in-one server wants 8–16 GB on a dedicated VM. Current 48 GB physical leaves no comfortable headroom once arrstack/n8n/PBS/Beszel/control/plex/nginx are accounted for. 4B's 8× 32 GB / 256 GB target is sized partly to unblock this phase, with comfortable ARC headroom + future expansion room (4 slots free). Without 4B done first, Phase 7 squeezes everything else and hits its retention/heap ceilings within months.
flowchart LR
A[7A: Wazuh AIO + first agents] --> B[7B: Agent fleet rollout]
B --> C[7C: Suricata NIDS + integration]
B --> D[7D: CrowdSec edge]
B --> E[7E: Pocket-ID identity + SSO]
style A fill:#FAECE7,stroke:#993C1D,color:#712B13
style B fill:#FAECE7,stroke:#993C1D,color:#712B13
style C fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style E fill:#EEEDFE,stroke:#534AB7,color:#3C3489
7D + 7E ship together
CrowdSec covers edge protection (drop scanners before they reach app login). Pocket-ID covers identity (single sign-on for OIDC-native admin UIs). They're complementary, share an "edge hardening" theme, and are scoped to land in one cycle. 7E does not depend on Phase 4B's RAM upgrade — Pocket-ID's footprint (~50 MB RAM, single Go binary) is trivial.
7A. Wazuh AIO + Ansible scaffolding + first agents¶
Why: Stand up Wazuh manager + indexer + dashboard on a single VM. Pin to Wazuh 4.14.x — current stable is 4.14.5 (release notes, 2026-04-23). Wazuh 5.0 beta1 shipped April 2026 with rewritten agent protocol, removed Filebeat, and cluster-by-default (5.0 brief) — treat 4.14.x as the runway through 5.0 GA.
- Provision new VM (e.g.
wazuh, next free VMID, Debian 13 trixie genericcloud cloud-init, 4 vCPU / 8 GB / 100 GB onlocal-zfs, static IP in 192.168.1.0/24) - Apply baseline:
common,security,auto_updates(with the Wazuh package itself excluded from auto-updates — pinned manually to dodge cluster-version-mismatch landmines),beszel_agent - Add wrapper roles
roles/wazuh_serverandroles/wazuh_agentthat pull the upstreamwazuh.wazuhcollection from Galaxy viarequirements.yml— mirrors the existingauto_updateswrapping pattern aroundhifis.toolkit.unattended_upgrades. Don't dump upstreamwazuh-ansibleroles directly intoroles/; they have opinionated facts/handlers that collide with the homelab baseline - Vault entries:
vault_wazuh_admin_password,vault_wazuh_api_password,vault_wazuh_cluster_key,vault_wazuh_agent_password,vault_discord_webhook_homelab_security - Bake homelab-specific tunings into the wrapper role on day one —
agents_disconnection_time: 1d,agents_disconnection_alert_time: 0, JVM heapXms = Xmx = 1024m. Without these, every reboot generates noise and runtime heap resizing degrades indexer perf (Lazro homelab tuning writeup) - Reverse-proxy
wazuh.rampancy.cloud→wazuh:443via Caddy onedge(Phase 5D). TLS terminates upstream; Wazuh dashboard speaks HTTPS internally so the proxy is 443→443 - Discord integration via Wazuh's built-in
<integration>block → new#homelab-securitychannel (separate from#homelab-opsand#homelab-drift). Default to rule level ≥ 10; tune down later - Captured by
pbs-dailyautomatically once the VM lands in inventory - First agents: proxfold (PVE host directly) + arrstack VM as the test fleet
- Add docs:
hosts/wazuh/index.md,services/wazuh.md,ansible/roles/wazuh-server.md,ansible/roles/wazuh-agent.md, all wired intomkdocs.yml
7B. Agent fleet rollout¶
Why: Coverage. The proxfold host agent gives the most signal value (kernel events, package changes, ZFS state, SSH attempts, auditd) but every additional agent multiplies the security picture.
- Roll out agents to remaining VMs: n8n (and plex if/when it moves from LXC to VM)
- Roll out agents to LXCs: control, pbs, beszel, edge — but with caveats. Wazuh agent has documented install issues inside unprivileged LXCs (wazuh#24954). Test on one CT first; evaluate before promising fleet coverage. Worst case: skip agent in LXCs and rely on the proxfold host's view of
/var/log/lxc/*andpct execaudit - Tune VPN-induced noise — qBittorrent's source IP via gluetun looks "wrong" to geoIP-style rules; suppress or whitelist early
- Pull in BeardedTinker's UniFi/Synology/HomeAssistant rule packs (BeardedTinker/wazuh-homelab-security) — UDM and future Phase 6 HAOS integrate near zero-config
- Rebuild-kit gotcha — Wazuh agent name can't be changed post-install (wazuh#19710); the
rebuild/kit must install agents after hostname is set, not as part of a base template. Codify or document the constraint
7C. Suricata NIDS + Wazuh integration¶
Why: Without an NIDS, Wazuh is host-side blind. Suricata + Wazuh is a first-class integration — Wazuh auto-parses /var/log/suricata/eve.json and surfaces alerts in the dashboard (Wazuh PoC: Suricata integration).
- Decide host shape — dedicated Suricata VM, or co-located on the nginx/Caddy edge VM
- Decide tap point — port-mirror from MikroTik PENFOLD-SW01, or in-line on the gateway path. Mirror is non-invasive and the right starting point; in-line gives blocking capability later
- Codify as
roles/suricatawithroles/wazuh_agentalready installed for theeve.jsonforward - Tune EmergingThreats Open ruleset — homelab traffic generates noise on alerts written for enterprise environments
7D. CrowdSec on edge¶
Completed 2026-05-04
CrowdSec engine + hslatman Caddy bouncer module live on edge (CT 107). Bouncer registered, polling LAPI on 15s ticker, crowdsec directive in every per-host handle block. End-to-end validation via cellular phone confirmed: blocked IP got 403, removed IP returned to 200. Accepted risk: edge security gap closed same day. See crowdsec_engine role and crowdsec-validation runbook — the runbook captures six bugs / gotchas hit during first execution.
-
crowdsec_enginerole landed: packagecloudany/anyapt source,crowdsecurity/caddycollection,Restart=on-failuredrop-in, stat-gated for--check --diffcleanliness -
caddyrole updated: xcaddy-built binarycaddy-2.11.2-cs1with hslatman bouncer module + cloudflare DNS, gated bouncer block in Caddyfile template ({$CROWDSEC_BOUNCER_API_KEY}via parse-time substitution), site-block JSON access logs to journal - One-time
cscli bouncers add caddy-edgeoperator step + vaulted key (registration is non-idempotent; not in the role) - Validation runbook executed end-to-end from cellular (LAN test masked by hairpin NAT — runbook now leads with this constraint)
- Accepted-risks register entry closed
- Closed 2026-05-05 — rotated via CF Roll as part of Phase 6A.2. New secret validated by API probe (token verify + Zone:DNS:Edit TXT round-trip); script-based vault swap so the token never appeared in scrollback.
Validated post-go-live (2026-05-04): within 4 hours of bouncer enable, the engine's local Caddy-log parser caught a real HTTP-probing attempt from 49.178.191.113 (AU, Microplex) and auto-banned via the crowdsecurity/http-probing scenario. Confirms cscli setup auto-discovery wired Caddy log acquisition + base-http-scenarios + http-cve collections at install time — the role doesn't need to do that work explicitly. SSH log acquisition + crowdsecurity/sshd collection also auto-installed on edge; SSH brute-force detection now feeds the same reputation pool as Caddy, but enforcement for SSH still needs cs-firewall-bouncer (Caddy bouncer is HTTP-only).
7D follow-ups (deferred — flag for later)¶
Captured 2026-05-04 from a maturity-of-implementation review. None are blocking; pull into a future cycle when one feels warranted.
- Discord notification profile. Engine-side notification system supports HTTP webhooks (
/etc/crowdsec/notifications/http.yaml+/etc/crowdsec/profiles.yaml). Wire bans →vault_discord_webhook_homelab_opsfor visibility per-effort. ~30 min of work; shares the channel pattern used by PBS/ZFS/PVE/auto-updates events. -
cs-firewall-bouncer+ retirefail2banon edge. SSH detection is already happening (auto-installedcrowdsecurity/sshdparses ssh.service journald), but SSH enforcement still relies on the Caddy-sidesecurityrole'sfail2ban. Addingcs-firewall-bouncer(nftables-based) gives a unified federated reputation feed for both SSH and HTTP, and lets us cleanly retire fail2ban from thesecurityrole on edge. Newcrowdsec_firewall_bouncerrole; small bootstrap (similar non-idempotentcscli bouncers addoperator step pattern as the Caddy bouncer). - CrowdSec Console enrollment. app.crowdsec.net — free SaaS dashboard showing alerts/decisions/top scenarios. One-command enroll:
cscli console enroll <token>. Privacy tradeoff: ships event metadata (timestamps, scenario names, IPs) to CrowdSec's cloud. Considered worth weighing if CLI feedback ever stops being enough. - Wazuh forwarding — already deferred to Phase 7A/B per scoping.
Scope cuts (2026-05-04):
- CrowdSec ↔ Wazuh forwarding deferred to Phase 7A/B. Wazuh is gated on Phase 4B (CPU + RAM upgrade), which is deprioritised. Edge gap is more urgent than the broader SIEM build; closing 7D standalone unblocks it. Forwarding hook documented in the role doc as a one-line wiring job when Wazuh exists.
- Lynis weekly cron split out to Phase 7A/B. Functionally orthogonal to edge bouncing; pairs naturally with Wazuh's SCA module via custom decoder. Not in 7D scope.
- SOAR-lite stretch moves with the Wazuh piece — depends on Wazuh active-response.
Decisions (locked 2026-05-04):
- Apt source = packagecloud
any/any, not Debian trixie main. Upstream's trixie repo returns 404 (issue #3909, unresolved 2026-05-04);any/anyis upstream's documented workaround. Debian trixie ships its owncrowdsecpackage but the version freezes at trixie-release time and falls behind hub items — picked the upstream repo for engine/parser currency despite the external dep. Path component and suite must both beany(/debian+ suiteanyreturns HTTP 422; caught on first apply). - Integration via Caddy module, not file-log parser. hslatman/caddy-crowdsec-bouncer plugs in via
xcaddy --with(we already xcaddy-build), per-request IP check against LAPI, no logfile detour, dodges the known caddy-logs parser bug. Tradeoff: Caddy upgrades require a rebuild — already our workflow. - fail2ban kept for SSH-only. The CrowdSec bouncer module is HTTP-only (Caddy choke-point). fail2ban on edge stays for SSH brute-force protection (LAN-only path but still sensible). No retire-fail2ban work in 7D.
- LAPI stays on stock 127.0.0.1:8080. Initial design moved it to 6060 to dodge a hypothetical Caddy alt-port collision; reverted on first apply because the agent's
local_api_credentials.yamlis hardcoded to 8080 by the installer, so the engine wouldn't start. Lesson: don't optimise for hypothetical future state when it touches multiple components.
7E. Pocket-ID identity provider + selective SSO¶
Why: Adds OIDC-based single sign-on in front of the OIDC-native admin UIs (Proxmox VE, PBS) without the operational footprint of a full IdP. Bundles with Phase 7D — CrowdSec drops unauthorized traffic at the edge; Pocket-ID handles identity for the apps that benefit from SSO. The two together close the edge-security accepted risk captured 2026-05-02.
- New LXC
auth(next free CTID), Debian 13 trixie, unprivileged,features=nesting=1, 1 vCPU / 1 GB / 4 GB onlocal-zfs, static IP in 192.168.1.0/24 (mirrors CT 106 Beszel / CT 107 edge pattern) - New
pocket_idAnsible role — pinned binary (or apt install if/when Pocket-ID ships a Debian package), systemd unit, SQLite-backed, vault-stored signing secret. Same shape as thecaddyrole — single binary, config templated from inventory. - Caddy: add
auth.rampancy.cloud→auth:1411tocaddy_proxy_hostsinhost_vars/edge.yml. The existing LE wildcard*.rampancy.cloudalready covers; Cloudflare needs a CNAMEauth → rampancy.cloud(gray-cloud). - OIDC integration: configure Proxmox VE OIDC realm pointing at Pocket-ID; configure PBS similarly. Both PVE 8+ and PBS 4+ support OIDC realms natively — no proxy auth needed.
- User accounts: primary (operator) + Hazel. Each registers two passkeys — primary in Bitwarden (cloud-synced), secondary on a hardware key (YubiKey 5) or device-native (Touch ID / Windows Hello). Bitwarden emergency access pre-configured for Hazel as the recovery path.
- Beszel agent on the new LXC.
- Docs: new
services/pocket-id.md+hosts/auth/index.md+ansible/roles/pocket-id.md, all wired intomkdocs.yml. New runbookrunbooks/sso-rollout.mdcapturing the cutover (Proxmox/PBS realm switch + first passkey registration walkthrough).
Apps explicitly NOT in scope for SSO (deliberate, captured here so future-self doesn't re-litigate):
| App | Reason |
|---|---|
| Beszel hub UI | No OIDC support. CrowdSec + Beszel's own auth is sufficient. |
| Dockhand UI | No OIDC support. CrowdSec + Dockhand's own auth. |
| n8n | OIDC is a paid Enterprise feature; free version stays on its own auth. |
Overseerr (requests) |
Already auth'd via Plex OAuth — already SSO of a sort. |
| arrstack admin UIs (Sonarr/Radarr/Prowlarr/qBittorrent) | LAN-only, no OIDC, low value. |
Decisions (locked at scope time, 2026-05-03):
- Pocket-ID, not Authentik. Authentik bundles OIDC + forward-auth in one component, would gate the non-OIDC apps too. But it introduces PostgreSQL + Redis (everything else in the homelab is SQLite), runs ~3-4 GB RAM vs Pocket-ID's ~50 MB, and the bundled forward-auth job is already covered by 7D CrowdSec. Reconsider Authentik if the public-facing SSO surface grows beyond ~5 services with weak per-app auth.
- No Tinyauth pairing. The 2026 community pattern is
Pocket-ID + Tinyauthfor double-coverage of OIDC + forward-auth. Tinyauth in front of apps that have their own login produces a double-login UX (gate at proxy + app's own login still fires). With CrowdSec in 7D, the marginal benefit doesn't justify the friction. - Bitwarden as primary passkey vault, hardware key as backup. Operator already runs Bitwarden paid plan. Bitwarden emergency access closes the break-glass gap flagged during the 5D close-out — Hazel is the natural emergency-access nominee.
- Skip Authelia. Slowing release cadence (Nov 2025 → Mar 2026 release gap, patch-only since), classical password+TOTP UX is dated relative to passkey-first. Pocket-ID is the cohort-aligned 2026 pick for new deployments.
Phase 8 — Network segmentation (deferred)¶
Goal: Migrate the homelab off a single flat VLAN onto a segmented design — Mgmt for infrastructure admin, IoT for smart-home devices, Lab for the housemate sandbox. Switch PENFOLD-SW01 from SwOS to RouterOS to enable programmatic config management. Reuses spare proxfold NICs to add the Lab VLAN bridge without reconfiguring vmbr0, so existing services don't blip during the additive phases.
Why this is its own phase
Triggered by the housemate-access work (Phase 6 entry above), but deliberately scoped separately. The immediate housemate change ships with Hazel's VMs on vmbr0/VLAN 1 with Proxmox-side controls only; their migration onto the Lab VLAN is captured here as 8G. Doing this work alongside Hazel's onboarding would conflate "first VLAN rollout" with "first RouterOS exposure" — too many simultaneous variables.
VLAN scheme (10-spacing buffer for future classes):
| ID | Name | Subnet (TBD at scope time) | Purpose |
|---|---|---|---|
| 1 | LAN | 192.168.1.0/24 | Existing trusted servers/clients |
| 10 | Mgmt | 192.168.10.0/24 | UDM admin, switch admin, iDRAC, future managed APs |
| 20 | IoT | 192.168.20.0/24 | Smart-home devices |
| 30 | Lab | 192.168.30.0/24 | Housemate sandbox |
| 40 | Guest | 192.168.40.0/24 | Visitor WiFi (reserved, not built) |
flowchart LR
A[8A: Console cable + SwOS backup] --> B[8B: Switch flip to RouterOS]
B --> C[8C: RouterOS hardening]
C --> D[8D: UDM VLAN networks + firewall rules]
D --> E[8E: Switch port VLAN config]
E --> F[8F: proxfold vmbr1 + PVE SDN]
F --> G[8G: Migrate Hazel onto Lab VLAN]
F --> H[8H: Migrate mgmt plane onto Mgmt VLAN]
F --> I[8I: Migrate IoT devices onto IoT VLAN]
style A fill:#FAECE7,stroke:#993C1D,color:#712B13
style B fill:#FAECE7,stroke:#993C1D,color:#712B13
style C fill:#FAECE7,stroke:#993C1D,color:#712B13
style D fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style E fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style F fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style G fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
style H fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
style I fill:#E1F5E7,stroke:#2E7D32,color:#1B5E20
8A. Console cable + dated SwOS backup¶
Why: Escape path for the OS flip. Without console access, recovery from a misconfigured switch reboot is "laptop direct-cabled to RouterOS's default 192.168.88.1 mgmt subnet" — works but slower and error-prone. Switch has no autobackup automation today (UDM does; switch does not — gap worth closing regardless).
- Order RJ45-to-USB-serial console adapter
- Manual SwOS config export (download
.swbfromhttp://192.168.1.3/backup, dated copy stored alongside the~/proxfold-pve9-upgrade/artifacts) - (Optional) Codify daily SwOS backup as a small cron on the control LXC to fill the autobackup gap
8B. Switch flip — SwOS → RouterOS¶
Why: SwOS has no SSH or scriptable API. RouterOS gives full SSH + API + /export config readout. Switching is reversible — both OSes coexist on the device, configs preserved per slot. Marvell switching chip handles L2 in hardware on both OSes; line-rate preserved as long as RouterOS bridge VLAN filtering keeps hw=yes.
- Pre-stage RouterOS minimal config as a
.rscscript that reproduces today's behavior (24× 1G + 2× 10G in one untagged bridge, mgmt IP192.168.1.3) - Console cable in hand, flip via
/system routerboard settings set boot-os=routerosthen reboot - Import
.rscon first boot - Verify all wired hosts reachable; verify
hw=yeson every bridge port (/interface bridge port print) - Escape path: if anything fails,
/system routerboard settings set boot-os=swos+ reboot — instant rollback
8C. RouterOS hardening¶
Why: RouterOS exposes more services than SwOS — SSH, API, web, WinBox, optional FTP/Telnet. MikroTik has a CVE history when these are left exposed. Discipline: management plane on Mgmt VLAN only, unused services off.
- Disable
telnet,ftp,www(usewww-sslonly),apiif unused - Restrict
winbox/ssh/www-sslto a management address allowlist (admin IP for now; Mgmt VLAN once 8H lands) - Pin firmware version, document update cadence in
hosts/penfold-sw01/
8D. UDM — VLAN networks + inter-VLAN firewall rules¶
Why: UDM controller defines the L3 + DHCP + inter-VLAN policy. Adding networks is additive; existing VLAN 1 untouched. WiFi clients on VLAN 1 don't blip from network creation alone, but UniFi controller commits do trigger AP provisioning — schedule outside peak housemate hours.
- Create UniFi Networks: Mgmt (10), IoT (20), Lab (30); reserve Guest (40) as a placeholder
- Inter-VLAN firewall: deny by default; explicit allowlist for required flows (e.g. PBS host → Lab on backup ports, Mgmt → all for admin)
- Document policy in
hosts/the-egg/firewall.md
8E. Switch — VLAN port assignments¶
Why: L2 enforcement of the network design. Trunk to UDM tagged for VLANs 10/20/30/40; access ports per device classification.
- Trunk to UDM: tag VLANs 10/20/30/40, untagged VLAN 1
- proxfold ports: existing
nic0port stays untagged VLAN 1; new spare-NIC port (8F) configured for VLAN 30 (or VLAN-aware trunk if multi-VLAN exposure is wanted) - IoT device ports: untagged access VLAN 20
8F. proxfold — vmbr1 VLAN-aware bridge + PVE SDN¶
Why: Additive to existing vmbr0. Spare NIC (nic1) bound to a new VLAN-aware bridge means existing services on vmbr0 are untouched throughout. PVE SDN gives a clean VNet abstraction with proper permission scoping.
- Cable a spare proxfold NIC (
nic1) to a switch port configured for VLAN 30 trunking - Add
vmbr1to/etc/network/interfaceswithbridge-vlan-aware yes,bridge-vids 2-4094 - Define PVE SDN zone (VLAN-aware) bound to
vmbr1; create VNetlabfor VLAN 30 - Update Hazel's group ACLs: replace
/sdn/zones/localnetwork/vmbr0→PVESDNUserwith/sdn/zones/<labzone>/lab→PVESDNUser - Codify as Ansible role or runbook step (sync-docs precedent)
8G. Migrate Hazel's VMs onto Lab VLAN¶
Why: Completes the housemate-access architecture. Pre-Phase-8 her VMs are on vmbr0/VLAN 1 with only Proxmox-side controls; post-Phase-8 they're L2-isolated.
- Per VM: shut down, change network bridge to the Lab VNet, boot, verify DHCP from UDM Lab pool, verify firewall behavior
- Update permissions: revoke Hazel's
SDN.Useon/sdn/zones/localnetwork/vmbr0
8H. Migrate management plane onto Mgmt VLAN¶
Why: Reduces blast radius. iDRAC, switch admin, UDM admin currently sit on VLAN 1 alongside production guests.
- iDRAC: reconfigure dedicated NIC onto VLAN 10
- Switch (RouterOS): mgmt IP onto VLAN 10
- UDM: dedicated mgmt interface on VLAN 10
- Update
network/overview.mdhost map
8I. Migrate IoT devices onto IoT VLAN¶
Why: Default-deny outbound to LAN; cap blast radius from compromised IoT firmware.
- Inventory IoT devices currently on VLAN 1
- Migrate device-by-device (DHCP reservation move + reconnect to IoT WiFi SSID)
Decisions to lock at scope time:
- PVE SDN VLAN zone vs. plain VLAN-aware bridge — SDN is the cleaner answer for permission scoping (per-VNet ACLs); plain
bridge-vlan-aware yesis simpler if SDN feels like overkill for one VLAN - Guest VLAN — provisioned now or later — reserved as 40 either way
- Bond instead of single NIC for
vmbr1— proxfold has 4 NICs, onlynic0in use; could bond 2-3 spares for the Lab bridge. Probably overkill for housemate experiments
Design decisions¶
These don't need answers now but will come up during implementation.
| Decision | When it matters | Options |
|---|---|---|
| Ansible for compose deploys vs Dockhand | Phase 1–2 | Use both (Ansible for OS, Dockhand for containers) or consolidate to Ansible's community.docker.docker_compose_v2 module |
| Caddy vs Nginx Proxy Manager | Phase 1B or later | Caddy is more Git-friendly and lighter; NPM has a GUI. Can migrate incrementally |
| n8n deployment host | Phase 5C | LXC with native Node.js install (standalone service, fits the LXC-for-standalone pattern); no Docker/VM needed |
| UDM upgrade to Dream Router 7 | Any time | Independent of everything else. WiFi 7 + better hardware |
Infrastructure reference¶
Current state (pre-Phase 4 hardware upgrades):
graph TB
subgraph net["Network"]
udm["The-Egg · 192.168.1.1\nUniFi UDM · Gateway · WAP"]
sw["PENFOLD-SW01 · 192.168.1.3\nMikroTik CRS326-24G-2S+"]
udm --> sw
end
subgraph proxfold["proxfold · 192.168.1.250 — Proxmox VE · Dell R430"]
subgraph ct100["CT 100 — plex · 192.168.1.230"]
plex["Plex Media Server\nNvidia T400 GPU"]
end
subgraph vm101["VM 101 — arrstack · 192.168.1.252"]
media["Sonarr · Radarr · Prowlarr\nSeerr · qBittorrent · MediaBot\ngluetun (ProtonVPN)"]
end
subgraph vm102["VM 102 — nginx · 192.168.1.249"]
npm["Nginx Proxy Manager"]
end
subgraph ct104["CT 104 — control · 192.168.1.245"]
ansible["Ansible control node"]
end
pbs["CT — pbs · 192.168.1.246\n(Phase 5A)"]
beszel["CT — beszel · 192.168.1.247\n(Phase 5B)"]
n8n["CT — n8n\n(Phase 5C)"]
end
nas["QNAP TS-269L · 192.168.1.253\nNFS datastore (Phase 5A)"]
sw --> proxfold
sw --> nas
Post-Phase 4 changes: Boot drive becomes ZFS mirror (rpool), RAM increases from 48GB to 384GB, second CPU socket populated (28C/56T total).