Proxfold rebuild runbook¶

Step-by-step for rebuilding proxfold from bare metal to fully-running homelab. Covers the Phase 4C boot drive swap and any future DR scenario.

Flow: WSL-primary cold-start → boot auto-install ISO → Ansible from WSL reconstitutes the PVE host → vzdump restores VMs/CTs → CT104 resumes scheduled duties.

The automation kit lives in the rebuild/ directory of the rampantlemming/homelab-ansible repo. Key files:

rebuild/answer.toml.j2 — production auto-install answer template
rebuild/answer-test.toml.j2 — nested-VM rehearsal template
rebuild/render-answer.sh — decrypts vault, writes answer.toml
rebuild/build-iso.sh — wraps proxmox-auto-install-assistant prepare-iso
rebuild/README.md — vault setup + usage

Pre-flight¶

Before you need this runbook in anger, confirm:

WSL control node is set up. See WSL control node bootstrap.
Vault contains vault_proxfold_root_password_hash and vault_proxfold_root_ssh_keys. See rebuild/README.md in the homelab-ansible repo.
Baked ISO exists and is recent. Rebuild it whenever answer.toml.j2 changes — you don't want to do this under pressure.
vzdump backups are recent for CT100, CT104, VM101, VM102. On proxfold: ls -lt /mnt/pve/nasbackup/dump/ | head.
Proxmox source ISO downloaded from proxmox.com/downloads to ~/iso/ on WSL.
Pre-upgrade artifact bundle (~/proxfold-pve9-upgrade/ on WSL) still available as a last-resort reference for pre-Ansible state.

Build the install media¶

Run on whichever host has proxmox-auto-install-assistant installed (proxfold, or WSL with the Proxmox no-sub repo added).

# On WSL, prepare answer.toml (decrypts vault)
cd ~/homelab-ansible
rebuild/render-answer.sh

# Option A — build on proxfold (preferred while proxfold is still up)
scp rebuild/answer.toml ~/iso/proxmox-ve_9.x-1.iso root@192.168.1.250:/root/
ssh root@192.168.1.250 'apt install -y proxmox-auto-install-assistant'
ssh root@192.168.1.250 'cd /root && proxmox-auto-install-assistant prepare-iso \
  proxmox-ve_9.x-1.iso --fetch-from iso --answer-file answer.toml \
  --output proxmox-ve-auto-prod.iso'
scp root@192.168.1.250:/root/proxmox-ve-auto-prod.iso ~/iso/

# Option B — build on WSL (if proxfold is already down)
# Add Proxmox no-sub apt repo to WSL first (one-time):
#   curl -fsSL https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg \
#     | sudo tee /etc/apt/keyrings/proxmox.gpg >/dev/null
#   echo 'deb [signed-by=/etc/apt/keyrings/proxmox.gpg] http://download.proxmox.com/debian/pve trixie pve-no-subscription' \
#     | sudo tee /etc/apt/sources.list.d/pve.list
#   sudo apt update && sudo apt install -y proxmox-auto-install-assistant
rebuild/build-iso.sh ~/iso/proxmox-ve_9.x-1.iso

The built ISO lives in rebuild/ (gitignored).

Write the USB¶

# Identify the USB device carefully — the wrong /dev/sdX is destructive
lsblk
# Assume /dev/sdX is your USB stick
sudo dd if=rebuild/proxmox-ve_9.x-1-auto-prod.iso of=/dev/sdX \
  bs=4M status=progress conv=fdatasync

Physical boot¶

Shut down proxfold (if still up): ssh root@192.168.1.250 shutdown -h now.
If replacing the boot drive: swap drive now. Leave the ZFS pool drives untouched. See boot-drive-swap.md for Phase 4C.
Insert USB. Power on. iDRAC console helps here — see your iDRAC notes.
BIOS → boot from USB (F11 boot menu on Dell R430).
Auto-install runs unattended (~5 min). The machine reboots when done.
Remove the USB on first reboot or the installer will loop.

First-boot sanity check¶

From WSL:

ssh root@192.168.1.250 pveversion
# Expect: pve-manager/9.x.x

If SSH fails, check the machine directly via iDRAC — the answer.toml sets reboot-on-error = false, so failed auto-installs stay at the installer shell for inspection.

Reconstitute the host with Ansible¶

From WSL:

cd ~/homelab-ansible
git pull --ff-only   # just in case

# First run — expect many changes (fresh PVE baseline)
ansible-playbook playbooks/proxmox-host.yml \
  --limit proxfold \
  --vault-password-file ~/.vault_pass \
  --diff

This runs the common, security, proxmox, zfs, nfs, nvidia, and nut roles. Key outcomes:

APT repos switched to deb822 no-sub
Kernel pinned to 6.14.11-6-pve
nouveau blacklisted; Nvidia 550 driver + NVENC patch installed
ZFS pool stash imported
nasbackup CIFS storage registered
NUT configured for the CyberPower UPS

If Nvidia install fails because pve-headers-{{ ansible_kernel }} isn't available for the fresh kernel, reboot once to ensure the pinned kernel is active, then re-run.

If nvidia tasks fail because nouveau is still loaded, the blacklist was written but the running kernel hasn't reloaded. Reboot after the first proxmox-host.yml run completes — the nouveau blacklist only takes effect after a reboot, and the Nvidia driver install won't succeed until it's gone.

Kernel pin order: pin first, then refresh

proxmox-boot-tool kernel pin <version> writes /etc/kernel/proxmox-boot-pin, and a subsequent proxmox-boot-tool refresh propagates the pin to both ESPs. Running refresh before pin is a common mistake — the pin file doesn't exist yet, so the refresh has nothing to propagate. If the wrong kernel boots after a pin attempt, check cat /etc/kernel/proxmox-boot-pin first, and re-run refresh once the file is correct.

Restore VMs and CTs from vzdump¶

Clear stale ISO references after restore, before first start

vzdump backups capture whatever was attached to a VM at backup time, including ide2: local:iso/<something>.iso from an old Debian install. On a fresh rebuild that volume doesn't exist, and the VM will fail to start with volume 'local:iso/...' does not exist. Fix before first qm start:

qm set 101 --ide2 none,media=cdrom
qm set 102 --ide2 none,media=cdrom

Only affects VMs, not LXCs. Observed in Phase 4C with VM101 and VM102 both carrying a stale debian-12.4.0-amd64-netinst.iso reference.

Once nasbackup is mounted and visible in pvesm status, restore in this order so CT104 (Ansible control) is available soonest:

# On proxfold
BACKUP_DIR=/mnt/pve/nasbackup/dump

# Storage target: `local-zfs` after Phase 4C (ZFS mirror rpool). Pre-4C
# installs used `local-lvm` on the LVM-thin boot drive — substitute that if
# you're restoring onto an LVM install.

# CT104 (Ansible control) — restore first
LATEST_104=$(ls -t "$BACKUP_DIR"/vzdump-lxc-104-*.tar.zst | head -1)
pct restore 104 "$LATEST_104" --storage local-zfs --unique 0

# CT100 (plex)
LATEST_100=$(ls -t "$BACKUP_DIR"/vzdump-lxc-100-*.tar.zst | head -1)
pct restore 100 "$LATEST_100" --storage local-zfs --unique 0

# VM101 (arrstack)
LATEST_101=$(ls -t "$BACKUP_DIR"/vzdump-qemu-101-*.vma.zst | head -1)
qmrestore "$LATEST_101" 101 --storage local-zfs --unique 0

# VM102 (nginx)
LATEST_102=$(ls -t "$BACKUP_DIR"/vzdump-qemu-102-*.vma.zst | head -1)
qmrestore "$LATEST_102" 102 --storage local-zfs --unique 0

# Start all
pct start 104 100
qm start 101
qm start 102

--unique 0 preserves MAC addresses so DHCP leases and firewall rules still match.

Reconfigure each guest with Ansible¶

Run guest playbooks from CT104, not WSL

Guest authorized_keys trust the scheduled-control node key (ansible@homelab on CT104), not the WSL key. Running guest playbooks directly from WSL fails with Permission denied (publickey) for CT100/VM101/VM102. CT104 is restored first so it can run these. The WSL key only needs to land on proxfold for cold-start — push the WSL key to guests only if CT104 itself is unavailable for some reason.

From CT104:

ssh root@192.168.1.245
cd ~/homelab-ansible && git pull --ff-only
ansible-playbook playbooks/plex.yml     --vault-password-file ~/.vault_pass --diff
ansible-playbook playbooks/arrstack.yml --vault-password-file ~/.vault_pass --diff
ansible-playbook playbooks/nginx.yml    --vault-password-file ~/.vault_pass --diff

Nvidia LXC passthrough needs a second host-reconcile pass

The first proxmox-host.yml run (under "Reconstitute the host") runs the nvidia LXC passthrough tasks, but /etc/pve/lxc/100.conf doesn't exist yet at that point — the tasks skip cleanly. After CT100 is restored, apply the deferred passthrough with:

ansible-playbook playbooks/proxmox-host.yml --limit proxfold \
  --vault-password-file ~/.vault_pass --tags lxc
pct stop 100 && pct start 100

Verify inside the CT with pct exec 100 -- nvidia-smi.

Gotcha — Plex data directory (fresh install). If CT100's rootfs was restored from vzdump, the symlink target is already in place. If CT100 is being rebuilt from scratch, the Plex package's postinst creates /var/lib/plexmediaserver/Library/Application Support/Plex Media Server as a real directory, and the Plex role fails hard rather than overwriting it. Remove the empty dir with rm -rf "/var/lib/plexmediaserver/Library/Application Support/Plex Media Server" and re-run.

Gotcha — arrstack docker containers on restart. If the docker role fires its restart docker handler and gluetun is recreated, qbittorrent will fail with "No such container" (pinned to the old gluetun namespace via network_mode: service:gluetun). Redeploy the arrstack stack via dockhand to resolve.

Hand off to CT104¶

After CT104 is restored and Ansible bootstrapped, CT104 resumes its scheduled-automation role. Confirm:

ssh root@192.168.1.250 'pct exec 104 -- bash -c "cd /root/homelab-ansible && git pull --ff-only && ansible all -m ping --vault-password-file ~/.vault_pass"'

Timezone — set this on CT104 after restore

CT104's vzdump may have been taken while the container was Etc/UTC. The drift-detection timer runs OnCalendar=*-*-* 04:00:00, which is interpreted against the container's local timezone — leaving it on UTC means drift sweeps fire at 13:30 ACST instead of 04:00 ACST. After restore, set the timezone once:

ssh root@192.168.1.245 'timedatectl set-timezone Australia/Adelaide && \
  systemctl list-timers drift-detection.timer --no-pager'

The timer reschedule is automatic; no service restart required.

Verification checklist¶

ssh root@192.168.1.250 pveversion returns expected version
zpool status stash shows ONLINE
pvesm status shows nasbackup active
ansible all -m ping succeeds from WSL
Proxmox UI reachable at https://192.168.1.250:8006
Plex UI reachable at http://192.168.1.230:32400/web
NPM UI reachable (nginx VM)
docker ps on arrstack shows all media-stack containers up
UPS status: upsc cyberpower returns battery data

Rehearsal (validate before you need it)¶

Do this after any change to answer.toml.j2 or before any real rebuild.

Build the test ISO¶

cd ~/homelab-ansible
rebuild/render-answer.sh
# Build on proxfold (easier — tool already there)
scp rebuild/answer-test.toml ~/iso/proxmox-ve_9.x-1.iso root@192.168.1.250:/var/lib/vz/template/iso/
ssh root@192.168.1.250 'cd /var/lib/vz/template/iso && proxmox-auto-install-assistant prepare-iso \
  proxmox-ve_9.x-1.iso --fetch-from iso --answer-file answer-test.toml \
  --output proxmox-ve-auto-test.iso'

Create a nested PVE VM¶

ssh root@192.168.1.250 <<'EOF'
qm create 999 \
  --name proxfold-rehearsal \
  --memory 2048 --cores 2 --cpu host \
  --net0 virtio,bridge=vmbr0 \
  --scsihw virtio-scsi-single \
  --scsi0 local-zfs:8,ssd=1 \
  --ide2 local:iso/proxmox-ve-auto-test.iso,media=cdrom \
  --boot 'order=scsi0;ide2' \
  --ostype l26
qm start 999
EOF

Watch auto-install¶

From Proxmox UI: Datacenter → proxfold → 999 → Console. The installer runs unattended for ~5 min.

Known limitation — nested install fails at ~99%¶

The PVE 9 installer reliably fails in a nested VM at the final "make system bootable → update initramfs" step with bootloader setup errors: unable to install initramfs. This is a nested-KVM + virtio-scsi quirk, not a problem with the answer template. By the time it fails, the rehearsal has already validated the parts the template controls:

vault rendering (root hash + SSH keys baked into ISO)
hostname from answer-test.toml applied
disk filter matched the QEMU disk only (non-destructive guard works)
reboot-on-error = false keeps the VM at the error prompt for inspection

The post-install Ansible bootstrap (steps below) therefore can't be exercised in the nested VM. It's only meaningful against bare metal during a real rebuild. Treat the rehearsal as complete once the installer reaches the initramfs error with hostname/disk evidence intact, then tear down.

If you really want to validate the post-install path, options are: - Retry with --machine q35 --bios ovmf + an EFI disk (sometimes dodges it) - Test on a spare physical machine with the prod answer - Accept that the next real rebuild will be the first full end-to-end run

Find the VM's DHCP-assigned IP¶

ssh root@192.168.1.250 'ip neigh | grep $(qm config 999 | grep -oP "virtio=\K[^,]+")'

Verify Ansible reachability¶

Add a temporary inventory entry and ping:

ansible -i "proxfold-rehearsal," -u root -m ping \
  -e ansible_host=<test-ip> proxfold-rehearsal

Run the common role against it as a bootstrap sanity check:

ansible-playbook -i "proxfold-rehearsal," \
  -e ansible_host=<test-ip> \
  --vault-password-file ~/.vault_pass \
  -l proxfold-rehearsal \
  -t common \
  playbooks/site.yml --diff

(Full proxmox-host.yml run would try to import the stash pool, which doesn't exist in the nested VM. common alone validates the SSH + Ansible bootstrap path — which is the piece the rehearsal is specifically validating.)

Tear down¶

ssh root@192.168.1.250 'qm stop 999 && qm destroy 999 --purge'
ssh root@192.168.1.250 'rm /var/lib/vz/template/iso/proxmox-ve-auto-test.iso'