Proxfold rebuild runbook¶
Step-by-step for rebuilding proxfold from bare metal to fully-running homelab. Covers the Phase 4C boot drive swap and any future DR scenario.
Flow: WSL-primary cold-start → boot auto-install ISO → Ansible from WSL reconstitutes the PVE host → vzdump restores VMs/CTs → CT104 resumes scheduled duties.
The automation kit lives in the rebuild/ directory of the
rampantlemming/homelab-ansible repo. Key files:
rebuild/answer.toml.j2— production auto-install answer templaterebuild/answer-test.toml.j2— nested-VM rehearsal templaterebuild/render-answer.sh— decrypts vault, writesanswer.tomlrebuild/build-iso.sh— wrapsproxmox-auto-install-assistant prepare-isorebuild/README.md— vault setup + usage
Pre-flight¶
Before you need this runbook in anger, confirm:
- WSL control node is set up. See WSL control node bootstrap.
- Vault contains
vault_proxfold_root_password_hashandvault_proxfold_root_ssh_keys. Seerebuild/README.mdin thehomelab-ansiblerepo. - Baked ISO exists and is recent. Rebuild it whenever
answer.toml.j2changes — you don't want to do this under pressure. - vzdump backups are recent for CT100, CT104, VM101, VM102.
On proxfold:
ls -lt /mnt/pve/nasbackup/dump/ | head. - Proxmox source ISO downloaded from
proxmox.com/downloads
to
~/iso/on WSL. - Pre-upgrade artifact bundle (
~/proxfold-pve9-upgrade/on WSL) still available as a last-resort reference for pre-Ansible state.
Build the install media¶
Run on whichever host has proxmox-auto-install-assistant installed
(proxfold, or WSL with the Proxmox no-sub repo added).
# On WSL, prepare answer.toml (decrypts vault)
cd ~/homelab-ansible
rebuild/render-answer.sh
# Option A — build on proxfold (preferred while proxfold is still up)
scp rebuild/answer.toml ~/iso/proxmox-ve_9.x-1.iso root@192.168.1.250:/root/
ssh root@192.168.1.250 'apt install -y proxmox-auto-install-assistant'
ssh root@192.168.1.250 'cd /root && proxmox-auto-install-assistant prepare-iso \
proxmox-ve_9.x-1.iso --fetch-from iso --answer-file answer.toml \
--output proxmox-ve-auto-prod.iso'
scp root@192.168.1.250:/root/proxmox-ve-auto-prod.iso ~/iso/
# Option B — build on WSL (if proxfold is already down)
# Add Proxmox no-sub apt repo to WSL first (one-time):
# curl -fsSL https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg \
# | sudo tee /etc/apt/keyrings/proxmox.gpg >/dev/null
# echo 'deb [signed-by=/etc/apt/keyrings/proxmox.gpg] http://download.proxmox.com/debian/pve trixie pve-no-subscription' \
# | sudo tee /etc/apt/sources.list.d/pve.list
# sudo apt update && sudo apt install -y proxmox-auto-install-assistant
rebuild/build-iso.sh ~/iso/proxmox-ve_9.x-1.iso
The built ISO lives in rebuild/ (gitignored).
Write the USB¶
# Identify the USB device carefully — the wrong /dev/sdX is destructive
lsblk
# Assume /dev/sdX is your USB stick
sudo dd if=rebuild/proxmox-ve_9.x-1-auto-prod.iso of=/dev/sdX \
bs=4M status=progress conv=fdatasync
Physical boot¶
- Shut down proxfold (if still up):
ssh root@192.168.1.250 shutdown -h now. - If replacing the boot drive: swap drive now. Leave the ZFS pool drives untouched. See boot-drive-swap.md for Phase 4C.
- Insert USB. Power on. iDRAC console helps here — see your iDRAC notes.
- BIOS → boot from USB (F11 boot menu on Dell R430).
- Auto-install runs unattended (~5 min). The machine reboots when done.
- Remove the USB on first reboot or the installer will loop.
First-boot sanity check¶
From WSL:
If SSH fails, check the machine directly via iDRAC — the answer.toml sets
reboot-on-error = false, so failed auto-installs stay at the installer
shell for inspection.
Reconstitute the host with Ansible¶
From WSL:
cd ~/homelab-ansible
git pull --ff-only # just in case
# First run — expect many changes (fresh PVE baseline)
ansible-playbook playbooks/proxmox-host.yml \
--limit proxfold \
--vault-password-file ~/.vault_pass \
--diff
This runs the common, security, proxmox, zfs, nfs, nvidia, and
nut roles. Key outcomes:
- APT repos switched to deb822 no-sub
- Kernel pinned to
6.14.11-6-pve - nouveau blacklisted; Nvidia 550 driver + NVENC patch installed
- ZFS pool
stashimported nasbackupCIFS storage registered- NUT configured for the CyberPower UPS
If Nvidia install fails because pve-headers-{{ ansible_kernel }} isn't
available for the fresh kernel, reboot once to ensure the pinned kernel is
active, then re-run.
If nvidia tasks fail because nouveau is still loaded, the blacklist
was written but the running kernel hasn't reloaded. Reboot after the first
proxmox-host.yml run completes — the nouveau blacklist only takes effect
after a reboot, and the Nvidia driver install won't succeed until it's gone.
Kernel pin order: pin first, then refresh
proxmox-boot-tool kernel pin <version> writes
/etc/kernel/proxmox-boot-pin, and a subsequent
proxmox-boot-tool refresh propagates the pin to both ESPs. Running
refresh before pin is a common mistake — the pin file doesn't
exist yet, so the refresh has nothing to propagate. If the wrong kernel
boots after a pin attempt, check cat /etc/kernel/proxmox-boot-pin
first, and re-run refresh once the file is correct.
Restore VMs and CTs from vzdump¶
Clear stale ISO references after restore, before first start
vzdump backups capture whatever was attached to a VM at backup time,
including ide2: local:iso/<something>.iso from an old Debian install.
On a fresh rebuild that volume doesn't exist, and the VM will fail to
start with volume 'local:iso/...' does not exist. Fix before first
qm start:
debian-12.4.0-amd64-netinst.iso reference.
Once nasbackup is mounted and visible in pvesm status, restore in this
order so CT104 (Ansible control) is available soonest:
# On proxfold
BACKUP_DIR=/mnt/pve/nasbackup/dump
# Storage target: `local-zfs` after Phase 4C (ZFS mirror rpool). Pre-4C
# installs used `local-lvm` on the LVM-thin boot drive — substitute that if
# you're restoring onto an LVM install.
# CT104 (Ansible control) — restore first
LATEST_104=$(ls -t "$BACKUP_DIR"/vzdump-lxc-104-*.tar.zst | head -1)
pct restore 104 "$LATEST_104" --storage local-zfs --unique 0
# CT100 (plex)
LATEST_100=$(ls -t "$BACKUP_DIR"/vzdump-lxc-100-*.tar.zst | head -1)
pct restore 100 "$LATEST_100" --storage local-zfs --unique 0
# VM101 (arrstack)
LATEST_101=$(ls -t "$BACKUP_DIR"/vzdump-qemu-101-*.vma.zst | head -1)
qmrestore "$LATEST_101" 101 --storage local-zfs --unique 0
# VM102 (nginx)
LATEST_102=$(ls -t "$BACKUP_DIR"/vzdump-qemu-102-*.vma.zst | head -1)
qmrestore "$LATEST_102" 102 --storage local-zfs --unique 0
# Start all
pct start 104 100
qm start 101
qm start 102
--unique 0 preserves MAC addresses so DHCP leases and firewall rules
still match.
Reconfigure each guest with Ansible¶
Run guest playbooks from CT104, not WSL
Guest authorized_keys trust the scheduled-control node key
(ansible@homelab on CT104), not the WSL key. Running guest playbooks
directly from WSL fails with Permission denied (publickey) for
CT100/VM101/VM102. CT104 is restored first so it can run these. The
WSL key only needs to land on proxfold for cold-start — push the WSL
key to guests only if CT104 itself is unavailable for some reason.
From CT104:
ssh root@192.168.1.245
cd ~/homelab-ansible && git pull --ff-only
ansible-playbook playbooks/plex.yml --vault-password-file ~/.vault_pass --diff
ansible-playbook playbooks/arrstack.yml --vault-password-file ~/.vault_pass --diff
ansible-playbook playbooks/nginx.yml --vault-password-file ~/.vault_pass --diff
Nvidia LXC passthrough needs a second host-reconcile pass
The first proxmox-host.yml run (under "Reconstitute the host")
runs the nvidia LXC passthrough tasks, but /etc/pve/lxc/100.conf
doesn't exist yet at that point — the tasks skip cleanly. After
CT100 is restored, apply the deferred passthrough with:
ansible-playbook playbooks/proxmox-host.yml --limit proxfold \
--vault-password-file ~/.vault_pass --tags lxc
pct stop 100 && pct start 100
pct exec 100 -- nvidia-smi.
Gotcha — Plex data directory (fresh install). If CT100's rootfs was
restored from vzdump, the symlink target is already in place. If CT100 is
being rebuilt from scratch, the Plex package's postinst creates
/var/lib/plexmediaserver/Library/Application Support/Plex Media Server as
a real directory, and the Plex role fails hard rather than overwriting it.
Remove the empty dir with
rm -rf "/var/lib/plexmediaserver/Library/Application Support/Plex Media Server"
and re-run.
Gotcha — arrstack docker containers on restart. If the docker role
fires its restart docker handler and gluetun is recreated, qbittorrent
will fail with "No such container" (pinned to the old gluetun namespace
via network_mode: service:gluetun). Redeploy the arrstack stack via
dockhand to resolve.
Hand off to CT104¶
After CT104 is restored and Ansible bootstrapped, CT104 resumes its scheduled-automation role. Confirm:
ssh root@192.168.1.250 'pct exec 104 -- bash -c "cd /root/homelab-ansible && git pull --ff-only && ansible all -m ping --vault-password-file ~/.vault_pass"'
Timezone — set this on CT104 after restore
CT104's vzdump may have been taken while the container was Etc/UTC.
The drift-detection timer runs OnCalendar=*-*-* 04:00:00, which is
interpreted against the container's local timezone — leaving it on UTC
means drift sweeps fire at 13:30 ACST instead of 04:00 ACST. After
restore, set the timezone once:
ssh root@192.168.1.245 'timedatectl set-timezone Australia/Adelaide && \
systemctl list-timers drift-detection.timer --no-pager'
The timer reschedule is automatic; no service restart required.
Verification checklist¶
-
ssh root@192.168.1.250 pveversionreturns expected version -
zpool status stashshows ONLINE -
pvesm statusshowsnasbackupactive -
ansible all -m pingsucceeds from WSL - Proxmox UI reachable at
https://192.168.1.250:8006 - Plex UI reachable at
http://192.168.1.230:32400/web - NPM UI reachable (nginx VM)
-
docker pson arrstack shows all media-stack containers up - UPS status:
upsc cyberpowerreturns battery data
Rehearsal (validate before you need it)¶
Do this after any change to answer.toml.j2 or before any real rebuild.
Build the test ISO¶
cd ~/homelab-ansible
rebuild/render-answer.sh
# Build on proxfold (easier — tool already there)
scp rebuild/answer-test.toml ~/iso/proxmox-ve_9.x-1.iso root@192.168.1.250:/var/lib/vz/template/iso/
ssh root@192.168.1.250 'cd /var/lib/vz/template/iso && proxmox-auto-install-assistant prepare-iso \
proxmox-ve_9.x-1.iso --fetch-from iso --answer-file answer-test.toml \
--output proxmox-ve-auto-test.iso'
Create a nested PVE VM¶
ssh root@192.168.1.250 <<'EOF'
qm create 999 \
--name proxfold-rehearsal \
--memory 2048 --cores 2 --cpu host \
--net0 virtio,bridge=vmbr0 \
--scsihw virtio-scsi-single \
--scsi0 local-zfs:8,ssd=1 \
--ide2 local:iso/proxmox-ve-auto-test.iso,media=cdrom \
--boot 'order=scsi0;ide2' \
--ostype l26
qm start 999
EOF
Watch auto-install¶
From Proxmox UI: Datacenter → proxfold → 999 → Console. The installer runs unattended for ~5 min.
Known limitation — nested install fails at ~99%¶
The PVE 9 installer reliably fails in a nested VM at the final "make system
bootable → update initramfs" step with bootloader setup errors: unable to
install initramfs. This is a nested-KVM + virtio-scsi quirk, not a
problem with the answer template. By the time it fails, the rehearsal has
already validated the parts the template controls:
- vault rendering (root hash + SSH keys baked into ISO)
- hostname from
answer-test.tomlapplied - disk filter matched the QEMU disk only (non-destructive guard works)
reboot-on-error = falsekeeps the VM at the error prompt for inspection
The post-install Ansible bootstrap (steps below) therefore can't be exercised in the nested VM. It's only meaningful against bare metal during a real rebuild. Treat the rehearsal as complete once the installer reaches the initramfs error with hostname/disk evidence intact, then tear down.
If you really want to validate the post-install path, options are:
- Retry with --machine q35 --bios ovmf + an EFI disk (sometimes dodges it)
- Test on a spare physical machine with the prod answer
- Accept that the next real rebuild will be the first full end-to-end run
Find the VM's DHCP-assigned IP¶
Verify Ansible reachability¶
Add a temporary inventory entry and ping:
Run the common role against it as a bootstrap sanity check:
ansible-playbook -i "proxfold-rehearsal," \
-e ansible_host=<test-ip> \
--vault-password-file ~/.vault_pass \
-l proxfold-rehearsal \
-t common \
playbooks/site.yml --diff
(Full proxmox-host.yml run would try to import the stash pool, which
doesn't exist in the nested VM. common alone validates the
SSH + Ansible bootstrap path — which is the piece the rehearsal is
specifically validating.)