Boot Drive Swap — ZFS Mirror¶
Scope: Replace the single 128GB LVM boot drive with two 960GB SATA SSDs in a ZFS mirror (RAID1). Boot drive redundancy is the goal — the system continues from the surviving drive if one fails.
Revision: April 2026 — rewritten to lean on the Phase 3C Proxfold rebuild runbook, then revised post-execution (2026-04-22) with lessons from the live run.
Relationship to rebuild.md
Since Phase 3C, the install itself is handled by the Proxmox auto-install kit. This runbook covers only the swap-specific parts: hardware sourcing, ID_MODEL filter update, physical replacement, and boot-mirror verification. Everything from "insert ISO" through "guests restored" lives in rebuild.md.
Hardware¶
| Item | Detail |
|---|---|
| Current boot drive | Samsung SSD 840 PRO 128GB (single, LVM) |
| New drive 1 | Dell 39KRG — Samsung SM843T 960GB MLC SATA |
| New drive 2 | Dell 04T7DD — Intel DC S4500 960GB TLC SATA 6Gbps |
| Drive caddies | 2× Dell 2.5" SFF |
| Installer | Production ISO from rebuild.md (bakes in answer.toml) |
The data pool (stash) lives on 6× Samsung PM1633a SAS SSDs in separate bays. It is not touched by this procedure — the swap only affects the boot bay.
Part 1: Pre-flight¶
1.1 Source the hardware¶
- 2× 960GB SATA SSDs (Dell 39KRG + Dell 04T7DD, or equivalent)
- 2× Dell 2.5" SFF caddies (front drive bays 0 and 1)
1.2 How the installer picks drives — and what to verify¶
The answer.toml.j2 template filters target disks by ID_BUS = "ata". The PERC H730 in Non-RAID passthrough exposes SATA drives with ID_BUS=ata via the SAT translation layer, while the six stash pool members come through as ID_BUS=scsi (native SAS) and the IDSDM is ID_BUS=usb. The filter matches only the new pair, leaving the other drives untouched even though they remain connected during install.
During Phase 4C prep we initially went with an array-form ID_MODEL filter (filter.ID_MODEL = ["SAMSUNG_MZ7WD960*", "INTEL_SSDSC2KB960*"] with filter-match = "any", per Proxmox forum #147365). PVE 9.1.7's proxmox-auto-install-assistant rejects this syntax with invalid type: sequence, expected a string. The array form may work in a future version, but not in 9.1.7. filter.ID_BUS = "ata" sidesteps the issue — one value, one predicate, matches both drives.
Confirm the new drives really are SATA-via-H730 before committing to this template. Expected values from a live ISO (see §2.5 below):
| Drive | Expected ID_MODEL |
ID_BUS |
|---|---|---|
| Dell 39KRG (Samsung SM843T) | SAMSUNG_MZ7WD960* |
ata |
| Dell 04T7DD (Intel DC S4500) | INTEL_SSDSC2KB960* |
ata |
| Six stash members (Samsung PM1633a) | MZILS3T8HMLH* |
scsi |
| IDSDM | various | usb |
If you can verify on a workstation (USB enclosure, Linux live VM) ahead of time, do so and skip the outage-time recon. If the new drives don't enumerate as ID_BUS=ata, update filter.ID_BUS accordingly or fall back to disk-list = ["sda","sdb"] after confirming udev ordering.
1.3 Production answer template (already updated for 4C)¶
As of branch phase-4c-prep, homelab-ansible/rebuild/answer.toml.j2 already contains the ZFS mirror [disk-setup] block targeting the 960GB SATA pair. Merge that branch before running the render — the vault render pulls from the working tree.
If bench recon (step 1.2) produced different ID_MODEL strings than the template expects, amend the filter on the branch before merge.
1.4 Render and build a fresh production ISO¶
The output ISO lands in rebuild/ (gitignored). Don't write to USB — we'll serve it via iDRAC virtual media. Keep a copy on the NAS or a second location as belt-and-braces.
1.4a Install proxmox-auto-install-assistant on WSL (one-time, pre-stage)¶
If not already installed, add it to WSL so the ISO can be rebuilt on-the-fly during the outage if the disk filter needs adjustment after recon:
curl -fsSL https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg \
| sudo tee /etc/apt/keyrings/proxmox.gpg >/dev/null
echo 'deb [signed-by=/etc/apt/keyrings/proxmox.gpg] http://download.proxmox.com/debian/pve trixie pve-no-subscription' \
| sudo tee /etc/apt/sources.list.d/pve.list
sudo apt update && sudo apt install -y proxmox-auto-install-assistant
1.5 Full vzdump backup of all guests¶
ssh root@192.168.1.250 \
'vzdump 100 101 102 104 --storage nasbackup --compress zstd --mode snapshot \
--notes-template "Pre boot swap - {{guestname}}"'
Verify sizes on the NAS:
Approximate expected sizes:
| VM/CT | Name | Size |
|---|---|---|
| LXC 100 | plex | ~5.6 GB |
| VM 101 | arrstack | ~3.9 GB |
| VM 102 | nginx | ~1.4 GB |
| LXC 104 | control | ~500 MB |
1.6 Verify Ansible playbooks are pushed¶
After the rebuild, the first control node (WSL, then CT104 once restored) clones the repo and runs ansible-playbook playbooks/site.yml. Anything not on origin/main will not be applied.
Part 2: Physical swap and pre-install recon¶
Hold point H1 — shutdown
Last moment to stop with zero disruption. Resume plan by bringing host back up.
- Shut down Proxmox cleanly:
ssh root@192.168.1.250 'shutdown -h now' - Wait for iDRAC to report POST complete, machine powered off
Hold point H2 — physical swap
Drives out but nothing installed. Rollback = re-insert 840 PRO into slot 0, pull new drives, power on. Old PVE 9.1.7 boots unchanged.
- Disconnect both power cables from the rear PSUs
- Open the chassis (rotate latch, slide and lift the cover)
- Remove the Samsung 840 PRO 128GB from its front drive bay caddy (PERC slot 0). Set it aside on the bench — do not re-insert into the PERC unless rolling back.
- Mount both new SSDs in Dell 2.5" SFF caddies
- Insert both new drives into front drive bays 0 and 1
- Reassemble and reconnect power
Note
The Nvidia T400 GPU and all 6× SAS data drives remain in place throughout. Do not touch the PCIe riser or the data drive bays (PERC slots 2–7).
2.5 Live-ISO recon (if not done on a bench)¶
Do not rely on the Proxmox installer's debug shell for recon
The Proxmox auto-installer offers a debug shell when you abort with
Ctrl-C / "Abort installation". It is pre-init busybox, not a full
live environment — lsblk, udevadm, smartctl, zpool, blkid
are all missing. Only /proc and /sys are usable. This is enough
to verify which disks will match the filter (see below) but NOT
enough to SMART-check them. Use a proper Linux live ISO instead.
If you're already in the debug shell and need to sanity-check the
vendor class before committing, read sysfs directly — this is the
same attribute udev uses to set ID_VENDOR:
sda = ATA, sdb = ATA (new pair, SATA-via-H730),
sdc–sdh = IBM-C051 or similar (SAS stash drives), no USB/IDSDM
entry.
Power on, attach a Linux live ISO via iDRAC virtual media (Ubuntu desktop, SystemRescue, or any Debian live), boot into it, and capture the udev properties:
for d in /dev/sd[a-z]; do
echo "==> $d"
udevadm info --query=property --name=$d | grep -E "ID_BUS|ID_MODEL|ID_SERIAL|ID_PATH" | head -6
done
Verify the new drives show:
ID_BUS=ata,ID_MODEL=SAMSUNG_MZ7WD960...(Dell 39KRG)ID_BUS=ata,ID_MODEL=INTEL_SSDSC2KB960...(Dell 04T7DD)- Six
ID_BUS=scsi,MZILS3T8HMLH*entries for the stash drives (untouched)
Then SMART-check both new drives before committing to install. Used enterprise SSDs off eBay/resellers occasionally arrive DOA or with aggressive wear:
# Assuming sda and sdb are the two new drives per the udev dump above
for d in /dev/sda /dev/sdb; do
echo "==================== $d"
smartctl -H "$d" # overall PASS/FAIL
smartctl -A "$d" | grep -E \
"Power_On_Hours|Wear_Leveling|Media_Wearout|Reallocated_Sector|Total_LBAs_Written|Available_Spare"
done
Expect:
SMART overall-health self-assessment test result: PASSEDReallocated_Sector_Ctraw value0(or very small)- Wear indicator (Samsung
Wear_Leveling_Count/ IntelMedia_Wearout_Indicator) normalized value well above the threshold — 90+ is fresh, 10–30 is near EOL Power_On_Hourssanity — used drives will have hours logged; catastrophically high (>40000) is worth flagging
If a drive fails SMART or looks near EOL: abort, re-insert the 840 PRO, reschedule with a replacement drive. This is the last cheap rollback.
If the actual ID_MODEL strings differ from the template's globs:
1. Power off via the live ISO
2. On WSL: edit homelab-ansible/rebuild/answer.toml.j2 with the real values
3. rebuild/render-answer.sh && rebuild/build-iso.sh ~/iso/proxmox-ve_9.1-1.iso
4. Swap iDRAC virtual media to the new prod ISO
If the actual strings match the template, proceed to Part 3.
Hold point H3 — pre-installer
Last moment to stop before writing to the new drives. Rollback same as H2: power off, re-insert 840 PRO, power on.
iDRAC boot-order quirk: stale UEFI entries can fall through to unexpected media
With both new drives blank, the firmware's existing Boot0007* proxmox entry (or equivalent) still points to the old 840 PRO's ESP GPT UUID — which no longer exists. The firmware skips it and falls through to the next bootable device in BootOrder. On proxfold that was the IDSDM's old ESXi install, which is surprising. Use F11 at POST to explicitly select the virtual media instead of waiting for auto-boot. Post-install, clear the stale entry with efibootmgr -b <hex> -B once the new systemd-boot entries are confirmed working.
Part 3: Run rebuild.md¶
At this point the procedure is identical to a bare-metal rebuild. Follow the Proxfold rebuild runbook from "Boot the installer" through to "Verify guests started". Key differences to watch for:
- The auto-install answer's filesystem section should already specify
zfs (RAID1)across both new drives — the template default is a mirror, but verify after rendering - The installer will find both new SSDs matching the updated
ID_MODELfilter and createrpoolas a mirror across them - Stash import, NFS export, NUT config, GPU driver install are all handled by the
proxmox/nfs/nut/nvidiaroles — no manual post-install work beyond what rebuild.md covers - Guest restore targets
local-zfs(the default storage pool on a fresh ZFS install), notlocal-lvm. Verified in rebuild.md's restore block.
Hold points inside rebuild.md¶
| Hold | Where in rebuild.md | Recovery |
|---|---|---|
| H4 | After "First-boot sanity check" — fresh PVE, no Ansible yet | Roll back: power off, re-insert 840 PRO, boot old PVE |
| H5 | After proxmox-host.yml completes |
Host fully configured, no guests. Safe to walk away; re-run Ansible is idempotent |
| H6 | After vzdump restore, before guest-specific playbooks | Rollback window effectively closed — guests are on new storage |
| H7 | After guest playbooks | Only loose ends are verification (Part 4 below) |
| H8 | Verification complete | Keep 840 PRO in the drawer as archival rollback for ≥1 week |
Part 4: Post-swap verification — boot mirror specific¶
These are the checks that matter because it was a mirror install, not generic "is the system up" checks (those are in rebuild.md's verification section).
4.1 Boot mirror health¶
Expected:
state: ONLINEmirror-0section listing both drives, bothONLINEerrors: No known data errors
4.2 Both ESPs registered¶
Proxmox uses proxmox-boot-tool to keep the EFI System Partition on both drives in sync. Kernel updates write to both automatically — but only if both ESPs are registered.
Expected: two lines, one per drive, each showing a UUID + "configured with: uefi".
4.3 Kernel pin survived¶
Expected: 6.14.11-6-pve under "Manually selected kernels". If it isn't pinned, the proxmox role will re-pin on the next run — but it's faster to confirm here.
4.4 Media pipeline sanity¶
One spot-check beyond the role/rebuild validation: play a Plex title and force a transcode. (hw) should appear in the Plex dashboard, nvidia-smi on proxfold should show the encoder process. This catches the "Nvidia major numbers shifted" class of failure that the role verifies but doesn't guarantee end-to-end.
Appendix: Replacing a failed boot mirror drive¶
Ongoing procedure — not part of the initial swap, but lives here because it's mirror-specific.
- Identify the failed drive:
zpool status rpool - Physically replace the failed drive (front bay, hot-swap caddy)
- Copy the partition table from the surviving drive:
- Replace in the ZFS pool:
- Wait for resilver to complete:
zpool status rpool - Initialise EFI on the new drive so
proxmox-boot-toolkeeps it in sync:
Note
Always use /dev/disk/by-id/ paths for ZFS operations — they are stable across reboots. sdX assignments are not.
Lessons from the 2026-04-22 run¶
Captured post-execution. The flow above has been patched to match — this section is the backstory for why certain steps exist.
filter.ID_BUS = "ata"beats array-formID_MODEL. The prep work went in with array-formID_MODEL, but PVE 9.1.7's validator rejects it (invalid type: sequence, expected a string). Switching toID_BUS = "ata"is simpler and works — §1.2 now leads with that.- Proxmox installer debug shell is busybox, not a live environment. Realised mid-recon. Sysfs (
/sys/block/*/device/vendor) gave enough ground truth to proceed, but SMART +udevadmhad to wait for an Ubuntu live ISO via iDRAC. §2.5 now calls this out before you rely on the debug shell. - Stale UEFI
Boot0007* proxmoxentry fell through to IDSDM ESXi. With both new blank drives in, firmware skipped the dead entry and auto-booted the IDSDM's long-forgotten ESXi install. F11 boot menu → select virtual media explicitly is the reliable path. §H3 now carries the note. proxmox-boot-tool pinorder matters: pin then refresh. Running refresh first (before the pin file exists) silently propagates nothing. The rebuild runbook now mentions this; the role pins correctly, so you only hit it if you're doing a manual pin.- Fresh install = new hostid. The stash pool needed
zpool import -f stashonce before the playbook could take over. Theproxmoxrole now uses-fby default — one host per pool in this homelab, so it's safe. - nvidia install failed first-pass while nouveau was still loaded. The blacklist lands on disk but the running kernel doesn't reload it. After the first
proxmox-host.yml, reboot before re-running. Now documented in rebuild.md. - vzdump captures stale
ide2: local:iso/...references. VM101 and VM102 both carried adebian-12.4.0-amd64-netinst.isothat didn't exist on the fresh install.qm set <vmid> --ide2 none,media=cdrombefore first start. Now documented in rebuild.md's restore section. - Guest playbooks have to run from CT104, not WSL. Guest
authorized_keystrust the CT104ansible@homelabkey only. Guest-side playbooks from WSL fail with pubkey-denied. CT104 is restored first on purpose; rebuild.md's "Reconfigure each guest" now says so explicitly. - Missing
vault_nasbackup_passwordblocked CIFS registration. Thenasbackupconfig shipped with the role but the vault variable was never committed. Added during 4C. If you're cloning this setup fresh, make sure the vault has it. - Nvidia LXC passthrough role used
blockinfile; PVE re-sorts the block body out from under it. PVE's LXC config parser moves all rawlxc.*keys to end-of-file on any write. Withblockinfile, the markers stay put while the content migrates out, so the block looks empty and drift reportschanged=1every day. Role was migrated to per-linelineinfile state=present, which is transparent to the reorder. Documented in the nvidia role page. - Rehearsal couldn't validate post-install Ansible, as expected (nested PVE installer fails at ~99% on initramfs, see rebuild.md). The live run did expose a few role gaps that rehearsal couldn't have surfaced — ceph deb822 disable,
zpool import -f, vault var — all now fixed.
Related¶
- Proxfold rebuild runbook — the install procedure this runbook defers to
- proxmox role — kernel pin, repos, nouveau blacklist, sysctl
- nvidia role — GPU driver + LXC passthrough
- nut role — UPS monitoring restored as part of site.yml
- Storage — ZFS ARC cap