Boot Drive Swap — ZFS Mirror¶

Scope: Replace the single 128GB LVM boot drive with two 960GB SATA SSDs in a ZFS mirror (RAID1). Boot drive redundancy is the goal — the system continues from the surviving drive if one fails.

Revision: April 2026 — rewritten to lean on the Phase 3C Proxfold rebuild runbook, then revised post-execution (2026-04-22) with lessons from the live run.

Relationship to rebuild.md

Since Phase 3C, the install itself is handled by the Proxmox auto-install kit. This runbook covers only the swap-specific parts: hardware sourcing, ID_MODEL filter update, physical replacement, and boot-mirror verification. Everything from "insert ISO" through "guests restored" lives in rebuild.md.

Hardware¶

Item	Detail
Current boot drive	Samsung SSD 840 PRO 128GB (single, LVM)
New drive 1	Dell 39KRG — Samsung SM843T 960GB MLC SATA
New drive 2	Dell 04T7DD — Intel DC S4500 960GB TLC SATA 6Gbps
Drive caddies	2× Dell 2.5" SFF
Installer	Production ISO from rebuild.md (bakes in `answer.toml`)

The data pool (stash) lives on 6× Samsung PM1633a SAS SSDs in separate bays. It is not touched by this procedure — the swap only affects the boot bay.

Part 1: Pre-flight¶

1.1 Source the hardware¶

2× 960GB SATA SSDs (Dell 39KRG + Dell 04T7DD, or equivalent)
2× Dell 2.5" SFF caddies (front drive bays 0 and 1)

1.2 How the installer picks drives — and what to verify¶

The answer.toml.j2 template filters target disks by ID_BUS = "ata". The PERC H730 in Non-RAID passthrough exposes SATA drives with ID_BUS=ata via the SAT translation layer, while the six stash pool members come through as ID_BUS=scsi (native SAS) and the IDSDM is ID_BUS=usb. The filter matches only the new pair, leaving the other drives untouched even though they remain connected during install.

During Phase 4C prep we initially went with an array-form ID_MODEL filter (filter.ID_MODEL = ["SAMSUNG_MZ7WD960*", "INTEL_SSDSC2KB960*"] with filter-match = "any", per Proxmox forum #147365). PVE 9.1.7's proxmox-auto-install-assistant rejects this syntax with invalid type: sequence, expected a string. The array form may work in a future version, but not in 9.1.7. filter.ID_BUS = "ata" sidesteps the issue — one value, one predicate, matches both drives.

Confirm the new drives really are SATA-via-H730 before committing to this template. Expected values from a live ISO (see §2.5 below):

Drive	Expected `ID_MODEL`	`ID_BUS`
Dell 39KRG (Samsung SM843T)	`SAMSUNG_MZ7WD960*`	`ata`
Dell 04T7DD (Intel DC S4500)	`INTEL_SSDSC2KB960*`	`ata`
Six stash members (Samsung PM1633a)	`MZILS3T8HMLH*`	`scsi`
IDSDM	various	`usb`

udevadm info --query=property --name=/dev/sdX | grep -E "ID_BUS|ID_MODEL|ID_SERIAL"

If you can verify on a workstation (USB enclosure, Linux live VM) ahead of time, do so and skip the outage-time recon. If the new drives don't enumerate as ID_BUS=ata, update filter.ID_BUS accordingly or fall back to disk-list = ["sda","sdb"] after confirming udev ordering.

1.3 Production answer template (already updated for 4C)¶

As of branch phase-4c-prep, homelab-ansible/rebuild/answer.toml.j2 already contains the ZFS mirror [disk-setup] block targeting the 960GB SATA pair. Merge that branch before running the render — the vault render pulls from the working tree.

If bench recon (step 1.2) produced different ID_MODEL strings than the template expects, amend the filter on the branch before merge.

1.4 Render and build a fresh production ISO¶

cd ~/homelab-ansible
rebuild/render-answer.sh
rebuild/build-iso.sh ~/iso/proxmox-ve_9.1-1.iso

The output ISO lands in rebuild/ (gitignored). Don't write to USB — we'll serve it via iDRAC virtual media. Keep a copy on the NAS or a second location as belt-and-braces.

1.4a Install `proxmox-auto-install-assistant` on WSL (one-time, pre-stage)¶

If not already installed, add it to WSL so the ISO can be rebuilt on-the-fly during the outage if the disk filter needs adjustment after recon:

curl -fsSL https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg \
  | sudo tee /etc/apt/keyrings/proxmox.gpg >/dev/null
echo 'deb [signed-by=/etc/apt/keyrings/proxmox.gpg] http://download.proxmox.com/debian/pve trixie pve-no-subscription' \
  | sudo tee /etc/apt/sources.list.d/pve.list
sudo apt update && sudo apt install -y proxmox-auto-install-assistant

1.5 Full vzdump backup of all guests¶

ssh root@192.168.1.250 \
  'vzdump 100 101 102 104 --storage nasbackup --compress zstd --mode snapshot \
    --notes-template "Pre boot swap - {{guestname}}"'

Verify sizes on the NAS:

ssh root@192.168.1.250 'ls -lh /mnt/pve/nasbackup/dump/'

Approximate expected sizes:

VM/CT	Name	Size
LXC 100	plex	~5.6 GB
VM 101	arrstack	~3.9 GB
VM 102	nginx	~1.4 GB
LXC 104	control	~500 MB

1.6 Verify Ansible playbooks are pushed¶

cd ~/homelab-ansible
git status        # clean working tree
git push          # everything on origin/main

After the rebuild, the first control node (WSL, then CT104 once restored) clones the repo and runs ansible-playbook playbooks/site.yml. Anything not on origin/main will not be applied.

Part 2: Physical swap and pre-install recon¶

Hold point H1 — shutdown

Last moment to stop with zero disruption. Resume plan by bringing host back up.

Shut down Proxmox cleanly: ssh root@192.168.1.250 'shutdown -h now'
Wait for iDRAC to report POST complete, machine powered off

Hold point H2 — physical swap

Drives out but nothing installed. Rollback = re-insert 840 PRO into slot 0, pull new drives, power on. Old PVE 9.1.7 boots unchanged.

Disconnect both power cables from the rear PSUs
Open the chassis (rotate latch, slide and lift the cover)
Remove the Samsung 840 PRO 128GB from its front drive bay caddy (PERC slot 0). Set it aside on the bench — do not re-insert into the PERC unless rolling back.
Mount both new SSDs in Dell 2.5" SFF caddies
Insert both new drives into front drive bays 0 and 1
Reassemble and reconnect power

Note

The Nvidia T400 GPU and all 6× SAS data drives remain in place throughout. Do not touch the PCIe riser or the data drive bays (PERC slots 2–7).

2.5 Live-ISO recon (if not done on a bench)¶

Do not rely on the Proxmox installer's debug shell for recon

The Proxmox auto-installer offers a debug shell when you abort with Ctrl-C / "Abort installation". It is pre-init busybox, not a full live environment — lsblk, udevadm, smartctl, zpool, blkid are all missing. Only /proc and /sys are usable. This is enough to verify which disks will match the filter (see below) but NOT enough to SMART-check them. Use a proper Linux live ISO instead.

If you're already in the debug shell and need to sanity-check the vendor class before committing, read sysfs directly — this is the same attribute udev uses to set ID_VENDOR:

for d in /sys/block/sd*; do
  echo "$(basename $d) = $(cat $d/device/vendor)"
done

Expected during 4C: sda = ATA, sdb = ATA (new pair, SATA-via-H730), sdc–sdh = IBM-C051 or similar (SAS stash drives), no USB/IDSDM entry.

Power on, attach a Linux live ISO via iDRAC virtual media (Ubuntu desktop, SystemRescue, or any Debian live), boot into it, and capture the udev properties:

for d in /dev/sd[a-z]; do
  echo "==> $d"
  udevadm info --query=property --name=$d | grep -E "ID_BUS|ID_MODEL|ID_SERIAL|ID_PATH" | head -6
done

Verify the new drives show:

ID_BUS=ata, ID_MODEL=SAMSUNG_MZ7WD960... (Dell 39KRG)
ID_BUS=ata, ID_MODEL=INTEL_SSDSC2KB960... (Dell 04T7DD)
Six ID_BUS=scsi, MZILS3T8HMLH* entries for the stash drives (untouched)

Then SMART-check both new drives before committing to install. Used enterprise SSDs off eBay/resellers occasionally arrive DOA or with aggressive wear:

# Assuming sda and sdb are the two new drives per the udev dump above
for d in /dev/sda /dev/sdb; do
  echo "==================== $d"
  smartctl -H "$d"                      # overall PASS/FAIL
  smartctl -A "$d" | grep -E \
    "Power_On_Hours|Wear_Leveling|Media_Wearout|Reallocated_Sector|Total_LBAs_Written|Available_Spare"
done

Expect:

SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct raw value 0 (or very small)
Wear indicator (Samsung Wear_Leveling_Count / Intel Media_Wearout_Indicator) normalized value well above the threshold — 90+ is fresh, 10–30 is near EOL
Power_On_Hours sanity — used drives will have hours logged; catastrophically high (>40000) is worth flagging

If a drive fails SMART or looks near EOL: abort, re-insert the 840 PRO, reschedule with a replacement drive. This is the last cheap rollback.

If the actual ID_MODEL strings differ from the template's globs: 1. Power off via the live ISO 2. On WSL: edit homelab-ansible/rebuild/answer.toml.j2 with the real values 3. rebuild/render-answer.sh && rebuild/build-iso.sh ~/iso/proxmox-ve_9.1-1.iso 4. Swap iDRAC virtual media to the new prod ISO

If the actual strings match the template, proceed to Part 3.

Hold point H3 — pre-installer

Last moment to stop before writing to the new drives. Rollback same as H2: power off, re-insert 840 PRO, power on.

iDRAC boot-order quirk: stale UEFI entries can fall through to unexpected media

With both new drives blank, the firmware's existing Boot0007* proxmox entry (or equivalent) still points to the old 840 PRO's ESP GPT UUID — which no longer exists. The firmware skips it and falls through to the next bootable device in BootOrder. On proxfold that was the IDSDM's old ESXi install, which is surprising. Use F11 at POST to explicitly select the virtual media instead of waiting for auto-boot. Post-install, clear the stale entry with efibootmgr -b <hex> -B once the new systemd-boot entries are confirmed working.

Part 3: Run rebuild.md¶

At this point the procedure is identical to a bare-metal rebuild. Follow the Proxfold rebuild runbook from "Boot the installer" through to "Verify guests started". Key differences to watch for:

The auto-install answer's filesystem section should already specify zfs (RAID1) across both new drives — the template default is a mirror, but verify after rendering
The installer will find both new SSDs matching the updated ID_MODEL filter and create rpool as a mirror across them
Stash import, NFS export, NUT config, GPU driver install are all handled by the proxmox / nfs / nut / nvidia roles — no manual post-install work beyond what rebuild.md covers
Guest restore targets local-zfs (the default storage pool on a fresh ZFS install), not local-lvm. Verified in rebuild.md's restore block.

Hold points inside rebuild.md¶

Hold	Where in rebuild.md	Recovery
H4	After "First-boot sanity check" — fresh PVE, no Ansible yet	Roll back: power off, re-insert 840 PRO, boot old PVE
H5	After `proxmox-host.yml` completes	Host fully configured, no guests. Safe to walk away; re-run Ansible is idempotent
H6	After vzdump restore, before guest-specific playbooks	Rollback window effectively closed — guests are on new storage
H7	After guest playbooks	Only loose ends are verification (Part 4 below)
H8	Verification complete	Keep 840 PRO in the drawer as archival rollback for ≥1 week

Part 4: Post-swap verification — boot mirror specific¶

These are the checks that matter because it was a mirror install, not generic "is the system up" checks (those are in rebuild.md's verification section).

4.1 Boot mirror health¶

zpool status rpool

Expected:

state: ONLINE
mirror-0 section listing both drives, both ONLINE
errors: No known data errors

4.2 Both ESPs registered¶

Proxmox uses proxmox-boot-tool to keep the EFI System Partition on both drives in sync. Kernel updates write to both automatically — but only if both ESPs are registered.

proxmox-boot-tool status

Expected: two lines, one per drive, each showing a UUID + "configured with: uefi".

4.3 Kernel pin survived¶

proxmox-boot-tool kernel list

Expected: 6.14.11-6-pve under "Manually selected kernels". If it isn't pinned, the proxmox role will re-pin on the next run — but it's faster to confirm here.

4.4 Media pipeline sanity¶

One spot-check beyond the role/rebuild validation: play a Plex title and force a transcode. (hw) should appear in the Plex dashboard, nvidia-smi on proxfold should show the encoder process. This catches the "Nvidia major numbers shifted" class of failure that the role verifies but doesn't guarantee end-to-end.

Appendix: Replacing a failed boot mirror drive¶

Ongoing procedure — not part of the initial swap, but lives here because it's mirror-specific.

Identify the failed drive: zpool status rpool
Physically replace the failed drive (front bay, hot-swap caddy)

Copy the partition table from the surviving drive:

sgdisk /dev/disk/by-id/<surviving-drive> -R /dev/disk/by-id/<new-drive>
sgdisk -G /dev/disk/by-id/<new-drive>

Replace in the ZFS pool:

zpool replace rpool /dev/disk/by-id/<old-drive-partN> /dev/disk/by-id/<new-drive-partN>

Wait for resilver to complete: zpool status rpool

Initialise EFI on the new drive so proxmox-boot-tool keeps it in sync:

proxmox-boot-tool format /dev/<new-drive-efi-partition>
proxmox-boot-tool init /dev/<new-drive-efi-partition>
proxmox-boot-tool refresh