Backup & Restore¶

Architecture (post-Phase 5A + 5E)¶

Two scheduled jobs, both targeting PBS CT 105 (nas-primary datastore on the QNAP TS-269L via NFSv3):

Job	Captures	Schedule	Mechanism	Retention
`pbs-daily`	All guests (LXC + VM, all `/etc/pve/lxc/` and `/etc/pve/qemu-server/`)	02:00 daily	`vzdump` → PBS via PVE storage `nas-primary`	global prune-job: 7d / 4w / 6m
`pbs-host-backup.timer`	proxfold host config (`/etc`, `/root`, `/var/lib/pve-cluster`)	02:30 daily	`proxmox-backup-client` systemd timer (separate user, namespace `host/proxfold`)	namespace-scoped prune-job: 14d / 8w / 12m / 2y

The original CIFS path (nasbackup, mounted at /mnt/pve/nasbackup) is still registered as PVE storage but no scheduled job writes to it. It's kept for ad-hoc vzdump --storage nasbackup pushes (e.g. pre-decom snapshots, see VM 102 retirement 2026-05-03).

What ZFS data is not backed up

LXC ZFS mount points are excluded from vzdump — /stash media on CT 100 isn't captured by pbs-daily, only the OS + app configs. The host's data pool (stash, including all media) is not backed up anywhere; that's accepted risk on a homelab media server. The 4C ZFS boot mirror covers physical-disk failure for rpool; pbs-host-backup.timer covers config-level corruption on the host OS.

Manual backup (ad-hoc)¶

# All guests to PBS (matches the scheduled job, useful pre-change)
vzdump 100 101 104 105 106 107 108 109 --storage nas-primary --mode snapshot \
  --notes-template "Manual backup - {{guestname}}"

# Single guest to CIFS (pre-decom or break-glass)
vzdump 102 --storage nasbackup --compress zstd --mode snapshot \
  --notes-template "Pre-decom snapshot"

Verify backups¶

# PBS-side guest snapshots
ssh root@192.168.1.246 'proxmox-backup-manager task list --limit 10'
ssh root@192.168.1.246 'proxmox-backup-client snapshot list ct/<id>'

# Host-backup snapshots (Phase 5E)
ssh root@192.168.1.246 'proxmox-backup-client snapshot list --ns host/proxfold host/proxfold'

# CIFS dump dir
ls -lh /mnt/pve/nasbackup/dump/

Troubleshooting¶

Datastore not mounted (backups failing)¶

Signature: every backup fails and the nas-primary storage shows inactive in PVE, despite the QNAP being online (nasbackup CIFS still mounts fine on the host). The tell is pvesm status / the PBS journal repeating:

unable to open chunk store at "/mnt/pbs-datastore/.chunks" - No such file or directory (os error 2)

This means CT 105's NFS datastore mount is gone — the directory exists but nothing is mounted on it, so PBS sees no .chunks. Root cause history: the in-container mount used x-systemd.automount, which is unsupported inside an LXC, so the datastore never re-mounted after the 2026-06-19 power-cycle and backups silently failed until a manual remount the next morning — the 2026-06-20 02:00 job errored and the last good backup was 2026-06-19 02:06 (see the pbs role gotcha).

Immediate fix (restores backups now):

# From proxfold — remount inside CT 105
pct exec 105 -- mount /mnt/pbs-datastore
# Verify
pct exec 105 -- bash -c 'mountpoint -q /mnt/pbs-datastore && ls /mnt/pbs-datastore/.chunks >/dev/null && echo OK'
pvesm status -storage nas-primary        # should flip to "active"

Then confirm end-to-end with a small backup, e.g. vzdump 106 --storage nas-primary --mode snapshot.

Durable fix (in place since 2026-06-20): pbs_nfs_options is now a plain _netdev,nofail,nfsvers=3 mount (no automount), and a datastore healthcheck timer (pbs-datastore-healthcheck.timer, every 15 min) posts to the #homelab-ops Discord webhook if the mount or its .chunks dir is missing — so a future miss is loud, not silent. Check it with:

pct exec 105 -- systemctl list-timers pbs-datastore-healthcheck.timer --no-pager
pct exec 105 -- /usr/local/sbin/pbs-datastore-healthcheck.sh; echo "rc=$?"   # rc=0 = healthy

Restore (guest)¶

List available snapshots¶

PBS UI at https://192.168.1.246:8007 (datastore nas-primary) is the easiest path. CLI equivalents:

# From proxfold (uses the registered PVE storage)
pvesm list nas-primary
# Or directly from PBS
ssh root@192.168.1.246 'proxmox-backup-client snapshot list ct/100'

Restore an LXC container from PBS¶

# Replace <volid> with the listing entry, e.g. nas-primary:backup/ct/100/2026-05-06T16:00:00Z
pct restore <vmid> <volid> --storage local-zfs

Restore a VM from PBS¶

qmrestore <volid> <vmid> --storage local-zfs

Restore from the legacy CIFS path¶

Only relevant for backups predating Phase 5A or pre-decom snapshots intentionally pushed to nasbackup:

ls /mnt/pve/nasbackup/dump/
pct restore <vmid> /mnt/pve/nasbackup/dump/<file>.tar.zst --storage local-zfs
qmrestore /mnt/pve/nasbackup/dump/<file>.vma.zst <vmid> --storage local-zfs

Note

local-zfs has been the boot-drive-backed storage pool on proxfold since Phase 4C (2026-04-22). For historical restores taken against the pre-4C single-drive LVM install, pass --storage local-zfs anyway — the restore transparently redirects onto whatever pool currently exists.

Re-add ZFS mount points to Plex LXC after restore¶

The vzdump backup does not include Proxmox-level mount point configuration — these must be re-added manually after restoring the Plex LXC:

pct set 100 -mp0 /stash,mp=/stash
pct set 100 -mp1 /stash/plex-data,mp=/stash/plex-data
pct set 100 -features mount=nfs;cifs

Note

The stash/plex-data ZFS dataset (100G quota) must also exist with correct ownership (999:996). This is codified by the plex role — ansible-playbook playbooks/plex.yml creates the dataset via delegate_to: proxfold and sets the mount points. Only re-add mp0/mp1 manually if the role is unavailable (cold-start before CT104 restore).

Start and verify after restore¶

# Start containers and VMs
pct start 100
qm start 101
qm start 102
pct start 104

# Verify Plex can see media
pct exec 100 -- ls /mnt/plex/Movies/ | head -5

# Verify arrstack NFS mount
qm guest exec 101 -- df -h /stash

# Verify Docker containers are running
qm guest exec 101 -- docker ps

Restart services¶

# Plex LXC
pct restart 100

# Arrstack VM
qm restart 101

# Individual Docker containers (from inside arrstack)
docker restart sonarr radarr qbittorrent

# Redeploy full stack — push to GitHub, Dockhand picks up the change

Host-level file backup (Phase 5E)¶

Daily file-level backup of proxfold host config to PBS via proxmox-backup-client. Closes the gap that vzdump leaves: guests are captured by pbs-daily, but the host's own /etc, /root, and /var/lib/pve-cluster are not.

Architecture summary¶

Client side — roles/proxmox/tasks/host_backup.yml renders a credentials env file, a wrapper script, a oneshot systemd service, and a daily timer. Activated when pbs_host_backup is defined in host_vars and vault_pbs_host_token_secret exists in vault.
PBS side — separate user host-backup@pbs with token host-backup@pbs!proxfold, scoped via ACL to namespace host/proxfold on the nas-primary datastore. Namespace-scoped prune-job applies its own (longer) retention.
Schedule — 02:30 daily. Vzdump kickoff is 02:00, recent runs complete in ~5 min, PBS prune is 03:00. 02:30 is clear of both.

Bootstrap (one-shot, manual)¶

Run once after the homelab-ansible code lands but before the role activates. Steps (1)–(4) on PBS CT 105 (192.168.1.246), (5) on the WSL/CT104 control node.

# (1) On PBS CT 105 — create user
ssh root@192.168.1.246
proxmox-backup-manager user create host-backup@pbs --comment 'proxfold host file-level backup'
# Set a throwaway password when prompted; real auth is the token below.

# (2) Generate the API token (value shown ONCE)
proxmox-backup-manager user generate-token host-backup@pbs proxfold
# Capture the `value` field IMMEDIATELY — there is no retrieval path.
# Do NOT echo it to scrollback. Pipe to /dev/shm or copy directly into the
# vault-append script. See feedback memory: never_view_vault_to_scrollback.

# (3) Create the namespace
proxmox-backup-client namespace create host/proxfold \
  --repository host-backup@pbs!proxfold@127.0.0.1:nas-primary
# Will prompt for the token value — paste the same one captured in (2).

# (4) Grant DatastoreBackup on the namespace to BOTH the user and the token
# auth-ids. The "token inherits from user" pattern documented in older versions
# of pbs.md is wrong — see [pbs role doc](../ansible/roles/pbs.md#gotchas-captured-during-execution).
# Without the token grant, the timer fails with "missing permissions
# 'Datastore.Backup'" even though `user permissions host-backup@pbs` looks fine.
proxmox-backup-manager acl update /datastore/nas-primary/host/proxfold \
  DatastoreBackup --auth-id host-backup@pbs
proxmox-backup-manager acl update /datastore/nas-primary/host/proxfold \
  DatastoreBackup --auth-id 'host-backup@pbs!proxfold'

# Verify BOTH resolve to a dict (not {}):
proxmox-backup-manager user permissions host-backup@pbs \
  --path /datastore/nas-primary/host/proxfold --output-format json
proxmox-backup-manager user permissions 'host-backup@pbs!proxfold' \
  --path /datastore/nas-primary/host/proxfold --output-format json

# (5) Namespace-scoped prune-job — longer retention than the global pbs-daily prune.
# Patterned after the official PBS prune example, tuned for homelab.
proxmox-backup-manager prune-job create nas-primary-host-prune \
  --store nas-primary --ns host/proxfold --schedule '03:15' \
  --keep-daily 14 --keep-weekly 8 --keep-monthly 12 --keep-yearly 2

Then on the control node:

# (6) Append vault entry — append-only pattern, no plaintext to scrollback.
# The token value from step (2) goes into vault_pbs_host_token_secret.
cd ~/homelab-ansible
TMP=$(mktemp -p /dev/shm vault-edit.XXXXXX)
ansible-vault decrypt --output "$TMP" group_vars/all/vault.yml
printf '\nvault_pbs_host_token_secret: %s\n' '<paste-token-value-here>' >> "$TMP"
ansible-vault encrypt --output group_vars/all/vault.yml "$TMP"
shred -u "$TMP"
# Verify the encrypt round-tripped (decrypts cleanly without dumping plaintext):
ansible-vault view group_vars/all/vault.yml > /dev/null && echo "vault decrypts OK"
# Don't echo the variable's value. Activation in step (7) confirms the entry
# parsed correctly: pre-step (6) the host_backup include is gated off; post-(6)
# `--check --diff` shows the new tasks evaluating against proxfold.

# (7) First role run — only the host_backup tasks
ansible-playbook playbooks/site.yml --limit proxfold --tags host_backup --diff

# (8) Smoke test the timer
ssh root@192.168.1.250 'systemctl start pbs-host-backup.service'
ssh root@192.168.1.250 'systemctl status pbs-host-backup.service'
# Verify the snapshot landed:
ssh root@192.168.1.246 \
  'proxmox-backup-client snapshot list --ns host/proxfold host/proxfold'

Restore (host configs)¶

# List available snapshots
ssh root@192.168.1.250
proxmox-backup-client snapshot list --ns host/proxfold host/proxfold

# Mount a snapshot read-only (FUSE)
proxmox-backup-client mount host/proxfold/<snapshot> etc.pxar /mnt/restore

# Selective restore — DO NOT blanket-copy /etc; pick the specific files
cp /mnt/restore/pve/storage.cfg /etc/pve/storage.cfg
cp /mnt/restore/network/interfaces /etc/network/interfaces

# Unmount when done
fusermount -u /mnt/restore

Host backups are not bare-metal restore

These backups capture files, not a bootable image. Recovering a dead proxfold means: reinstall PVE, run the homelab-ansible bootstrap (rebuild runbook), then selectively restore from the snapshot. The 4C ZFS boot mirror is the bare-metal protection; this is the config-level protection.

What's actually in the snapshot¶

proxmox-backup-client doesn't traverse mount points by default, so /etc/pve (the pmxcfs FUSE mount) is not captured by etc.pxar. That's fine — /etc/pve is a synthesised view; the source of truth is /var/lib/pve-cluster/config.db, captured by pve-cluster.pxar. Restore path: install PVE on a fresh host, restore pve-cluster.pxar to /var/lib/pve-cluster, restart pve-cluster.service, and /etc/pve repopulates from the DB. If you ever want the live FUSE view captured directly, add --all-file-systems to the wrapper script — but that's redundant given the DB is the source of truth.

Lessons from the 2026-05-06 run¶

proxmox-backup-manager user generate-token doesn't accept --output-format. The CLI rejects it with "schema does not allow additional properties". Use proxmox-backup-debug api create /access/users/<userid>/token/<name> --output-format json instead — same effect, returns the token value as JSON for clean parsing.
proxmox-backup-manager namespace doesn't exist (caught earlier in the housemate-access runbook too — same memory). Create namespaces via proxmox-backup-debug api create /admin/datastore/<ds>/namespace --name <leaf> [--parent <p>]. Top-level first, then nest. The CLI panics in text_table.rs on the empty-result render after success — cosmetic, the operation succeeded.
ACL needs to land on BOTH user AND token auth-ids — see step (4) above. Caused a "missing permissions 'Datastore.Backup'" failure on the first timer fire even though user permissions host-backup@pbs looked correct. Revealed a latent bug in the pbs role (only granting on user); patched same cycle — Ensure datastore ACLs for PBS client TOKEN auth-id task added, gated on vault_pbs_token_id is defined, idempotent against the live state. The host-backup bootstrap above doesn't use the role (it's a separate user/token), so step (4) keeps the manual two-grant pattern.

Quarterly drill¶

The verify-job (Phase 5A, sun 04:00) checks chunk integrity but not that the documented restore path works. Once a quarter, mount the latest host/proxfold snapshot to /tmp/restore-test and diff a known-stable file (e.g. /etc/network/interfaces, /etc/pve/storage.cfg) against the live host. Surfaces silent regressions in the snapshot pipeline.

Config file locations¶

Proxmox stores guest configurations at:

/etc/pve/lxc/ — LXC container configs
/etc/pve/qemu-server/ — VM configs

Docker Compose is managed via Dockhand (Git-backed, from the homelab-ansible repo stacks/arrstack/). App configs are under /opt/mediaserver/ on the arrstack VM.