Alerting¶

Active alert surfaces across the homelab and the deliberate gaps. Notifications fan out across several Discord channels by signal class — i.e. by what the reader needs to do — not by source system or subject matter.

Channel taxonomy¶

Channel	Signal class	Question it answers	Vault webhook var
`#homelab-ops`	Reactive operational events	"Something happened to a running system" — backups succeeded / failed, Beszel thresholds breached, ZED pool events, PVE notifications, pending-reboot nags	`vault_discord_webhook_homelab_ops`
`#homelab-drift`	Declared state ≠ actual state	"Ansible says config diverged" — daily drift detection runs	dedicated drift webhook (separate from ops)
`#homelab-updates`	Pending updates	"Upstream moved, review when you have time" — Dockhand image-update + CVE scans, matrix-deploy fetch	`vault_discord_webhook_homelab_updates` (Dockhand uses its own UI-configured webhook on the same channel)
Service channels (`#mediabot`, `#seerr`, `#sonarr`, `#radarr`)	Service-specific runtime	"This service is speaking for itself" — per-app events	configured in each app, not in Ansible vault

Naming principle: channels are named for the signal class they carry, not the tool that emits the signal. #dockhand was renamed to #homelab-updates 2026-05-23 specifically so the bucket absorbs all "go look at this when you have time" sources (Dockhand updates, matrix-deploy fetches, future kernel-staleness pings, etc.) without rotting the name.

Notification sources¶

Source	Trigger	Channel	Transport
Beszel hub	Per-system CPU / memory / disk / offline thresholds	`#homelab-ops`	Discord via shoutrrr (`discord://token@id`)
PVE 9 notification target	Proxmox events (backup jobs, cluster, HA, etc.)	`#homelab-ops`	Discord webhook endpoint `discord-ops` + match-all matcher `ops-all`
ZED webhook	ZFS `statechange`, `scrub_finish`, `resilver_finish` on proxfold	`#homelab-ops`	Direct curl to Discord webhook
auto_updates reboot-required notifier	`/var/run/reboot-required` exists on a host that opted in (proxfold, arrstack, matrix, n8n)	`#homelab-ops`	Systemd timer + curl, daily 09:00
Ansible drift-detection	Daily `--check --diff` run on CT 104; reports non-zero `changed` count	`#homelab-drift`	Drift script + curl
Dockhand (controller)	Image-update detection + vulnerability scans on managed stacks	`#homelab-updates`	Dockhand UI webhook config (not in Ansible vault)
matrix_deploy_notifier	First Monday of month: pending upstream commits in `/root/matrix-deploy/`	`#homelab-updates`	Systemd timer + curl, monthly
PBS — mail-to-root	Backup / verify / GC failures	local mail	Local mail (secondary to PVE notification target)

Beszel alert set (locked 2026-04-25, agent fleet refreshed 2026-04-28)¶

Configured in the Beszel hub UI, not codified in Ansible. Applied via the per-system bell icon or "All Systems" for bulk.

The agent fleet has grown since the original lock — n8n joined during Phase 5C (2026-04-28) and vintage CT 201 was added the same week. Current registered agents: proxfold, arrstack, nginx, plex, pbs, control, n8n, vintage (8 total). Apply the thresholds below to new agents on registration.

Alert	Threshold	Window	Applied to
Status (offline)	—	2 min	all agents
CPU	90 %	10 min	proxfold, plex
CPU	85 %	10 min	arrstack, nginx, pbs, n8n, vintage, control
Memory	90 %	15 min	proxfold
Memory	85 %	10 min	arrstack, nginx, plex, pbs, n8n, vintage, control
Disk	85 %	5 min	all agents (on proxfold covers `/` and `/stash` via `beszel_agent_extra_filesystems`)

Proxfold runs a higher memory threshold because ZFS ARC inflates "used". Status at 2 minutes survives a single missed poll.

Temperature alerting is deliberately unwired¶

Locked decision — 2026-04-25

Thermals on proxfold are a spot-check, not an alerted surface. Do not re-propose Beszel temperature alerts or a standalone GPU watchdog without reopening the decision.

Why:

Beszel's temperature alert uses a single global threshold across all hwmon + nvidia-smi sensors. It cannot distinguish CPU package from GPU (see henrygd/beszel#1111).
On proxfold (Dell R430, Xeon E5-2680 v4), the CPU package idles at ~74 °C against a 90 °C TJunction. Any threshold tight enough to flag a T400 GPU excursion (target 83 °C, max-op 93 °C) collides with normal CPU workload temps.
Spinning drive temperatures are SMART-only and not covered by Beszel's temperature alert at all.
ZED already covers pool-level drive faults; hardware failsafes (GPU slowdown 98 °C, CPU throttle 100 °C) are the backstop.
Adding a standalone nvidia-smi watchdog purely for the T400 was judged inelegant relative to its value.

Spot-check path: ssh root@proxfold 'sensors ; nvidia-smi -q -d TEMPERATURE'.

Alert types intentionally skipped¶

Temperature, Load Avg, Disk I/O, Bandwidth — not enabled on any host. Load Avg duplicates CPU signal; Disk I/O and Bandwidth are too noisy to be actionable in a homelab; Temperature per above.

ZFS pool coverage¶

Beszel does not natively monitor zpool health (henrygd/beszel#1541). Coverage is split:

Pool capacity — Beszel Disk alert via beszel_agent_extra_filesystems: ["/stash__stash"] on proxfold (and / catches rpool on the boot mirror).
Pool state changes, scrub/resilver completion — ZED webhook (/etc/zfs/zed.d/{statechange,scrub_finish,resilver_finish}-discord.sh) on proxfold. See Role: zfs.
Pool I/O — not covered. Beszel's ZFS I/O path is broken (henrygd/beszel#200); standalone zpool iostat spot-checks if needed.

Review triggers¶

Beszel begins emitting a per-sensor threshold (tracks #1111) → revisit temperature alerting on proxfold.
Recurring unexplained performance drops → consider Load Avg or Disk I/O alerts.
T400 swapped for a different GPU → revisit the thermal-alert tradeoff (different idle and max-op envelopes change the math).