Skip to content

Alerting

Active alert surfaces across the homelab and the deliberate gaps. Notifications fan out across several Discord channels by signal class — i.e. by what the reader needs to do — not by source system or subject matter.

Channel taxonomy

Channel Signal class Question it answers Vault webhook var
#homelab-ops Reactive operational events "Something happened to a running system" — backups succeeded / failed, Beszel thresholds breached, ZED pool events, PVE notifications, pending-reboot nags vault_discord_webhook_homelab_ops
#homelab-drift Declared state ≠ actual state "Ansible says config diverged" — daily drift detection runs dedicated drift webhook (separate from ops)
#homelab-updates Pending updates "Upstream moved, review when you have time" — Dockhand image-update + CVE scans, matrix-deploy fetch vault_discord_webhook_homelab_updates (Dockhand uses its own UI-configured webhook on the same channel)
Service channels (#mediabot, #seerr, #sonarr, #radarr) Service-specific runtime "This service is speaking for itself" — per-app events configured in each app, not in Ansible vault

Naming principle: channels are named for the signal class they carry, not the tool that emits the signal. #dockhand was renamed to #homelab-updates 2026-05-23 specifically so the bucket absorbs all "go look at this when you have time" sources (Dockhand updates, matrix-deploy fetches, future kernel-staleness pings, etc.) without rotting the name.

Notification sources

Source Trigger Channel Transport
Beszel hub Per-system CPU / memory / disk / offline thresholds #homelab-ops Discord via shoutrrr (discord://token@id)
PVE 9 notification target Proxmox events (backup jobs, cluster, HA, etc.) #homelab-ops Discord webhook endpoint discord-ops + match-all matcher ops-all
ZED webhook ZFS statechange, scrub_finish, resilver_finish on proxfold #homelab-ops Direct curl to Discord webhook
auto_updates reboot-required notifier /var/run/reboot-required exists on a host that opted in (proxfold, arrstack, matrix, n8n) #homelab-ops Systemd timer + curl, daily 09:00
Ansible drift-detection Daily --check --diff run on CT 104; reports non-zero changed count #homelab-drift Drift script + curl
Dockhand (controller) Image-update detection + vulnerability scans on managed stacks #homelab-updates Dockhand UI webhook config (not in Ansible vault)
matrix_deploy_notifier First Monday of month: pending upstream commits in /root/matrix-deploy/ #homelab-updates Systemd timer + curl, monthly
PBS — mail-to-root Backup / verify / GC failures local mail Local mail (secondary to PVE notification target)

Beszel alert set (locked 2026-04-25, agent fleet refreshed 2026-04-28)

Configured in the Beszel hub UI, not codified in Ansible. Applied via the per-system bell icon or "All Systems" for bulk.

The agent fleet has grown since the original lock — n8n joined during Phase 5C (2026-04-28) and vintage CT 201 was added the same week. Current registered agents: proxfold, arrstack, nginx, plex, pbs, control, n8n, vintage (8 total). Apply the thresholds below to new agents on registration.

Alert Threshold Window Applied to
Status (offline) 2 min all agents
CPU 90 % 10 min proxfold, plex
CPU 85 % 10 min arrstack, nginx, pbs, n8n, vintage, control
Memory 90 % 15 min proxfold
Memory 85 % 10 min arrstack, nginx, plex, pbs, n8n, vintage, control
Disk 85 % 5 min all agents (on proxfold covers / and /stash via beszel_agent_extra_filesystems)

Proxfold runs a higher memory threshold because ZFS ARC inflates "used". Status at 2 minutes survives a single missed poll.

Temperature alerting is deliberately unwired

Locked decision — 2026-04-25

Thermals on proxfold are a spot-check, not an alerted surface. Do not re-propose Beszel temperature alerts or a standalone GPU watchdog without reopening the decision.

Why:

  • Beszel's temperature alert uses a single global threshold across all hwmon + nvidia-smi sensors. It cannot distinguish CPU package from GPU (see henrygd/beszel#1111).
  • On proxfold (Dell R430, Xeon E5-2680 v4), the CPU package idles at ~74 °C against a 90 °C TJunction. Any threshold tight enough to flag a T400 GPU excursion (target 83 °C, max-op 93 °C) collides with normal CPU workload temps.
  • Spinning drive temperatures are SMART-only and not covered by Beszel's temperature alert at all.
  • ZED already covers pool-level drive faults; hardware failsafes (GPU slowdown 98 °C, CPU throttle 100 °C) are the backstop.
  • Adding a standalone nvidia-smi watchdog purely for the T400 was judged ineleg­ant relative to its value.

Spot-check path: ssh root@proxfold 'sensors ; nvidia-smi -q -d TEMPERATURE'.

Alert types intentionally skipped

  • Temperature, Load Avg, Disk I/O, Bandwidth — not enabled on any host. Load Avg duplicates CPU signal; Disk I/O and Bandwidth are too noisy to be actionable in a homelab; Temperature per above.

ZFS pool coverage

Beszel does not natively monitor zpool health (henrygd/beszel#1541). Coverage is split:

  • Pool capacity — Beszel Disk alert via beszel_agent_extra_filesystems: ["/stash__stash"] on proxfold (and / catches rpool on the boot mirror).
  • Pool state changes, scrub/resilver completion — ZED webhook (/etc/zfs/zed.d/{statechange,scrub_finish,resilver_finish}-discord.sh) on proxfold. See Role: zfs.
  • Pool I/O — not covered. Beszel's ZFS I/O path is broken (henrygd/beszel#200); standalone zpool iostat spot-checks if needed.

Review triggers

  • Beszel begins emitting a per-sensor threshold (tracks #1111) → revisit temperature alerting on proxfold.
  • Recurring unexplained performance drops → consider Load Avg or Disk I/O alerts.
  • T400 swapped for a different GPU → revisit the thermal-alert tradeoff (different idle and max-op envelopes change the math).