Drift Detection¶

Scheduled ansible-playbook --check --diff runs on CT104, with Discord alerts on drift or failure. Closes the Phase 3D loop — config-as-code has no value if nobody notices when live state drifts away from the repo.

Kit lives in: rampantlemming/homelab-ansible/drift-detection/

How it works¶

flowchart LR
    timer[systemd timer<br/>daily 04:00 ACST] --> svc[drift-detection.service]
    svc --> wrap[drift-check.sh]
    wrap --> pull[git pull --ff-only]
    pull --> play[ansible-playbook --check --diff<br/>against all managed hosts]
    play --> classify{PLAY RECAP<br/>totals}
    classify -->|changed=0 rc=0| clean[silent exit]
    classify -->|changed&gt;0 rc=0| drift[POST :warning: drift<br/>to Discord]
    classify -->|rc&ne;0 or failed&gt;0| fail[POST :x: failure<br/>to Discord]

    style clean fill:#DCFCE7,stroke:#15803D,color:#14532D
    style drift fill:#FEF3C7,stroke:#B45309,color:#78350F
    style fail fill:#FECACA,stroke:#B91C1C,color:#7F1D1D

Classification¶

The wrapper reads the PLAY RECAP block and sums changed, failed, and unreachable across all hosts:

Outcome	Condition	Action
clean	`rc=0`, `changed=0`, `failed=0`, `unreachable=0`	Silent (unless `DRIFT_SUMMARY=1`)
drift	`rc=0`, `changed>0` or `unreachable>0`	Amber embed to Discord
failure	`rc≠0` or `failed>0`	Red embed to Discord

Each post includes the full PLAY RECAP (truncated to Discord's 1024-char field limit) plus the path to the full log under /var/log/drift-detection/. Log retention is 30 days by default.

Timing¶

Setting	Value
Schedule	Daily, `04:00` local (Australia/Adelaide)
Randomised delay	0–5 minutes (avoids 04:00 clash with other cron work)
Persistent	`true` — missed runs (CT104 down at 04:00) are caught on next boot
Timeout	15 minutes — anything longer means something is stuck

Timezone gotcha on fresh CT104

OnCalendar is interpreted against the container's local timezone. Debian templates ship with Etc/UTC, which silently fires the timer at 13:30 ACST instead of 04:00. After any CT104 rebuild, verify with timedatectl and set Australia/Adelaide if UTC — see the rebuild runbook.

Installation (CT104)¶

One-time setup¶

apt install -y jq curl git ansible
cd ~/homelab-ansible
git pull --ff-only

# Install with webhook URL in one step:
sudo drift-detection/install.sh \
  --webhook 'https://discord.com/api/webhooks/<id>/<token>'

# Or without the webhook (install units only, add the URL later):
sudo drift-detection/install.sh
sudo install -m 0700 -d /etc/drift-detection
echo 'https://discord.com/api/webhooks/<id>/<token>' \
  | sudo install -m 0600 /dev/stdin /etc/drift-detection/webhook

Verify¶

systemctl list-timers drift-detection.timer
systemctl cat drift-detection.service

Test run¶

# Fire the service manually (doesn't wait for 04:00)
sudo systemctl start drift-detection.service

# Watch output
sudo journalctl -u drift-detection.service -f

# Or tail the latest log
sudo ls -t /var/log/drift-detection/ | head -1

A successful test posts a :white_check_mark: embed if you set DRIFT_SUMMARY=1 in the service env (override via drop-in); otherwise clean runs are silent.

Using the wrapper from WSL¶

The same drift-check.sh runs on WSL — useful for manual sweeps outside the CT104 schedule. Override paths via env:

cd ~/homelab-ansible
DRIFT_REPO_DIR=$HOME/homelab-ansible \
DRIFT_VAULT_PASS_FILE=$HOME/.vault_pass \
DRIFT_WEBHOOK_FILE=$HOME/.config/drift-detection/webhook \
DRIFT_LOG_DIR=/tmp/drift-logs \
  drift-detection/drift-check.sh

Note

WSL currently has SSH access to 3 of 4 managed hosts (plex LXC refuses WSL's key — a known gap from the CT104 bootstrap era). A WSL-run drift check will report unreachable=1 for plex and classify the run as drift. Either fix the SSH key on plex first, or accept the noise when running from WSL.

Webhook setup¶

The Discord webhook is a dedicated #homelab-drift channel webhook — kept separate from the MediaBot webhook so ops alerts don't mix with service notifications.

To rotate:

# On CT104
echo 'https://discord.com/api/webhooks/<new-id>/<new-token>' \
  | sudo install -m 0600 /dev/stdin /etc/drift-detection/webhook

No service restart required — the file is read on each run.

Expected drift sources¶

Things that legitimately cause drift between scheduled runs:

Kernel upgrades — Nvidia cgroup majors can shift (234→235, 237→238) after apt dist-upgrade + reboot on proxfold; the nvidia role re-reads and reapplies but shows diff until a play applies
Manual pvesm/pct set edits — anything changed through the Proxmox web UI that overlaps with Ansible-managed state
Docker Compose drift on arrstack — if you edit /opt/mediaserver/.env directly instead of going through Dockhand; the arrstack role will flag the diff

Things that should not drift — if they do, the role is buggy, not the environment:

Vault-rendered files (/etc/nut/upsd.users, /etc/nut/upsmon.conf) — no_log: true is set; any diff here needs investigation but won't leak secrets to Discord
APT repo files — the proxmox role owns full file content

Troubleshooting¶

Symptom	Likely cause
Timer listed but never fires	`systemctl daemon-reload` missed; re-run `install.sh`
Every run posts drift with same recap	Repo out of date on CT104 — check `git pull` output in the latest log
`rc=2` pre-flight failure	Missing `/root/.vault_pass` or `/etc/drift-detection/webhook`
`rc=3` Discord POST failed	Webhook URL wrong or rotated without updating the file
Log bloat over time	`DRIFT_LOG_RETENTION_DAYS` default is 30 — lower via service env drop-in if needed

Full log of a run (including the ansible output with diffs) lives at /var/log/drift-detection/drift-<timestamp>.log. Start there for anything unexpected.

Ansible overview — control node layout (CT104 + WSL dual-control)
WSL control node bootstrap — WSL-side vault pass + SSH key setup
Roadmap — Phase 3D — design rationale