Drift Detection¶
Scheduled ansible-playbook --check --diff runs on CT104, with Discord alerts on drift or failure. Closes the Phase 3D loop — config-as-code has no value if nobody notices when live state drifts away from the repo.
Kit lives in: rampantlemming/homelab-ansible/drift-detection/
How it works¶
flowchart LR
timer[systemd timer<br/>daily 04:00 ACST] --> svc[drift-detection.service]
svc --> wrap[drift-check.sh]
wrap --> pull[git pull --ff-only]
pull --> play[ansible-playbook --check --diff<br/>against all managed hosts]
play --> classify{PLAY RECAP<br/>totals}
classify -->|changed=0 rc=0| clean[silent exit]
classify -->|changed>0 rc=0| drift[POST :warning: drift<br/>to Discord]
classify -->|rc≠0 or failed>0| fail[POST :x: failure<br/>to Discord]
style clean fill:#DCFCE7,stroke:#15803D,color:#14532D
style drift fill:#FEF3C7,stroke:#B45309,color:#78350F
style fail fill:#FECACA,stroke:#B91C1C,color:#7F1D1D
Classification¶
The wrapper reads the PLAY RECAP block and sums changed, failed, and unreachable across all hosts:
| Outcome | Condition | Action |
|---|---|---|
| clean | rc=0, changed=0, failed=0, unreachable=0 |
Silent (unless DRIFT_SUMMARY=1) |
| drift | rc=0, changed>0 or unreachable>0 |
Amber embed to Discord |
| failure | rc≠0 or failed>0 |
Red embed to Discord |
Each post includes the full PLAY RECAP (truncated to Discord's 1024-char field limit) plus the path to the full log under /var/log/drift-detection/. Log retention is 30 days by default.
Timing¶
| Setting | Value |
|---|---|
| Schedule | Daily, 04:00 local (Australia/Adelaide) |
| Randomised delay | 0–5 minutes (avoids 04:00 clash with other cron work) |
| Persistent | true — missed runs (CT104 down at 04:00) are caught on next boot |
| Timeout | 15 minutes — anything longer means something is stuck |
Timezone gotcha on fresh CT104
OnCalendar is interpreted against the container's local timezone. Debian
templates ship with Etc/UTC, which silently fires the timer at 13:30
ACST instead of 04:00. After any CT104 rebuild, verify with timedatectl
and set Australia/Adelaide if UTC — see the
rebuild runbook.
Installation (CT104)¶
One-time setup¶
apt install -y jq curl git ansible
cd ~/homelab-ansible
git pull --ff-only
# Install with webhook URL in one step:
sudo drift-detection/install.sh \
--webhook 'https://discord.com/api/webhooks/<id>/<token>'
# Or without the webhook (install units only, add the URL later):
sudo drift-detection/install.sh
sudo install -m 0700 -d /etc/drift-detection
echo 'https://discord.com/api/webhooks/<id>/<token>' \
| sudo install -m 0600 /dev/stdin /etc/drift-detection/webhook
Verify¶
Test run¶
# Fire the service manually (doesn't wait for 04:00)
sudo systemctl start drift-detection.service
# Watch output
sudo journalctl -u drift-detection.service -f
# Or tail the latest log
sudo ls -t /var/log/drift-detection/ | head -1
A successful test posts a :white_check_mark: embed if you set DRIFT_SUMMARY=1 in the service env (override via drop-in); otherwise clean runs are silent.
Using the wrapper from WSL¶
The same drift-check.sh runs on WSL — useful for manual sweeps outside the CT104 schedule. Override paths via env:
cd ~/homelab-ansible
DRIFT_REPO_DIR=$HOME/homelab-ansible \
DRIFT_VAULT_PASS_FILE=$HOME/.vault_pass \
DRIFT_WEBHOOK_FILE=$HOME/.config/drift-detection/webhook \
DRIFT_LOG_DIR=/tmp/drift-logs \
drift-detection/drift-check.sh
Note
WSL currently has SSH access to 3 of 4 managed hosts (plex LXC refuses WSL's key — a known gap from the CT104 bootstrap era). A WSL-run drift check will report unreachable=1 for plex and classify the run as drift. Either fix the SSH key on plex first, or accept the noise when running from WSL.
Webhook setup¶
The Discord webhook is a dedicated #homelab-drift channel webhook — kept separate from the MediaBot webhook so ops alerts don't mix with service notifications.
To rotate:
# On CT104
echo 'https://discord.com/api/webhooks/<new-id>/<new-token>' \
| sudo install -m 0600 /dev/stdin /etc/drift-detection/webhook
No service restart required — the file is read on each run.
Expected drift sources¶
Things that legitimately cause drift between scheduled runs:
- Kernel upgrades — Nvidia cgroup majors can shift (234→235, 237→238) after
apt dist-upgrade+ reboot on proxfold; thenvidiarole re-reads and reapplies but shows diff until a play applies - Manual
pvesm/pct setedits — anything changed through the Proxmox web UI that overlaps with Ansible-managed state - Docker Compose drift on arrstack — if you edit
/opt/mediaserver/.envdirectly instead of going through Dockhand; thearrstackrole will flag the diff
Things that should not drift — if they do, the role is buggy, not the environment:
- Vault-rendered files (
/etc/nut/upsd.users,/etc/nut/upsmon.conf) —no_log: trueis set; any diff here needs investigation but won't leak secrets to Discord - APT repo files — the
proxmoxrole owns full file content
Troubleshooting¶
| Symptom | Likely cause |
|---|---|
| Timer listed but never fires | systemctl daemon-reload missed; re-run install.sh |
| Every run posts drift with same recap | Repo out of date on CT104 — check git pull output in the latest log |
rc=2 pre-flight failure |
Missing /root/.vault_pass or /etc/drift-detection/webhook |
rc=3 Discord POST failed |
Webhook URL wrong or rotated without updating the file |
| Log bloat over time | DRIFT_LOG_RETENTION_DAYS default is 30 — lower via service env drop-in if needed |
Full log of a run (including the ansible output with diffs) lives at /var/log/drift-detection/drift-<timestamp>.log. Start there for anything unexpected.
Related¶
- Ansible overview — control node layout (CT104 + WSL dual-control)
- WSL control node bootstrap — WSL-side vault pass + SSH key setup
- Roadmap — Phase 3D — design rationale