Skip to content

Drift Detection

Scheduled ansible-playbook --check --diff runs on CT104, with Discord alerts on drift or failure. Closes the Phase 3D loop — config-as-code has no value if nobody notices when live state drifts away from the repo.

Kit lives in: rampantlemming/homelab-ansible/drift-detection/

How it works

flowchart LR
    timer[systemd timer<br/>daily 04:00 ACST] --> svc[drift-detection.service]
    svc --> wrap[drift-check.sh]
    wrap --> pull[git pull --ff-only]
    pull --> play[ansible-playbook --check --diff<br/>against all managed hosts]
    play --> classify{PLAY RECAP<br/>totals}
    classify -->|changed=0 rc=0| clean[silent exit]
    classify -->|changed&gt;0 rc=0| drift[POST :warning: drift<br/>to Discord]
    classify -->|rc&ne;0 or failed&gt;0| fail[POST :x: failure<br/>to Discord]

    style clean fill:#DCFCE7,stroke:#15803D,color:#14532D
    style drift fill:#FEF3C7,stroke:#B45309,color:#78350F
    style fail fill:#FECACA,stroke:#B91C1C,color:#7F1D1D

Classification

The wrapper reads the PLAY RECAP block and sums changed, failed, and unreachable across all hosts:

Outcome Condition Action
clean rc=0, changed=0, failed=0, unreachable=0 Silent (unless DRIFT_SUMMARY=1)
drift rc=0, changed>0 or unreachable>0 Amber embed to Discord
failure rc≠0 or failed>0 Red embed to Discord

Each post includes the full PLAY RECAP (truncated to Discord's 1024-char field limit) plus the path to the full log under /var/log/drift-detection/. Log retention is 30 days by default.

Timing

Setting Value
Schedule Daily, 04:00 local (Australia/Adelaide)
Randomised delay 0–5 minutes (avoids 04:00 clash with other cron work)
Persistent true — missed runs (CT104 down at 04:00) are caught on next boot
Timeout 15 minutes — anything longer means something is stuck

Timezone gotcha on fresh CT104

OnCalendar is interpreted against the container's local timezone. Debian templates ship with Etc/UTC, which silently fires the timer at 13:30 ACST instead of 04:00. After any CT104 rebuild, verify with timedatectl and set Australia/Adelaide if UTC — see the rebuild runbook.

Installation (CT104)

One-time setup

apt install -y jq curl git ansible
cd ~/homelab-ansible
git pull --ff-only

# Install with webhook URL in one step:
sudo drift-detection/install.sh \
  --webhook 'https://discord.com/api/webhooks/<id>/<token>'

# Or without the webhook (install units only, add the URL later):
sudo drift-detection/install.sh
sudo install -m 0700 -d /etc/drift-detection
echo 'https://discord.com/api/webhooks/<id>/<token>' \
  | sudo install -m 0600 /dev/stdin /etc/drift-detection/webhook

Verify

systemctl list-timers drift-detection.timer
systemctl cat drift-detection.service

Test run

# Fire the service manually (doesn't wait for 04:00)
sudo systemctl start drift-detection.service

# Watch output
sudo journalctl -u drift-detection.service -f

# Or tail the latest log
sudo ls -t /var/log/drift-detection/ | head -1

A successful test posts a :white_check_mark: embed if you set DRIFT_SUMMARY=1 in the service env (override via drop-in); otherwise clean runs are silent.

Using the wrapper from WSL

The same drift-check.sh runs on WSL — useful for manual sweeps outside the CT104 schedule. Override paths via env:

cd ~/homelab-ansible
DRIFT_REPO_DIR=$HOME/homelab-ansible \
DRIFT_VAULT_PASS_FILE=$HOME/.vault_pass \
DRIFT_WEBHOOK_FILE=$HOME/.config/drift-detection/webhook \
DRIFT_LOG_DIR=/tmp/drift-logs \
  drift-detection/drift-check.sh

Note

WSL currently has SSH access to 3 of 4 managed hosts (plex LXC refuses WSL's key — a known gap from the CT104 bootstrap era). A WSL-run drift check will report unreachable=1 for plex and classify the run as drift. Either fix the SSH key on plex first, or accept the noise when running from WSL.

Webhook setup

The Discord webhook is a dedicated #homelab-drift channel webhook — kept separate from the MediaBot webhook so ops alerts don't mix with service notifications.

To rotate:

# On CT104
echo 'https://discord.com/api/webhooks/<new-id>/<new-token>' \
  | sudo install -m 0600 /dev/stdin /etc/drift-detection/webhook

No service restart required — the file is read on each run.

Expected drift sources

Things that legitimately cause drift between scheduled runs:

  • Kernel upgrades — Nvidia cgroup majors can shift (234→235, 237→238) after apt dist-upgrade + reboot on proxfold; the nvidia role re-reads and reapplies but shows diff until a play applies
  • Manual pvesm/pct set edits — anything changed through the Proxmox web UI that overlaps with Ansible-managed state
  • Docker Compose drift on arrstack — if you edit /opt/mediaserver/.env directly instead of going through Dockhand; the arrstack role will flag the diff

Things that should not drift — if they do, the role is buggy, not the environment:

  • Vault-rendered files (/etc/nut/upsd.users, /etc/nut/upsmon.conf) — no_log: true is set; any diff here needs investigation but won't leak secrets to Discord
  • APT repo files — the proxmox role owns full file content

Troubleshooting

Symptom Likely cause
Timer listed but never fires systemctl daemon-reload missed; re-run install.sh
Every run posts drift with same recap Repo out of date on CT104 — check git pull output in the latest log
rc=2 pre-flight failure Missing /root/.vault_pass or /etc/drift-detection/webhook
rc=3 Discord POST failed Webhook URL wrong or rotated without updating the file
Log bloat over time DRIFT_LOG_RETENTION_DAYS default is 30 — lower via service env drop-in if needed

Full log of a run (including the ansible output with diffs) lives at /var/log/drift-detection/drift-<timestamp>.log. Start there for anything unexpected.