Skip to content

Matrix maintenance — keeping the homeserver current

Runbook for the cadence of updating the Matrix server. Triggered monthly by the matrix_deploy_notifier role — that role POSTs pending upstream commits to #homelab-updates on the first Monday of each month, but never pulls or applies. Application is operator-driven, per upstream's maintenance-upgrading-services.md and the Phase 6E lessons appendix.

What updates and how

Layer Mechanism Cadence Notes
Matrix VM OS (Debian 13 trixie) auto_updates role (unattended-upgrades, security-only, manual reboot) + needrestart Daily, automatic See auto-updates role. auto_updates_notify_discord: true on this host → #homelab-ops ping when a reboot is pending
spantaleev playbook on CT 104 Manual git pull of /root/matrix-deploy/ Monthly fetch + on-demand The matrix_deploy_notifier timer fetches and notifies; you decide when to pull and apply
Docker images (Tuwunel, Traefik, LiveKit, …) Manual — version-pinned in vars.yml; advance on apply On-demand, tied to a playbook apply Spantaleev pins image tags so the 9-container compatibility matrix doesn't drift; :latest is never used here

Standing apply procedure (triggered by a #homelab-updates notification)

Run from a terminal on CT 104 (192.168.1.245). All commands assume root.

1. Read the upstream changelog

The Discord embed lists pending commit subjects but not the full context. Open matrix-docker-ansible-deploy CHANGELOG.md and skim entries covering the date range of the pending commits.

What to look for: - Sections marked "Breaking change" / "Upgrading from x.y to x.z" — these are why the migration-validation gate exists - Image version bumps for components you actively use (Tuwunel, Element Call / LiveKit, Element Web) - Removed defaults or changed variable semantics

If you see a breaking change that affects your config, plan the variable update before running the playbook — the playbook will refuse to run until you do.

2. Pull the playbook + refresh galaxy roles

cd /root/matrix-deploy
git pull
rm -rf roles/galaxy
ansible-galaxy install -r requirements.yml -p roles/galaxy/ --force

The rm -rf roles/galaxy && --force reinstall is the upstream-recommended pattern. just update is equivalent if you prefer; both end at the same state.

3. Acknowledge any migration breakpoint

grep matrix_playbook_migration_expected_version \
  /root/matrix-deploy/roles/custom/matrix-base/defaults/main.yml
grep matrix_playbook_migration_validated_version \
  /root/matrix-deploy/inventory/host_vars/matrix.rampancy.cloud/vars.yml

If the two values differ, the playbook will fail with a message linking to the changelog entry for the new expected version. Read the linked entry, adapt your vars if needed, then bump matrix_playbook_migration_validated_version in inventory/host_vars/matrix.rampancy.cloud/vars.yml to match.

If the values already match, you can skip this step.

4. Apply

cd /root/matrix-deploy
ansible-playbook -i inventory/hosts setup.yml \
  --tags=install-all,start \
  --vault-password-file /root/.vault_pass \
  2>&1 | tee /tmp/matrix-apply-$(date +%Y%m%d).log

Always pipe to tee + keep stderr

Bare pipes drop stderr. The 2>&1 | tee pattern (plus set -o pipefail if you're inside a script) is the only reliable way to catch failures — the Phase 6E run hit two cases where a silenced error would have masked a downstream cascade.

install-all,start is the right tag set for routine upgrades — it's 2–5× faster than setup-all,ensure-matrix-users-created,start per spantaleev's changelog. Use setup-all,… only when you've removed a component from vars.yml (Tuwunel admins are deactivated by the spantaleev playbook on setup-all, so don't reach for it casually).

5. Verify

# From CT 104, SSH to VM 111 to check container state
ssh darcyn@192.168.1.243 \
  'sudo docker ps --filter name=matrix- --format "table {{.Names}}\t{{.Status}}"'

# Quick health probe
curl -s https://matrix.rampancy.cloud/_matrix/client/versions | jq .versions[0]

# Federation tester from outside (open this in a browser)
echo "https://federationtester.matrix.org/#rampancy.cloud"

All matrix-* containers should be Up (healthy) or Up <duration>. The federation tester should return all-green.

6. Clean up

Move the apply log somewhere durable if anything notable came out of it; otherwise it's fine to leave in /tmp/ (cleared on reboot).

What if the playbook fails mid-apply

Spantaleev's setup.yml is idempotent. Re-running after fixing the reported issue is the standard recovery — the playbook picks up where it failed.

Common failure shapes:

Symptom Likely cause Fix
"Detected breaking change … validated version is X, expected version is Y" New migration breakpoint upstream Read the linked CHANGELOG entry, adapt vars, bump matrix_playbook_migration_validated_version to Y, re-run
Pull stalls / image not found Registry blip or removed image tag Re-run; if persistent, check the relevant matrix_*_version var against the image tag actually on the registry
M_SENDER_IGNORED in Tuwunel logs after apply matrix_tuwunel_config_allowed_remote_server_names lost rampancy.cloud Re-add rampancy.cloud to the allowlist (the variable's name says "remote" but the implementation applies to all senders, including the local server). See Phase 6E lessons #1

For other failures, check journalctl -u matrix-tuwunel -f on VM 111 (ssh darcyn@192.168.1.243) and consult the matrix-setup runbook's Lessons appendix before going wider.

Skipping a month

Skipping a #homelab-updates notification is fine if you've read the changelog and there's nothing security-relevant. The notification will fire again next month with the cumulative pending list — the timer is fetch-only, so nothing accumulates badly.

If you're going to skip for >3 months, consider running through the changelog deliberately rather than letting it pile up — large jumps multiply the chance of hitting a migration breakpoint you'd rather see in isolation.