Matrix setup — Phase 6E¶
Stand up a closed, federated Matrix homeserver on a new VM (matrix, VM 111, 192.168.1.243) using spantaleev/matrix-docker-ansible-deploy with Tuwunel as the homeserver and LiveKit-backed Element Call for group voice/video. The existing CT 107 Caddy + CrowdSec edge fronts everything; the apex domain serves the Matrix .well-known files for federation delegation.
Not started — scoping reference
This runbook is the execution plan written ahead of time. Hold points and commands will be exercised once the work begins; the Lessons from the run appendix lands on close-out.
Stages¶
| Stage | Scope | Hold points |
|---|---|---|
| 6E.1 | VM stand-up + base roles (common / security / docker / beszel_agent) | After qm create; after first boot; before declaring base host green |
| 6E.2 | spantaleev playbook bootstrap + Tuwunel install | After just update; after install-all first run; before opening any external ports |
| 6E.3 | Edge Caddy integration + apex .well-known delegation |
Before flipping DNS; before federation tester |
| 6E.4 | MatrixRTC: LiveKit + lk-jwt-service + UDM port-forwards | Before opening UDM ports (irreversible-ish — touches router config); before first cellular call test |
| 6E.5 | Admin user + smoke tests | Before declaring done |
| 6E.6 | Docs sweep | None high-risk |
Cross-phase decisions¶
- Homeserver: Tuwunel. Embedded RocksDB; no Postgres. Lighter than Synapse under our pre-4B 19 GiB free constraint.
- Deploy: spantaleev playbook vendored on CT 104. Cloned at
~/matrix-deploy/, separate from~/homelab-ansible/. Drift detection skips this host —--checkruns are manual. -
Edge: external Caddy. Playbook in "external reverse proxy" mode (
traefik_config_entrypoint_web_secure_enabled: false, Traefik bound to0.0.0.0:81and0.0.0.0:8449). CT 107 Caddy terminates TLS formatrix.rampancy.cloud.Two HTTP layers, by design — not 'Caddy everywhere'
Two HTTP routers are involved and they're at different layers:
- CT 107 Caddy (edge): terminates TLS, applies CrowdSec, fans
*.rampancy.cloudout to the right backend. This is what "Caddy everywhere" in the homelab refers to. - VM 111 Traefik (internal, bundled by spantaleev): plain HTTP on port 81, routes between the playbook's containers (Tuwunel, lk-jwt-service, LiveKit, matrix-static-files).
Traefik is mandatory in spantaleev's playbook as of their Jan 2024 consolidation — nginx-proxy was retired and there's no Caddy-as-internal-router option. The only valid values for
matrix_playbook_reverse_proxy_typeareplaybook-managed-traefikandother-traefik-container; a "no internal proxy" mode exists but is documented as broken for addons. Treat the bundled Traefik as a sealed-box internal router — we don't write its config, we don't reason about it daily, we only interact with its bind port. Comparable to never thinking about whatever HTTP server n8n uses internally.- Federation: well-known delegation, all traffic on 443. Apex
rampancy.cloudserves/.well-known/matrix/serverand/.well-known/matrix/client. Federation port 8448 is not opened. - MatrixRTC: LiveKit only, no Element Call frontend. Clients embed the RTC UI internally. Five port/range UDM forwards: 7881/tcp + 7882/udp (ICE) + 3479/udp + 5350/tcp (TURN) + 30000-30020/udp (relay).
- Registration: token only.
vault_matrix_registration_tokenrotated post-bootstrap. First registered account is admin. - Backup: PBS daily VM snapshot picks it up automatically once VM 111 lands. No file-level backup configured — RocksDB lives entirely on rpool, VM-level snapshot is the source of truth.
- CT 107 Caddy (edge): terminates TLS, applies CrowdSec, fans
Pre-flight gates¶
- VM 111 ID free on proxfold (
qm listshows no 111) - 192.168.1.243 free (not in inventory; no DHCP lease)
- proxfold ≥ 10 GiB free RAM (
free -hon proxfold, accounting for the 8 GiB VM allocation) - DNS:
matrix.rampancy.cloudA/AAAA via Cloudflare to home WAN; wildcard*.rampancy.cloudLE cert already covers it (no new cert needed) - Vault keys staged in
homelab-ansible/inventory/group_vars/all/vault.yml:vault_matrix_registration_token(32 hex bytes)vault_matrix_admin_password(kept for reference; password is set during register-user)vault_livekit_secret(32 hex bytes — shared between LiveKit and lk-jwt-service)
- UDM port-forward plan reviewed (5 forwards, all → 192.168.1.243; mix of TCP and UDP; one UDP range of 21 ports)
- Read feedback_never_view_vault_to_scrollback.md memory pattern; secrets are written via redirected pipes, never echoed
Stage 6E.1 — VM 111 stand-up¶
Provision the VM on proxfold¶
Mirrors the VM 108 (n8n) precedent exactly (qm config 108 is authoritative); 4 cores vs n8n's 2 is the only deliberate deviation:
# on proxfold
qm create 111 \
--name matrix \
--memory 8192 --cores 4 --sockets 1 --cpu host \
--net0 virtio,bridge=vmbr0 \
--scsihw virtio-scsi-pci \
--boot order=scsi0 \
--agent enabled=1 \
--ostype l26 \
--onboot 1 \
--serial0 socket --vga serial0 \
--nameserver "1.1.1.1 9.9.9.9" --searchdomain local \
--ipconfig0 ip=192.168.1.243/24,gw=192.168.1.1 \
--ciuser darcyn --sshkeys /root/.ssh/authorized_keys
# Debian 13 cloud image (already at /var/lib/vz/template/iso/ from prior provisioning)
qm importdisk 111 /var/lib/vz/template/iso/debian-13-genericcloud-amd64.qcow2 local-zfs
qm set 111 --scsi0 local-zfs:vm-111-disk-0,discard=on,size=32G
qm set 111 --ide2 local-zfs:cloudinit
# IMPORTANT: --scsi0 size= on initial attach does NOT grow the disk;
# the imported qcow2 is 3 GiB. Explicit resize is required.
qm resize 111 scsi0 32G
qm start 111
qm set --scsi0 size= does not resize on initial attach
The disk lands at the qcow2 base size (3 GiB for the genericcloud image). qm config 111 will show size=32G because that's what was requested, but lsblk inside the guest will show 3 GiB until you run qm resize. Caught during the 2026-05-21 stand-up. Always run the explicit qm resize step.
Hold point — VM reachable + cloud-init applied¶
qm guest exec 111 -- hostname # → matrix
ssh darcyn@192.168.1.243 'hostname; ip a show eth0' # confirm hostname + .243 static
ssh darcyn@192.168.1.243 'sudo apt list --installed 2>/dev/null | wc -l' # cloud-init done
Add to homelab-ansible inventory + apply base roles¶
Add to inventory/hosts.yml under virtual_machines:
A dedicated playbooks/matrix.yml (parallel to playbooks/n8n.yml) applies common + security + docker + beszel_agent. No hawser — matrix is not Dockhand-managed. The fleet-wide auto-updates.yml picks up the new host automatically via hosts: all.
A per-host inventory/host_vars/matrix.yml carries the storage-driver: overlay2 pin pre-emptively, mirroring n8n's workaround for the Docker 29 containerd-snapshotter extraction race (n8n host_vars + ddev/ddev#8136).
From CT 104 (per the drift-checks-from-CT104 rule):
cd ~/homelab-ansible && git pull --ff-only
ansible-playbook playbooks/matrix.yml --check --diff --limit matrix # preview
ansible-playbook playbooks/matrix.yml --limit matrix # apply
ansible-playbook playbooks/auto-updates.yml --limit matrix # fleet policy
ansible-playbook playbooks/matrix.yml --check --diff --limit matrix # idempotency: expect changed=0
--check is expected to fail on first apply
On a fresh Debian host, the docker role's --check preview can't simulate apt cache state after a brand-new repo source lands; the docker-ce package install task reports No package matching 'docker-ce' is available. This is a check-mode artifact (the new .sources file isn't actually written in check mode, so the simulated apt cache never learns about the docker.com repo). Run the full apply; it works. Same pattern hit during Forgejo CT 109 stand-up.
Hold point — base host healthy¶
ssh darcyn@192.168.1.243 'sudo docker info' # docker daemon up, storage-driver: overlay2
ssh darcyn@192.168.1.243 'systemctl is-active beszel-agent' # active
ssh darcyn@192.168.1.243 'systemctl is-enabled unattended-upgrades' # enabled
Beszel hub pairing is manual
The beszel_agent role installs and starts the agent but does NOT pair it to the hub. The agent's first-boot log shows WARN must set TOKEN or TOKEN_FILE — expected for fresh hosts. Pair via the Beszel hub UI at http://192.168.1.247:8090: Systems → Add System, generate the SSH-key pairing token, drop it into matrix's /etc/beszel-agent/env (or systemd drop-in), restart the agent. Same procedure used for every new host since Phase 5B.
qemu-guest-agent is unmanaged across the homelab
agent: enabled=1 in qm config is the host side; the guest package isn't installed by any role. VM 108 (n8n) is in the same state. Means host-side qm guest exec and IP reporting via qga don't work — SSH is the only management path. Pre-existing gap, not a 6E regression.
Stage 6E.2 — spantaleev playbook bootstrap¶
Clone and scaffold inventory¶
On CT 104, kept separate from ~/homelab-ansible/:
cd ~
git clone https://github.com/spantaleev/matrix-docker-ansible-deploy.git matrix-deploy
cd ~/matrix-deploy
mkdir -p inventory/host_vars/matrix.rampancy.cloud
cp examples/vars.yml inventory/host_vars/matrix.rampancy.cloud/vars.yml
cp examples/hosts inventory/hosts
Edit inventory/hosts¶
[matrix_servers]
matrix.rampancy.cloud ansible_host=192.168.1.243 ansible_user=darcyn ansible_become=true
Vault keys staged in homelab-ansible¶
Three keys consumed by spantaleev via the vault-bridge symlink (see below):
| Vault key | Format | Consumer |
|---|---|---|
vault_matrix_registration_token |
32-byte hex (64 chars) | Tuwunel registration_token for invite-style registration |
vault_matrix_admin_password |
48-char URL-safe base64 | Reference only — Tuwunel doesn't use spantaleev's register-user flow |
vault_matrix_homeserver_generic_secret_key |
32-byte hex (64 chars) | Master seed — spantaleev derives all per-service secrets (LiveKit, MAS, bridge tokens, postgres password) via SHA-512 hashing |
Generate via the append-only pattern from [[feedback_never_view_vault_to_scrollback]]. Do NOT generate a vault_livekit_secret — current spantaleev auto-derives LiveKit credentials from the master seed.
Bridge homelab-ansible vault into matrix-deploy inventory¶
# on CT 104
mkdir -p ~/matrix-deploy/inventory/group_vars/matrix_servers
ln -s /root/homelab-ansible/inventory/group_vars/all/vault.yml \
~/matrix-deploy/inventory/group_vars/matrix_servers/vault.yml
Spantaleev's inventory now auto-loads our vault for the matrix_servers group. The symlink survives ansible-vault encrypt --output rewrites (the file path is preserved across rewrites).
Edit inventory/host_vars/matrix.rampancy.cloud/vars.yml¶
Key settings (full file is large; only the deltas from examples/vars.yml listed here):
matrix_domain: rampancy.cloud
# Pin to v1.7.0 explicitly: spantaleev's default tracks the project's release
# tag, but v1.6.2 has a UIAA regression on the registration_token + password
# flow (matrix-construct/tuwunel#459) that corrupts user state silently.
# v1.7.0 fixes that and adds MSC4222 state_after support.
matrix_tuwunel_version: v1.7.0
# Homeserver: Tuwunel
matrix_homeserver_implementation: tuwunel
# Master seed — spantaleev derives per-service secrets (LiveKit, MAS, bridge
# tokens, postgres conn password) from this via SHA-512 hashing. One secret
# to manage; the dependent services auto-wire.
matrix_homeserver_generic_secret_key: "{{ vault_matrix_homeserver_generic_secret_key }}"
# Postgres connection password derived from the same master seed (matches
# spantaleev's own group_vars/matrix_servers pattern for MAS/bridge db creds).
# Tuwunel itself doesn't use Postgres (RocksDB), but spantaleev still requires
# this be set non-empty for the standby Postgres service install.
postgres_connection_password: "{{ (matrix_homeserver_generic_secret_key + ':postgres.connection') | hash('sha512') | to_uuid }}"
# Registration: token-gated. allow_registration MUST be `true` for the token
# flow to be honored — Tuwunel ignores registration_token when allow_registration
# is false (registration is then closed outright, not invite-only).
matrix_tuwunel_config_allow_registration: true
matrix_tuwunel_config_registration_token: "{{ vault_matrix_registration_token }}"
# Federation peer allowlist — CRITICAL GOTCHA: must include OUR OWN server name.
# Tuwunel's `allowed_remote_server_names_experimental` (the underlying config
# key) applies to every event's sender server, not just remote. Omitting our
# own server causes M_SENDER_IGNORED on every locally-sent event — PUT returns
# 200 + event_id but /messages, /sync, /event/{id} all filter the events out.
# See "Lessons from the 2026-05-21/22 run" appendix.
matrix_tuwunel_config_allowed_remote_server_names:
- rampancy.cloud # MUST be present — see gotcha above
- matrix.org
- chat.dacool.zone
# Federation on port 443 via well-known delegation, not port 8448. Setting
# this equal to traefik_config_entrypoint_web_secure_port (443) tells spantaleev
# to disable the dedicated federation Traefik entrypoint — federation routes
# through the web entrypoint. See group_vars/matrix_servers:53 for the conditional.
matrix_federation_public_port: 443
# External reverse proxy mode — CT 107 Caddy terminates TLS upstream of the
# bundled Traefik. See "Two HTTP layers" callout above for layering rationale.
matrix_playbook_reverse_proxy_type: playbook-managed-traefik
traefik_config_entrypoint_web_secure_enabled: false
traefik_container_web_host_bind_port: '0.0.0.0:81'
# MatrixRTC stack — server + jwt-service auto-enable from matrix_rtc_enabled.
# Per-service secrets auto-derive from matrix_homeserver_generic_secret_key
# (see group_vars/matrix_servers:6337-6339). No explicit livekit secret var needed.
matrix_rtc_enabled: true
# Static files served from matrix.rampancy.cloud directly; apex serves the
# .well-known/matrix/* JSON via edge Caddy (see 6E.3).
matrix_static_files_container_labels_base_domain_enabled: false
Var names drift across playbook releases
The exact spellings above are valid for spantaleev as of the v1.7.0 cut (2026-05-21). The matrix_rtc_livekit_* family was collapsed into auto-derivation between Jan and May 2026; the runbook's pre-2026-05-21 spelling no longer matches. Consult ~/matrix-deploy/docs/configuring-playbook-{tuwunel,matrix-rtc,own-webserver}.md at execution time before flipping any value.
Pull spantaleev's role dependencies¶
just isn't packaged for Debian bookworm (CT 104's distribution); rather than add a non-apt binary just for one helper command, run its equivalent directly:
cd ~/matrix-deploy
rm -rf roles/galaxy
ansible-galaxy install -r requirements.yml -p roles/galaxy/ --force
First playbook run — install only¶
Split the run: do install-all first (creates configs + systemd units, pulls images), hold and verify, then run start.
cd ~/matrix-deploy
set -o pipefail
ansible-playbook -i inventory/hosts setup.yml \
--tags=install-all \
--vault-password-file /root/.vault_pass 2>&1 | tee install-all.log
Without set -o pipefail, a failed playbook will exit 0 because the trailing tee succeeds. Verify the final PLAY RECAP line shows failed=0 regardless of exit code.
Hold point — install complete, services not yet started¶
ssh darcyn@192.168.1.243 'ls /etc/systemd/system/matrix-*.service | wc -l'
# Expect: 9 (tuwunel, traefik, livekit-server, livekit-jwt-service,
# static-files, postgres, client-element, container-socket-proxy, exim-relay)
ssh darcyn@192.168.1.243 'sudo -n docker images --format "{{.Repository}}:{{.Tag}}" | grep matrix\|livekit\|traefik\|postgres'
Start the stack¶
cd ~/matrix-deploy
set -o pipefail
ansible-playbook -i inventory/hosts setup.yml \
--tags=start \
--vault-password-file /root/.vault_pass 2>&1 | tee start.log
ensure-matrix-users-created tag does NOT work on Tuwunel
The runbook spec's earlier version chained install-all,ensure-matrix-users-created,start. On Tuwunel, the user-creator role is skipped silently (it's Synapse-specific). First admin user is created instead via the registration token in any Matrix client (see 6E.5). The vault_matrix_admin_password key in our vault is informational/reference only — it never reaches Tuwunel.
Hold point — services up on VM 111¶
ssh darcyn@192.168.1.243 'sudo docker ps --format "table {{.Names}}\t{{.Status}}"'
# Expect 9 containers, all "Up X minutes"
ssh darcyn@192.168.1.243 'curl -s -H "Host: matrix.rampancy.cloud" http://127.0.0.1:81/_matrix/client/versions | jq ".versions"'
# Expect: ["r0.0.1", ..., "v1.15", ...]
Do not open external ports yet — federation hasn't been routed and the well-known files aren't yet served from the apex.
Stage 6E.3 — Edge integration + apex well-known delegation¶
The caddy role's data model uses a flat caddy_proxy_hosts list (one upstream per host). Matrix needs both an apex site block (well-known statics) and a matrix.rampancy.cloud block (single upstream, but path-routed so the federation path skips CrowdSec). Adding these as a one-off via the role's template, gated by caddy_matrix_enabled.
Role additions (roles/caddy/)¶
roles/caddy/defaults/main.yml — add the gate + upstream variables:
roles/caddy/templates/Caddyfile.j2 — append after the existing *.{{ caddy_zone }} { ... } block:
{% if caddy_matrix_enabled | default(false) %}
# Phase 6E apex: serve federation well-known delegation. Separate cert from
# the wildcard, issued via the same Cloudflare DNS-01 path.
{{ caddy_zone }} {
{% if caddy_crowdsec_enabled %}
log {
output stdout
format json
}
{% endif %}
@well_known_matrix_server path /.well-known/matrix/server
handle @well_known_matrix_server {
header Content-Type application/json
header Access-Control-Allow-Origin *
respond `{"m.server":"matrix.{{ caddy_zone }}:443"}` 200
}
@well_known_matrix_client path /.well-known/matrix/client
handle @well_known_matrix_client {
header Content-Type application/json
header Access-Control-Allow-Origin *
respond `{"m.homeserver":{"base_url":"https://matrix.{{ caddy_zone }}"}}` 200
}
handle {
respond "{{ caddy_zone }}" 200
}
}
# Phase 6E matrix subdomain: client + federation share one upstream (VM 111's
# Traefik web entrypoint, host-bind :81). CrowdSec on client path only; the
# federation path skips it because Tuwunel's allowed_remote_server_names
# already gates which homeservers can federate.
matrix.{{ caddy_zone }} {
{% if caddy_crowdsec_enabled %}
log {
output stdout
format json
}
{% endif %}
@federation path /_matrix/federation/*
handle @federation {
reverse_proxy http://{{ caddy_matrix_upstream }}
}
handle {
{% if caddy_crowdsec_enabled %}
crowdsec
{% endif %}
reverse_proxy http://{{ caddy_matrix_upstream }}
}
}
{% endif %}
Edge host_vars¶
inventory/host_vars/edge.yml — enable:
Single upstream because matrix_federation_public_port: 443 collapses Traefik's federation entrypoint into the web entrypoint inside the VM (see 6E.2).
Apply¶
Caddy auto-provisions a new LE cert for the apex rampancy.cloud via the existing CF DNS-01 path (the wildcard cert doesn't cover apex). Cert issuance is async — usually visible within ~30s.
Apex cert is a second LE cert, not a SAN on the wildcard
Caddy issues rampancy.cloud and *.rampancy.cloud as two distinct certs, both renewed independently by Caddy's per-site renewal logic. Both use the same CF API token. No CF token-scope change needed; no LE rate-limit concerns at our usage.
Hold point — DNS + delegation working¶
# From CT 104 (or anywhere with public DNS resolution)
curl -s https://rampancy.cloud/.well-known/matrix/server | jq .
# Expect: { "m.server": "matrix.rampancy.cloud:443" }
curl -s https://rampancy.cloud/.well-known/matrix/client | jq .
# Expect: { "m.homeserver": { "base_url": "https://matrix.rampancy.cloud" } }
curl -s https://matrix.rampancy.cloud/_matrix/federation/v1/version | jq .
# Expect: Tuwunel version block, served via Caddy
curl -s https://federationtester.matrix.org/api/report?server_name=rampancy.cloud | jq .FederationOK
# Expect: true
If federation tester is unhappy, stop here — no point opening RTC ports until the homeserver itself federates cleanly.
Stage 6E.4 — MatrixRTC + UDM port-forwards¶
Executed 2026-05-22
Five UDM port-forwards configured via UI (manual; no homelab automation for UDM port-forwards). Element Call validated end-to-end: desktop ↔ Element X mobile over cellular, audio + video + screen-share all working. One configuration gotcha caught and fixed (apex well-known missing rtc_foci).
UDM-Pro port-forward configuration¶
Five forwards, all destination 192.168.1.243. Edited in the UDM UI (no automation in the homelab yet — track manually in reference/config-files.md).
| Name | WAN port | Protocol | LAN target |
|---|---|---|---|
| matrix-rtc-ice-tcp | 7881 | TCP | 192.168.1.243:7881 |
| matrix-rtc-ice-udp-mux | 7882 | UDP | 192.168.1.243:7882 |
| matrix-rtc-turn-udp | 3479 | UDP | 192.168.1.243:3479 |
| matrix-rtc-turn-tcp | 5350 | TCP | 192.168.1.243:5350 |
| matrix-rtc-turn-relay | 30000-30020 | UDP | 192.168.1.243:30000-30020 |
Bypasses edge LXC and CrowdSec
These forwards reach the matrix VM directly. No CrowdSec coverage on the RTC ports. LiveKit's signed JWT auth is the only gate. Accept and record in accepted-risks.
Lessons from the 2026-05-22 RTC run¶
Apex .well-known/matrix/client was missing org.matrix.msc4143.rtc_foci — first Element Call attempt errored MISSING_MATRIX_RTC_TRANSPORT despite the forwards being correct and LiveKit being healthy.
Cause: Element Call resolves the user's MXID @<user>:rampancy.cloud → queries https://rampancy.cloud/.well-known/matrix/client (the apex, not the matrix subdomain). Our apex well-known was a static respond in the Caddy template that only emitted m.homeserver. The matrix subdomain's well-known (generated by Tuwunel's matrix-static-files) had the full rtc_foci field — but Element Call never queries it; client config discovery stops at the apex.
Fix: extend the Caddy template's apex well_known_matrix_client handler to also emit rtc_foci:
respond `{"m.homeserver":{"base_url":"https://matrix.{{ caddy_zone }}"},"org.matrix.msc4143.rtc_foci":[{"livekit_service_url":"https://matrix.{{ caddy_zone }}/livekit-jwt-service","type":"livekit"}]}` 200
This is now baked into roles/caddy/templates/Caddyfile.j2 post-2026-05-22 — any future Phase 6E rebuild gets it automatically. The matrix subdomain's well-known is still served (by matrix-static-files inside VM 111) for any client that does its own homeserver-side discovery, but Element Call doesn't.
Hold point — RTC reachable from outside¶
From cellular (per the hairpin-NAT memory — LAN-side test is misleading because UDM rewrites source IPs on hairpin):
The cheapest practical validation is just running Element Call end-to-end: desktop (Element Web, OK from LAN since client traffic goes via TLS/443 through edge Caddy) ↔ Element X mobile on cellular data (WiFi off so the device's path actually exercises the port-forwards). If video + audio + screen-share work both ways, the entire LiveKit + lk-jwt-service + 5-forward stack is validated.
Optional per-port probing from a non-LAN box if you have one available:
Stage 6E.5 — Admin user + smoke tests¶
Register first user (becomes admin automatically)¶
Tuwunel doesn't use spantaleev's user-creator role. Registration is via a Matrix client using the registration token.
- Retrieve the registration token in your own terminal (NOT via tool output, per [[feedback_never_view_vault_to_scrollback]]):
- In any Matrix client that supports token-gated registration (Element Web at
https://app.element.io, Element X mobile): - Edit the homeserver to
rampancy.cloud - Create account, enter the registration token when prompted
- The first registered user is automatically promoted to admin (
matrix_tuwunel_config_grant_admin_to_first_user: trueis the spantaleev default) and auto-invited to a room aliased#admins:rampancy.cloud. The room contains a server bot@conduit:rampancy.cloud(legacy localpart from Tuwunel's Conduit lineage).
Promoting later users to admin
grant_admin_to_first_user only fires for the first user with no other users in the DB. If you ever need to deactivate the original admin and promote a replacement, do it via Tuwunel CLI (the in-room bot can't because only existing admins can drive it):
# on VM 111
sudo systemctl stop matrix-tuwunel
sudo docker run --rm \
-v /matrix/tuwunel/data:/var/lib/tuwunel \
-v /matrix/tuwunel/config:/etc/tuwunel:ro \
-e TUWUNEL_CONFIG=/etc/tuwunel/tuwunel.toml \
ghcr.io/matrix-construct/tuwunel:v1.7.0 \
--execute "users make-user-admin @<username>:rampancy.cloud"
sudo systemctl start matrix-tuwunel
--execute "rooms list" — it hangs indefinitely (observed 2026-05-21).
Admin commands¶
!admin <command> in the admin room. Available top-level subcommands (from --execute help): appservices, users, rooms, federation, server, media, debug, query, token, help. The appservices plural form is what works, despite source-tree dirs being singular appservice/.
!admin -h prints the full clap help in the room (provided you're in the actual admin room — the bot silently drops any !admin outside #admins:<server_name>).
Smoke checklist¶
- Log in to
https://app.element.ioagainsthttps://matrix.rampancy.cloud(well-known auto-redirects from apex) - Verify the room timeline persists across hard refresh — if "messages not sent" red banners show on successful PUTs, check that
rampancy.cloudis inmatrix_tuwunel_config_allowed_remote_server_names(see Lessons appendix) - Send a curl PUT message and confirm it appears in
/messagesvia the access-token endpoint - Create a test room
- Join
#matrix-spec:matrix.org— confirms outbound federation works - Verify E2EE: enable encryption on a separate test room (NOT the admin room — bots don't decrypt E2EE), send a message, decrypt on a second device
- Start an Element Call group session in the test room from desktop, join from Element X mobile (over cellular, not LAN) — confirms SFU media flow
- Run federationtester.matrix.org one more time post-RTC — should still be green
Stage 6E.6 — Docs sweep¶
-
services/matrix.md— new service page (Tuwunel + LiveKit + lk-jwt; LAN URL, public URL, registration token rotation) -
roadmap.md— flip 6E to!!! success "Completed YYYY-MM-DD"with- [x]items mirroring actual execution (not the original plan) -
changelog.md— 1–2 line pointer entry (lessons live here in this runbook, not in changelog) -
hosts/proxfold/index.md— guests table extended with VM 111 -
reference/accepted-risks.md— RTC ports bypass CrowdSec; drift detection skips matrix VM -
reference/config-files.md— UDM port-forward list captured -
mkdocs.yml—services/matrix.mdnav entry - Lessons from the run appendix appended below
Lessons from the 2026-05-21/22 run¶
Phase 6E.1 (VM stand-up) executed cleanly. 6E.2 + 6E.3 closed the same day but the cumulative state took a second session (2026-05-22 morning) to actually surface working messages. Lessons in rough order of cost:
1. matrix_tuwunel_config_allowed_remote_server_names filters our OWN server¶
This cost ~8 hours of diagnosis before being found via direct /event/{event_id} lookup returning errcode: M_SENDER_IGNORED, sender: @<self>:<server>.
Tuwunel's allowed_remote_server_names_experimental config option (which our matrix_tuwunel_config_allowed_remote_server_names spantaleev var maps to) is documented as a remote-federation allowlist. Its implementation in src/core/config/net.rs::is_forbidden_remote_server_name applies the check to every event's sender's server, including the local homeserver. With our own server name omitted, is_ignored_pdu returns true for all locally-sent events. They persist in the DB but every read query (/messages, /sync, /event/{id}) filters them out as ignored.
Symptom: PUT returns 200 + event_id, but messages never appear in clients across refresh. Bot responses (same server) also vanish. Both Element Web and Element X show "not sent" red banners.
The fix is one line — include rampancy.cloud in the allowlist alongside genuine remote peers. The runbook's 6E.2 spec now includes this in the example block; saved to memory at [[feedback_tuwunel_allowlist_includes_self]].
2. Tuwunel v1.6.2 UIAA regression on registration_token + password flow¶
Spantaleev's role pinned latest which resolved to v1.6.2 at our deploy time. v1.6.2 has a documented regression (matrix-construct/tuwunel#459) where the UIAA fallback acknowledgement rejected non-SSO flows. Token + password registration would M_FORBIDDEN: SSO authentication not completed on first attempt, sometimes succeed on retry, but leave user state in an unclear half-registered shape. v1.7.0 (released 2026-05-21) fixes this.
matrix_tuwunel_version: v1.7.0 is now pinned in vars.yml. If a future deploy lands on a newer Tuwunel, validate the registration flow before declaring success — pre-v1.7.0 latest was a footgun.
3. ensure-matrix-users-created tag is Synapse-only¶
Original runbook chained install-all,ensure-matrix-users-created,start --extra-vars='username=... password=... admin=yes'. On Tuwunel the user-creator role's include is silently skipped. The vault_matrix_admin_password we generated turned out to be a reference value with no consumer; first-user-becomes-admin is governed by matrix_tuwunel_config_grant_admin_to_first_user instead. Registration happens via a Matrix client + token.
4. matrix_rtc_livekit_* family vars no longer exist in current spantaleev¶
The pre-execution runbook spec used matrix_rtc_livekit_server_enabled, matrix_rtc_livekit_jwt_service_enabled, matrix_rtc_livekit_server_config_keys_secret. Spantaleev collapsed these into a single matrix_rtc_enabled toggle with per-service secrets auto-derived from matrix_homeserver_generic_secret_key via SHA-512 hashing. The vars.yml in 6E.2 now reflects current spelling.
5. postgres_connection_password required even when Tuwunel doesn't use Postgres¶
Spantaleev validates this be non-empty regardless of homeserver implementation — Postgres is installed as standby for future bridges (mautrix-discord etc.). Setting it to '' (default in examples/vars.yml) fails install-all with a validate_config.yml error. Use the same master-seed derivation pattern as the rest of the per-service secrets: "{{ (matrix_homeserver_generic_secret_key + ':postgres.connection') | hash('sha512') | to_uuid }}".
6. just not in Debian bookworm apt, not worth installing¶
Spantaleev's docs lead with just update / just roles shortcuts. CT 104 runs Debian 12 (bookworm), and just isn't packaged there. The shortcuts wrap git pull && ansible-galaxy install -r requirements.yml -p roles/galaxy/ --force; running the underlying commands directly is fine and keeps CT 104 fully apt-managed.
7. Tuwunel v1.7.0 vs Element Web sync timing¶
When the user-side bug looked like a sync issue (before we found the allowlist filter), we tried two compensating edge-Caddy hacks: HTTP/3 disable, and 404'ing the simplified_msc3575 sliding-sync endpoint. Both were red herrings and have been reverted. Element Web on app.element.io works fine over both HTTP/3 and sliding sync against Tuwunel v1.7.0 once the allowlist filter is corrected.
8. CLI admin mode requires service stop, runs in foreground¶
Tuwunel's --execute <command> and --console modes need the RocksDB lock, so the running container must be stopped first:
sudo systemctl stop matrix-tuwunel
sudo docker run --rm <same mounts> ghcr.io/matrix-construct/tuwunel:v1.7.0 --execute "<command>"
sudo systemctl start matrix-tuwunel
--execute "rooms list" hangs indefinitely under our deploy state — observed 2026-05-21, caused a 6-minute outage before the container was killed externally. Avoid it; use --execute "rooms info <room_id>" for specific rooms instead.
9. Element Web's MSC3916 service worker is unreliable; legacy media kept on¶
Tuwunel defaults allow_legacy_media: false — Matrix-spec-correct, since the legacy unauthenticated media endpoints were deprecated in favour of MSC3916 authenticated media at /_matrix/client/v1/media/*. Tuwunel does advertise MSC3916 support (org.matrix.msc3916.stable: true in /_matrix/client/versions), so in theory Element Web should use the new endpoints.
In practice, Element Web's MSC3916 implementation is a browser service worker (sw.js) that intercepts media URL requests and rewrites them to add the Authorization header. The service worker is fragile in real-world deployments — per element-hq/element-web#28011, #28370, and ongoing follow-ups:
| Failure mode | Observed |
|---|---|
| Service worker not registered at all | Common; "excessive page reloads" cited as the fix |
| Stale service worker from older Element Web version still cached | Common |
| Private/incognito browsing | Browsers disable service workers in private mode → MSC3916 broken by design |
| Browser quirks | Threads cite Firefox, Safari, Brave behaving differently |
When the service worker fails to register/rewrite, Element Web silently falls back to issuing the deprecated /_matrix/media/v3/... paths directly. Caddy access logs on app.element.io Firefox sessions show this clearly — /v3/thumbnail/... and /v3/download/... GETs hitting the server with no rewrite. Result: uploads succeed (POST /v3/upload → 200 + mxc URI), but every read returns 403 and images render as blank squares.
Element Desktop and Element X (mobile) use native HTTP clients, not the browser service-worker hack — they hit MSC3916 endpoints cleanly with no fallback issue. Only Element Web (app.element.io and self-hosted) has this dependency.
Fix: enable legacy media via env-var override (no dedicated spantaleev var):
# in inventory/host_vars/matrix.rampancy.cloud/vars.yml
matrix_tuwunel_environment_variables_extension: |
TUWUNEL_ALLOW_LEGACY_MEDIA=true
Don't also enable TUWUNEL_REQUEST_LEGACY_MEDIA=true — Tuwunel's own docs warn it "adds considerable federation requests which are unlikely to succeed" (most remote servers removed legacy endpoints in ~2024Q3). The authenticated federation media path (/_matrix/federation/v1/media/*) is what we use for outbound fetches and it works fine.
Security trade-off: legacy endpoints don't require Matrix auth — anyone with an mxc:// URI can fetch the bytes via /v3/download. For a closed-membership homeserver behind a federation allowlist this is low risk (only allowed peers can resolve our media in the first place). Revisit if Element Web ever ships an MSC3916 implementation that doesn't depend on a service worker — until then this isn't fixable from our side.
10. CT 104's docker role conflicted with spantaleev's Docker install¶
Both homelab-ansible's docker role and spantaleev's matrix-deploy playbook want to manage Docker on VM 111. Their /etc/apt/sources.list.d/docker.sources formats differ (cosmetic but persistent diff) and homelab-ansible was also setting /etc/docker/daemon.json with an overlay2 pin carried over from n8n. The pin had never been validated as required for matrix, and each run of either playbook would re-flip the state, producing perpetual changed=4 drift.
Resolution: dropped docker from playbooks/matrix.yml's role list. Spantaleev owns Docker lifecycle on this host end-to-end. The host_vars docker_daemon_config was removed. If extraction-snapshot races ever bite here, re-pin overlay2 — but spantaleev's image pulls didn't show the symptom Tuwunel/Postgres/LiveKit images don't seem to trigger the n8n-style race.
Related¶
- Roadmap Phase 6E — scoping
- Edge cutover — Caddy reverse proxy on CT 107
- CrowdSec validation — edge security pattern; not covering RTC ports
- Forgejo setup — closest precedent for a new service + edge vhost + DNS delegation
- spantaleev/matrix-docker-ansible-deploy — upstream playbook (vendored at
~/matrix-deploy/on CT 104) - Tuwunel — homeserver implementation
- Tuwunel Admin Room Commands wiki —
!adminsyntax reference - LiveKit ports & firewall — authoritative port reference