Edge cutover — NPM → Caddy¶
Phase 5D cutover runbook. Replaces hand-clicked Nginx Proxy Manager on VM 102 (nginx, 192.168.1.249) with config-as-code Caddy on a new LXC edge (CT 107, 192.168.1.244). Caddy serves a single Let's Encrypt wildcard *.rampancy.cloud issued via DNS-01 against Cloudflare — no per-host certs, no GUI, no plain-port-80 inbound dependency.
Executed 2026-05-02
Cutover complete. Four hosts (requests, dash, n8n, kosync) migrated from NPM to Caddy. NPM container stopped on VM 102; VM running through soak. CF orange-cloud attempted then reverted (Universal SSL disabled at zone). Lessons captured at the bottom of this page.
Strategy¶
Parallel-run on a new IP, swap UDM port-forward in one step, soak, decom. Zero scheduled downtime; cutover window is a sub-minute UDM apply.
| Phase | What happens | Hold point | Rollback |
|---|---|---|---|
| 0 | Preflight — read-only audit | — | — |
| 1 | Build edge LXC, apply role on alt-ports 8080/8443 |
After parallel-run tests pass | pct destroy 107 |
| 2.1 | Flip Caddy to bind 80/443 | Before UDM change | Flip alt-ports back, re-apply |
| 2.3 | UDM port-forward swap → .244 |
This is the cutover | Revert UDM forward to .249, docker start nginx_proxy_manager-app-1 |
| 2.5 | Stop NPM container on VM 102 | — | docker start nginx_proxy_manager-app-1 |
| 3 | Soak 24–72h, then VM 102 decom | Before qm destroy 102 |
Restore from PBS |
Pre-flight gates¶
- Enumerate every NPM
proxy_host(read NPM SQLite; confirm hostname list, websocket flags, advanced_config blocks) - Drop orphan certs in NPM UI (any cert without a matching
proxy_host) - Decide CF token scope — Zone:Read + DNS:Edit on the zone only; nothing else needed
- Build Caddy binary on WSL via xcaddy +
caddy-dns/cloudflare, commit toroles/caddy/files/caddy-<version> - Verify CF zone SSL/TLS encryption mode (only relevant if planning CF orange-cloud later)
Phase 1 — Build edge in parallel¶
# on proxfold
pct create 107 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
--hostname edge \
--cores 1 --memory 1024 --swap 512 \
--rootfs local-zfs:4 \
--net0 name=eth0,bridge=vmbr0,ip=192.168.1.244/24,gw=192.168.1.1 \
--features nesting=1 \
--unprivileged 1 \
--onboot 1 \
--ssh-public-keys /root/.ssh/authorized_keys \
--start 1
Mirrors CT 106 (Beszel) — nesting=1 for systemd 257 in trixie, not Docker.
Then on the WSL control node, with caddy_listen_alt_ports: true in host_vars/edge.yml:
ansible-playbook playbooks/edge.yml --check --diff --limit edge # preview
ansible-playbook playbooks/edge.yml --limit edge # apply
Verify on edge: ss -tlnp | grep caddy should show :8080 and :8443. Verify cert issuance: find /var/lib/caddy -name "*.crt" should list wildcard_.rampancy.cloud.crt.
Test parallel-run from the control node — bypass DNS, hit Caddy direct:
for h in n8n dash kosync requests; do
curl --resolve "$h.rampancy.cloud:8443:192.168.1.244" -k -sI \
-o /dev/null -w "$h: HTTP %{http_code}\n" "https://$h.rampancy.cloud:8443/"
done
Expect 200 / 307 / 404 per upstream behaviour. NPM is still serving prod traffic on .249 throughout.
Phase 2 — Cutover¶
- Flip
caddy_listen_alt_ports: falseinhost_vars/edge.yml, re-apply. Caddy rebinds to 80/443. NPM is unaffected. - Hold point — confirm intent before UDM change.
- UDM UI → Settings → Security → Port Forwarding → edit both rules (TCP 80 and TCP 443) to forward to
192.168.1.244. -
Verify externally — from a cellular hotspot or
curlfrom WSL using public DNS:for h in n8n dash kosync requests; do ISSUER=$(echo | openssl s_client -servername "$h.rampancy.cloud" \ -connect "$h.rampancy.cloud:443" 2>/dev/null \ | openssl x509 -noout -subject 2>/dev/null) echo "$h: $ISSUER" doneExpect
subject=CN = *.rampancy.cloudfor all four (Caddy's LE wildcard). If you see per-host CN, UDM hasn't applied yet — wait 30s and retry. -
Stop NPM container on VM 102:
Phase 3 — Soak + decom¶
Watch Caddy logs (journalctl -u caddy -f on edge) and Beszel for the next 24–72h.
After soak, decom VM 102:
# on proxfold
qm stop 102
qm config 102 > /root/vm102-pre-decom-config.txt # for the records
vzdump 102 --storage nasbackup --notes-template "pre-decom"
qm destroy 102 --purge # only after explicit go
Then remove nginx from inventory/hosts.yml and retire host_vars/nginx.yml, playbooks/nginx.yml.
Lessons from the 2026-05-02 run¶
Caddy upstream systemd unit ships --environ flag — leaks env vars to journalctl¶
Caddy's official systemd unit sets ExecStart=/usr/bin/caddy run --environ --config .... The --environ flag tells Caddy to print all environment variables to its log on startup — including any secret values pulled from EnvironmentFile=. With our CF API token sourced from /etc/caddy/caddy.env, the token landed in plaintext in /var/log/journal on first start.
Fix: drop --environ from the rendered ExecStart. Already done in the role's systemd template.
If the token leaked, rotate it (CF dashboard → API Tokens → ⋯ → Roll). Roll preserves token ID and permissions; only the secret changes. Re-vault and re-apply the role to push the new env file. Existing certs stay valid; only renewals need the working token.
caddy validate doesn't see EnvironmentFile¶
The EnvironmentFile= directive in the systemd unit only applies to ExecStart, not to ad-hoc commands. So /usr/local/bin/caddy validate --config /etc/caddy/Caddyfile run from Ansible fails with API token '' appears invalid — Caddy's {env.CADDY_CLOUDFLARE_API_TOKEN} substitutes to empty.
Fix: pass the token via Ansible's environment: task parameter on the validate task. The role does this; if writing similar tasks elsewhere, remember the env-file scoping.
Caddyfile fmt uses tabs (gofmt-style)¶
caddy fmt rewrites Caddyfile with tab indents and collapses redundant column-aligned spacing. The first run logged Caddyfile input is not formatted on every reload. Fix: tab-indent the Jinja template to match canonical style. Now the warning is gone.
If the template ever shows the warning again, run caddy fmt /etc/caddy/Caddyfile on edge to see the canonical form, then mirror the changes back into roles/caddy/templates/Caddyfile.j2.
CF token scopes are granular — Zone:Read + DNS:Edit ≠ Zone Settings:Read¶
Token minted with Zone:Read + DNS:Edit on rampancy.cloud can: list zones, list DNS records, create/update/delete DNS records, verify itself via /user/tokens/verify.
Token cannot: read SSL/TLS settings, edit zone-level features, list certificate packs, change Universal SSL state. Those require Zone Settings:Read (and Edit for changes). When diagnosing certificate issues from the host, the token will return 9109 Unauthorized to access requested resource — that's the SSL settings scope, not a problem with DNS:Edit.
For DNS-01 ACME the current scope is correct and minimal — don't add SSL scope to the production token. Mint a separate diagnostic token if you need to script SSL settings.
CF Universal SSL is per-hostname — flipping orange without an existing cert breaks TLS¶
CF "Universal SSL" auto-issues a cert per orange-cloud hostname when proxying is enabled for that hostname. If Universal SSL is disabled at the zone level — which is the default state on this account — flipping a record to orange means CF receives the SNI but has no matching cert, and TLS handshake fails immediately. Symptoms: openssl s_client returns alert number 40 (handshake_failure), curl returns error:0A000410, browser shows "ERR_SSL_PROTOCOL_ERROR".
Diagnostic: dig @luke.ns.cloudflare.com +short <host>.rampancy.cloud to confirm CF edge IPs, then echo | openssl s_client -servername <host>.rampancy.cloud -connect <CF-IP>:443 and check whether you get a cert subject or a handshake alert.
Fix: either re-enable Universal SSL at the zone (CF dashboard → SSL/TLS → Edge Certificates → Universal SSL → Enable, then wait for the cert to provision before flipping orange), or skip CF orange-cloud entirely and rely on edge-side protection (CrowdSec on Caddy, Phase 7D).
We chose the latter — orange-cloud reverted same day. Rollback was cheap: API PATCH on each record's proxied field back to false. DNS caches refreshed in <5 min.
The doc glitch — dash.rampancy.cloud ≠ beszel.rampancy.cloud¶
Earlier docs (docs/network/overview.md, docs/ansible/roles/beszel.md) called the public Beszel hostname beszel.rampancy.cloud. The actual NPM proxy_host record was dash.rampancy.cloud. Caught during the Phase 0 NPM SQLite dump. Cause: original NPM rule was created under one hostname; docs were written from intent rather than the live database. Fixed in the same docs sweep.
Lesson: read the actual database when documenting reverse-proxy state, don't trust the Caddyfile (or NPM equivalent) without cross-checking what's serving traffic.
Related¶
- caddy role page — defaults, behaviours, gotchas
- Network overview — public-services table
- Roadmap §Phase 5D
- Accepted risk: edge security gap until CrowdSec
- Roadmap §Phase 7D — CrowdSec — closes the gap