Skip to content

Edge cutover — NPM → Caddy

Phase 5D cutover runbook. Replaces hand-clicked Nginx Proxy Manager on VM 102 (nginx, 192.168.1.249) with config-as-code Caddy on a new LXC edge (CT 107, 192.168.1.244). Caddy serves a single Let's Encrypt wildcard *.rampancy.cloud issued via DNS-01 against Cloudflare — no per-host certs, no GUI, no plain-port-80 inbound dependency.

Executed 2026-05-02

Cutover complete. Four hosts (requests, dash, n8n, kosync) migrated from NPM to Caddy. NPM container stopped on VM 102; VM running through soak. CF orange-cloud attempted then reverted (Universal SSL disabled at zone). Lessons captured at the bottom of this page.

Strategy

Parallel-run on a new IP, swap UDM port-forward in one step, soak, decom. Zero scheduled downtime; cutover window is a sub-minute UDM apply.

Phase What happens Hold point Rollback
0 Preflight — read-only audit
1 Build edge LXC, apply role on alt-ports 8080/8443 After parallel-run tests pass pct destroy 107
2.1 Flip Caddy to bind 80/443 Before UDM change Flip alt-ports back, re-apply
2.3 UDM port-forward swap → .244 This is the cutover Revert UDM forward to .249, docker start nginx_proxy_manager-app-1
2.5 Stop NPM container on VM 102 docker start nginx_proxy_manager-app-1
3 Soak 24–72h, then VM 102 decom Before qm destroy 102 Restore from PBS

Pre-flight gates

  • Enumerate every NPM proxy_host (read NPM SQLite; confirm hostname list, websocket flags, advanced_config blocks)
  • Drop orphan certs in NPM UI (any cert without a matching proxy_host)
  • Decide CF token scope — Zone:Read + DNS:Edit on the zone only; nothing else needed
  • Build Caddy binary on WSL via xcaddy + caddy-dns/cloudflare, commit to roles/caddy/files/caddy-<version>
  • Verify CF zone SSL/TLS encryption mode (only relevant if planning CF orange-cloud later)

Phase 1 — Build edge in parallel

# on proxfold
pct create 107 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
  --hostname edge \
  --cores 1 --memory 1024 --swap 512 \
  --rootfs local-zfs:4 \
  --net0 name=eth0,bridge=vmbr0,ip=192.168.1.244/24,gw=192.168.1.1 \
  --features nesting=1 \
  --unprivileged 1 \
  --onboot 1 \
  --ssh-public-keys /root/.ssh/authorized_keys \
  --start 1

Mirrors CT 106 (Beszel) — nesting=1 for systemd 257 in trixie, not Docker.

Then on the WSL control node, with caddy_listen_alt_ports: true in host_vars/edge.yml:

ansible-playbook playbooks/edge.yml --check --diff --limit edge   # preview
ansible-playbook playbooks/edge.yml --limit edge                  # apply

Verify on edge: ss -tlnp | grep caddy should show :8080 and :8443. Verify cert issuance: find /var/lib/caddy -name "*.crt" should list wildcard_.rampancy.cloud.crt.

Test parallel-run from the control node — bypass DNS, hit Caddy direct:

for h in n8n dash kosync requests; do
  curl --resolve "$h.rampancy.cloud:8443:192.168.1.244" -k -sI \
       -o /dev/null -w "$h: HTTP %{http_code}\n" "https://$h.rampancy.cloud:8443/"
done

Expect 200 / 307 / 404 per upstream behaviour. NPM is still serving prod traffic on .249 throughout.

Phase 2 — Cutover

  1. Flip caddy_listen_alt_ports: false in host_vars/edge.yml, re-apply. Caddy rebinds to 80/443. NPM is unaffected.
  2. Hold point — confirm intent before UDM change.
  3. UDM UI → Settings → Security → Port Forwarding → edit both rules (TCP 80 and TCP 443) to forward to 192.168.1.244.
  4. Verify externally — from a cellular hotspot or curl from WSL using public DNS:

    for h in n8n dash kosync requests; do
      ISSUER=$(echo | openssl s_client -servername "$h.rampancy.cloud" \
                -connect "$h.rampancy.cloud:443" 2>/dev/null \
                | openssl x509 -noout -subject 2>/dev/null)
      echo "$h: $ISSUER"
    done
    

    Expect subject=CN = *.rampancy.cloud for all four (Caddy's LE wildcard). If you see per-host CN, UDM hasn't applied yet — wait 30s and retry.

  5. Stop NPM container on VM 102:

    ansible nginx -m shell -a 'docker stop nginx_proxy_manager-app-1'
    

Phase 3 — Soak + decom

Watch Caddy logs (journalctl -u caddy -f on edge) and Beszel for the next 24–72h.

After soak, decom VM 102:

# on proxfold
qm stop 102
qm config 102 > /root/vm102-pre-decom-config.txt   # for the records
vzdump 102 --storage nasbackup --notes-template "pre-decom"
qm destroy 102 --purge   # only after explicit go

Then remove nginx from inventory/hosts.yml and retire host_vars/nginx.yml, playbooks/nginx.yml.

Lessons from the 2026-05-02 run

Caddy upstream systemd unit ships --environ flag — leaks env vars to journalctl

Caddy's official systemd unit sets ExecStart=/usr/bin/caddy run --environ --config .... The --environ flag tells Caddy to print all environment variables to its log on startup — including any secret values pulled from EnvironmentFile=. With our CF API token sourced from /etc/caddy/caddy.env, the token landed in plaintext in /var/log/journal on first start.

Fix: drop --environ from the rendered ExecStart. Already done in the role's systemd template.

If the token leaked, rotate it (CF dashboard → API Tokens → ⋯ → Roll). Roll preserves token ID and permissions; only the secret changes. Re-vault and re-apply the role to push the new env file. Existing certs stay valid; only renewals need the working token.

caddy validate doesn't see EnvironmentFile

The EnvironmentFile= directive in the systemd unit only applies to ExecStart, not to ad-hoc commands. So /usr/local/bin/caddy validate --config /etc/caddy/Caddyfile run from Ansible fails with API token '' appears invalid — Caddy's {env.CADDY_CLOUDFLARE_API_TOKEN} substitutes to empty.

Fix: pass the token via Ansible's environment: task parameter on the validate task. The role does this; if writing similar tasks elsewhere, remember the env-file scoping.

Caddyfile fmt uses tabs (gofmt-style)

caddy fmt rewrites Caddyfile with tab indents and collapses redundant column-aligned spacing. The first run logged Caddyfile input is not formatted on every reload. Fix: tab-indent the Jinja template to match canonical style. Now the warning is gone.

If the template ever shows the warning again, run caddy fmt /etc/caddy/Caddyfile on edge to see the canonical form, then mirror the changes back into roles/caddy/templates/Caddyfile.j2.

CF token scopes are granular — Zone:Read + DNS:Edit ≠ Zone Settings:Read

Token minted with Zone:Read + DNS:Edit on rampancy.cloud can: list zones, list DNS records, create/update/delete DNS records, verify itself via /user/tokens/verify.

Token cannot: read SSL/TLS settings, edit zone-level features, list certificate packs, change Universal SSL state. Those require Zone Settings:Read (and Edit for changes). When diagnosing certificate issues from the host, the token will return 9109 Unauthorized to access requested resource — that's the SSL settings scope, not a problem with DNS:Edit.

For DNS-01 ACME the current scope is correct and minimal — don't add SSL scope to the production token. Mint a separate diagnostic token if you need to script SSL settings.

CF Universal SSL is per-hostname — flipping orange without an existing cert breaks TLS

CF "Universal SSL" auto-issues a cert per orange-cloud hostname when proxying is enabled for that hostname. If Universal SSL is disabled at the zone level — which is the default state on this account — flipping a record to orange means CF receives the SNI but has no matching cert, and TLS handshake fails immediately. Symptoms: openssl s_client returns alert number 40 (handshake_failure), curl returns error:0A000410, browser shows "ERR_SSL_PROTOCOL_ERROR".

Diagnostic: dig @luke.ns.cloudflare.com +short <host>.rampancy.cloud to confirm CF edge IPs, then echo | openssl s_client -servername <host>.rampancy.cloud -connect <CF-IP>:443 and check whether you get a cert subject or a handshake alert.

Fix: either re-enable Universal SSL at the zone (CF dashboard → SSL/TLS → Edge Certificates → Universal SSL → Enable, then wait for the cert to provision before flipping orange), or skip CF orange-cloud entirely and rely on edge-side protection (CrowdSec on Caddy, Phase 7D).

We chose the latter — orange-cloud reverted same day. Rollback was cheap: API PATCH on each record's proxied field back to false. DNS caches refreshed in <5 min.

The doc glitch — dash.rampancy.cloudbeszel.rampancy.cloud

Earlier docs (docs/network/overview.md, docs/ansible/roles/beszel.md) called the public Beszel hostname beszel.rampancy.cloud. The actual NPM proxy_host record was dash.rampancy.cloud. Caught during the Phase 0 NPM SQLite dump. Cause: original NPM rule was created under one hostname; docs were written from intent rather than the live database. Fixed in the same docs sweep.

Lesson: read the actual database when documenting reverse-proxy state, don't trust the Caddyfile (or NPM equivalent) without cross-checking what's serving traffic.