Skip to content

Runbook: CrowdSec bouncer end-to-end validation

Proves the hslatman Caddy bouncer module is actually bouncing — not just that the engine is running. Use after the crowdsec_engine bootstrap and after flipping caddy_crowdsec_enabled: true in inventory/host_vars/edge.yml.

Active proof must run from outside the LAN

Hairpin NAT on the UDM rewrites the source IP for any LAN-internal request to the public hostname (requests.rampancy.cloud etc.) — Caddy sees client_ip: 192.168.1.1 instead of your real public IP. The bouncer correctly allows that LAN-side traffic, so a curl from WSL/CT104 will always show "not blocking" even when everything is wired right. Use a cellular phone (WiFi off) or a remote SSH host for the active test.

Prerequisites

  • crowdsec_engine role applied to edge, crowdsec.service active
  • caddy_crowdsec_enabled: true in inventory/host_vars/edge.yml, applied
  • Bouncer-enabled Caddy binary in roles/caddy/files/ (built per caddy.md §Binary build flow with the --with github.com/hslatman/caddy-crowdsec-bouncer/http flag)
  • SSH access to edge as root
  • A device on cellular (or any network whose egress IP is not your home WAN IP)

Step 1 — Read-only sanity (drift-safe, run from edge)

Confirm engine + bouncer + inflow before touching the active path.

ssh root@192.168.1.244

# Engine + Caddy both running.
systemctl is-active crowdsec caddy
# Expected: active / active

# Bouncer is registered and pulling decisions every ~15s.
cscli bouncers list
# Expected: caddy-edge, Valid ✔️, Last API pull within ~30s, Type caddy-cs-bouncer

# Engine has decisions to enforce (community CAPI pull on first start).
cscli decisions list | head
# Expected: a non-empty list eventually (CAPI ships ~thousands by default).
# On a fresh install, allow up to 10 min for the first community pull.

# Caddy bouncer module loaded.
/usr/local/bin/caddy list-modules | grep crowdsec
# Expected: admin.api.crowdsec, crowdsec, http.handlers.crowdsec

Hold point: if any of those fail, stop. Don't proceed to active proof.

Step 2 — Capture your phone's cellular egress IP

On the phone (WiFi off, cellular on): - Visit https://api.ipify.org (just shows the IP) or any whatsmyip site - Note it (call it $PHONE_IP) - Confirm it's plausibly a cellular range — NOT your home WAN IP

Sanity-load https://requests.rampancy.cloud — should reach Overseerr's login screen normally.

Step 3 — Add a test decision (5m TTL, from WSL)

ssh root@192.168.1.244 "cscli decisions add --ip $PHONE_IP --duration 5m --reason 'phase-7d-validation'"
# Expected: "Decision successfully added"

ssh root@192.168.1.244 "cscli decisions list --ip $PHONE_IP"
# Expected: one entry, type ban, ~5m remaining

5-minute TTL gives breathing room — the streaming bouncer ticker is 15s, so the decision needs at least one poll cycle to land in Caddy's cache.

Step 4 — Confirm the block (on phone)

Wait ~20s for the bouncer poll, then on the phone reload https://requests.rampancy.cloud.

Expected: browser shows blank page (Firefox) or "Forbidden" (curl). The bouncer's default deny response is 403 Forbidden with no body.

If still loading normally after 30s, see troubleshooting below.

Step 5 — Confirm the deny event landed in Caddy logs

On edge:

ssh root@192.168.1.244 "journalctl -u caddy --since '2 minutes ago' --no-pager | grep -E 'client_ip.*$PHONE_IP|crowdsec.*$PHONE_IP' | head"
# Expected: at least one access-log line with the phone IP and status 403,
# or a crowdsec logger line referencing the IP.

Step 6 — Expire the decision

ssh root@192.168.1.244 "cscli decisions delete --ip $PHONE_IP"
# Expected: "1 decision(s) deleted"

Step 7 — Confirm restored access (on phone)

Wait ~20s for the next bouncer poll (decisions are streamed as deltas, including deletions), then reload on the phone.

Expected: Overseerr loads normally again.

Step 8 — Soft-fail behaviour (optional, riskier)

To prove the bouncer fails open if the engine is unreachable (so a CrowdSec outage doesn't black-hole the public apps), with no decisions blocking your phone IP:

ssh root@192.168.1.244 "systemctl stop crowdsec"
# On phone: reload requests.rampancy.cloud -- expected: still 200 (Caddy logs
# warn the bouncer can't reach LAPI; module fails open per hslatman default).
ssh root@192.168.1.244 "systemctl start crowdsec"

If the phone gets 5xx with engine stopped, enable_hard_fails got flipped on somewhere — check the Caddyfile global block.

Troubleshooting

Symptom Likely cause Diagnostic
Phone reload still 200 after 30s Bouncer stream cache hasn't picked up the decision ssh root@edge cscli bouncers list — check Last API pull is recent. If it isn't, restart crowdsec.service
Phone reload still 200 even after cscli bouncers list shows recent pull Caddy client_ip isn't what you think Check journal access log: journalctl -u caddy --since '1 min ago' --no-pager \| grep client_ip. If client_ip$PHONE_IP, your phone is on a different egress (corporate VPN? CGNAT?) — re-capture
All requests blocked (LAN included) A decision exists for 192.168.1.1 (the UDM hairpin source) cscli decisions list \| grep 192.168.1.1; delete it
caddy validate fails with "crowdsec API key must not be empty" caddy_crowdsec_bouncer_key_var resolved to empty (vault key missing) Check vault_caddy_crowdsec_bouncer_key exists in vault: ansible-vault view inventory/group_vars/all/vault.yml \| grep crowdsec
Caddy errors with unknown module: http.handlers.crowdsec on reload Running Caddy is the OLD binary; new binary is on disk but the process didn't restart ssh root@edge systemctl restart caddy (full restart, not reload — reload reloads config into the running process which doesn't have the module)

Lessons from the 2026-05-04 run

The first execution caught six bugs / gotchas, all now codified into the role + this runbook + a feedback memory on hairpin NAT.

  1. packagecloud any/any URL gotcha. The upstream-documented workaround for the broken trixie repo is https://packagecloud.io/crowdsec/crowdsec/any (path component = any) + suite any. NOT crowdsec/crowdsec/debian + suite any. The /debian distro path rejects the any suite (returns HTTP 422 / "repository not signed"). Verified empirically: /any/dists/any/InRelease → 302 (works); /debian/dists/any/InRelease → 422.
  2. Don't move LAPI off port 8080 without re-templating the agent. The CrowdSec agent's /etc/crowdsec/local_api_credentials.yaml is wired to 127.0.0.1:8080 by the installer. Overriding LAPI's listen_uri via config.yaml.local without also re-templating the agent's credentials file leaves the agent unable to authenticate, and the engine fails to start. Lesson: don't optimise for hypothetical alt-port collisions; stay on stock unless there's a real reason.
  3. {env.X} doesn't substitute in the crowdsec module's api_key. Caddy's runtime env-var substitution doesn't fire for this field — the module reads its config before that pass. Use {$X} (parse-time substitution) instead. Verified by querying the running Caddy via /config/apps/crowdsec/ admin API and seeing the literal string {env.CROWDSEC_BOUNCER_API_KEY} in the loaded config.
  4. caddy validate needs every env var passed via Ansible's environment:. EnvironmentFile is systemd-only; ad-hoc validate doesn't see it. The validate task in roles/caddy/tasks/main.yml already passed CADDY_CLOUDFLARE_API_TOKEN; needed to add CROWDSEC_BOUNCER_API_KEY too.
  5. Handler-failure cascade leaves binaries unsynced. When the Restart crowdsec handler failed (due to the LAPI bug), Ansible aborted subsequent handlers — including caddy's Restart caddy. The new Caddy binary on disk wasn't loaded by the running process, so the next reload tried to load a Caddyfile with the crowdsec directive into a Caddy instance without the module → "unknown module" error. Fix: systemctl restart caddy (full restart loads the new binary). Future-proof: when a binary's content changes, prefer Restart over Reload, and check that handlers actually fired with --check --diff before assuming clean state.
  6. HTTP access logging needs site-block log, not global log. Caddy's global log directive only configures the default logger (the one used for the default log target); HTTP server access logs require log inside the site block. Without it, request flow is invisible — critical for diagnosing bouncer behaviour.

The big one: hairpin NAT made the bouncer look broken when it wasn't. Spent ~30 min testing the bounce from WSL with cscli decisions add for my WAN IP, saw nothing happen, chased four false leads (binary version, env-var syntax, streaming-vs-live mode, restart vs reload). Finally added access logging and saw client_ip: 192.168.1.1 — the UDM's LAN IP, not my WAN IP. UDM hairpin rewrites source on internal-loop traffic. The bouncer was correctly allowing LAN traffic the entire time. Always validate edge bouncers from outside the LAN. Captured as a feedback memory so future-me doesn't repeat this.