agnes-the-ai-analyst/docs/DEPLOYMENT.md
ZdenekSrotyr 5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00

179 lines
8.7 KiB
Markdown

# Deployment Guide
Agnes supports two deployment paths. Pick the one that matches your use case.
## 1. Terraform — managed, multi-customer (recommended)
For Keboola-operated deployments and anyone running Agnes for multiple customers on GCP.
**Follow:** [`ONBOARDING.md`](ONBOARDING.md)
Highlights:
- Per-customer GCP project + private infra repo cloned from [`keboola/agnes-infra-template`](https://github.com/keboola/agnes-infra-template)
- Reusable Terraform module `infra/modules/customer-instance` (versioned — `infra-vX.Y.Z` tags)
- Prod + optional branch-aware dev VMs
- Persistent SSD data disk with daily snapshots
- Secret Manager for tokens (no plaintext in VM metadata)
- OS Login for SSH, dedicated VM service account with scoped `secretAccessor`
- Cron-based auto-upgrade (pulls `:stable` image digest every 5 min)
- Caddy TLS with corporate-CA or self-managed certs mounted from `/data/state/certs`; daily auto-rotation from a URL (`TLS_FULLCHAIN_URL`) with zero-downtime `SIGUSR1` reload
- Uptime check + alert policy per VM (wire a notification channel to be paged)
- CI/CD in the private repo: PR → `terraform plan`, merge to main → `apply-dev` auto, `apply-prod` gated by reviewer
- First-boot bootstrap via `POST /auth/bootstrap`
Target onboarding time: **< 1 hour** per customer.
## 2. Docker Compose — OSS self-host
For running Agnes on your own VM / bare metal without Terraform. You're responsible for provisioning and maintenance.
### Prerequisites
- Ubuntu 24.04 (or any Linux with Docker)
- 2 vCPU, 2 GB RAM, 30 GB SSD minimum
- Docker Engine + Compose plugin
- Public IP with ports 80/443 (if using Caddy TLS) or 8000 (plain HTTP) open
- Data-source credentials (e.g., Keboola Storage token)
### Steps
1. Clone the Agnes repository:
```bash
git clone https://github.com/keboola/agnes-the-ai-analyst.git /opt/agnes
cd /opt/agnes
```
2. Create `.env`:
```bash
cat > .env <<'EOF'
JWT_SECRET_KEY=$(openssl rand -hex 32)
DATA_DIR=/data
DATA_SOURCE=keboola
KEBOOLA_STORAGE_TOKEN=<your-token>
KEBOOLA_STACK_URL=<your-stack-url>
SEED_ADMIN_EMAIL=<your-email>
LOG_LEVEL=info
AGNES_TAG=stable
EOF
chmod 600 .env
```
3. Mount a persistent disk at `/data` (optional but recommended — survives host rebuild). If you do, use the overlay:
```bash
docker compose \
-f docker-compose.yml \
-f docker-compose.prod.yml \
-f docker-compose.host-mount.yml \
up -d
```
Without a persistent disk (data on Docker named volume, tied to boot disk):
```bash
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```
4. Bootstrap your admin password via `POST /auth/bootstrap`:
```bash
curl -X POST http://<host>:8000/auth/bootstrap \
-H "Content-Type: application/json" \
-d '{"email":"<your-email>","password":"<strong-password>"}'
```
5. Open `http://<host>:8000/login` and sign in.
### TLS (optional)
Caddy runs as the TLS terminator. It reads certs from `/data/state/certs/{fullchain,privkey}.pem` bind-mounted into the container. Two provisioning modes:
**A. Public internet (Let's Encrypt)** — for this path, override the `Caddyfile` to drop the `tls` directive (so Caddy auto-issues) and skip steps below. Not covered here anymore; see git history prior to the `feat(tls)` change if you need the ACME flow.
**B. Corporate CA / self-managed certs** (recommended, and what the infra repo ships):
Two bring-up flows, picked by whether `TLS_PRIVKEY_URL` is set in `.env`:
- **On-VM gen** (preferred for new deployments): leave `TLS_PRIVKEY_URL` empty. On first run, `agnes-tls-rotate.sh` generates an RSA-2048 key + CSR directly into `/data/state/certs/` using the subject string from `TLS_CSR_SUBJECT`. The key never leaves the host; the CSR (`/data/state/certs/cert.csr`) is what you submit to your corporate PKI. Until the CA signs and publishes, rotate falls back to a 30-day self-signed cert against the same key so Caddy can serve :443.
- **Pre-provisioned key** (legacy / VM-replace-resilient): set `TLS_PRIVKEY_URL=sm://<secret>` (or any supported scheme). Seed the key out-of-band before first rotate. Same real-cert fetch + self-signed fallback applies.
Both modes converge: once the CA publishes the signed chain at `TLS_FULLCHAIN_URL`, the daily rotate tick atomically swaps the fullchain in place and `SIGUSR1`-reloads Caddy. Zero key churn, zero downtime, no reload when the URL content hasn't moved.
1. Set the required env vars in `.env`:
```
DOMAIN=agnes.example.com
TLS_FULLCHAIN_URL=https://your-ca.example.com/agnes/fullchain.pem
TLS_PRIVKEY_URL= # empty → on-VM gen; or sm://<secret>
TLS_CSR_SUBJECT=/C=…/ST=…/L=…/O=…/CN=agnes.example.com
```
2. Start with the `tls` profile + overlay (`docker-compose.tls.yml` closes host `:8000` so all traffic enters via `:443`):
```bash
docker compose \
-f docker-compose.yml \
-f docker-compose.prod.yml \
-f docker-compose.tls.yml \
--profile tls up -d
```
3. Grab the CSR if you used on-VM gen:
```bash
sudo cat /data/state/certs/cert.csr
```
Submit to your corporate PKI. While waiting, Caddy is already up on :443 with the self-signed fallback.
#### Automatic rotation
`scripts/ops/agnes-tls-rotate.sh` is the single entry point — it handles fetch, self-signed fallback, auto-generation on missing key, atomic cert swap, and Caddy reload. Env vars it reads:
| Var | Required | Schemes | Notes |
|---|---|---|---|
| `DOMAIN` | yes | — | The hostname Caddy serves + the CN in auto-generated CSRs. |
| `TLS_FULLCHAIN_URL` | yes | `https://`, `sm://<secret>`, `gs://<obj>`, `file://` | Polled daily; rotate only reloads Caddy when the bytes change. |
| `TLS_PRIVKEY_URL` | optional | same | Empty activates on-VM gen. Set to pre-provisioned scheme (e.g. `sm://`) for VM-replace resilience. |
| `TLS_CSR_SUBJECT` | optional | — | Stamped on auto-generated CSRs. Defaults to `/CN=<DOMAIN>` if unset. Example: `/C=US/ST=Illinois/L=Chicago/O=Your Org/CN=agnes.example.com`. |
`scripts/tls-fetch.sh` at `/usr/local/bin/tls-fetch.sh` is required (generic URL fetcher used by rotate). On infra-repo-managed VMs, both scripts are installed by `startup.sh` and fired via a daily systemd timer; for manual compose deployments, copy them under `/usr/local/bin/` and wire a systemd timer (`OnBootSec=10min`, `OnUnitActiveSec=24h`, `Persistent=true`).
### Upgrades (manual)
```bash
cd /opt/agnes
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```
Or set up a cron job — see `infra/modules/customer-instance/startup-script.sh.tpl` for the reference implementation.
### Health checks & external monitoring
Two health endpoints serve different audiences:
| Endpoint | Auth | Response | Use for |
|---|---|---|---|
| `GET /api/health` | None | `{"status": "ok"}` | Load balancers, Docker `healthcheck`, uptime pings |
| `GET /api/health/detailed` | Bearer token | `{"status", "version", "services": {...}}` | Dashboards, alerting rules, `da diagnose`/`da status` CLI |
The Docker Compose `healthcheck` uses the minimal endpoint (`curl -sf http://localhost:8000/api/health`). For external monitoring tools (Datadog, Prometheus, UptimeRobot, etc.) that need service-level detail (DuckDB status, sync freshness, user count), point them at `/api/health/detailed` with an `Authorization: Bearer <token>` header. Any authenticated user can call it; a personal access token (`da admin create-pat`) works well for service accounts.
## Which path should I pick?
| | Terraform | Docker Compose |
|---|---|---|
| Setup time | ~45 min first customer, ~15 min each subsequent | ~30 min |
| Infra-as-Code | Full (all resources in git) | Partial (compose.yml only) |
| Secret storage | GCP Secret Manager | `.env` file on host |
| Upgrades | Auto via cron, gated prod apply | Manual `docker compose pull` |
| Backups | Daily GCP snapshots, 30-day retention | You set up yourself |
| Monitoring / alerts | GCP Uptime Checks + alert policy | You set up yourself |
| TLS | Caddy + corp cert, auto-rotated from URL | Caddy + corp cert, manual or user-scripted rotation |
| Best for | Multi-tenant SaaS, production | Single-instance self-host, learning |
## Related documentation
- [`ONBOARDING.md`](ONBOARDING.md) — end-to-end Terraform onboarding checklist
- [`CONFIGURATION.md`](CONFIGURATION.md) — `instance.yaml`, env vars, per-instance config
- [`architecture.md`](architecture.md) — internal architecture (orchestrator, extractors, DB layout)
- [`QUICKSTART.md`](QUICKSTART.md) — local development setup
- [`superpowers/specs/2026-04-21-multi-customer-deployment-spec.md`](superpowers/specs/2026-04-21-multi-customer-deployment-spec.md) — design rationale for the multi-customer model