agnes-the-ai-analyst/docs/DEPLOYMENT.md
ZdenekSrotyr 5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00

8.7 KiB

Deployment Guide

Agnes supports two deployment paths. Pick the one that matches your use case.

For Keboola-operated deployments and anyone running Agnes for multiple customers on GCP.

Follow: ONBOARDING.md

Highlights:

  • Per-customer GCP project + private infra repo cloned from keboola/agnes-infra-template
  • Reusable Terraform module infra/modules/customer-instance (versioned — infra-vX.Y.Z tags)
  • Prod + optional branch-aware dev VMs
  • Persistent SSD data disk with daily snapshots
  • Secret Manager for tokens (no plaintext in VM metadata)
  • OS Login for SSH, dedicated VM service account with scoped secretAccessor
  • Cron-based auto-upgrade (pulls :stable image digest every 5 min)
  • Caddy TLS with corporate-CA or self-managed certs mounted from /data/state/certs; daily auto-rotation from a URL (TLS_FULLCHAIN_URL) with zero-downtime SIGUSR1 reload
  • Uptime check + alert policy per VM (wire a notification channel to be paged)
  • CI/CD in the private repo: PR → terraform plan, merge to main → apply-dev auto, apply-prod gated by reviewer
  • First-boot bootstrap via POST /auth/bootstrap

Target onboarding time: < 1 hour per customer.

2. Docker Compose — OSS self-host

For running Agnes on your own VM / bare metal without Terraform. You're responsible for provisioning and maintenance.

Prerequisites

  • Ubuntu 24.04 (or any Linux with Docker)
  • 2 vCPU, 2 GB RAM, 30 GB SSD minimum
  • Docker Engine + Compose plugin
  • Public IP with ports 80/443 (if using Caddy TLS) or 8000 (plain HTTP) open
  • Data-source credentials (e.g., Keboola Storage token)

Steps

  1. Clone the Agnes repository:

    git clone https://github.com/keboola/agnes-the-ai-analyst.git /opt/agnes
    cd /opt/agnes
    
  2. Create .env:

    cat > .env <<'EOF'
    JWT_SECRET_KEY=$(openssl rand -hex 32)
    DATA_DIR=/data
    DATA_SOURCE=keboola
    KEBOOLA_STORAGE_TOKEN=<your-token>
    KEBOOLA_STACK_URL=<your-stack-url>
    SEED_ADMIN_EMAIL=<your-email>
    LOG_LEVEL=info
    AGNES_TAG=stable
    EOF
    chmod 600 .env
    
  3. Mount a persistent disk at /data (optional but recommended — survives host rebuild). If you do, use the overlay:

    docker compose \
        -f docker-compose.yml \
        -f docker-compose.prod.yml \
        -f docker-compose.host-mount.yml \
        up -d
    

    Without a persistent disk (data on Docker named volume, tied to boot disk):

    docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
    
  4. Bootstrap your admin password via POST /auth/bootstrap:

    curl -X POST http://<host>:8000/auth/bootstrap \
        -H "Content-Type: application/json" \
        -d '{"email":"<your-email>","password":"<strong-password>"}'
    
  5. Open http://<host>:8000/login and sign in.

TLS (optional)

Caddy runs as the TLS terminator. It reads certs from /data/state/certs/{fullchain,privkey}.pem bind-mounted into the container. Two provisioning modes:

A. Public internet (Let's Encrypt) — for this path, override the Caddyfile to drop the tls directive (so Caddy auto-issues) and skip steps below. Not covered here anymore; see git history prior to the feat(tls) change if you need the ACME flow.

B. Corporate CA / self-managed certs (recommended, and what the infra repo ships):

Two bring-up flows, picked by whether TLS_PRIVKEY_URL is set in .env:

  • On-VM gen (preferred for new deployments): leave TLS_PRIVKEY_URL empty. On first run, agnes-tls-rotate.sh generates an RSA-2048 key + CSR directly into /data/state/certs/ using the subject string from TLS_CSR_SUBJECT. The key never leaves the host; the CSR (/data/state/certs/cert.csr) is what you submit to your corporate PKI. Until the CA signs and publishes, rotate falls back to a 30-day self-signed cert against the same key so Caddy can serve :443.
  • Pre-provisioned key (legacy / VM-replace-resilient): set TLS_PRIVKEY_URL=sm://<secret> (or any supported scheme). Seed the key out-of-band before first rotate. Same real-cert fetch + self-signed fallback applies.

Both modes converge: once the CA publishes the signed chain at TLS_FULLCHAIN_URL, the daily rotate tick atomically swaps the fullchain in place and SIGUSR1-reloads Caddy. Zero key churn, zero downtime, no reload when the URL content hasn't moved.

  1. Set the required env vars in .env:
    DOMAIN=agnes.example.com
    TLS_FULLCHAIN_URL=https://your-ca.example.com/agnes/fullchain.pem
    TLS_PRIVKEY_URL=            # empty → on-VM gen; or sm://<secret>
    TLS_CSR_SUBJECT=/C=…/ST=…/L=…/O=…/CN=agnes.example.com
    
  2. Start with the tls profile + overlay (docker-compose.tls.yml closes host :8000 so all traffic enters via :443):
    docker compose \
        -f docker-compose.yml \
        -f docker-compose.prod.yml \
        -f docker-compose.tls.yml \
        --profile tls up -d
    
  3. Grab the CSR if you used on-VM gen:
    sudo cat /data/state/certs/cert.csr
    
    Submit to your corporate PKI. While waiting, Caddy is already up on :443 with the self-signed fallback.

Automatic rotation

scripts/ops/agnes-tls-rotate.sh is the single entry point — it handles fetch, self-signed fallback, auto-generation on missing key, atomic cert swap, and Caddy reload. Env vars it reads:

Var Required Schemes Notes
DOMAIN yes The hostname Caddy serves + the CN in auto-generated CSRs.
TLS_FULLCHAIN_URL yes https://, sm://<secret>, gs://<obj>, file:// Polled daily; rotate only reloads Caddy when the bytes change.
TLS_PRIVKEY_URL optional same Empty activates on-VM gen. Set to pre-provisioned scheme (e.g. sm://) for VM-replace resilience.
TLS_CSR_SUBJECT optional Stamped on auto-generated CSRs. Defaults to /CN=<DOMAIN> if unset. Example: /C=US/ST=Illinois/L=Chicago/O=Your Org/CN=agnes.example.com.

scripts/tls-fetch.sh at /usr/local/bin/tls-fetch.sh is required (generic URL fetcher used by rotate). On infra-repo-managed VMs, both scripts are installed by startup.sh and fired via a daily systemd timer; for manual compose deployments, copy them under /usr/local/bin/ and wire a systemd timer (OnBootSec=10min, OnUnitActiveSec=24h, Persistent=true).

Upgrades (manual)

cd /opt/agnes
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

Or set up a cron job — see infra/modules/customer-instance/startup-script.sh.tpl for the reference implementation.

Health checks & external monitoring

Two health endpoints serve different audiences:

Endpoint Auth Response Use for
GET /api/health None {"status": "ok"} Load balancers, Docker healthcheck, uptime pings
GET /api/health/detailed Bearer token {"status", "version", "services": {...}} Dashboards, alerting rules, da diagnose/da status CLI

The Docker Compose healthcheck uses the minimal endpoint (curl -sf http://localhost:8000/api/health). For external monitoring tools (Datadog, Prometheus, UptimeRobot, etc.) that need service-level detail (DuckDB status, sync freshness, user count), point them at /api/health/detailed with an Authorization: Bearer <token> header. Any authenticated user can call it; a personal access token (da admin create-pat) works well for service accounts.

Which path should I pick?

Terraform Docker Compose
Setup time ~45 min first customer, ~15 min each subsequent ~30 min
Infra-as-Code Full (all resources in git) Partial (compose.yml only)
Secret storage GCP Secret Manager .env file on host
Upgrades Auto via cron, gated prod apply Manual docker compose pull
Backups Daily GCP snapshots, 30-day retention You set up yourself
Monitoring / alerts GCP Uptime Checks + alert policy You set up yourself
TLS Caddy + corp cert, auto-rotated from URL Caddy + corp cert, manual or user-scripted rotation
Best for Multi-tenant SaaS, production Single-instance self-host, learning