* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88) Vendor-neutralization step before public release. The directory mixed two concerns: (1) generic ops scripts referenced from mainline OSS infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's hackathon manual-deploy helper with hardcoded GCP project IDs, VM names, and admin emails. Splitting them per concern. Moved (still in OSS, just under a vendor-neutral name): - scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh - scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh Removed (belongs in private consumer infra repos, not upstream OSS): - scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email) - scripts/grpn/README.md (GRPN hackathon deploy walkthrough) - docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log) Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md, docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any consumer infra repo that installs these scripts via path-based systemd timers. This is the first wave of #88 — the remaining leaks (test data with prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test fixtures, docstrings in connectors/openmetadata/enricher.py) will be a separate, smaller PR. Refs #88. * chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes) PR #94 review found that the original wave-1 grep was scoped wrong and many leaks survived. This commit closes wave 1 properly AND folds in all wave-2 anonymization in a single pass — easier to review than two PRs. Wave-1 review-fix corrections: - Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original wave-1 grep filter excluded extensionless files like Caddyfile). - CHANGELOG bullet rewritten — original wording implied an in-repo migration for infra/modules/customer-instance/, which is wrong (the TF module embeds the script inline via heredoc, never sourced from scripts/grpn/). Now flags downstream consumer infra repos only. - infra/modules/customer-instance/variables.tf: Czech docstring with `grpn` example → English description with `acme, example` placeholders. Wave-2 anonymization: - Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py, src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… → my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent, FoundryAIDataModel → AnalyticsDataModel. - Test fixtures (4 files): same set of replacements — 157 tests still pass. - .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment → generic "per-developer dev VMs". - docs/auth-groups.md + scripts/debug/probe_google_groups.py: kids-ai-data-analysis project name → acme-internal-prod placeholder. - 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*: hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>; GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-…. - scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a specific shared dev VM. Per-developer dev VMs are the supported pattern. Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data` returns zero hits (excluding CHANGELOG.md historical entries). CHANGELOG entry expanded to document both waves under one bullet, with the BREAKING (ops) clarification about the TF module being unaffected. Refs review of #94, closes #88. * fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG) Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed: 1. infra/modules/customer-instance/variables.tf had Czech descriptions on 8 more variables. Previous review only flagged line 19; this round audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71, 78, 84 to English. Same review concern: a Terraform module that is the customer-facing API surface in Czech is unfit for OSS distribution. 2. infra/modules/customer-instance/outputs.tf had Czech descriptions on four outputs. Same fix. 3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206) in two places. Replaced with generic 'tracked upstream in the auth-CLI repo' per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos). 4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment. Translated. 5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent' but the actual code uses both MyAgent (in docstrings) and Example (in test fixtures). Reworded to mention both targets. Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh, Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai| prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across .tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits. 157/157 OpenMetadata + DuckDB tests still pass. * fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo) Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed (grep was scoped to extensions that did not include .template / .example suffixes — the audit was right, the previous grep was not paranoid enough): 1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand leak in a shipping config example. Replaced with '(optional)'. 2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh' in operator-facing env-template comment. The script lives at scripts/ops/ now (commit 16a85cc); this comment had been pointing operators at a non-existent path. 3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked upstream' from a sloppy substitution in round-2. Trivial wording fix. Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/ Caddyfile/.toml/.template/.example/.env* with the full token set (groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61| kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding CHANGELOG.md historical entries. * fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md Devin Review caught two findings on the latest round-3 commit: 1. docs/QUICKSTART.md:67 still pointed users at the deleted scripts/switch-dev-vm.sh. A Quickstart user following step-by-step would hit a missing-file error at the final step. Replaced with the inline gcloud-ssh equivalent that the Removed bullet documents. 2. docs/padak-security.md filename retains the personal identifier 'padak'. The PR fixed the body content (replaced padak/keboola_agent_cli#206 references with generic wording) but missed the filename. Renamed to docs/security-audit-2026-04.md (date-anchored, vendor-neutral). Updated the historical CHANGELOG link to point at the new path with an inline note about the rename. * fix(oss): redact remaining hardcoded IPs from planning docs + remove default email Devin Review caught two more leaks: 1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded personal-email default (zdenek.srotyr@keboola.com). Replaced with ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set — safer than carrying any specific identity. 2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy prod/dev IPs) that the round-1 IP-replace pattern missed (it only targeted 34.77.x.x). Generic regex redaction across all five planning docs replaces every public IP with <redacted-ip>, preserving private/loopback/IAP ranges.
168 lines
7.9 KiB
Markdown
168 lines
7.9 KiB
Markdown
# Deployment Guide
|
|
|
|
Agnes supports two deployment paths. Pick the one that matches your use case.
|
|
|
|
## 1. Terraform — managed, multi-customer (recommended)
|
|
|
|
For Keboola-operated deployments and anyone running Agnes for multiple customers on GCP.
|
|
|
|
**Follow:** [`ONBOARDING.md`](ONBOARDING.md)
|
|
|
|
Highlights:
|
|
- Per-customer GCP project + private infra repo cloned from [`keboola/agnes-infra-template`](https://github.com/keboola/agnes-infra-template)
|
|
- Reusable Terraform module `infra/modules/customer-instance` (versioned — `infra-vX.Y.Z` tags)
|
|
- Prod + optional branch-aware dev VMs
|
|
- Persistent SSD data disk with daily snapshots
|
|
- Secret Manager for tokens (no plaintext in VM metadata)
|
|
- OS Login for SSH, dedicated VM service account with scoped `secretAccessor`
|
|
- Cron-based auto-upgrade (pulls `:stable` image digest every 5 min)
|
|
- Caddy TLS with corporate-CA or self-managed certs mounted from `/data/state/certs`; daily auto-rotation from a URL (`TLS_FULLCHAIN_URL`) with zero-downtime `SIGUSR1` reload
|
|
- Uptime check + alert policy per VM (wire a notification channel to be paged)
|
|
- CI/CD in the private repo: PR → `terraform plan`, merge to main → `apply-dev` auto, `apply-prod` gated by reviewer
|
|
- First-boot bootstrap via `POST /auth/bootstrap`
|
|
|
|
Target onboarding time: **< 1 hour** per customer.
|
|
|
|
## 2. Docker Compose — OSS self-host
|
|
|
|
For running Agnes on your own VM / bare metal without Terraform. You're responsible for provisioning and maintenance.
|
|
|
|
### Prerequisites
|
|
|
|
- Ubuntu 24.04 (or any Linux with Docker)
|
|
- 2 vCPU, 2 GB RAM, 30 GB SSD minimum
|
|
- Docker Engine + Compose plugin
|
|
- Public IP with ports 80/443 (if using Caddy TLS) or 8000 (plain HTTP) open
|
|
- Data-source credentials (e.g., Keboola Storage token)
|
|
|
|
### Steps
|
|
|
|
1. Clone the Agnes repository:
|
|
|
|
```bash
|
|
git clone https://github.com/keboola/agnes-the-ai-analyst.git /opt/agnes
|
|
cd /opt/agnes
|
|
```
|
|
|
|
2. Create `.env`:
|
|
|
|
```bash
|
|
cat > .env <<'EOF'
|
|
JWT_SECRET_KEY=$(openssl rand -hex 32)
|
|
DATA_DIR=/data
|
|
DATA_SOURCE=keboola
|
|
KEBOOLA_STORAGE_TOKEN=<your-token>
|
|
KEBOOLA_STACK_URL=<your-stack-url>
|
|
SEED_ADMIN_EMAIL=<your-email>
|
|
LOG_LEVEL=info
|
|
AGNES_TAG=stable
|
|
EOF
|
|
chmod 600 .env
|
|
```
|
|
|
|
3. Mount a persistent disk at `/data` (optional but recommended — survives host rebuild). If you do, use the overlay:
|
|
|
|
```bash
|
|
docker compose \
|
|
-f docker-compose.yml \
|
|
-f docker-compose.prod.yml \
|
|
-f docker-compose.host-mount.yml \
|
|
up -d
|
|
```
|
|
|
|
Without a persistent disk (data on Docker named volume, tied to boot disk):
|
|
|
|
```bash
|
|
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
|
|
```
|
|
|
|
4. Bootstrap your admin password via `POST /auth/bootstrap`:
|
|
|
|
```bash
|
|
curl -X POST http://<host>:8000/auth/bootstrap \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"<your-email>","password":"<strong-password>"}'
|
|
```
|
|
|
|
5. Open `http://<host>:8000/login` and sign in.
|
|
|
|
### TLS (optional)
|
|
|
|
Caddy runs as the TLS terminator. It reads certs from `/data/state/certs/{fullchain,privkey}.pem` bind-mounted into the container. Two provisioning modes:
|
|
|
|
**A. Public internet (Let's Encrypt)** — for this path, override the `Caddyfile` to drop the `tls` directive (so Caddy auto-issues) and skip steps below. Not covered here anymore; see git history prior to the `feat(tls)` change if you need the ACME flow.
|
|
|
|
**B. Corporate CA / self-managed certs** (recommended, and what the infra repo ships):
|
|
|
|
Two bring-up flows, picked by whether `TLS_PRIVKEY_URL` is set in `.env`:
|
|
|
|
- **On-VM gen** (preferred for new deployments): leave `TLS_PRIVKEY_URL` empty. On first run, `agnes-tls-rotate.sh` generates an RSA-2048 key + CSR directly into `/data/state/certs/` using the subject string from `TLS_CSR_SUBJECT`. The key never leaves the host; the CSR (`/data/state/certs/cert.csr`) is what you submit to your corporate PKI. Until the CA signs and publishes, rotate falls back to a 30-day self-signed cert against the same key so Caddy can serve :443.
|
|
- **Pre-provisioned key** (legacy / VM-replace-resilient): set `TLS_PRIVKEY_URL=sm://<secret>` (or any supported scheme). Seed the key out-of-band before first rotate. Same real-cert fetch + self-signed fallback applies.
|
|
|
|
Both modes converge: once the CA publishes the signed chain at `TLS_FULLCHAIN_URL`, the daily rotate tick atomically swaps the fullchain in place and `SIGUSR1`-reloads Caddy. Zero key churn, zero downtime, no reload when the URL content hasn't moved.
|
|
|
|
1. Set the required env vars in `.env`:
|
|
```
|
|
DOMAIN=agnes.example.com
|
|
TLS_FULLCHAIN_URL=https://your-ca.example.com/agnes/fullchain.pem
|
|
TLS_PRIVKEY_URL= # empty → on-VM gen; or sm://<secret>
|
|
TLS_CSR_SUBJECT=/C=…/ST=…/L=…/O=…/CN=agnes.example.com
|
|
```
|
|
2. Start with the `tls` profile + overlay (`docker-compose.tls.yml` closes host `:8000` so all traffic enters via `:443`):
|
|
```bash
|
|
docker compose \
|
|
-f docker-compose.yml \
|
|
-f docker-compose.prod.yml \
|
|
-f docker-compose.tls.yml \
|
|
--profile tls up -d
|
|
```
|
|
3. Grab the CSR if you used on-VM gen:
|
|
```bash
|
|
sudo cat /data/state/certs/cert.csr
|
|
```
|
|
Submit to your corporate PKI. While waiting, Caddy is already up on :443 with the self-signed fallback.
|
|
|
|
#### Automatic rotation
|
|
|
|
`scripts/ops/agnes-tls-rotate.sh` is the single entry point — it handles fetch, self-signed fallback, auto-generation on missing key, atomic cert swap, and Caddy reload. Env vars it reads:
|
|
|
|
| Var | Required | Schemes | Notes |
|
|
|---|---|---|---|
|
|
| `DOMAIN` | yes | — | The hostname Caddy serves + the CN in auto-generated CSRs. |
|
|
| `TLS_FULLCHAIN_URL` | yes | `https://`, `sm://<secret>`, `gs://<obj>`, `file://` | Polled daily; rotate only reloads Caddy when the bytes change. |
|
|
| `TLS_PRIVKEY_URL` | optional | same | Empty activates on-VM gen. Set to pre-provisioned scheme (e.g. `sm://`) for VM-replace resilience. |
|
|
| `TLS_CSR_SUBJECT` | optional | — | Stamped on auto-generated CSRs. Defaults to `/CN=<DOMAIN>` if unset. Example: `/C=US/ST=Illinois/L=Chicago/O=Your Org/CN=agnes.example.com`. |
|
|
|
|
`scripts/tls-fetch.sh` at `/usr/local/bin/tls-fetch.sh` is required (generic URL fetcher used by rotate). On infra-repo-managed VMs, both scripts are installed by `startup.sh` and fired via a daily systemd timer; for manual compose deployments, copy them under `/usr/local/bin/` and wire a systemd timer (`OnBootSec=10min`, `OnUnitActiveSec=24h`, `Persistent=true`).
|
|
|
|
### Upgrades (manual)
|
|
|
|
```bash
|
|
cd /opt/agnes
|
|
git pull
|
|
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
|
|
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
|
|
```
|
|
|
|
Or set up a cron job — see `infra/modules/customer-instance/startup-script.sh.tpl` for the reference implementation.
|
|
|
|
## Which path should I pick?
|
|
|
|
| | Terraform | Docker Compose |
|
|
|---|---|---|
|
|
| Setup time | ~45 min first customer, ~15 min each subsequent | ~30 min |
|
|
| Infra-as-Code | Full (all resources in git) | Partial (compose.yml only) |
|
|
| Secret storage | GCP Secret Manager | `.env` file on host |
|
|
| Upgrades | Auto via cron, gated prod apply | Manual `docker compose pull` |
|
|
| Backups | Daily GCP snapshots, 30-day retention | You set up yourself |
|
|
| Monitoring / alerts | GCP Uptime Checks + alert policy | You set up yourself |
|
|
| TLS | Caddy + corp cert, auto-rotated from URL | Caddy + corp cert, manual or user-scripted rotation |
|
|
| Best for | Multi-tenant SaaS, production | Single-instance self-host, learning |
|
|
|
|
## Related documentation
|
|
|
|
- [`ONBOARDING.md`](ONBOARDING.md) — end-to-end Terraform onboarding checklist
|
|
- [`CONFIGURATION.md`](CONFIGURATION.md) — `instance.yaml`, env vars, per-instance config
|
|
- [`architecture.md`](architecture.md) — internal architecture (orchestrator, extractors, DB layout)
|
|
- [`QUICKSTART.md`](QUICKSTART.md) — local development setup
|
|
- [`superpowers/specs/2026-04-21-multi-customer-deployment-spec.md`](superpowers/specs/2026-04-21-multi-customer-deployment-spec.md) — design rationale for the multi-customer model
|