diff --git a/docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md b/docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md new file mode 100644 index 0000000..1763e66 --- /dev/null +++ b/docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md @@ -0,0 +1,504 @@ +# Multi-Instance Deployment & Versioning — Design Spec + +## Goal + +Make Agnes deployable to 20+ independent customer instances via self-service, with safe versioning that prevents one customer's PR from breaking another's deployment. + +## Context + +Agnes is an open-source AI Data Analyst platform. Customers (or their AI agents) deploy it as a Docker image on their own infrastructure. Each instance connects to different data sources (Keboola, BigQuery, Jira, custom). + +**Key constraints:** +- Customers range from semi-technical to non-technical, assisted by AI agents +- Cloud-agnostic (GCP, AWS, Azure, on-prem, VPS) +- One repo, one Docker image, many instances +- Community PRs must not break existing customers +- AI agent is the primary "installer" and "developer" + +--- + +## 1. Versioning & Release Channels + +### CalVer: `YYYY.MM.N` + +Format: year.month.sequential-number. Example: `2026.04.1`, `2026.04.2`, `2026.05.1`. + +No manual release decisions. Every merge to main is a release. + +### Three channels + +| Channel | Docker tag | Source | Who uses it | +|---------|-----------|--------|-------------| +| **dev** | `:dev`, `:dev-sha-abc1234` | Every CI-passing push on any feature branch | Developers, PR testing | +| **stable** | `:stable`, `:2026.04.N` | Every merge to main + CI pass | All production customers | +| **deprecated** | `:deprecated-2026.04.N` | Previous stable after breaking change or failed smoke test | Grace period (30 days) | + +### Tag lifecycle + +``` +feature branch push → CI ✅ → :dev + :dev-sha-abc1234 + ❌ → nothing pushed + +merge to main → CI ✅ → :stable + :2026.04.N + :sha-abc1234 + ❌ → merge blocked (CI required) + │ + ▼ + smoke test on canary VM + │ + ✅ → :stable confirmed + ❌ → alert, rollback canary to previous :stable + broken build tagged :deprecated-2026.04.N +``` + +### Customer pins version + +```yaml +# docker-compose.prod.yml + +# Auto-update (recommended): always latest stable +image: ghcr.io/keboola/agnes-the-ai-analyst:stable + +# Pinned: specific release, manual update +image: ghcr.io/keboola/agnes-the-ai-analyst:2026.04.3 +``` + +### Main = stable + +- `main` branch is always releasable +- Every merge to main triggers a new stable release +- Feature branches are the dev channel +- No promotion pipeline, no manual approval for releases +- Smoke test is a post-deploy safety net, not a gate + +--- + +## 2. Breaking Change Detection + +### What is a breaking change + +- `_meta` table schema change (add/remove column) +- `_remote_attach` table schema change +- API endpoint removed or response field removed +- DuckDB system schema migration that drops data +- CLI command removed or argument renamed +- `instance.yaml` required key added + +### Automated detection in CI + +Every PR runs: + +1. **Contract tests**: `_meta` and `_remote_attach` schema validation against frozen spec +2. **OpenAPI diff**: Compare PR's `openapi.json` against main's. Flag removed endpoints/fields. +3. **DuckDB schema diff**: Compare table definitions in system.duckdb +4. **Config diff**: Compare `instance.yaml.example` required keys +5. **Full connector matrix**: ALL connectors tested, not just changed ones + +If breaking change detected: +- PR gets `BREAKING` label automatically +- Requires 2 reviewers (elevated review) +- Commit message must have `BREAKING:` prefix +- CHANGELOG.md entry with migration guide required +- On merge: previous stable tagged as `:deprecated-YYYY.MM.N` + +### Deprecated channel + +When a breaking change merges: +1. Previous stable image retagged to `:deprecated-2026.04.N` +2. New build becomes `:stable` + `:2026.04.(N+1)` +3. Health endpoint on deprecated version shows warning: + ```json + {"warnings": ["Running deprecated version 2026.04.3. Update to stable."]} + ``` +4. Deprecated images removed from GHCR after 30 days + +--- + +## 3. Smoke Test (Post-Deploy Safety Net) + +### What it tests + +Automated sequence run on canary VM after every `:stable` deploy: + +``` +1. GET /api/health → status != "unhealthy" +2. POST /auth/token → 200 (valid credentials) +3. GET /api/catalog/tables → count > 0 +4. POST /api/query {sql: "SELECT 1"} → 200 + rows +5. POST /api/sync/trigger → 200 +6. (wait 30s) +7. GET /api/health → check no new errors +``` + +### On failure + +1. Alert (GitHub issue + optional webhook) +2. Canary VM rolled back to previous stable: `docker compose pull && docker compose up -d` with previous tag +3. Failed build tagged `:deprecated-YYYY.MM.N` +4. `:stable` tag reverted to previous good build + +### Implementation + +GitHub Actions workflow triggered after the build-and-push workflow completes: + +```yaml +smoke-test: + needs: build-and-push + runs-on: ubuntu-latest + steps: + - name: Deploy to canary + run: | + gcloud compute ssh canary-vm --command=" + cd /opt/agnes && + docker compose pull && + docker compose up -d" + + - name: Wait for healthy + run: | + for i in $(seq 1 30); do + STATUS=$(curl -sf canary:8000/api/health | jq -r .status) + [ "$STATUS" != "unhealthy" ] && break + sleep 10 + done + + - name: Run smoke tests + run: | + # auth, catalog, query, sync checks + ./scripts/smoke-test.sh canary:8000 + + - name: Rollback on failure + if: failure() + run: | + # retag and rollback +``` + +--- + +## 4. Self-Service Deployment + +### Target experience + +Customer (or their AI agent) goes from zero to running instance: + +```bash +# 1. Get the code +git clone https://github.com/keboola/agnes-the-ai-analyst.git +cd agnes-the-ai-analyst + +# 2. Start it +docker compose up -d + +# 3. Open browser or use API +# First visit: /setup wizard (no users exist) +# Or headless: curl -X POST localhost:8000/auth/bootstrap ... +``` + +### Two setup modes + +**A) Interactive (browser):** +- First visit when no users exist → redirected to `/setup` +- Step 1: Create admin account (email + password) +- Step 2: Choose data source (Keboola / BigQuery / CSV / Custom) +- Step 3: Enter credentials (token, URL) +- Step 4: Auto-discover and register tables +- Step 5: Trigger first sync +- Done → redirect to dashboard + +**B) Headless (AI agent / CLI):** +```bash +# Bootstrap admin +curl -X POST http://localhost:8000/auth/bootstrap \ + -H "Content-Type: application/json" \ + -d '{"email":"admin@company.com","password":"SecurePass123!"}' + +# Configure data source +curl -X POST http://localhost:8000/api/admin/configure \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"data_source":"keboola","keboola_token":"...","keboola_url":"..."}' + +# Discover and register tables +curl -X POST http://localhost:8000/api/admin/discover-and-register \ + -H "Authorization: Bearer $TOKEN" + +# Trigger first sync +curl -X POST http://localhost:8000/api/sync/trigger \ + -H "Authorization: Bearer $TOKEN" +``` + +Both modes lead to same result. AI agent uses headless. + +### Auto-configuration + +On first `docker compose up` with no `.env`: +- `JWT_SECRET_KEY` auto-generated and persisted to `/data/state/.jwt_secret` +- `SESSION_SECRET` auto-generated similarly +- App starts in "setup mode" — only `/setup`, `/auth/bootstrap`, and `/api/health` accessible + +On first `docker compose up` with `.env` containing `KEBOOLA_STORAGE_TOKEN`: +- Auto-discovers tables from Keboola on first sync +- Skips manual table registration step + +### What customer must provide + +| Required | Optional | +|----------|----------| +| Server with Docker | Custom domain + TLS | +| Admin email + password | Google OAuth credentials | +| Data source credentials (Keboola token OR BigQuery creds OR CSV files) | Telegram bot token | +| | Jira webhook secret | + +### What customer must NOT do + +- Edit YAML manually (setup wizard generates `instance.yaml`) +- Generate JWT secret (auto-generated) +- Register tables manually (auto-discovery) +- Understand DuckDB internals + +--- + +## 5. Custom Connectors (Three Tiers) + +All tiers produce the same output: `extract.duckdb` with `_meta` table + `data/*.parquet`. Orchestrator treats them identically. + +### Tier A: Local mount (fastest, AI-generated) + +Customer's AI agent generates a connector. Lives outside Docker image, survives updates. + +``` +/opt/agnes/ +├── docker-compose.yml ← official image +├── docker-compose.override.yml ← customer additions +└── custom-connectors/ + └── snowflake/ + ├── extractor.py + └── requirements.txt +``` + +```yaml +# docker-compose.override.yml +services: + app: + volumes: + - ./custom-connectors:/app/connectors/custom:ro +``` + +Orchestrator scans `connectors/custom/*/` in addition to built-in connectors. + +**How the AI agent creates one:** +1. Reads CLAUDE.md → understands extract.duckdb contract +2. Reads existing connector as reference (e.g., `connectors/keboola/extractor.py`) +3. Generates `custom-connectors/snowflake/extractor.py` +4. Runs contract test to validate output +5. Done — orchestrator picks it up on next rebuild + +**Requirements for this to work:** +- CLAUDE.md must perfectly describe the contract +- Contract test must be runnable standalone +- Existing connectors must be readable as examples +- Clear error messages when contract doesn't match + +### Tier B: Standalone container (complex dependencies) + +For connectors needing their own runtime (Java, .NET, heavy Python packages). + +```yaml +# docker-compose.override.yml +services: + connector-sap: + build: ./custom-connectors/sap + volumes: + - data:/data + environment: + - DATA_DIR=/data + - SAP_HOST=... + profiles: + - extract +``` + +Connector is its own Docker image. Writes to `/data/extracts/sap/extract.duckdb`. Orchestrator finds it automatically. + +### Tier C: Community PR (shared with all) + +Connector contributed to main repo via PR. After merge, available in official image for all customers. + +``` +connectors/ +├── keboola/ ← built-in +├── bigquery/ ← built-in +├── jira/ ← built-in +└── snowflake/ ← community contributed +``` + +**PR requirements:** +- Must pass contract tests +- Must include tests +- Must not modify shared code (orchestrator, API, auth) +- CI runs full connector matrix + +--- + +## 6. CI/CD Pipeline + +### On feature branch push + +```yaml +ci.yml: + - tests (all 654+) + - contract tests (all connectors) + - docker build + - push :dev + :dev-sha-xxx to GHCR +``` + +### On merge to main + +```yaml +release.yml: + - tests (all) + - contract tests (all connectors) + - breaking change detection (OpenAPI diff, schema diff) + - docker build + - push :stable + :YYYY.MM.N + :sha-xxx to GHCR + - trigger smoke test on canary + +smoke-test.yml (triggered): + - deploy to canary VM + - run smoke test sequence + - on failure: rollback canary, tag build as deprecated, create alert +``` + +### On PR + +```yaml +pr-check.yml: + - tests + - contract tests + - breaking change detection + - label PR: "BREAKING" if detected + - require 2 reviewers if breaking +``` + +--- + +## 7. Infrastructure (Cloud-Agnostic) + +### Primary: Docker Compose + +Works everywhere Docker runs. This is the default and only required deployment method. + +```bash +git clone https://github.com/keboola/agnes-the-ai-analyst.git +cd agnes-the-ai-analyst +docker compose up -d +``` + +### Optional: Terraform (GCP) + +For automated provisioning. Lives in `infra/` with GCS remote state backend. + +```bash +cd infra +terraform workspace new customer-name +terraform apply -var-file=instances/customer-name.tfvars +``` + +Creates VM, installs Docker, clones repo, generates `.env` and `instance.yaml`, starts Docker Compose. + +### Optional: Caddy TLS + +Production profile adds Caddy reverse proxy with automatic Let's Encrypt: + +```bash +DOMAIN=data.customer.com docker compose --profile production up -d +``` + +### Directory layout on customer server + +``` +/opt/agnes/ ← git clone +├── docker-compose.yml ← official +├── docker-compose.prod.yml ← GHCR images +├── docker-compose.override.yml ← customer customizations +├── .env ← secrets (gitignored) +├── config/ +│ └── instance.yaml ← generated by setup wizard +├── custom-connectors/ ← Tier A connectors +│ └── snowflake/ +└── Caddyfile ← TLS config + +/data/ ← Docker volume (persistent) +├── state/system.duckdb ← users, registry, sync state +├── analytics/server.duckdb ← views into extracts +└── extracts/ ← per-source data + ├── keboola/extract.duckdb + ├── bigquery/extract.duckdb + └── snowflake/extract.duckdb ← from custom connector +``` + +--- + +## 8. AI Agent as Primary Installer + +CLAUDE.md and documentation must be optimized for AI agent consumption: + +### CLAUDE.md requirements +- Complete extract.duckdb contract with exact SQL for `_meta` and `_remote_attach` +- Step-by-step setup instructions with exact curl commands +- Existing connectors as reference for AI-generated new ones +- Clear error messages explaining what went wrong and how to fix + +### API requirements +- All setup operations available as API calls (not just UI) +- Self-describing error messages: `"Missing KEBOOLA_STORAGE_TOKEN. Set it in .env or pass via /api/admin/configure"` +- `/api/health` returns structured diagnostics AI agent can parse +- `/api/admin/configure` accepts data source config without file editing + +### Documentation requirements +- Machine-readable (no screenshots, no "click here") +- Every manual step has an equivalent API/CLI command +- QUICKSTART.md optimized for copy-paste by AI agent + +--- + +## 9. What Needs to Be Built + +### Must have (blocks multi-instance) + +| # | What | Effort | +|---|------|--------| +| 1 | CalVer auto-tagging in CI (release.yml) | 1 day | +| 2 | Smoke test script + CI workflow | 1 day | +| 3 | Breaking change detection in CI (OpenAPI diff, contract diff) | 2 days | +| 4 | `/setup` wizard (web) + `/api/admin/configure` (headless) | 3 days | +| 5 | Auto-generate JWT_SECRET_KEY on first start | 0.5 day | +| 6 | Auto-discovery for Keboola tables on first sync | 1 day | +| 7 | Custom connector mount support in orchestrator | 1 day | +| 8 | `CHANGELOG.md` + release notes template | 0.5 day | +| 9 | Health endpoint version + channel info | 0.5 day | + +### Should have (improves experience) + +| # | What | Effort | +|---|------|--------| +| 10 | Deprecated version warning in health endpoint | 0.5 day | +| 11 | `/api/admin/discover-and-register` auto-discovery endpoint | 1 day | +| 12 | Standalone container connector example (Tier B) | 0.5 day | +| 13 | CLAUDE.md optimization for AI agent setup | 1 day | +| 14 | Terraform module refactor for multi-workspace | 1 day | + +### Nice to have (future) + +| # | What | +|---|------| +| 15 | Community connector contribution guide | +| 16 | Instance health dashboard (central monitoring) | +| 17 | Automated backup (GCP disk snapshots) | +| 18 | Usage analytics (opt-in telemetry) | + +--- + +## Non-Goals + +- Multi-tenancy in single process (each customer = separate instance) +- Kubernetes/Helm (Docker Compose is sufficient for target scale) +- Paid tier / license keys (open-source, monetization TBD) +- GUI for connector development (AI agent + CLAUDE.md is sufficient)