# Multi-Instance Deployment & Versioning — Design Spec > **Historical note (2026-04-24):** This spec is a snapshot from 2026-04-09. Some operational details have evolved since — most notably, the Caddy `production` profile referenced in command examples below was renamed to `tls` (see PR #51). For the current deployment commands, follow `docs/DEPLOYMENT.md`. This file is preserved as design history. ## Goal Make Agnes deployable to 20+ independent customer instances via self-service, with safe versioning that prevents one customer's PR from breaking another's deployment. ## Context Agnes is an open-source AI Data Analyst platform. Customers (or their AI agents) deploy it as a Docker image on their own infrastructure. Each instance connects to different data sources (Keboola, BigQuery, Jira, custom). **Key constraints:** - Customers range from semi-technical to non-technical, assisted by AI agents - Cloud-agnostic (GCP, AWS, Azure, on-prem, VPS) - One repo, one Docker image, many instances - Community PRs must not break existing customers - AI agent is the primary "installer" and "developer" --- ## 1. Versioning & Release Channels ### CalVer: `YYYY.MM.N` Format: year.month.sequential-number. Example: `2026.04.1`, `2026.04.2`, `2026.05.1`. No manual release decisions. Every merge to main is a release. ### Three channels | Channel | Floating tag | Versioned tag | Source | Who uses it | |---------|-------------|---------------|--------|-------------| | **dev** | `:dev` | `:dev-2026.04.N` | Every CI-passing push on any feature branch | Developers, PR testing | | **stable** | `:stable` | `:stable-2026.04.N` | Every merge to main + CI pass | All production customers | | **deprecated** | — | `:deprecated-2026.04.N` | Previous stable after breaking change or failed smoke test | Grace period (30 days) | Every image also gets a `:sha-abc1234` tag for exact commit traceability. ### Tag lifecycle ``` feature branch push → CI ✅ → :dev + :dev-2026.04.N + :sha-abc1234 ❌ → nothing pushed merge to main → CI ✅ → :stable + :stable-2026.04.N + :sha-abc1234 ❌ → merge blocked (CI required) │ ▼ smoke test on canary VM │ ✅ → :stable confirmed ❌ → alert, rollback canary to previous :stable broken build tagged :deprecated-2026.04.N ``` ### Version numbering CalVer `YYYY.MM.N` where N is a global auto-incrementing counter per month across both channels. Example timeline: ``` Apr 8 feature/foo push → :dev-2026.04.1 Apr 8 feature/bar push → :dev-2026.04.2 Apr 8 merge foo to main → :stable-2026.04.3 Apr 9 feature/baz push → :dev-2026.04.4 Apr 9 merge bar to main → :stable-2026.04.5 ``` This avoids confusion — version `2026.04.3` exists only once, in one channel. ### Customer pins version ```yaml # docker-compose.prod.yml # Auto-update (recommended): always latest stable image: ghcr.io/keboola/agnes-the-ai-analyst:stable # Pinned: specific stable release, manual update image: ghcr.io/keboola/agnes-the-ai-analyst:stable-2026.04.3 # Testing: latest dev image: ghcr.io/keboola/agnes-the-ai-analyst:dev # Testing: specific dev build image: ghcr.io/keboola/agnes-the-ai-analyst:dev-2026.04.2 ``` ### Main = stable - `main` branch is always releasable - Every merge to main triggers a new stable release - Feature branches are the dev channel - No promotion pipeline, no manual approval for releases - Smoke test is a post-deploy safety net, not a gate --- ## 2. Breaking Change Detection ### What is a breaking change - `_meta` table schema change (add/remove column) - `_remote_attach` table schema change - API endpoint removed or response field removed - DuckDB system schema migration that drops data - CLI command removed or argument renamed - `instance.yaml` required key added ### Automated detection in CI Every PR runs: 1. **Contract tests**: `_meta` and `_remote_attach` schema validation against frozen spec 2. **OpenAPI diff**: Compare PR's `openapi.json` against main's. Flag removed endpoints/fields. 3. **DuckDB schema diff**: Compare table definitions in system.duckdb 4. **Config diff**: Compare `instance.yaml.example` required keys 5. **Full connector matrix**: ALL connectors tested, not just changed ones If breaking change detected: - PR gets `BREAKING` label automatically - Requires 2 reviewers (elevated review) - Commit message must have `BREAKING:` prefix - CHANGELOG.md entry with migration guide required - On merge: previous stable tagged as `:deprecated-YYYY.MM.N` ### Deprecated channel When a breaking change merges: 1. Previous stable image retagged to `:deprecated-2026.04.N` 2. New build becomes `:stable` + `:2026.04.(N+1)` 3. Health endpoint on deprecated version shows warning: ```json {"warnings": ["Running deprecated version 2026.04.3. Update to stable."]} ``` 4. Deprecated images removed from GHCR after 30 days --- ## 3. Smoke Test (Post-Deploy Safety Net) ### What it tests Automated sequence run on canary VM after every `:stable` deploy: ``` 1. GET /api/health → status != "unhealthy" 2. POST /auth/token → 200 (valid credentials) 3. GET /api/catalog/tables → count > 0 4. POST /api/query {sql: "SELECT 1"} → 200 + rows 5. POST /api/sync/trigger → 200 6. (wait 30s) 7. GET /api/health → check no new errors ``` ### On failure 1. Alert (GitHub issue + optional webhook) 2. Canary VM rolled back to previous stable: `docker compose pull && docker compose up -d` with previous tag 3. Failed build tagged `:deprecated-YYYY.MM.N` 4. `:stable` tag reverted to previous good build ### Implementation GitHub Actions workflow triggered after the build-and-push workflow completes: ```yaml smoke-test: needs: build-and-push runs-on: ubuntu-latest steps: - name: Deploy to canary run: | gcloud compute ssh canary-vm --command=" cd /opt/agnes && docker compose pull && docker compose up -d" - name: Wait for healthy run: | for i in $(seq 1 30); do STATUS=$(curl -sf canary:8000/api/health | jq -r .status) [ "$STATUS" != "unhealthy" ] && break sleep 10 done - name: Run smoke tests run: | # auth, catalog, query, sync checks ./scripts/smoke-test.sh canary:8000 - name: Rollback on failure if: failure() run: | # retag and rollback ``` --- ## 4. Self-Service Deployment ### Target experience Customer (or their AI agent) goes from zero to running instance: ```bash # 1. Get the code git clone https://github.com/keboola/agnes-the-ai-analyst.git cd agnes-the-ai-analyst # 2. Start it docker compose up -d # 3. Open browser or use API # First visit: /setup wizard (no users exist) # Or headless: curl -X POST localhost:8000/auth/bootstrap ... ``` ### Two setup modes **A) Interactive (browser):** - First visit when no users exist → redirected to `/setup` - Step 1: Create admin account (email + password) - Step 2: Choose data source (Keboola / BigQuery / CSV / Custom) - Step 3: Enter credentials (token, URL) - Step 4: Auto-discover and register tables - Step 5: Trigger first sync - Done → redirect to dashboard **B) Headless (AI agent / CLI):** ```bash # Bootstrap admin curl -X POST http://localhost:8000/auth/bootstrap \ -H "Content-Type: application/json" \ -d '{"email":"admin@company.com","password":"SecurePass123!"}' # Configure data source curl -X POST http://localhost:8000/api/admin/configure \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"data_source":"keboola","keboola_token":"...","keboola_url":"..."}' # Discover and register tables curl -X POST http://localhost:8000/api/admin/discover-and-register \ -H "Authorization: Bearer $TOKEN" # Trigger first sync curl -X POST http://localhost:8000/api/sync/trigger \ -H "Authorization: Bearer $TOKEN" ``` Both modes lead to same result. AI agent uses headless. ### Auto-configuration On first `docker compose up` with no `.env`: - `JWT_SECRET_KEY` auto-generated and persisted to `/data/state/.jwt_secret` - `SESSION_SECRET` auto-generated similarly - App starts in "setup mode" — only `/setup`, `/auth/bootstrap`, and `/api/health` accessible On first `docker compose up` with `.env` containing `KEBOOLA_STORAGE_TOKEN`: - Auto-discovers tables from Keboola on first sync - Skips manual table registration step ### What customer must provide | Required | Optional | |----------|----------| | Server with Docker | Custom domain + TLS | | Admin email + password | Google OAuth credentials | | Data source credentials (Keboola token OR BigQuery creds OR CSV files) | Telegram bot token | | | Jira webhook secret | ### What customer must NOT do - Edit YAML manually (setup wizard generates `instance.yaml`) - Generate JWT secret (auto-generated) - Register tables manually (auto-discovery) - Understand DuckDB internals --- ## 5. Custom Connectors (Three Tiers) All tiers produce the same output: `extract.duckdb` with `_meta` table + `data/*.parquet`. Orchestrator treats them identically. ### Tier A: Local mount (fastest, AI-generated) Customer's AI agent generates a connector. Lives outside Docker image, survives updates. ``` /opt/agnes/ ├── docker-compose.yml ← official image ├── docker-compose.override.yml ← customer additions └── custom-connectors/ └── snowflake/ ├── extractor.py └── requirements.txt ``` ```yaml # docker-compose.override.yml services: app: volumes: - ./custom-connectors:/app/connectors/custom:ro ``` Orchestrator scans `connectors/custom/*/` in addition to built-in connectors. **How the AI agent creates one:** 1. Reads CLAUDE.md → understands extract.duckdb contract 2. Reads existing connector as reference (e.g., `connectors/keboola/extractor.py`) 3. Generates `custom-connectors/snowflake/extractor.py` 4. Runs contract test to validate output 5. Done — orchestrator picks it up on next rebuild **Requirements for this to work:** - CLAUDE.md must perfectly describe the contract - Contract test must be runnable standalone - Existing connectors must be readable as examples - Clear error messages when contract doesn't match ### Tier B: Standalone container (complex dependencies) For connectors needing their own runtime (Java, .NET, heavy Python packages). ```yaml # docker-compose.override.yml services: connector-sap: build: ./custom-connectors/sap volumes: - data:/data environment: - DATA_DIR=/data - SAP_HOST=... profiles: - extract ``` Connector is its own Docker image. Writes to `/data/extracts/sap/extract.duckdb`. Orchestrator finds it automatically. ### Tier C: Community PR (shared with all) Connector contributed to main repo via PR. After merge, available in official image for all customers. ``` connectors/ ├── keboola/ ← built-in ├── bigquery/ ← built-in ├── jira/ ← built-in └── snowflake/ ← community contributed ``` **PR requirements:** - Must pass contract tests - Must include tests - Must not modify shared code (orchestrator, API, auth) - CI runs full connector matrix --- ## 6. CI/CD Pipeline ### On feature branch push ```yaml ci.yml: - tests (all 654+) - contract tests (all connectors) - docker build - push :dev + :dev-sha-xxx to GHCR ``` ### On merge to main ```yaml release.yml: - tests (all) - contract tests (all connectors) - breaking change detection (OpenAPI diff, schema diff) - docker build - push :stable + :YYYY.MM.N + :sha-xxx to GHCR - trigger smoke test on canary smoke-test.yml (triggered): - deploy to canary VM - run smoke test sequence - on failure: rollback canary, tag build as deprecated, create alert ``` ### On PR ```yaml pr-check.yml: - tests - contract tests - breaking change detection - label PR: "BREAKING" if detected - require 2 reviewers if breaking ``` --- ## 7. Infrastructure (Cloud-Agnostic) ### Primary: Docker Compose Works everywhere Docker runs. This is the default and only required deployment method. ```bash git clone https://github.com/keboola/agnes-the-ai-analyst.git cd agnes-the-ai-analyst docker compose up -d ``` ### Optional: Terraform (GCP) For automated provisioning. Lives in `infra/` with GCS remote state backend. ```bash cd infra terraform workspace new customer-name terraform apply -var-file=instances/customer-name.tfvars ``` Creates VM, installs Docker, clones repo, generates `.env` and `instance.yaml`, starts Docker Compose. ### Optional: Caddy TLS Production profile adds Caddy reverse proxy with automatic Let's Encrypt: ```bash DOMAIN=data.customer.com docker compose --profile production up -d ``` ### Directory layout on customer server ``` /opt/agnes/ ← git clone ├── docker-compose.yml ← official ├── docker-compose.prod.yml ← GHCR images ├── docker-compose.override.yml ← customer customizations ├── .env ← secrets (gitignored) ├── config/ │ └── instance.yaml ← generated by setup wizard ├── custom-connectors/ ← Tier A connectors │ └── snowflake/ └── Caddyfile ← TLS config /data/ ← Docker volume (persistent) ├── state/system.duckdb ← users, registry, sync state ├── analytics/server.duckdb ← views into extracts └── extracts/ ← per-source data ├── keboola/extract.duckdb ├── bigquery/extract.duckdb └── snowflake/extract.duckdb ← from custom connector ``` --- ## 8. AI Agent as Primary Installer CLAUDE.md and documentation must be optimized for AI agent consumption: ### CLAUDE.md requirements - Complete extract.duckdb contract with exact SQL for `_meta` and `_remote_attach` - Step-by-step setup instructions with exact curl commands - Existing connectors as reference for AI-generated new ones - Clear error messages explaining what went wrong and how to fix ### API requirements - All setup operations available as API calls (not just UI) - Self-describing error messages: `"Missing KEBOOLA_STORAGE_TOKEN. Set it in .env or pass via /api/admin/configure"` - `/api/health` returns structured diagnostics AI agent can parse - `/api/admin/configure` accepts data source config without file editing ### Documentation requirements - Machine-readable (no screenshots, no "click here") - Every manual step has an equivalent API/CLI command - QUICKSTART.md optimized for copy-paste by AI agent --- ## 9. What Needs to Be Built ### Must have (blocks multi-instance) | # | What | Effort | |---|------|--------| | 1 | CalVer auto-tagging in CI (release.yml) | 1 day | | 2 | Smoke test script + CI workflow | 1 day | | 3 | Breaking change detection in CI (OpenAPI diff, contract diff) | 2 days | | 4 | `/setup` wizard (web) + `/api/admin/configure` (headless) | 3 days | | 5 | Auto-generate JWT_SECRET_KEY on first start | 0.5 day | | 6 | Auto-discovery for Keboola tables on first sync | 1 day | | 7 | Custom connector mount support in orchestrator | 1 day | | 8 | `CHANGELOG.md` + release notes template | 0.5 day | | 9 | Health endpoint version + channel info | 0.5 day | ### Should have (improves experience) | # | What | Effort | |---|------|--------| | 10 | Deprecated version warning in health endpoint | 0.5 day | | 11 | `/api/admin/discover-and-register` auto-discovery endpoint | 1 day | | 12 | Standalone container connector example (Tier B) | 0.5 day | | 13 | CLAUDE.md optimization for AI agent setup | 1 day | | 14 | Terraform module refactor for multi-workspace | 1 day | ### Nice to have (future) | # | What | |---|------| | 15 | Community connector contribution guide | | 16 | Instance health dashboard (central monitoring) | | 17 | Automated backup (GCP disk snapshots) | | 18 | Usage analytics (opt-in telemetry) | --- ## Non-Goals - Multi-tenancy in single process (each customer = separate instance) - Kubernetes/Helm (Docker Compose is sufficient for target scale) - Paid tier / license keys (open-source, monetization TBD) - GUI for connector development (AI agent + CLAUDE.md is sufficient)