docs: multi-instance deployment and versioning design spec

2026-04-09 21:14:21 +02:00 · 2026-04-09 21:14:21 +02:00 · 4ea22232ef
commit 4ea22232ef
parent b7a3c8dd13
1 changed files with 504 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md
+++ b/docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md
@ -0,0 +1,504 @@
+# Multi-Instance Deployment & Versioning — Design Spec
+
+## Goal
+
+Make Agnes deployable to 20+ independent customer instances via self-service, with safe versioning that prevents one customer's PR from breaking another's deployment.
+
+## Context
+
+Agnes is an open-source AI Data Analyst platform. Customers (or their AI agents) deploy it as a Docker image on their own infrastructure. Each instance connects to different data sources (Keboola, BigQuery, Jira, custom).
+
+**Key constraints:**
+- Customers range from semi-technical to non-technical, assisted by AI agents
+- Cloud-agnostic (GCP, AWS, Azure, on-prem, VPS)
+- One repo, one Docker image, many instances
+- Community PRs must not break existing customers
+- AI agent is the primary "installer" and "developer"
+
+---
+
+## 1. Versioning & Release Channels
+
+### CalVer: `YYYY.MM.N`
+
+Format: year.month.sequential-number. Example: `2026.04.1`, `2026.04.2`, `2026.05.1`.
+
+No manual release decisions. Every merge to main is a release.
+
+### Three channels
+
+| Channel | Docker tag | Source | Who uses it |
+|---------|-----------|--------|-------------|
+| **dev** | `:dev`, `:dev-sha-abc1234` | Every CI-passing push on any feature branch | Developers, PR testing |
+| **stable** | `:stable`, `:2026.04.N` | Every merge to main + CI pass | All production customers |
+| **deprecated** | `:deprecated-2026.04.N` | Previous stable after breaking change or failed smoke test | Grace period (30 days) |
+
+### Tag lifecycle
+
+```
+feature branch push → CI ✅ → :dev + :dev-sha-abc1234
+                         ❌ → nothing pushed
+
+merge to main       → CI ✅ → :stable + :2026.04.N + :sha-abc1234
+                         ❌ → merge blocked (CI required)
+                                │
+                                ▼
+                         smoke test on canary VM
+                                │
+                         ✅ → :stable confirmed
+                         ❌ → alert, rollback canary to previous :stable
+                              broken build tagged :deprecated-2026.04.N
+```
+
+### Customer pins version
+
+```yaml
+# docker-compose.prod.yml
+
+# Auto-update (recommended): always latest stable
+image: ghcr.io/keboola/agnes-the-ai-analyst:stable
+
+# Pinned: specific release, manual update
+image: ghcr.io/keboola/agnes-the-ai-analyst:2026.04.3
+```
+
+### Main = stable
+
+- `main` branch is always releasable
+- Every merge to main triggers a new stable release
+- Feature branches are the dev channel
+- No promotion pipeline, no manual approval for releases
+- Smoke test is a post-deploy safety net, not a gate
+
+---
+
+## 2. Breaking Change Detection
+
+### What is a breaking change
+
+- `_meta` table schema change (add/remove column)
+- `_remote_attach` table schema change
+- API endpoint removed or response field removed
+- DuckDB system schema migration that drops data
+- CLI command removed or argument renamed
+- `instance.yaml` required key added
+
+### Automated detection in CI
+
+Every PR runs:
+
+1. **Contract tests**: `_meta` and `_remote_attach` schema validation against frozen spec
+2. **OpenAPI diff**: Compare PR's `openapi.json` against main's. Flag removed endpoints/fields.
+3. **DuckDB schema diff**: Compare table definitions in system.duckdb
+4. **Config diff**: Compare `instance.yaml.example` required keys
+5. **Full connector matrix**: ALL connectors tested, not just changed ones
+
+If breaking change detected:
+- PR gets `BREAKING` label automatically
+- Requires 2 reviewers (elevated review)
+- Commit message must have `BREAKING:` prefix
+- CHANGELOG.md entry with migration guide required
+- On merge: previous stable tagged as `:deprecated-YYYY.MM.N`
+
+### Deprecated channel
+
+When a breaking change merges:
+1. Previous stable image retagged to `:deprecated-2026.04.N`
+2. New build becomes `:stable` + `:2026.04.(N+1)`
+3. Health endpoint on deprecated version shows warning:
+   ```json
+   {"warnings": ["Running deprecated version 2026.04.3. Update to stable."]}
+   ```
+4. Deprecated images removed from GHCR after 30 days
+
+---
+
+## 3. Smoke Test (Post-Deploy Safety Net)
+
+### What it tests
+
+Automated sequence run on canary VM after every `:stable` deploy:
+
+```
+1. GET  /api/health                    → status != "unhealthy"
+2. POST /auth/token                    → 200 (valid credentials)
+3. GET  /api/catalog/tables            → count > 0
+4. POST /api/query {sql: "SELECT 1"}   → 200 + rows
+5. POST /api/sync/trigger              → 200
+6. (wait 30s)
+7. GET  /api/health                    → check no new errors
+```
+
+### On failure
+
+1. Alert (GitHub issue + optional webhook)
+2. Canary VM rolled back to previous stable: `docker compose pull && docker compose up -d` with previous tag
+3. Failed build tagged `:deprecated-YYYY.MM.N`
+4. `:stable` tag reverted to previous good build
+
+### Implementation
+
+GitHub Actions workflow triggered after the build-and-push workflow completes:
+
+```yaml
+smoke-test:
+  needs: build-and-push
+  runs-on: ubuntu-latest
+  steps:
+    - name: Deploy to canary
+      run: |
+        gcloud compute ssh canary-vm --command="
+          cd /opt/agnes &&
+          docker compose pull &&
+          docker compose up -d"
+
+    - name: Wait for healthy
+      run: |
+        for i in $(seq 1 30); do
+          STATUS=$(curl -sf canary:8000/api/health | jq -r .status)
+          [ "$STATUS" != "unhealthy" ] && break
+          sleep 10
+        done
+
+    - name: Run smoke tests
+      run: |
+        # auth, catalog, query, sync checks
+        ./scripts/smoke-test.sh canary:8000
+
+    - name: Rollback on failure
+      if: failure()
+      run: |
+        # retag and rollback
+```
+
+---
+
+## 4. Self-Service Deployment
+
+### Target experience
+
+Customer (or their AI agent) goes from zero to running instance:
+
+```bash
+# 1. Get the code
+git clone https://github.com/keboola/agnes-the-ai-analyst.git
+cd agnes-the-ai-analyst
+
+# 2. Start it
+docker compose up -d
+
+# 3. Open browser or use API
+# First visit: /setup wizard (no users exist)
+# Or headless: curl -X POST localhost:8000/auth/bootstrap ...
+```
+
+### Two setup modes
+
+**A) Interactive (browser):**
+- First visit when no users exist → redirected to `/setup`
+- Step 1: Create admin account (email + password)
+- Step 2: Choose data source (Keboola / BigQuery / CSV / Custom)
+- Step 3: Enter credentials (token, URL)
+- Step 4: Auto-discover and register tables
+- Step 5: Trigger first sync
+- Done → redirect to dashboard
+
+**B) Headless (AI agent / CLI):**
+```bash
+# Bootstrap admin
+curl -X POST http://localhost:8000/auth/bootstrap \
+  -H "Content-Type: application/json" \
+  -d '{"email":"admin@company.com","password":"SecurePass123!"}'
+
+# Configure data source
+curl -X POST http://localhost:8000/api/admin/configure \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"data_source":"keboola","keboola_token":"...","keboola_url":"..."}'
+
+# Discover and register tables
+curl -X POST http://localhost:8000/api/admin/discover-and-register \
+  -H "Authorization: Bearer $TOKEN"
+
+# Trigger first sync
+curl -X POST http://localhost:8000/api/sync/trigger \
+  -H "Authorization: Bearer $TOKEN"
+```
+
+Both modes lead to same result. AI agent uses headless.
+
+### Auto-configuration
+
+On first `docker compose up` with no `.env`:
+- `JWT_SECRET_KEY` auto-generated and persisted to `/data/state/.jwt_secret`
+- `SESSION_SECRET` auto-generated similarly
+- App starts in "setup mode" — only `/setup`, `/auth/bootstrap`, and `/api/health` accessible
+
+On first `docker compose up` with `.env` containing `KEBOOLA_STORAGE_TOKEN`:
+- Auto-discovers tables from Keboola on first sync
+- Skips manual table registration step
+
+### What customer must provide
+
+| Required | Optional |
+|----------|----------|
+| Server with Docker | Custom domain + TLS |
+| Admin email + password | Google OAuth credentials |
+| Data source credentials (Keboola token OR BigQuery creds OR CSV files) | Telegram bot token |
+| | Jira webhook secret |
+
+### What customer must NOT do
+
+- Edit YAML manually (setup wizard generates `instance.yaml`)
+- Generate JWT secret (auto-generated)
+- Register tables manually (auto-discovery)
+- Understand DuckDB internals
+
+---
+
+## 5. Custom Connectors (Three Tiers)
+
+All tiers produce the same output: `extract.duckdb` with `_meta` table + `data/*.parquet`. Orchestrator treats them identically.
+
+### Tier A: Local mount (fastest, AI-generated)
+
+Customer's AI agent generates a connector. Lives outside Docker image, survives updates.
+
+```
+/opt/agnes/
+├── docker-compose.yml              ← official image
+├── docker-compose.override.yml     ← customer additions
+└── custom-connectors/
+    └── snowflake/
+        ├── extractor.py
+        └── requirements.txt
+```
+
+```yaml
+# docker-compose.override.yml
+services:
+  app:
+    volumes:
+      - ./custom-connectors:/app/connectors/custom:ro
+```
+
+Orchestrator scans `connectors/custom/*/` in addition to built-in connectors.
+
+**How the AI agent creates one:**
+1. Reads CLAUDE.md → understands extract.duckdb contract
+2. Reads existing connector as reference (e.g., `connectors/keboola/extractor.py`)
+3. Generates `custom-connectors/snowflake/extractor.py`
+4. Runs contract test to validate output
+5. Done — orchestrator picks it up on next rebuild
+
+**Requirements for this to work:**
+- CLAUDE.md must perfectly describe the contract
+- Contract test must be runnable standalone
+- Existing connectors must be readable as examples
+- Clear error messages when contract doesn't match
+
+### Tier B: Standalone container (complex dependencies)
+
+For connectors needing their own runtime (Java, .NET, heavy Python packages).
+
+```yaml
+# docker-compose.override.yml
+services:
+  connector-sap:
+    build: ./custom-connectors/sap
+    volumes:
+      - data:/data
+    environment:
+      - DATA_DIR=/data
+      - SAP_HOST=...
+    profiles:
+      - extract
+```
+
+Connector is its own Docker image. Writes to `/data/extracts/sap/extract.duckdb`. Orchestrator finds it automatically.
+
+### Tier C: Community PR (shared with all)
+
+Connector contributed to main repo via PR. After merge, available in official image for all customers.
+
+```
+connectors/
+├── keboola/          ← built-in
+├── bigquery/         ← built-in
+├── jira/             ← built-in
+└── snowflake/        ← community contributed
+```
+
+**PR requirements:**
+- Must pass contract tests
+- Must include tests
+- Must not modify shared code (orchestrator, API, auth)
+- CI runs full connector matrix
+
+---
+
+## 6. CI/CD Pipeline
+
+### On feature branch push
+
+```yaml
+ci.yml:
+  - tests (all 654+)
+  - contract tests (all connectors)
+  - docker build
+  - push :dev + :dev-sha-xxx to GHCR
+```
+
+### On merge to main
+
+```yaml
+release.yml:
+  - tests (all)
+  - contract tests (all connectors)
+  - breaking change detection (OpenAPI diff, schema diff)
+  - docker build
+  - push :stable + :YYYY.MM.N + :sha-xxx to GHCR
+  - trigger smoke test on canary
+
+smoke-test.yml (triggered):
+  - deploy to canary VM
+  - run smoke test sequence
+  - on failure: rollback canary, tag build as deprecated, create alert
+```
+
+### On PR
+
+```yaml
+pr-check.yml:
+  - tests
+  - contract tests
+  - breaking change detection
+  - label PR: "BREAKING" if detected
+  - require 2 reviewers if breaking
+```
+
+---
+
+## 7. Infrastructure (Cloud-Agnostic)
+
+### Primary: Docker Compose
+
+Works everywhere Docker runs. This is the default and only required deployment method.
+
+```bash
+git clone https://github.com/keboola/agnes-the-ai-analyst.git
+cd agnes-the-ai-analyst
+docker compose up -d
+```
+
+### Optional: Terraform (GCP)
+
+For automated provisioning. Lives in `infra/` with GCS remote state backend.
+
+```bash
+cd infra
+terraform workspace new customer-name
+terraform apply -var-file=instances/customer-name.tfvars
+```
+
+Creates VM, installs Docker, clones repo, generates `.env` and `instance.yaml`, starts Docker Compose.
+
+### Optional: Caddy TLS
+
+Production profile adds Caddy reverse proxy with automatic Let's Encrypt:
+
+```bash
+DOMAIN=data.customer.com docker compose --profile production up -d
+```
+
+### Directory layout on customer server
+
+```
+/opt/agnes/                           ← git clone
+├── docker-compose.yml                ← official
+├── docker-compose.prod.yml           ← GHCR images
+├── docker-compose.override.yml       ← customer customizations
+├── .env                              ← secrets (gitignored)
+├── config/
+│   └── instance.yaml                 ← generated by setup wizard
+├── custom-connectors/                ← Tier A connectors
+│   └── snowflake/
+└── Caddyfile                         ← TLS config
+
+/data/                                ← Docker volume (persistent)
+├── state/system.duckdb               ← users, registry, sync state
+├── analytics/server.duckdb           ← views into extracts
+└── extracts/                         ← per-source data
+    ├── keboola/extract.duckdb
+    ├── bigquery/extract.duckdb
+    └── snowflake/extract.duckdb      ← from custom connector
+```
+
+---
+
+## 8. AI Agent as Primary Installer
+
+CLAUDE.md and documentation must be optimized for AI agent consumption:
+
+### CLAUDE.md requirements
+- Complete extract.duckdb contract with exact SQL for `_meta` and `_remote_attach`
+- Step-by-step setup instructions with exact curl commands
+- Existing connectors as reference for AI-generated new ones
+- Clear error messages explaining what went wrong and how to fix
+
+### API requirements
+- All setup operations available as API calls (not just UI)
+- Self-describing error messages: `"Missing KEBOOLA_STORAGE_TOKEN. Set it in .env or pass via /api/admin/configure"`
+- `/api/health` returns structured diagnostics AI agent can parse
+- `/api/admin/configure` accepts data source config without file editing
+
+### Documentation requirements
+- Machine-readable (no screenshots, no "click here")
+- Every manual step has an equivalent API/CLI command
+- QUICKSTART.md optimized for copy-paste by AI agent
+
+---
+
+## 9. What Needs to Be Built
+
+### Must have (blocks multi-instance)
+
+| # | What | Effort |
+|---|------|--------|
+| 1 | CalVer auto-tagging in CI (release.yml) | 1 day |
+| 2 | Smoke test script + CI workflow | 1 day |
+| 3 | Breaking change detection in CI (OpenAPI diff, contract diff) | 2 days |
+| 4 | `/setup` wizard (web) + `/api/admin/configure` (headless) | 3 days |
+| 5 | Auto-generate JWT_SECRET_KEY on first start | 0.5 day |
+| 6 | Auto-discovery for Keboola tables on first sync | 1 day |
+| 7 | Custom connector mount support in orchestrator | 1 day |
+| 8 | `CHANGELOG.md` + release notes template | 0.5 day |
+| 9 | Health endpoint version + channel info | 0.5 day |
+
+### Should have (improves experience)
+
+| # | What | Effort |
+|---|------|--------|
+| 10 | Deprecated version warning in health endpoint | 0.5 day |
+| 11 | `/api/admin/discover-and-register` auto-discovery endpoint | 1 day |
+| 12 | Standalone container connector example (Tier B) | 0.5 day |
+| 13 | CLAUDE.md optimization for AI agent setup | 1 day |
+| 14 | Terraform module refactor for multi-workspace | 1 day |
+
+### Nice to have (future)
+
+| # | What |
+|---|------|
+| 15 | Community connector contribution guide |
+| 16 | Instance health dashboard (central monitoring) |
+| 17 | Automated backup (GCP disk snapshots) |
+| 18 | Usage analytics (opt-in telemetry) |
+
+---
+
+## Non-Goals
+
+- Multi-tenancy in single process (each customer = separate instance)
+- Kubernetes/Helm (Docker Compose is sufficient for target scale)
+- Paid tier / license keys (open-source, monetization TBD)
+- GUI for connector development (AI agent + CLAUDE.md is sufficient)