agnes-the-ai-analyst/docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md

# Multi-Instance Deployment & Versioning — Design Spec

## Goal

Make Agnes deployable to 20+ independent customer instances via self-service, with safe versioning that prevents one customer's PR from breaking another's deployment.

## Context

Agnes is an open-source AI Data Analyst platform. Customers (or their AI agents) deploy it as a Docker image on their own infrastructure. Each instance connects to different data sources (Keboola, BigQuery, Jira, custom).

**Key constraints:**
- Customers range from semi-technical to non-technical, assisted by AI agents
- Cloud-agnostic (GCP, AWS, Azure, on-prem, VPS)
- One repo, one Docker image, many instances
- Community PRs must not break existing customers
- AI agent is the primary "installer" and "developer"

---

## 1. Versioning & Release Channels

### CalVer: `YYYY.MM.N`

Format: year.month.sequential-number. Example: `2026.04.1`, `2026.04.2`, `2026.05.1`.

No manual release decisions. Every merge to main is a release.

### Three channels

| Channel | Floating tag | Versioned tag | Source | Who uses it |
|---------|-------------|---------------|--------|-------------|
| **dev** | `:dev` | `:dev-2026.04.N` | Every CI-passing push on any feature branch | Developers, PR testing |
| **stable** | `:stable` | `:stable-2026.04.N` | Every merge to main + CI pass | All production customers |
| **deprecated** | — | `:deprecated-2026.04.N` | Previous stable after breaking change or failed smoke test | Grace period (30 days) |

Every image also gets a `:sha-abc1234` tag for exact commit traceability.

### Tag lifecycle

```
feature branch push → CI ✅ → :dev + :dev-2026.04.N + :sha-abc1234
                         ❌ → nothing pushed

merge to main       → CI ✅ → :stable + :stable-2026.04.N + :sha-abc1234
                         ❌ → merge blocked (CI required)
                                │
                                ▼
                         smoke test on canary VM
                                │
                         ✅ → :stable confirmed
                         ❌ → alert, rollback canary to previous :stable
                              broken build tagged :deprecated-2026.04.N
```

### Version numbering

CalVer `YYYY.MM.N` where N is a global auto-incrementing counter per month across both channels.

Example timeline:
```
Apr 8  feature/foo push     → :dev-2026.04.1
Apr 8  feature/bar push     → :dev-2026.04.2
Apr 8  merge foo to main    → :stable-2026.04.3
Apr 9  feature/baz push     → :dev-2026.04.4
Apr 9  merge bar to main    → :stable-2026.04.5
```

This avoids confusion — version `2026.04.3` exists only once, in one channel.

### Customer pins version

```yaml
# docker-compose.prod.yml

# Auto-update (recommended): always latest stable
image: ghcr.io/keboola/agnes-the-ai-analyst:stable

# Pinned: specific stable release, manual update
image: ghcr.io/keboola/agnes-the-ai-analyst:stable-2026.04.3

# Testing: latest dev
image: ghcr.io/keboola/agnes-the-ai-analyst:dev

# Testing: specific dev build
image: ghcr.io/keboola/agnes-the-ai-analyst:dev-2026.04.2
```

### Main = stable

- `main` branch is always releasable
- Every merge to main triggers a new stable release
- Feature branches are the dev channel
- No promotion pipeline, no manual approval for releases
- Smoke test is a post-deploy safety net, not a gate

---

## 2. Breaking Change Detection

### What is a breaking change

- `_meta` table schema change (add/remove column)
- `_remote_attach` table schema change
- API endpoint removed or response field removed
- DuckDB system schema migration that drops data
- CLI command removed or argument renamed
- `instance.yaml` required key added

### Automated detection in CI

Every PR runs:

1. **Contract tests**: `_meta` and `_remote_attach` schema validation against frozen spec
2. **OpenAPI diff**: Compare PR's `openapi.json` against main's. Flag removed endpoints/fields.
3. **DuckDB schema diff**: Compare table definitions in system.duckdb
4. **Config diff**: Compare `instance.yaml.example` required keys
5. **Full connector matrix**: ALL connectors tested, not just changed ones

If breaking change detected:
- PR gets `BREAKING` label automatically
- Requires 2 reviewers (elevated review)
- Commit message must have `BREAKING:` prefix
- CHANGELOG.md entry with migration guide required
- On merge: previous stable tagged as `:deprecated-YYYY.MM.N`

### Deprecated channel

When a breaking change merges:
1. Previous stable image retagged to `:deprecated-2026.04.N`
2. New build becomes `:stable` + `:2026.04.(N+1)`
3. Health endpoint on deprecated version shows warning:
   ```json
   {"warnings": ["Running deprecated version 2026.04.3. Update to stable."]}
   ```
4. Deprecated images removed from GHCR after 30 days

---

## 3. Smoke Test (Post-Deploy Safety Net)

### What it tests

Automated sequence run on canary VM after every `:stable` deploy:

```
1. GET  /api/health                    → status != "unhealthy"
2. POST /auth/token                    → 200 (valid credentials)
3. GET  /api/catalog/tables            → count > 0
4. POST /api/query {sql: "SELECT 1"}   → 200 + rows
5. POST /api/sync/trigger              → 200
6. (wait 30s)
7. GET  /api/health                    → check no new errors
```

### On failure

1. Alert (GitHub issue + optional webhook)
2. Canary VM rolled back to previous stable: `docker compose pull && docker compose up -d` with previous tag
3. Failed build tagged `:deprecated-YYYY.MM.N`
4. `:stable` tag reverted to previous good build

### Implementation

GitHub Actions workflow triggered after the build-and-push workflow completes:

```yaml
smoke-test:
  needs: build-and-push
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to canary
      run: |
        gcloud compute ssh canary-vm --command="
          cd /opt/agnes &&
          docker compose pull &&
          docker compose up -d"

    - name: Wait for healthy
      run: |
        for i in $(seq 1 30); do
          STATUS=$(curl -sf canary:8000/api/health | jq -r .status)
          [ "$STATUS" != "unhealthy" ] && break
          sleep 10
        done

    - name: Run smoke tests
      run: |
        # auth, catalog, query, sync checks
        ./scripts/smoke-test.sh canary:8000

    - name: Rollback on failure
      if: failure()
      run: |
        # retag and rollback
```

---

## 4. Self-Service Deployment

### Target experience

Customer (or their AI agent) goes from zero to running instance:

```bash
# 1. Get the code
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# 2. Start it
docker compose up -d

# 3. Open browser or use API
# First visit: /setup wizard (no users exist)
# Or headless: curl -X POST localhost:8000/auth/bootstrap ...
```

### Two setup modes

**A) Interactive (browser):**
- First visit when no users exist → redirected to `/setup`
- Step 1: Create admin account (email + password)
- Step 2: Choose data source (Keboola / BigQuery / CSV / Custom)
- Step 3: Enter credentials (token, URL)
- Step 4: Auto-discover and register tables
- Step 5: Trigger first sync
- Done → redirect to dashboard

**B) Headless (AI agent / CLI):**
```bash
# Bootstrap admin
curl -X POST http://localhost:8000/auth/bootstrap \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@company.com","password":"SecurePass123!"}'

# Configure data source
curl -X POST http://localhost:8000/api/admin/configure \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"data_source":"keboola","keboola_token":"...","keboola_url":"..."}'

# Discover and register tables
curl -X POST http://localhost:8000/api/admin/discover-and-register \
  -H "Authorization: Bearer $TOKEN"

# Trigger first sync
curl -X POST http://localhost:8000/api/sync/trigger \
  -H "Authorization: Bearer $TOKEN"
```

Both modes lead to same result. AI agent uses headless.

### Auto-configuration

On first `docker compose up` with no `.env`:
- `JWT_SECRET_KEY` auto-generated and persisted to `/data/state/.jwt_secret`
- `SESSION_SECRET` auto-generated similarly
- App starts in "setup mode" — only `/setup`, `/auth/bootstrap`, and `/api/health` accessible

On first `docker compose up` with `.env` containing `KEBOOLA_STORAGE_TOKEN`:
- Auto-discovers tables from Keboola on first sync
- Skips manual table registration step

### What customer must provide

| Required | Optional |
|----------|----------|
| Server with Docker | Custom domain + TLS |
| Admin email + password | Google OAuth credentials |
| Data source credentials (Keboola token OR BigQuery creds OR CSV files) | Telegram bot token |
| | Jira webhook secret |

### What customer must NOT do

- Edit YAML manually (setup wizard generates `instance.yaml`)
- Generate JWT secret (auto-generated)
- Register tables manually (auto-discovery)
- Understand DuckDB internals

---

## 5. Custom Connectors (Three Tiers)

All tiers produce the same output: `extract.duckdb` with `_meta` table + `data/*.parquet`. Orchestrator treats them identically.

### Tier A: Local mount (fastest, AI-generated)

Customer's AI agent generates a connector. Lives outside Docker image, survives updates.

```
/opt/agnes/
├── docker-compose.yml              ← official image
├── docker-compose.override.yml     ← customer additions
└── custom-connectors/
    └── snowflake/
        ├── extractor.py
        └── requirements.txt
```

```yaml
# docker-compose.override.yml
services:
  app:
    volumes:
      - ./custom-connectors:/app/connectors/custom:ro
```

Orchestrator scans `connectors/custom/*/` in addition to built-in connectors.

**How the AI agent creates one:**
1. Reads CLAUDE.md → understands extract.duckdb contract
2. Reads existing connector as reference (e.g., `connectors/keboola/extractor.py`)
3. Generates `custom-connectors/snowflake/extractor.py`
4. Runs contract test to validate output
5. Done — orchestrator picks it up on next rebuild

**Requirements for this to work:**
- CLAUDE.md must perfectly describe the contract
- Contract test must be runnable standalone
- Existing connectors must be readable as examples
- Clear error messages when contract doesn't match

### Tier B: Standalone container (complex dependencies)

For connectors needing their own runtime (Java, .NET, heavy Python packages).

```yaml
# docker-compose.override.yml
services:
  connector-sap:
    build: ./custom-connectors/sap
    volumes:
      - data:/data
    environment:
      - DATA_DIR=/data
      - SAP_HOST=...
    profiles:
      - extract
```

Connector is its own Docker image. Writes to `/data/extracts/sap/extract.duckdb`. Orchestrator finds it automatically.

### Tier C: Community PR (shared with all)

Connector contributed to main repo via PR. After merge, available in official image for all customers.

```
connectors/
├── keboola/          ← built-in
├── bigquery/         ← built-in
├── jira/             ← built-in
└── snowflake/        ← community contributed
```

**PR requirements:**
- Must pass contract tests
- Must include tests
- Must not modify shared code (orchestrator, API, auth)
- CI runs full connector matrix

---

## 6. CI/CD Pipeline

### On feature branch push

```yaml
ci.yml:
  - tests (all 654+)
  - contract tests (all connectors)
  - docker build
  - push :dev + :dev-sha-xxx to GHCR
```

### On merge to main

```yaml
release.yml:
  - tests (all)
  - contract tests (all connectors)
  - breaking change detection (OpenAPI diff, schema diff)
  - docker build
  - push :stable + :YYYY.MM.N + :sha-xxx to GHCR
  - trigger smoke test on canary

smoke-test.yml (triggered):
  - deploy to canary VM
  - run smoke test sequence
  - on failure: rollback canary, tag build as deprecated, create alert
```

### On PR

```yaml
pr-check.yml:
  - tests
  - contract tests
  - breaking change detection
  - label PR: "BREAKING" if detected
  - require 2 reviewers if breaking
```

---

## 7. Infrastructure (Cloud-Agnostic)

### Primary: Docker Compose

Works everywhere Docker runs. This is the default and only required deployment method.

```bash
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst
docker compose up -d
```

### Optional: Terraform (GCP)

For automated provisioning. Lives in `infra/` with GCS remote state backend.

```bash
cd infra
terraform workspace new customer-name
terraform apply -var-file=instances/customer-name.tfvars
```

Creates VM, installs Docker, clones repo, generates `.env` and `instance.yaml`, starts Docker Compose.

### Optional: Caddy TLS

Production profile adds Caddy reverse proxy with automatic Let's Encrypt:

```bash
DOMAIN=data.customer.com docker compose --profile production up -d
```

### Directory layout on customer server

```
/opt/agnes/                           ← git clone
├── docker-compose.yml                ← official
├── docker-compose.prod.yml           ← GHCR images
├── docker-compose.override.yml       ← customer customizations
├── .env                              ← secrets (gitignored)
├── config/
│   └── instance.yaml                 ← generated by setup wizard
├── custom-connectors/                ← Tier A connectors
│   └── snowflake/
└── Caddyfile                         ← TLS config

/data/                                ← Docker volume (persistent)
├── state/system.duckdb               ← users, registry, sync state
├── analytics/server.duckdb           ← views into extracts
└── extracts/                         ← per-source data
    ├── keboola/extract.duckdb
    ├── bigquery/extract.duckdb
    └── snowflake/extract.duckdb      ← from custom connector
```

---

## 8. AI Agent as Primary Installer

CLAUDE.md and documentation must be optimized for AI agent consumption:

### CLAUDE.md requirements
- Complete extract.duckdb contract with exact SQL for `_meta` and `_remote_attach`
- Step-by-step setup instructions with exact curl commands
- Existing connectors as reference for AI-generated new ones
- Clear error messages explaining what went wrong and how to fix

### API requirements
- All setup operations available as API calls (not just UI)
- Self-describing error messages: `"Missing KEBOOLA_STORAGE_TOKEN. Set it in .env or pass via /api/admin/configure"`
- `/api/health` returns structured diagnostics AI agent can parse
- `/api/admin/configure` accepts data source config without file editing

### Documentation requirements
- Machine-readable (no screenshots, no "click here")
- Every manual step has an equivalent API/CLI command
- QUICKSTART.md optimized for copy-paste by AI agent

---

## 9. What Needs to Be Built

### Must have (blocks multi-instance)

| # | What | Effort |
|---|------|--------|
| 1 | CalVer auto-tagging in CI (release.yml) | 1 day |
| 2 | Smoke test script + CI workflow | 1 day |
| 3 | Breaking change detection in CI (OpenAPI diff, contract diff) | 2 days |
| 4 | `/setup` wizard (web) + `/api/admin/configure` (headless) | 3 days |
| 5 | Auto-generate JWT_SECRET_KEY on first start | 0.5 day |
| 6 | Auto-discovery for Keboola tables on first sync | 1 day |
| 7 | Custom connector mount support in orchestrator | 1 day |
| 8 | `CHANGELOG.md` + release notes template | 0.5 day |
| 9 | Health endpoint version + channel info | 0.5 day |

### Should have (improves experience)

| # | What | Effort |
|---|------|--------|
| 10 | Deprecated version warning in health endpoint | 0.5 day |
| 11 | `/api/admin/discover-and-register` auto-discovery endpoint | 1 day |
| 12 | Standalone container connector example (Tier B) | 0.5 day |
| 13 | CLAUDE.md optimization for AI agent setup | 1 day |
| 14 | Terraform module refactor for multi-workspace | 1 day |

### Nice to have (future)

| # | What |
|---|------|
| 15 | Community connector contribution guide |
| 16 | Instance health dashboard (central monitoring) |
| 17 | Automated backup (GCP disk snapshots) |
| 18 | Usage analytics (opt-in telemetry) |

---

## Non-Goals

- Multi-tenancy in single process (each customer = separate instance)
- Kubernetes/Helm (Docker Compose is sufficient for target scale)
- Paid tier / license keys (open-source, monetization TBD)
- GUI for connector development (AI agent + CLAUDE.md is sufficient)