16 KiB
Multi-Instance Deployment & Versioning — Design Spec
Goal
Make Agnes deployable to 20+ independent customer instances via self-service, with safe versioning that prevents one customer's PR from breaking another's deployment.
Context
Agnes is an open-source AI Data Analyst platform. Customers (or their AI agents) deploy it as a Docker image on their own infrastructure. Each instance connects to different data sources (Keboola, BigQuery, Jira, custom).
Key constraints:
- Customers range from semi-technical to non-technical, assisted by AI agents
- Cloud-agnostic (GCP, AWS, Azure, on-prem, VPS)
- One repo, one Docker image, many instances
- Community PRs must not break existing customers
- AI agent is the primary "installer" and "developer"
1. Versioning & Release Channels
CalVer: YYYY.MM.N
Format: year.month.sequential-number. Example: 2026.04.1, 2026.04.2, 2026.05.1.
No manual release decisions. Every merge to main is a release.
Three channels
| Channel | Floating tag | Versioned tag | Source | Who uses it |
|---|---|---|---|---|
| dev | :dev |
:dev-2026.04.N |
Every CI-passing push on any feature branch | Developers, PR testing |
| stable | :stable |
:stable-2026.04.N |
Every merge to main + CI pass | All production customers |
| deprecated | — | :deprecated-2026.04.N |
Previous stable after breaking change or failed smoke test | Grace period (30 days) |
Every image also gets a :sha-abc1234 tag for exact commit traceability.
Tag lifecycle
feature branch push → CI ✅ → :dev + :dev-2026.04.N + :sha-abc1234
❌ → nothing pushed
merge to main → CI ✅ → :stable + :stable-2026.04.N + :sha-abc1234
❌ → merge blocked (CI required)
│
▼
smoke test on canary VM
│
✅ → :stable confirmed
❌ → alert, rollback canary to previous :stable
broken build tagged :deprecated-2026.04.N
Version numbering
CalVer YYYY.MM.N where N is a global auto-incrementing counter per month across both channels.
Example timeline:
Apr 8 feature/foo push → :dev-2026.04.1
Apr 8 feature/bar push → :dev-2026.04.2
Apr 8 merge foo to main → :stable-2026.04.3
Apr 9 feature/baz push → :dev-2026.04.4
Apr 9 merge bar to main → :stable-2026.04.5
This avoids confusion — version 2026.04.3 exists only once, in one channel.
Customer pins version
# docker-compose.prod.yml
# Auto-update (recommended): always latest stable
image: ghcr.io/keboola/agnes-the-ai-analyst:stable
# Pinned: specific stable release, manual update
image: ghcr.io/keboola/agnes-the-ai-analyst:stable-2026.04.3
# Testing: latest dev
image: ghcr.io/keboola/agnes-the-ai-analyst:dev
# Testing: specific dev build
image: ghcr.io/keboola/agnes-the-ai-analyst:dev-2026.04.2
Main = stable
mainbranch is always releasable- Every merge to main triggers a new stable release
- Feature branches are the dev channel
- No promotion pipeline, no manual approval for releases
- Smoke test is a post-deploy safety net, not a gate
2. Breaking Change Detection
What is a breaking change
_metatable schema change (add/remove column)_remote_attachtable schema change- API endpoint removed or response field removed
- DuckDB system schema migration that drops data
- CLI command removed or argument renamed
instance.yamlrequired key added
Automated detection in CI
Every PR runs:
- Contract tests:
_metaand_remote_attachschema validation against frozen spec - OpenAPI diff: Compare PR's
openapi.jsonagainst main's. Flag removed endpoints/fields. - DuckDB schema diff: Compare table definitions in system.duckdb
- Config diff: Compare
instance.yaml.examplerequired keys - Full connector matrix: ALL connectors tested, not just changed ones
If breaking change detected:
- PR gets
BREAKINGlabel automatically - Requires 2 reviewers (elevated review)
- Commit message must have
BREAKING:prefix - CHANGELOG.md entry with migration guide required
- On merge: previous stable tagged as
:deprecated-YYYY.MM.N
Deprecated channel
When a breaking change merges:
- Previous stable image retagged to
:deprecated-2026.04.N - New build becomes
:stable+:2026.04.(N+1) - Health endpoint on deprecated version shows warning:
{"warnings": ["Running deprecated version 2026.04.3. Update to stable."]} - Deprecated images removed from GHCR after 30 days
3. Smoke Test (Post-Deploy Safety Net)
What it tests
Automated sequence run on canary VM after every :stable deploy:
1. GET /api/health → status != "unhealthy"
2. POST /auth/token → 200 (valid credentials)
3. GET /api/catalog/tables → count > 0
4. POST /api/query {sql: "SELECT 1"} → 200 + rows
5. POST /api/sync/trigger → 200
6. (wait 30s)
7. GET /api/health → check no new errors
On failure
- Alert (GitHub issue + optional webhook)
- Canary VM rolled back to previous stable:
docker compose pull && docker compose up -dwith previous tag - Failed build tagged
:deprecated-YYYY.MM.N :stabletag reverted to previous good build
Implementation
GitHub Actions workflow triggered after the build-and-push workflow completes:
smoke-test:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Deploy to canary
run: |
gcloud compute ssh canary-vm --command="
cd /opt/agnes &&
docker compose pull &&
docker compose up -d"
- name: Wait for healthy
run: |
for i in $(seq 1 30); do
STATUS=$(curl -sf canary:8000/api/health | jq -r .status)
[ "$STATUS" != "unhealthy" ] && break
sleep 10
done
- name: Run smoke tests
run: |
# auth, catalog, query, sync checks
./scripts/smoke-test.sh canary:8000
- name: Rollback on failure
if: failure()
run: |
# retag and rollback
4. Self-Service Deployment
Target experience
Customer (or their AI agent) goes from zero to running instance:
# 1. Get the code
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst
# 2. Start it
docker compose up -d
# 3. Open browser or use API
# First visit: /setup wizard (no users exist)
# Or headless: curl -X POST localhost:8000/auth/bootstrap ...
Two setup modes
A) Interactive (browser):
- First visit when no users exist → redirected to
/setup - Step 1: Create admin account (email + password)
- Step 2: Choose data source (Keboola / BigQuery / CSV / Custom)
- Step 3: Enter credentials (token, URL)
- Step 4: Auto-discover and register tables
- Step 5: Trigger first sync
- Done → redirect to dashboard
B) Headless (AI agent / CLI):
# Bootstrap admin
curl -X POST http://localhost:8000/auth/bootstrap \
-H "Content-Type: application/json" \
-d '{"email":"admin@company.com","password":"SecurePass123!"}'
# Configure data source
curl -X POST http://localhost:8000/api/admin/configure \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"data_source":"keboola","keboola_token":"...","keboola_url":"..."}'
# Discover and register tables
curl -X POST http://localhost:8000/api/admin/discover-and-register \
-H "Authorization: Bearer $TOKEN"
# Trigger first sync
curl -X POST http://localhost:8000/api/sync/trigger \
-H "Authorization: Bearer $TOKEN"
Both modes lead to same result. AI agent uses headless.
Auto-configuration
On first docker compose up with no .env:
JWT_SECRET_KEYauto-generated and persisted to/data/state/.jwt_secretSESSION_SECRETauto-generated similarly- App starts in "setup mode" — only
/setup,/auth/bootstrap, and/api/healthaccessible
On first docker compose up with .env containing KEBOOLA_STORAGE_TOKEN:
- Auto-discovers tables from Keboola on first sync
- Skips manual table registration step
What customer must provide
| Required | Optional |
|---|---|
| Server with Docker | Custom domain + TLS |
| Admin email + password | Google OAuth credentials |
| Data source credentials (Keboola token OR BigQuery creds OR CSV files) | Telegram bot token |
| Jira webhook secret |
What customer must NOT do
- Edit YAML manually (setup wizard generates
instance.yaml) - Generate JWT secret (auto-generated)
- Register tables manually (auto-discovery)
- Understand DuckDB internals
5. Custom Connectors (Three Tiers)
All tiers produce the same output: extract.duckdb with _meta table + data/*.parquet. Orchestrator treats them identically.
Tier A: Local mount (fastest, AI-generated)
Customer's AI agent generates a connector. Lives outside Docker image, survives updates.
/opt/agnes/
├── docker-compose.yml ← official image
├── docker-compose.override.yml ← customer additions
└── custom-connectors/
└── snowflake/
├── extractor.py
└── requirements.txt
# docker-compose.override.yml
services:
app:
volumes:
- ./custom-connectors:/app/connectors/custom:ro
Orchestrator scans connectors/custom/*/ in addition to built-in connectors.
How the AI agent creates one:
- Reads CLAUDE.md → understands extract.duckdb contract
- Reads existing connector as reference (e.g.,
connectors/keboola/extractor.py) - Generates
custom-connectors/snowflake/extractor.py - Runs contract test to validate output
- Done — orchestrator picks it up on next rebuild
Requirements for this to work:
- CLAUDE.md must perfectly describe the contract
- Contract test must be runnable standalone
- Existing connectors must be readable as examples
- Clear error messages when contract doesn't match
Tier B: Standalone container (complex dependencies)
For connectors needing their own runtime (Java, .NET, heavy Python packages).
# docker-compose.override.yml
services:
connector-sap:
build: ./custom-connectors/sap
volumes:
- data:/data
environment:
- DATA_DIR=/data
- SAP_HOST=...
profiles:
- extract
Connector is its own Docker image. Writes to /data/extracts/sap/extract.duckdb. Orchestrator finds it automatically.
Tier C: Community PR (shared with all)
Connector contributed to main repo via PR. After merge, available in official image for all customers.
connectors/
├── keboola/ ← built-in
├── bigquery/ ← built-in
├── jira/ ← built-in
└── snowflake/ ← community contributed
PR requirements:
- Must pass contract tests
- Must include tests
- Must not modify shared code (orchestrator, API, auth)
- CI runs full connector matrix
6. CI/CD Pipeline
On feature branch push
ci.yml:
- tests (all 654+)
- contract tests (all connectors)
- docker build
- push :dev + :dev-sha-xxx to GHCR
On merge to main
release.yml:
- tests (all)
- contract tests (all connectors)
- breaking change detection (OpenAPI diff, schema diff)
- docker build
- push :stable + :YYYY.MM.N + :sha-xxx to GHCR
- trigger smoke test on canary
smoke-test.yml (triggered):
- deploy to canary VM
- run smoke test sequence
- on failure: rollback canary, tag build as deprecated, create alert
On PR
pr-check.yml:
- tests
- contract tests
- breaking change detection
- label PR: "BREAKING" if detected
- require 2 reviewers if breaking
7. Infrastructure (Cloud-Agnostic)
Primary: Docker Compose
Works everywhere Docker runs. This is the default and only required deployment method.
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst
docker compose up -d
Optional: Terraform (GCP)
For automated provisioning. Lives in infra/ with GCS remote state backend.
cd infra
terraform workspace new customer-name
terraform apply -var-file=instances/customer-name.tfvars
Creates VM, installs Docker, clones repo, generates .env and instance.yaml, starts Docker Compose.
Optional: Caddy TLS
Production profile adds Caddy reverse proxy with automatic Let's Encrypt:
DOMAIN=data.customer.com docker compose --profile production up -d
Directory layout on customer server
/opt/agnes/ ← git clone
├── docker-compose.yml ← official
├── docker-compose.prod.yml ← GHCR images
├── docker-compose.override.yml ← customer customizations
├── .env ← secrets (gitignored)
├── config/
│ └── instance.yaml ← generated by setup wizard
├── custom-connectors/ ← Tier A connectors
│ └── snowflake/
└── Caddyfile ← TLS config
/data/ ← Docker volume (persistent)
├── state/system.duckdb ← users, registry, sync state
├── analytics/server.duckdb ← views into extracts
└── extracts/ ← per-source data
├── keboola/extract.duckdb
├── bigquery/extract.duckdb
└── snowflake/extract.duckdb ← from custom connector
8. AI Agent as Primary Installer
CLAUDE.md and documentation must be optimized for AI agent consumption:
CLAUDE.md requirements
- Complete extract.duckdb contract with exact SQL for
_metaand_remote_attach - Step-by-step setup instructions with exact curl commands
- Existing connectors as reference for AI-generated new ones
- Clear error messages explaining what went wrong and how to fix
API requirements
- All setup operations available as API calls (not just UI)
- Self-describing error messages:
"Missing KEBOOLA_STORAGE_TOKEN. Set it in .env or pass via /api/admin/configure" /api/healthreturns structured diagnostics AI agent can parse/api/admin/configureaccepts data source config without file editing
Documentation requirements
- Machine-readable (no screenshots, no "click here")
- Every manual step has an equivalent API/CLI command
- QUICKSTART.md optimized for copy-paste by AI agent
9. What Needs to Be Built
Must have (blocks multi-instance)
| # | What | Effort |
|---|---|---|
| 1 | CalVer auto-tagging in CI (release.yml) | 1 day |
| 2 | Smoke test script + CI workflow | 1 day |
| 3 | Breaking change detection in CI (OpenAPI diff, contract diff) | 2 days |
| 4 | /setup wizard (web) + /api/admin/configure (headless) |
3 days |
| 5 | Auto-generate JWT_SECRET_KEY on first start | 0.5 day |
| 6 | Auto-discovery for Keboola tables on first sync | 1 day |
| 7 | Custom connector mount support in orchestrator | 1 day |
| 8 | CHANGELOG.md + release notes template |
0.5 day |
| 9 | Health endpoint version + channel info | 0.5 day |
Should have (improves experience)
| # | What | Effort |
|---|---|---|
| 10 | Deprecated version warning in health endpoint | 0.5 day |
| 11 | /api/admin/discover-and-register auto-discovery endpoint |
1 day |
| 12 | Standalone container connector example (Tier B) | 0.5 day |
| 13 | CLAUDE.md optimization for AI agent setup | 1 day |
| 14 | Terraform module refactor for multi-workspace | 1 day |
Nice to have (future)
| # | What |
|---|---|
| 15 | Community connector contribution guide |
| 16 | Instance health dashboard (central monitoring) |
| 17 | Automated backup (GCP disk snapshots) |
| 18 | Usage analytics (opt-in telemetry) |
Non-Goals
- Multi-tenancy in single process (each customer = separate instance)
- Kubernetes/Helm (Docker Compose is sufficient for target scale)
- Paid tier / license keys (open-source, monetization TBD)
- GUI for connector development (AI agent + CLAUDE.md is sufficient)