feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )

Replaces the implicit Let's Encrypt flow with a general corporate-CA HTTPS path:

- Caddy switches to cert-file mode (`tls /certs/fullchain.pem /certs/privkey.pem`) with HSTS + TLS 1.2/1.3 floor
- New `docker-compose.tls.yml` overlay closes host `:8000` when Caddy fronts (no TLS bypass)
- New `scripts/tls-fetch.sh` — generic URL fetcher for `sm://`, `gs://`, `https://`, `file://` with redirect refusal + PEM validation
- New `scripts/grpn/agnes-tls-rotate.sh` — daily rotation, self-signed fallback against same key (zero key churn), on-VM RSA-2048 + CSR auto-gen, atomic swap, SIGUSR1 reload
- `scripts/grpn/agnes-auto-upgrade.sh` becomes cert-aware (auto-enables tls overlay when certs present)
- Compose profile `production` renamed to `tls` (aligns with DEPLOYMENT.md and infra startup)

Pairs with FoundryAI/agnes-the-ai-analyst-infra#27 (merged) which wires per-VM `local.vm_tls`, writes `TLS_*` env vars into `.env`, auto-creates Secret Manager containers for `sm://` privkey URLs, and installs `agnes-tls-rotate.{service,timer}` for daily polling.

Includes hardening + docs follow-ups from code review:
- `TLS_CSR_SUBJECT` env-var parametrisation applied to both CSR and self-signed cert paths
- curl `--max-redirs 0 --proto '=https'` + post-fetch PEM validation in `tls-fetch.sh`
- `ulimit -c 0` + array-form `COMPOSE_FILES` (POSIX-safe, bash 3.2 compatible)
- TLS section added to `config/.env.template`
- Historical-note headers in `docs/superpowers/{plans,specs}/2026-04-09-*.md` flagging the profile rename

2026-04-25 19:51:25 +00:00

16 KiB

Raw Blame History

Multi-Instance Deployment & Versioning — Design Spec

Historical note (2026-04-24): This spec is a snapshot from 2026-04-09. Some operational details have evolved since — most notably, the Caddy production profile referenced in command examples below was renamed to tls (see PR #51). For the current deployment commands, follow docs/DEPLOYMENT.md. This file is preserved as design history.

Goal

Make Agnes deployable to 20+ independent customer instances via self-service, with safe versioning that prevents one customer's PR from breaking another's deployment.

Context

Agnes is an open-source AI Data Analyst platform. Customers (or their AI agents) deploy it as a Docker image on their own infrastructure. Each instance connects to different data sources (Keboola, BigQuery, Jira, custom).

Key constraints:

Customers range from semi-technical to non-technical, assisted by AI agents
Cloud-agnostic (GCP, AWS, Azure, on-prem, VPS)
One repo, one Docker image, many instances
Community PRs must not break existing customers
AI agent is the primary "installer" and "developer"

1. Versioning & Release Channels

CalVer: `YYYY.MM.N`

Format: year.month.sequential-number. Example: 2026.04.1, 2026.04.2, 2026.05.1.

No manual release decisions. Every merge to main is a release.

Three channels

Channel	Floating tag	Versioned tag	Source	Who uses it
dev	`:dev`	`:dev-2026.04.N`	Every CI-passing push on any feature branch	Developers, PR testing
stable	`:stable`	`:stable-2026.04.N`	Every merge to main + CI pass	All production customers
deprecated	—	`:deprecated-2026.04.N`	Previous stable after breaking change or failed smoke test	Grace period (30 days)

Every image also gets a :sha-abc1234 tag for exact commit traceability.

Tag lifecycle

feature branch push → CI ✅ → :dev + :dev-2026.04.N + :sha-abc1234
                         ❌ → nothing pushed

merge to main       → CI ✅ → :stable + :stable-2026.04.N + :sha-abc1234
                         ❌ → merge blocked (CI required)
                                │
                                ▼
                         smoke test on canary VM
                                │
                         ✅ → :stable confirmed
                         ❌ → alert, rollback canary to previous :stable
                              broken build tagged :deprecated-2026.04.N

Version numbering

CalVer YYYY.MM.N where N is a global auto-incrementing counter per month across both channels.

Example timeline:

Apr 8  feature/foo push     → :dev-2026.04.1
Apr 8  feature/bar push     → :dev-2026.04.2
Apr 8  merge foo to main    → :stable-2026.04.3
Apr 9  feature/baz push     → :dev-2026.04.4
Apr 9  merge bar to main    → :stable-2026.04.5

This avoids confusion — version 2026.04.3 exists only once, in one channel.

Customer pins version

# docker-compose.prod.yml

# Auto-update (recommended): always latest stable
image: ghcr.io/keboola/agnes-the-ai-analyst:stable

# Pinned: specific stable release, manual update
image: ghcr.io/keboola/agnes-the-ai-analyst:stable-2026.04.3

# Testing: latest dev
image: ghcr.io/keboola/agnes-the-ai-analyst:dev

# Testing: specific dev build
image: ghcr.io/keboola/agnes-the-ai-analyst:dev-2026.04.2

Main = stable

main branch is always releasable
Every merge to main triggers a new stable release
Feature branches are the dev channel
No promotion pipeline, no manual approval for releases
Smoke test is a post-deploy safety net, not a gate

2. Breaking Change Detection

What is a breaking change

_meta table schema change (add/remove column)
_remote_attach table schema change
API endpoint removed or response field removed
DuckDB system schema migration that drops data
CLI command removed or argument renamed
instance.yaml required key added

Automated detection in CI

Every PR runs:

Contract tests: _meta and _remote_attach schema validation against frozen spec
OpenAPI diff: Compare PR's openapi.json against main's. Flag removed endpoints/fields.
DuckDB schema diff: Compare table definitions in system.duckdb
Config diff: Compare instance.yaml.example required keys
Full connector matrix: ALL connectors tested, not just changed ones

If breaking change detected:

PR gets BREAKING label automatically
Requires 2 reviewers (elevated review)
Commit message must have BREAKING: prefix
CHANGELOG.md entry with migration guide required
On merge: previous stable tagged as :deprecated-YYYY.MM.N

Deprecated channel

When a breaking change merges:

Previous stable image retagged to :deprecated-2026.04.N
New build becomes :stable + :2026.04.(N+1)

Health endpoint on deprecated version shows warning:

{"warnings": ["Running deprecated version 2026.04.3. Update to stable."]}

Deprecated images removed from GHCR after 30 days

3. Smoke Test (Post-Deploy Safety Net)

What it tests

Automated sequence run on canary VM after every :stable deploy:

1. GET  /api/health                    → status != "unhealthy"
2. POST /auth/token                    → 200 (valid credentials)
3. GET  /api/catalog/tables            → count > 0
4. POST /api/query {sql: "SELECT 1"}   → 200 + rows
5. POST /api/sync/trigger              → 200
6. (wait 30s)
7. GET  /api/health                    → check no new errors

On failure

Alert (GitHub issue + optional webhook)
Canary VM rolled back to previous stable: docker compose pull && docker compose up -d with previous tag
Failed build tagged :deprecated-YYYY.MM.N
:stable tag reverted to previous good build

Implementation

GitHub Actions workflow triggered after the build-and-push workflow completes:

smoke-test:
  needs: build-and-push
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to canary
      run: |
        gcloud compute ssh canary-vm --command="
          cd /opt/agnes &&
          docker compose pull &&
          docker compose up -d"        

    - name: Wait for healthy
      run: |
        for i in $(seq 1 30); do
          STATUS=$(curl -sf canary:8000/api/health | jq -r .status)
          [ "$STATUS" != "unhealthy" ] && break
          sleep 10
        done        

    - name: Run smoke tests
      run: |
        # auth, catalog, query, sync checks
        ./scripts/smoke-test.sh canary:8000        

    - name: Rollback on failure
      if: failure()
      run: |
        # retag and rollback

4. Self-Service Deployment

Target experience

Customer (or their AI agent) goes from zero to running instance:

# 1. Get the code
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# 2. Start it
docker compose up -d

# 3. Open browser or use API
# First visit: /setup wizard (no users exist)
# Or headless: curl -X POST localhost:8000/auth/bootstrap ...

Two setup modes

A) Interactive (browser):

First visit when no users exist → redirected to /setup
Step 1: Create admin account (email + password)
Step 2: Choose data source (Keboola / BigQuery / CSV / Custom)
Step 3: Enter credentials (token, URL)
Step 4: Auto-discover and register tables
Step 5: Trigger first sync
Done → redirect to dashboard

B) Headless (AI agent / CLI):

# Bootstrap admin
curl -X POST http://localhost:8000/auth/bootstrap \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@company.com","password":"SecurePass123!"}'

# Configure data source
curl -X POST http://localhost:8000/api/admin/configure \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"data_source":"keboola","keboola_token":"...","keboola_url":"..."}'

# Discover and register tables
curl -X POST http://localhost:8000/api/admin/discover-and-register \
  -H "Authorization: Bearer $TOKEN"

# Trigger first sync
curl -X POST http://localhost:8000/api/sync/trigger \
  -H "Authorization: Bearer $TOKEN"

Both modes lead to same result. AI agent uses headless.

Auto-configuration

On first docker compose up with no .env:

JWT_SECRET_KEY auto-generated and persisted to /data/state/.jwt_secret
SESSION_SECRET auto-generated similarly
App starts in "setup mode" — only /setup, /auth/bootstrap, and /api/health accessible

On first docker compose up with .env containing KEBOOLA_STORAGE_TOKEN:

Auto-discovers tables from Keboola on first sync
Skips manual table registration step

What customer must provide

Required	Optional
Server with Docker	Custom domain + TLS
Admin email + password	Google OAuth credentials
Data source credentials (Keboola token OR BigQuery creds OR CSV files)	Telegram bot token
	Jira webhook secret

What customer must NOT do

Edit YAML manually (setup wizard generates instance.yaml)
Generate JWT secret (auto-generated)
Register tables manually (auto-discovery)
Understand DuckDB internals

5. Custom Connectors (Three Tiers)

All tiers produce the same output: extract.duckdb with _meta table + data/*.parquet. Orchestrator treats them identically.

Tier A: Local mount (fastest, AI-generated)

Customer's AI agent generates a connector. Lives outside Docker image, survives updates.

/opt/agnes/
├── docker-compose.yml              ← official image
├── docker-compose.override.yml     ← customer additions
└── custom-connectors/
    └── snowflake/
        ├── extractor.py
        └── requirements.txt

# docker-compose.override.yml
services:
  app:
    volumes:
      - ./custom-connectors:/app/connectors/custom:ro

Orchestrator scans connectors/custom/*/ in addition to built-in connectors.

How the AI agent creates one:

Reads CLAUDE.md → understands extract.duckdb contract
Reads existing connector as reference (e.g., connectors/keboola/extractor.py)
Generates custom-connectors/snowflake/extractor.py
Runs contract test to validate output
Done — orchestrator picks it up on next rebuild

Requirements for this to work:

CLAUDE.md must perfectly describe the contract
Contract test must be runnable standalone
Existing connectors must be readable as examples
Clear error messages when contract doesn't match

Tier B: Standalone container (complex dependencies)

For connectors needing their own runtime (Java, .NET, heavy Python packages).

# docker-compose.override.yml
services:
  connector-sap:
    build: ./custom-connectors/sap
    volumes:
      - data:/data
    environment:
      - DATA_DIR=/data
      - SAP_HOST=...
    profiles:
      - extract

Connector is its own Docker image. Writes to /data/extracts/sap/extract.duckdb. Orchestrator finds it automatically.

Tier C: Community PR (shared with all)

Connector contributed to main repo via PR. After merge, available in official image for all customers.

connectors/
├── keboola/          ← built-in
├── bigquery/         ← built-in
├── jira/             ← built-in
└── snowflake/        ← community contributed

PR requirements:

Must pass contract tests
Must include tests
Must not modify shared code (orchestrator, API, auth)
CI runs full connector matrix

6. CI/CD Pipeline

On feature branch push

ci.yml:
  - tests (all 654+)
  - contract tests (all connectors)
  - docker build
  - push :dev + :dev-sha-xxx to GHCR

On merge to main

release.yml:
  - tests (all)
  - contract tests (all connectors)
  - breaking change detection (OpenAPI diff, schema diff)
  - docker build
  - push :stable + :YYYY.MM.N + :sha-xxx to GHCR
  - trigger smoke test on canary

smoke-test.yml (triggered):
  - deploy to canary VM
  - run smoke test sequence
  - on failure: rollback canary, tag build as deprecated, create alert

On PR

pr-check.yml:
  - tests
  - contract tests
  - breaking change detection
  - label PR: "BREAKING" if detected
  - require 2 reviewers if breaking

7. Infrastructure (Cloud-Agnostic)

Primary: Docker Compose

Works everywhere Docker runs. This is the default and only required deployment method.

git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst
docker compose up -d

Optional: Terraform (GCP)

For automated provisioning. Lives in infra/ with GCS remote state backend.

cd infra
terraform workspace new customer-name
terraform apply -var-file=instances/customer-name.tfvars

Creates VM, installs Docker, clones repo, generates .env and instance.yaml, starts Docker Compose.

Optional: Caddy TLS

Production profile adds Caddy reverse proxy with automatic Let's Encrypt:

DOMAIN=data.customer.com docker compose --profile production up -d

Directory layout on customer server

/opt/agnes/                           ← git clone
├── docker-compose.yml                ← official
├── docker-compose.prod.yml           ← GHCR images
├── docker-compose.override.yml       ← customer customizations
├── .env                              ← secrets (gitignored)
├── config/
│   └── instance.yaml                 ← generated by setup wizard
├── custom-connectors/                ← Tier A connectors
│   └── snowflake/
└── Caddyfile                         ← TLS config

/data/                                ← Docker volume (persistent)
├── state/system.duckdb               ← users, registry, sync state
├── analytics/server.duckdb           ← views into extracts
└── extracts/                         ← per-source data
    ├── keboola/extract.duckdb
    ├── bigquery/extract.duckdb
    └── snowflake/extract.duckdb      ← from custom connector

8. AI Agent as Primary Installer

CLAUDE.md and documentation must be optimized for AI agent consumption:

CLAUDE.md requirements

Complete extract.duckdb contract with exact SQL for _meta and _remote_attach
Step-by-step setup instructions with exact curl commands
Existing connectors as reference for AI-generated new ones
Clear error messages explaining what went wrong and how to fix

API requirements

All setup operations available as API calls (not just UI)
Self-describing error messages: "Missing KEBOOLA_STORAGE_TOKEN. Set it in .env or pass via /api/admin/configure"
/api/health returns structured diagnostics AI agent can parse
/api/admin/configure accepts data source config without file editing

Documentation requirements

Machine-readable (no screenshots, no "click here")
Every manual step has an equivalent API/CLI command
QUICKSTART.md optimized for copy-paste by AI agent

9. What Needs to Be Built

Must have (blocks multi-instance)

#	What	Effort
1	CalVer auto-tagging in CI (release.yml)	1 day
2	Smoke test script + CI workflow	1 day
3	Breaking change detection in CI (OpenAPI diff, contract diff)	2 days
4	`/setup` wizard (web) + `/api/admin/configure` (headless)	3 days
5	Auto-generate JWT_SECRET_KEY on first start	0.5 day
6	Auto-discovery for Keboola tables on first sync	1 day
7	Custom connector mount support in orchestrator	1 day
8	`CHANGELOG.md` + release notes template	0.5 day
9	Health endpoint version + channel info	0.5 day

Should have (improves experience)

#	What	Effort
10	Deprecated version warning in health endpoint	0.5 day
11	`/api/admin/discover-and-register` auto-discovery endpoint	1 day
12	Standalone container connector example (Tier B)	0.5 day
13	CLAUDE.md optimization for AI agent setup	1 day
14	Terraform module refactor for multi-workspace	1 day

Nice to have (future)

#	What
15	Community connector contribution guide
16	Instance health dashboard (central monitoring)
17	Automated backup (GCP disk snapshots)
18	Usage analytics (opt-in telemetry)

Non-Goals

Multi-tenancy in single process (each customer = separate instance)
Kubernetes/Helm (Docker Compose is sufficient for target scale)
Paid tier / license keys (open-source, monetization TBD)
GUI for connector development (AI agent + CLAUDE.md is sufficient)

16 KiB Raw Blame History