docs: consolidate and de-clutter the documentation tree (#306 )

CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.

2026-05-14 18:54:22 +00:00

13 KiB

Raw Blame History

Deployment & Multi-Instance Readiness Plan

Historical note (2026-04-24): This plan is a snapshot from 2026-04-09. Some details have evolved — most notably, the Caddy production profile referenced throughout this document was renamed to tls (see PR #51). For the current deployment guide, follow docs/DEPLOYMENT.md. This file is preserved as design history.

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make the platform deployable to N customer instances with minimal manual effort.

Architecture: Docker image on GHCR + per-instance config (instance.yaml + .env) + Terraform provisioning. One image, many instances.

Tech Stack: Docker, Terraform (GCP), GitHub Actions, Caddy (TLS proxy)

Source: Deployment readiness + multi-instance architecture reviews 2026-04-09 (findings C5-C7, I4-I9)

File Map

File	Responsibility	Tasks
`config/.env.template`	Complete env var reference	1
`docker-compose.yml`	Add restart policy, config mount, image ref, Caddy proxy	2, 3
`docker-compose.prod.yml`	Production override with GHCR image + Caddy	2, 3
`.github/workflows/deploy.yml`	Image versioning with SHA tag	4
`infra/main.tf`	Remote state backend, instance.yaml generation	5
`services/telegram_bot/config.py`	Fix hardcoded paths	6
`src/profiler.py`	Fix PROFILER_DATA_DIR	6
`docs/DEPLOYMENT.md`	Update for multi-instance	7

Task 1: Complete .env.template with all env vars

The template lists only 8 of ~15 needed variables.

Files:

Modify: config/.env.template
Step 1: Rewrite .env.template

# Agnes AI Data Analyst - Environment Variables
# =============================================
# Copy to .env: cp config/.env.template .env
# .env is gitignored - NEVER commit it.

# ── REQUIRED ────────────────────────────────────────
JWT_SECRET_KEY=              # python -c "import secrets; print(secrets.token_hex(32))"
SESSION_SECRET=              # python -c "import secrets; print(secrets.token_hex(32))"

# ── GOOGLE OAUTH (required for Google login) ────────
# GOOGLE_CLIENT_ID=
# GOOGLE_CLIENT_SECRET=

# ── KEBOOLA (required for Keboola data source) ──────
# KEBOOLA_STORAGE_TOKEN=
# KEBOOLA_STACK_URL=https://connection.keboola.com

# ── BIGQUERY (required for BigQuery data source) ─────
# BIGQUERY_PROJECT=
# BIGQUERY_LOCATION=us

# ── BOOTSTRAP (first deploy only) ───────────────────
# SEED_ADMIN_EMAIL=admin@example.com

# ── EMAIL / SMTP (required for magic link auth) ─────
# SMTP_HOST=smtp.gmail.com
# SMTP_PORT=587
# SMTP_USER=
# SMTP_PASSWORD=

# ── OPTIONAL SERVICES ───────────────────────────────
# TELEGRAM_BOT_TOKEN=
# JIRA_WEBHOOK_SECRET=
# JIRA_API_TOKEN=
# ANTHROPIC_API_KEY=
# LLM_API_KEY=

# ── DESKTOP APP ─────────────────────────────────────
# DESKTOP_JWT_SECRET=       # Separate secret for desktop app tokens

# ── DEPLOYMENT ──────────────────────────────────────
# DATA_DIR=/data            # Default: /data in Docker, ./data locally
# LOG_LEVEL=info            # debug, info, warning, error
# CORS_ORIGINS=http://localhost:3000,http://localhost:8000

Step 2: Commit

git add config/.env.template
git commit -m "docs: complete .env.template with all 20+ env vars"

Task 2: Fix docker-compose for production (I7, I5, I8)

Add restart policy to app, config volume mount, and GHCR image reference.

Files:

Modify: docker-compose.yml
Create: docker-compose.prod.yml (production override)
Step 1: Add restart policy and config mount to docker-compose.yml

In docker-compose.yml, add to the app service:

  app:
    build: .
    command: uvicorn app.main:app --host 0.0.0.0 --port 8000
    ports:
      - "8000:8000"
    volumes:
      - data:/data
      - ./config:/app/config:ro
    env_file: .env
    environment:
      - DATA_DIR=/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:8000/api/health"]
      interval: 30s
      timeout: 5s
      retries: 3

Key changes: restart: unless-stopped added, ./config:/app/config:ro volume mount added.

Step 2: Create docker-compose.prod.yml

# Production override — uses pre-built GHCR image instead of local build.
# Usage: docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
services:
  app:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

  scheduler:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

  extract:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

  telegram-bot:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

  ws-gateway:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

  corporate-memory:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

  session-collector:
    image: ghcr.io/keboola/agnes-the-ai-analyst:latest
    build: !reset null

Step 3: Commit

git add docker-compose.yml docker-compose.prod.yml
git commit -m "feat: add restart policy, config mount, production compose override with GHCR images"

Task 3: Add Caddy reverse proxy for TLS (I4)

No HTTPS in Docker Compose — data transits in plaintext.

Files:

Create: Caddyfile
Modify: docker-compose.yml (add caddy service)
Step 1: Create Caddyfile

{$DOMAIN:localhost} {
    reverse_proxy app:8000
}

Step 2: Add Caddy service to docker-compose.yml

Add to services section:

  caddy:
    image: caddy:2-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
      - caddy_config:/config
    environment:
      - DOMAIN=${DOMAIN:-localhost}
    depends_on:
      app:
        condition: service_healthy
    restart: unless-stopped
    profiles:
      - production

Add volumes:

volumes:
  data:
  caddy_data:
  caddy_config:

Step 3: Update DEPLOYMENT.md

Add section:

### HTTPS with Caddy (production)

Set `DOMAIN=data.yourcompany.com` in `.env`, then:

```bash
docker compose --profile production up -d

Caddy automatically provisions Let's Encrypt TLS certificates.


- [ ] **Step 4: Commit**

```bash
git add Caddyfile docker-compose.yml docs/DEPLOYMENT.md
git commit -m "feat: add Caddy reverse proxy for automatic HTTPS in production"

Task 4: Add Docker image versioning with commit SHA (C7)

Images are only tagged :latest — no versioning, no rollback.

Files:

Modify: .github/workflows/deploy.yml
Step 1: Update image tagging

In .github/workflows/deploy.yml, replace the build-and-push step:

      - name: Build and push
        uses: docker/build-push-action@v7
        with:
          push: true
          tags: |
            ghcr.io/${{ github.repository }}:latest
            ghcr.io/${{ github.repository }}:${{ github.sha }}

Step 2: Commit

git add -f .github/workflows/deploy.yml
git commit -m "feat: tag Docker images with commit SHA for versioning and rollback"

Task 5: Add Terraform remote state backend (I6)

Local tfstate blocks multi-operator and multi-instance Terraform.

Files:

Modify: infra/main.tf
Modify: infra/variables.tf
Step 1: Add GCS backend to main.tf

In infra/main.tf, inside the terraform {} block:

terraform {
  required_version = ">= 1.5"

  backend "gcs" {
    bucket = "agnes-terraform-state"
    prefix = "instances"
  }

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.0"
    }
  }
}

Step 2: Add instance.yaml generation to startup script

In infra/main.tf, in the startup_script local, after the .env generation:

    echo "=== Creating instance.yaml ==="
    cat > "$APP_DIR/config/instance.yaml" << 'YAMLEOF'
    instance:
      name: "${var.instance_name}"
      subtitle: "Data Analytics Platform"
    server:
      host: "${google_compute_address.data_analyst.address}"
      hostname: "${var.domain != "" ? var.domain : google_compute_address.data_analyst.address}"
      port: 8000
    auth:
      allowed_domain: "${var.admin_email != "" ? join("", [split("@", var.admin_email)[1]]) : ""}"
    data_source:
      type: "${var.keboola_token != "" ? "keboola" : "local"}"
    YAMLEOF
    sed -i 's/^    //' "$APP_DIR/config/instance.yaml"

Step 3: Update repo URL in startup script

Replace line 73 git clone https://github.com/padak/tmp_oss.git with:

    git clone https://github.com/keboola/agnes-the-ai-analyst.git "$APP_DIR"

And line 75 git checkout feature/v2-fastapi-duckdb-docker-cli with:

    # main branch is default, no checkout needed

Step 4: Commit

git add infra/main.tf infra/variables.tf
git commit -m "feat: add Terraform GCS remote state, instance.yaml generation, update repo URL"

Task 6: Fix hardcoded paths in services (I9)

telegram_bot and profiler use hardcoded /data/... paths instead of DATA_DIR.

Files:

Modify: services/telegram_bot/config.py:14
Modify: src/profiler.py:87
Modify: services/telegram_bot/dispatch.py:17
Step 1: Fix telegram_bot config

In services/telegram_bot/config.py, replace line 14:

NOTIFICATIONS_DIR = os.path.join(os.environ.get("DATA_DIR", "/data"), "notifications")

Step 2: Fix profiler

In src/profiler.py, replace line 87:

DATA_DIR = Path(os.environ.get("DATA_DIR", "/data")) / "src_data"

Remove PROFILER_DATA_DIR reference — use standard DATA_DIR like everywhere else.

Step 3: Fix dispatch.py

In services/telegram_bot/dispatch.py, replace line 17:

WS_GATEWAY_SOCKET_PATH = os.environ.get("WS_GATEWAY_SOCKET", "/run/ws-gateway/ws.sock")

Step 4: Run tests

Run: pytest tests/ -q --tb=short Expected: All pass

Step 5: Commit

git add services/telegram_bot/config.py services/telegram_bot/dispatch.py src/profiler.py
git commit -m "fix: use DATA_DIR env var everywhere — remove hardcoded /data paths"

Task 7: Update DEPLOYMENT.md for multi-instance

Add production deployment with GHCR images, Caddy TLS, and multi-instance guidance.

Files:

Modify: docs/DEPLOYMENT.md
Step 1: Add sections to DEPLOYMENT.md

Add these sections:

Production with GHCR images:

### Production Deployment (pre-built images)

Instead of building locally, pull from GitHub Container Registry:

```bash
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

Pin to a specific version:

# In docker-compose.prod.yml, change :latest to :COMMIT_SHA
image: ghcr.io/keboola/agnes-the-ai-analyst:abc1234


**Multi-instance:**
```markdown
## Multi-Instance Deployment

Each customer gets a separate VM with isolated data and config.

1. Copy `infra/terraform.tfvars.example` to `infra/instances/customer-name.tfvars`
2. Fill in customer-specific values
3. Apply: `cd infra && terraform workspace new customer-name && terraform apply -var-file=instances/customer-name.tfvars`
4. SSH in and create `config/instance.yaml` from `config/instance.yaml.example`
5. Start: `docker compose -f docker-compose.yml -f docker-compose.prod.yml --profile production up -d`
6. Bootstrap: `curl -X POST http://IP:8000/auth/bootstrap -d '{"email":"admin@customer.com"}'`

Update/rollback:

## Updating an Instance

```bash
# Pull latest image
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull

# Restart with new image
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Rollback to specific version
# Edit docker-compose.prod.yml: change :latest to :PREVIOUS_SHA
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d


- [ ] **Step 2: Commit**

```bash
git add docs/DEPLOYMENT.md
git commit -m "docs: add multi-instance deployment, GHCR images, update/rollback procedures"

Execution Order

Sequential recommended (some tasks depend on earlier ones):

Task 1 — .env.template (no deps)
Task 2 — docker-compose fixes (no deps)
Task 3 — Caddy TLS (depends on Task 2)
Task 4 — image versioning (no deps)
Task 5 — Terraform remote state (no deps)
Task 6 — hardcoded paths (no deps)
Task 7 — documentation (depends on all above)

Tasks 1, 2, 4, 5, 6 can run in parallel.

Verification after all tasks:

# Tests still pass
pytest tests/ -v --tb=short

# Docker builds
docker compose build

# Production compose validates
docker compose -f docker-compose.yml -f docker-compose.prod.yml config

13 KiB Raw Blame History

Deployment & Multi-Instance Readiness Plan

File Map

Task 1: Complete .env.template with all env vars

Task 2: Fix docker-compose for production (I7, I5, I8)

Task 3: Add Caddy reverse proxy for TLS (I4)

Task 4: Add Docker image versioning with commit SHA (C7)

Task 5: Add Terraform remote state backend (I6)

Task 6: Fix hardcoded paths in services (I9)

Task 7: Update DEPLOYMENT.md for multi-instance

Execution Order

13 KiB

Raw Blame History