agnes-the-ai-analyst/docs/superpowers/plans/2026-04-09-deployment-readiness.md
Vojtech 0bbbf3e40b
feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51)
Replaces the implicit Let's Encrypt flow with a general corporate-CA HTTPS path:

- Caddy switches to cert-file mode (`tls /certs/fullchain.pem /certs/privkey.pem`) with HSTS + TLS 1.2/1.3 floor
- New `docker-compose.tls.yml` overlay closes host `:8000` when Caddy fronts (no TLS bypass)
- New `scripts/tls-fetch.sh` — generic URL fetcher for `sm://`, `gs://`, `https://`, `file://` with redirect refusal + PEM validation
- New `scripts/grpn/agnes-tls-rotate.sh` — daily rotation, self-signed fallback against same key (zero key churn), on-VM RSA-2048 + CSR auto-gen, atomic swap, SIGUSR1 reload
- `scripts/grpn/agnes-auto-upgrade.sh` becomes cert-aware (auto-enables tls overlay when certs present)
- Compose profile `production` renamed to `tls` (aligns with DEPLOYMENT.md and infra startup)

Pairs with FoundryAI/agnes-the-ai-analyst-infra#27 (merged) which wires per-VM `local.vm_tls`, writes `TLS_*` env vars into `.env`, auto-creates Secret Manager containers for `sm://` privkey URLs, and installs `agnes-tls-rotate.{service,timer}` for daily polling.

Includes hardening + docs follow-ups from code review:
- `TLS_CSR_SUBJECT` env-var parametrisation applied to both CSR and self-signed cert paths
- curl `--max-redirs 0 --proto '=https'` + post-fetch PEM validation in `tls-fetch.sh`
- `ulimit -c 0` + array-form `COMPOSE_FILES` (POSIX-safe, bash 3.2 compatible)
- TLS section added to `config/.env.template`
- Historical-note headers in `docs/superpowers/{plans,specs}/2026-04-09-*.md` flagging the profile rename
2026-04-25 19:51:25 +00:00

497 lines
13 KiB
Markdown

# Deployment & Multi-Instance Readiness Plan
> **Historical note (2026-04-24):** This plan is a snapshot from 2026-04-09. Some details have evolved — most notably, the Caddy `production` profile referenced throughout this document was renamed to `tls` (see PR #51). For the current deployment guide, follow `docs/DEPLOYMENT.md`. This file is preserved as design history.
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make the platform deployable to N customer instances with minimal manual effort.
**Architecture:** Docker image on GHCR + per-instance config (instance.yaml + .env) + Terraform provisioning. One image, many instances.
**Tech Stack:** Docker, Terraform (GCP), GitHub Actions, Caddy (TLS proxy)
**Source:** Deployment readiness + multi-instance architecture reviews 2026-04-09 (findings C5-C7, I4-I9)
---
## File Map
| File | Responsibility | Tasks |
|------|---------------|-------|
| `config/.env.template` | Complete env var reference | 1 |
| `docker-compose.yml` | Add restart policy, config mount, image ref, Caddy proxy | 2, 3 |
| `docker-compose.prod.yml` | Production override with GHCR image + Caddy | 2, 3 |
| `.github/workflows/deploy.yml` | Image versioning with SHA tag | 4 |
| `infra/main.tf` | Remote state backend, instance.yaml generation | 5 |
| `services/telegram_bot/config.py` | Fix hardcoded paths | 6 |
| `src/profiler.py` | Fix PROFILER_DATA_DIR | 6 |
| `docs/DEPLOYMENT.md` | Update for multi-instance | 7 |
---
### Task 1: Complete .env.template with all env vars
The template lists only 8 of ~15 needed variables.
**Files:**
- Modify: `config/.env.template`
- [ ] **Step 1: Rewrite .env.template**
```bash
# Agnes AI Data Analyst - Environment Variables
# =============================================
# Copy to .env: cp config/.env.template .env
# .env is gitignored - NEVER commit it.
# ── REQUIRED ────────────────────────────────────────
JWT_SECRET_KEY= # python -c "import secrets; print(secrets.token_hex(32))"
SESSION_SECRET= # python -c "import secrets; print(secrets.token_hex(32))"
# ── GOOGLE OAUTH (required for Google login) ────────
# GOOGLE_CLIENT_ID=
# GOOGLE_CLIENT_SECRET=
# ── KEBOOLA (required for Keboola data source) ──────
# KEBOOLA_STORAGE_TOKEN=
# KEBOOLA_STACK_URL=https://connection.keboola.com
# ── BIGQUERY (required for BigQuery data source) ─────
# BIGQUERY_PROJECT=
# BIGQUERY_LOCATION=us
# ── BOOTSTRAP (first deploy only) ───────────────────
# SEED_ADMIN_EMAIL=admin@example.com
# ── EMAIL / SMTP (required for magic link auth) ─────
# SMTP_HOST=smtp.gmail.com
# SMTP_PORT=587
# SMTP_USER=
# SMTP_PASSWORD=
# ── OPTIONAL SERVICES ───────────────────────────────
# TELEGRAM_BOT_TOKEN=
# JIRA_WEBHOOK_SECRET=
# JIRA_API_TOKEN=
# ANTHROPIC_API_KEY=
# LLM_API_KEY=
# ── DESKTOP APP ─────────────────────────────────────
# DESKTOP_JWT_SECRET= # Separate secret for desktop app tokens
# ── DEPLOYMENT ──────────────────────────────────────
# DATA_DIR=/data # Default: /data in Docker, ./data locally
# LOG_LEVEL=info # debug, info, warning, error
# CORS_ORIGINS=http://localhost:3000,http://localhost:8000
```
- [ ] **Step 2: Commit**
```bash
git add config/.env.template
git commit -m "docs: complete .env.template with all 20+ env vars"
```
---
### Task 2: Fix docker-compose for production (I7, I5, I8)
Add restart policy to app, config volume mount, and GHCR image reference.
**Files:**
- Modify: `docker-compose.yml`
- Create: `docker-compose.prod.yml` (production override)
- [ ] **Step 1: Add restart policy and config mount to docker-compose.yml**
In `docker-compose.yml`, add to the `app` service:
```yaml
app:
build: .
command: uvicorn app.main:app --host 0.0.0.0 --port 8000
ports:
- "8000:8000"
volumes:
- data:/data
- ./config:/app/config:ro
env_file: .env
environment:
- DATA_DIR=/data
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8000/api/health"]
interval: 30s
timeout: 5s
retries: 3
```
Key changes: `restart: unless-stopped` added, `./config:/app/config:ro` volume mount added.
- [ ] **Step 2: Create docker-compose.prod.yml**
```yaml
# Production override — uses pre-built GHCR image instead of local build.
# Usage: docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
services:
app:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
scheduler:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
extract:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
telegram-bot:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
ws-gateway:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
corporate-memory:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
session-collector:
image: ghcr.io/keboola/agnes-the-ai-analyst:latest
build: !reset null
```
- [ ] **Step 3: Commit**
```bash
git add docker-compose.yml docker-compose.prod.yml
git commit -m "feat: add restart policy, config mount, production compose override with GHCR images"
```
---
### Task 3: Add Caddy reverse proxy for TLS (I4)
No HTTPS in Docker Compose — data transits in plaintext.
**Files:**
- Create: `Caddyfile`
- Modify: `docker-compose.yml` (add caddy service)
- [ ] **Step 1: Create Caddyfile**
```
{$DOMAIN:localhost} {
reverse_proxy app:8000
}
```
- [ ] **Step 2: Add Caddy service to docker-compose.yml**
Add to services section:
```yaml
caddy:
image: caddy:2-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
- caddy_data:/data
- caddy_config:/config
environment:
- DOMAIN=${DOMAIN:-localhost}
depends_on:
app:
condition: service_healthy
restart: unless-stopped
profiles:
- production
```
Add volumes:
```yaml
volumes:
data:
caddy_data:
caddy_config:
```
- [ ] **Step 3: Update DEPLOYMENT.md**
Add section:
```markdown
### HTTPS with Caddy (production)
Set `DOMAIN=data.yourcompany.com` in `.env`, then:
```bash
docker compose --profile production up -d
```
Caddy automatically provisions Let's Encrypt TLS certificates.
```
- [ ] **Step 4: Commit**
```bash
git add Caddyfile docker-compose.yml docs/DEPLOYMENT.md
git commit -m "feat: add Caddy reverse proxy for automatic HTTPS in production"
```
---
### Task 4: Add Docker image versioning with commit SHA (C7)
Images are only tagged `:latest` — no versioning, no rollback.
**Files:**
- Modify: `.github/workflows/deploy.yml`
- [ ] **Step 1: Update image tagging**
In `.github/workflows/deploy.yml`, replace the build-and-push step:
```yaml
- name: Build and push
uses: docker/build-push-action@v7
with:
push: true
tags: |
ghcr.io/${{ github.repository }}:latest
ghcr.io/${{ github.repository }}:${{ github.sha }}
```
- [ ] **Step 2: Commit**
```bash
git add -f .github/workflows/deploy.yml
git commit -m "feat: tag Docker images with commit SHA for versioning and rollback"
```
---
### Task 5: Add Terraform remote state backend (I6)
Local tfstate blocks multi-operator and multi-instance Terraform.
**Files:**
- Modify: `infra/main.tf`
- Modify: `infra/variables.tf`
- [ ] **Step 1: Add GCS backend to main.tf**
In `infra/main.tf`, inside the `terraform {}` block:
```hcl
terraform {
required_version = ">= 1.5"
backend "gcs" {
bucket = "agnes-terraform-state"
prefix = "instances"
}
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.0"
}
}
}
```
- [ ] **Step 2: Add instance.yaml generation to startup script**
In `infra/main.tf`, in the `startup_script` local, after the `.env` generation:
```bash
echo "=== Creating instance.yaml ==="
cat > "$APP_DIR/config/instance.yaml" << 'YAMLEOF'
instance:
name: "${var.instance_name}"
subtitle: "Data Analytics Platform"
server:
host: "${google_compute_address.data_analyst.address}"
hostname: "${var.domain != "" ? var.domain : google_compute_address.data_analyst.address}"
port: 8000
auth:
allowed_domain: "${var.admin_email != "" ? join("", [split("@", var.admin_email)[1]]) : ""}"
data_source:
type: "${var.keboola_token != "" ? "keboola" : "local"}"
YAMLEOF
sed -i 's/^ //' "$APP_DIR/config/instance.yaml"
```
- [ ] **Step 3: Update repo URL in startup script**
Replace line 73 `git clone https://github.com/padak/tmp_oss.git` with:
```bash
git clone https://github.com/keboola/agnes-the-ai-analyst.git "$APP_DIR"
```
And line 75 `git checkout feature/v2-fastapi-duckdb-docker-cli` with:
```bash
# main branch is default, no checkout needed
```
- [ ] **Step 4: Commit**
```bash
git add infra/main.tf infra/variables.tf
git commit -m "feat: add Terraform GCS remote state, instance.yaml generation, update repo URL"
```
---
### Task 6: Fix hardcoded paths in services (I9)
telegram_bot and profiler use hardcoded `/data/...` paths instead of `DATA_DIR`.
**Files:**
- Modify: `services/telegram_bot/config.py:14`
- Modify: `src/profiler.py:87`
- Modify: `services/telegram_bot/dispatch.py:17`
- [ ] **Step 1: Fix telegram_bot config**
In `services/telegram_bot/config.py`, replace line 14:
```python
NOTIFICATIONS_DIR = os.path.join(os.environ.get("DATA_DIR", "/data"), "notifications")
```
- [ ] **Step 2: Fix profiler**
In `src/profiler.py`, replace line 87:
```python
DATA_DIR = Path(os.environ.get("DATA_DIR", "/data")) / "src_data"
```
Remove `PROFILER_DATA_DIR` reference — use standard `DATA_DIR` like everywhere else.
- [ ] **Step 3: Fix dispatch.py**
In `services/telegram_bot/dispatch.py`, replace line 17:
```python
WS_GATEWAY_SOCKET_PATH = os.environ.get("WS_GATEWAY_SOCKET", "/run/ws-gateway/ws.sock")
```
- [ ] **Step 4: Run tests**
Run: `pytest tests/ -q --tb=short`
Expected: All pass
- [ ] **Step 5: Commit**
```bash
git add services/telegram_bot/config.py services/telegram_bot/dispatch.py src/profiler.py
git commit -m "fix: use DATA_DIR env var everywhere — remove hardcoded /data paths"
```
---
### Task 7: Update DEPLOYMENT.md for multi-instance
Add production deployment with GHCR images, Caddy TLS, and multi-instance guidance.
**Files:**
- Modify: `docs/DEPLOYMENT.md`
- [ ] **Step 1: Add sections to DEPLOYMENT.md**
Add these sections:
**Production with GHCR images:**
```markdown
### Production Deployment (pre-built images)
Instead of building locally, pull from GitHub Container Registry:
```bash
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```
Pin to a specific version:
```bash
# In docker-compose.prod.yml, change :latest to :COMMIT_SHA
image: ghcr.io/keboola/agnes-the-ai-analyst:abc1234
```
```
**Multi-instance:**
```markdown
## Multi-Instance Deployment
Each customer gets a separate VM with isolated data and config.
1. Copy `infra/terraform.tfvars.example` to `infra/instances/customer-name.tfvars`
2. Fill in customer-specific values
3. Apply: `cd infra && terraform workspace new customer-name && terraform apply -var-file=instances/customer-name.tfvars`
4. SSH in and create `config/instance.yaml` from `config/instance.yaml.example`
5. Start: `docker compose -f docker-compose.yml -f docker-compose.prod.yml --profile production up -d`
6. Bootstrap: `curl -X POST http://IP:8000/auth/bootstrap -d '{"email":"admin@customer.com"}'`
```
**Update/rollback:**
```markdown
## Updating an Instance
```bash
# Pull latest image
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
# Restart with new image
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# Rollback to specific version
# Edit docker-compose.prod.yml: change :latest to :PREVIOUS_SHA
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```
```
- [ ] **Step 2: Commit**
```bash
git add docs/DEPLOYMENT.md
git commit -m "docs: add multi-instance deployment, GHCR images, update/rollback procedures"
```
---
## Execution Order
Sequential recommended (some tasks depend on earlier ones):
1. **Task 1** — .env.template (no deps)
2. **Task 2** — docker-compose fixes (no deps)
3. **Task 3** — Caddy TLS (depends on Task 2)
4. **Task 4** — image versioning (no deps)
5. **Task 5** — Terraform remote state (no deps)
6. **Task 6** — hardcoded paths (no deps)
7. **Task 7** — documentation (depends on all above)
Tasks 1, 2, 4, 5, 6 can run in parallel.
**Verification after all tasks:**
```bash
# Tests still pass
pytest tests/ -v --tb=short
# Docker builds
docker compose build
# Production compose validates
docker compose -f docker-compose.yml -f docker-compose.prod.yml config
```