agnes-the-ai-analyst/docs/archive/superpowers/plans/2026-04-21-multi-customer-deployment.md
ZdenekSrotyr a48524509a
docs: consolidate and de-clutter the documentation tree (#306)
CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.
2026-05-14 18:54:22 +00:00

63 KiB
Raw Permalink Blame History

Multi-Customer Deployment Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Přejít z dnešního "prod běží z osobního forku padak/tmp_oss" na production-grade multi-customer setup podle spec docs/superpowers/specs/2026-04-21-multi-customer-deployment-spec.md.

Architecture: Public upstream (keboola/agnes-the-ai-analyst) s TF modulem + public image na GHCR. Privátní template repo (keboola/agnes-infra-template) jako skeleton. Per-customer privátní repo (keboola/agnes-infra-keboola pro Keboola-as-customer, {org}/agnes-infra pro další) s Terraform + GitHub Actions + SA JSON key. Každý zákazník má vlastní GCP projekt, vlastní Secret Manager, vlastní prod/dev VMs. Watchtower na VMs polluje GHCR pro auto-deploy. Branch-aware dev VMs přes pole dev_instances v tfvars.

Tech Stack: Terraform (google provider ~5.0), Docker Compose, Caddy (TLS), Watchtower, GHCR, Google Cloud (Compute Engine, Secret Manager, Cloud Storage, IAM), GitHub Actions, Argon2 (passwords), DuckDB.


Závislosti mezi fázemi

Fáze 1 (MVP)  ──────────────────────────┐
   │                                    │
   ▼                                    ▼
Fáze 2 (TF modul + PD + rebuild)    Fáze 0 (Předpoklady)
   │
   ├─────────┬──────────┬──────────┐
   ▼         ▼          ▼          ▼
Fáze 3    Fáze 4     Fáze 5     Fáze 6
(TLS)   (Watchtower) (CI/CD)  (Template)
   │         │          │          │
   └─────────┴──────────┴──────────┘
              ▼
          Hotovo

Fáze 0 a 1 jsou sériové. Po Fázi 2 mohou 3/4/5 běžet paralelně. Fáze 6 používá výstupy 3/4/5.


Fáze 0 — Předpoklady (manuální, mimo kód)

Tyto kroky vyžadují externí akce (oprávnění, Keboola UI). Musí být hotové před Fází 1.

Task 0.1: Ověřit přístupová práva

  • Step 1: Ověřit, že máš iam.serviceAccountAdmin na internal-prod
gcloud projects get-iam-policy internal-prod --format=json \
  | python3 -c "import json, sys; d=json.load(sys.stdin); \
      me='zdenek.srotyr@keboola.com'; \
      roles=[b['role'] for b in d['bindings'] if any(me in m for m in b.get('members', []))]; \
      print('\n'.join(roles) if roles else 'NO DIRECT ROLES — check org-level or ask Petr (owner)')"

Expected: seznam rolí, nebo poznámka "NO DIRECT ROLES".

  • Step 2: Pokud chybí SA admin práva, požádat Petra o dočasný roles/iam.serviceAccountAdmin + roles/resourcemanager.projectIamAdmin

Poslat mu odkaz na tuhle dokumentaci: https://cloud.google.com/iam/docs/understanding-roles#iam-roles

Napsat Petrovi ve Slacku / emailu: "Potřebuji dočasně roli iam.serviceAccountAdmin a resourcemanager.projectIamAdmin na projektu internal-prod pro vytvoření Agnes deploy SA. Zrušíme, jakmile bude hotovo."

  • Step 3: Ověřit, že image ghcr.io/keboola/agnes-the-ai-analyst je public
gh api /orgs/keboola/packages/container/agnes-the-ai-analyst --jq '.visibility' 2>&1

Expected: "public". Pokud "private", změnit přes GitHub UI: Keboola org → Packages → agnes-the-ai-analyst → Package settings → Change visibility → Public.

Task 0.2: Backup stávajících dat (safety net před Fází 2)

  • Step 1: Snapshot boot disku prod VM (obsahuje /data)
gcloud compute disks snapshot data-analyst \
    --zone=europe-west1-b \
    --snapshot-names=data-analyst-pre-migration-$(date +%Y%m%d) \
    --project=internal-prod

Expected: Created snapshot data-analyst-pre-migration-YYYYMMDD.

  • Step 2: Ověřit snapshot
gcloud compute snapshots list --project=internal-prod \
    --filter="name~pre-migration" --format="table(name, status, diskSizeGb, creationTimestamp)"

Expected: STATUS = READY, 30 GB.


Fáze 1 — MVP: Odstřihnout od osobního forku, přejít na :stable image

Goal fáze: Prod VM data-analyst pulluje image z GHCR, nikoliv git pull z ZdenekSrotyr/tmp_oss. Tokeny jsou v Secret Manageru. Přepnutí je reverzibilní.

Task 1.1: Přidat per-branch image tagging do release.yml

Files:

  • Modify: .github/workflows/release.yml:47-95

  • Step 1: Number current state of meta step

cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
grep -n "branch_slug\|feature_tag\|SLUG" .github/workflows/release.yml 2>&1 | head -5

Expected: žádné výsledky — pattern neexistuje, přidáme ho.

  • Step 2: Otevřít .github/workflows/release.yml a najít Claim version tag step

Sekce má id: meta. Za řádkem echo "short_sha=${SHORT_SHA}" >> "$GITHUB_OUTPUT" (~ř. 90) přidat:

          # Per-branch slug for dev images (only on feature branches)
          if [[ "${{ github.ref }}" != "refs/heads/main" ]]; then
            BRANCH_NAME="${GITHUB_REF#refs/heads/}"
            BRANCH_SLUG=$(echo "$BRANCH_NAME" | sed 's|^feature/||' | sed 's|[^a-zA-Z0-9-]|-|g' | tr '[:upper:]' '[:lower:]' | cut -c1-50)
            echo "branch_slug=${BRANCH_SLUG}" >> "$GITHUB_OUTPUT"
            echo "Branch slug: ${BRANCH_SLUG}"
          fi
  • Step 3: V Build and push stepu přidat branch-slug tag

Najít tags: | blok (~ř. 110), nahradit za:

          tags: |
            ghcr.io/${{ github.repository }}:${{ steps.meta.outputs.channel }}
            ghcr.io/${{ github.repository }}:${{ steps.meta.outputs.versioned_tag }}
            ghcr.io/${{ github.repository }}:sha-${{ steps.meta.outputs.short_sha }}
            ${{ steps.meta.outputs.channel == 'dev' && format('ghcr.io/{0}:dev-{1}', github.repository, steps.meta.outputs.branch_slug) || '' }}            

Poslední řádek přidá :dev-<branch-slug> jen při pushech na feature branch.

  • Step 4: Syntax check workflow
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
gh workflow view release.yml 2>&1 | head -10

Expected: workflow info, žádné "Parse error".

  • Step 5: Commit
git add .github/workflows/release.yml
git commit -m "ci: add per-branch image tag :dev-<slug> for branch-aware dev deploys"

Task 1.2: Vytvořit GCP deploy service account

Files:

  • Create: scripts/bootstrap-gcp.sh

  • Step 1: Vytvořit bootstrap skript

cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
mkdir -p scripts

Write scripts/bootstrap-gcp.sh:

#!/usr/bin/env bash
# Bootstrap GCP projekt pro Agnes deployment.
# Jednorázové, idempotentní. Výstup = výpis secretů pro GitHub Actions.
#
# Usage: bootstrap-gcp.sh <GCP_PROJECT_ID> [SA_NAME]
# Pokud SA existuje, skript vygeneruje nový klíč a skončí.
set -euo pipefail

PROJECT_ID="${1:?Usage: $0 <GCP_PROJECT_ID> [SA_NAME=agnes-deploy]}"
SA_NAME="${2:-agnes-deploy}"
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

echo "=== Bootstrap GCP projekt: ${PROJECT_ID} ==="
gcloud config set project "${PROJECT_ID}" 1>/dev/null

echo "=== Enable APIs ==="
gcloud services enable \
    compute.googleapis.com \
    iam.googleapis.com \
    iamcredentials.googleapis.com \
    secretmanager.googleapis.com \
    cloudresourcemanager.googleapis.com \
    storage.googleapis.com \
    --project="${PROJECT_ID}"

echo "=== Create deploy service account (if not exists) ==="
if ! gcloud iam service-accounts describe "${SA_EMAIL}" --project="${PROJECT_ID}" 2>/dev/null; then
    gcloud iam service-accounts create "${SA_NAME}" \
        --display-name="Agnes Terraform deploy" \
        --project="${PROJECT_ID}"
fi

echo "=== Grant roles ==="
for role in \
    compute.instanceAdmin.v1 \
    compute.securityAdmin \
    compute.networkAdmin \
    iam.serviceAccountUser \
    secretmanager.admin \
    storage.admin; do
    gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
        --member="serviceAccount:${SA_EMAIL}" \
        --role="roles/${role}" \
        --condition=None \
        --quiet 1>/dev/null
done

echo "=== Create tfstate bucket (if not exists) ==="
BUCKET="agnes-${PROJECT_ID}-tfstate"
if ! gsutil ls -b "gs://${BUCKET}" 2>/dev/null; then
    gsutil mb -p "${PROJECT_ID}" -l europe-west1 -b on "gs://${BUCKET}"
    gsutil versioning set on "gs://${BUCKET}"
fi

echo "=== Generate SA key ==="
KEY_FILE="./${SA_NAME}-${PROJECT_ID}-key.json"
gcloud iam service-accounts keys create "${KEY_FILE}" \
    --iam-account="${SA_EMAIL}" \
    --project="${PROJECT_ID}"

echo ""
echo "=== HOTOVO ==="
echo ""
echo "SA email:           ${SA_EMAIL}"
echo "TF state bucket:    gs://${BUCKET}"
echo "SA key file:        ${KEY_FILE}"
echo ""
echo "DALŠÍ KROKY:"
echo "1. Pushni klíč do GitHub secretu privátního infra repa:"
echo "   gh secret set GCP_SA_KEY --repo <owner>/<repo> < ${KEY_FILE}"
echo "2. POTOM smaž klíč z lokálu:"
echo "   rm ${KEY_FILE}"
echo ""
  • Step 2: Udělat skript executable
chmod +x scripts/bootstrap-gcp.sh
  • Step 3: Spustit skript na internal-prod
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
./scripts/bootstrap-gcp.sh internal-prod

Expected: na konci výpis "HOTOVO" + instrukce.

Pokud selže na "Permission denied": viz Task 0.1 step 2 (požádat Petra).

  • Step 4: Ověřit SA a bucket
gcloud iam service-accounts list --project=internal-prod --filter="email~agnes-deploy" --format="value(email)"
gsutil ls -b gs://agnes-internal-prod-tfstate

Expected: SA email + bucket URL.

  • Step 5: Commit bootstrap skript
git add scripts/bootstrap-gcp.sh
git commit -m "infra: add bootstrap-gcp.sh for per-customer GCP setup"

Task 1.3: Nastavit tajemství v Secret Manageru

  • Step 1: Rotovat Keboola Storage token v Keboola UI

Přihlásit se do Keboola UI (https://connection.us-east4.gcp.keboola.com/), sekce Settings → Master Tokens → vygenerovat nový token.

Starý token zachovat aktivní, dokud nebude nový nasazený.

  • Step 2: Uložit nový token do Secret Manageru
read -s NEW_TOKEN
echo -n "$NEW_TOKEN" | gcloud secrets create keboola-storage-token \
    --data-file=- \
    --replication-policy=automatic \
    --project=internal-prod
unset NEW_TOKEN

Expected: Created secret [keboola-storage-token].

  • Step 3: Vygenerovat a uložit JWT secret
openssl rand -hex 32 | gcloud secrets create jwt-secret-key \
    --data-file=- \
    --replication-policy=automatic \
    --project=internal-prod

Expected: Created secret [jwt-secret-key].

  • Step 4: Ověřit secrets
gcloud secrets list --project=internal-prod --format="table(name, createTime)"

Expected: dva secrets — keboola-storage-token, jwt-secret-key.

  • Step 5: Přiřadit read access deploy SA
for secret in keboola-storage-token jwt-secret-key; do
    gcloud secrets add-iam-policy-binding "$secret" \
        --member="serviceAccount:agnes-deploy@internal-prod.iam.gserviceaccount.com" \
        --role=roles/secretmanager.secretAccessor \
        --project=internal-prod
done

Expected: Updated IAM policy × 2.

Task 1.4: Vytvořit skript, který na VM natáhne secrets ze Secret Manageru do .env

Files:

  • Create: scripts/fetch-env-from-secrets.sh

  • Step 1: Napsat skript

Write scripts/fetch-env-from-secrets.sh:

#!/usr/bin/env bash
# Stáhne secrets z GCP Secret Manageru a vytvoří .env pro Agnes.
# Spouští se jednorázově na VM během boot / deploy.
#
# Vyžaduje:
#   - gcloud CLI (už nainstalované na GCE default image)
#   - VM SA má roli roles/secretmanager.secretAccessor
set -euo pipefail

APP_DIR="${APP_DIR:-/home/deploy/app}"
ENV_FILE="${APP_DIR}/.env"

echo "Fetching secrets..."

KEBOOLA_TOKEN=$(gcloud secrets versions access latest --secret=keboola-storage-token 2>&1)
JWT_KEY=$(gcloud secrets versions access latest --secret=jwt-secret-key 2>&1)

# Non-secret config (může zůstat v metadatě/startup-scriptu)
DATA_SOURCE="${DATA_SOURCE:-keboola}"
KEBOOLA_STACK_URL="${KEBOOLA_STACK_URL:-https://connection.us-east4.gcp.keboola.com/}"
SEED_ADMIN_EMAIL="${SEED_ADMIN_EMAIL:-zdenek.srotyr@keboola.com}"
LOG_LEVEL="${LOG_LEVEL:-info}"
DATA_DIR="${DATA_DIR:-/data}"

cat > "${ENV_FILE}" <<EOF
JWT_SECRET_KEY=${JWT_KEY}
DATA_DIR=${DATA_DIR}
DATA_SOURCE=${DATA_SOURCE}
KEBOOLA_STORAGE_TOKEN=${KEBOOLA_TOKEN}
KEBOOLA_STACK_URL=${KEBOOLA_STACK_URL}
SEED_ADMIN_EMAIL=${SEED_ADMIN_EMAIL}
LOG_LEVEL=${LOG_LEVEL}
EOF

chmod 600 "${ENV_FILE}"
chown deploy:deploy "${ENV_FILE}" 2>/dev/null || true

echo "Wrote ${ENV_FILE} (chmod 600)"
  • Step 2: Chmod + commit
chmod +x scripts/fetch-env-from-secrets.sh
git add scripts/fetch-env-from-secrets.sh
git commit -m "infra: add fetch-env-from-secrets.sh for VM-side secret retrieval"

Task 1.5: Připravit prod docker-compose pro GHCR image

Files:

  • Modify: docker-compose.prod.yml

  • Step 1: Přečíst současný docker-compose.prod.yml

cat "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss/docker-compose.prod.yml"

Zaznamenat si strukturu (services, volumes).

  • Step 2: Ověřit, že prod overlay používá image: místo build:
grep -E "^\s*(image|build):" docker-compose.prod.yml

Expected: řádek image: ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable} (nebo podobně). Pokud chybí, přidat do services.app:

services:
  app:
    image: ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}
    build: !reset null   # vypnout lokální build

A pro scheduler:

  scheduler:
    image: ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}
    build: !reset null
  • Step 3: Commit změn (pokud nějaké)
git status docker-compose.prod.yml
# Pokud modified:
git add docker-compose.prod.yml
git commit -m "infra: prod compose pulls from GHCR via AGNES_TAG env (default :stable)"

Task 1.6: Deploy MVP na prod VM data-analyst

Tohle je destruktivní akce na prod. Předtím Task 0.2 (snapshot).

  • Step 1: SSH na prod VM a zastavit kontejnery
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo -u deploy bash -c 'cd /home/deploy/app && docker compose down'"

Expected: Container app-app-1 Stopped, Container app-scheduler-1 Stopped.

  • Step 2: Nastavit VM SA na deploy VM (jednorázově)
# Ověřit aktuální SA
gcloud compute instances describe data-analyst --zone=europe-west1-b --project=internal-prod \
    --format="value(serviceAccounts[0].email)"

Pokud výstup 327445566538-compute@developer.gserviceaccount.com (default SA), je to OK pro MVP — má cloud-platform scope a může číst secrets. Ve Fázi 4 (hardening) to přepneme na dedikovaný SA.

Přidat mu explicitně secretmanager.secretAccessor (idempotentní):

gcloud projects add-iam-policy-binding internal-prod \
    --member="serviceAccount:327445566538-compute@developer.gserviceaccount.com" \
    --role="roles/secretmanager.secretAccessor" \
    --condition=None
  • Step 3: Uploadnout fetch-env skript na VM
gcloud compute scp \
    "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss/scripts/fetch-env-from-secrets.sh" \
    data-analyst:/tmp/fetch-env.sh \
    --zone=europe-west1-b --project=internal-prod
  • Step 4: Spustit fetch-env skript pod uživatelem deploy
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo install -m 755 -o deploy -g deploy /tmp/fetch-env.sh /home/deploy/app/fetch-env.sh && sudo -u deploy bash -c 'cd /home/deploy/app && ./fetch-env.sh'"

Expected: Wrote /home/deploy/app/.env (chmod 600).

  • Step 5: Zkontrolovat .env na VM (bez vypisování hodnot)
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo -u deploy bash -c 'ls -la /home/deploy/app/.env && wc -l /home/deploy/app/.env && cut -d= -f1 /home/deploy/app/.env'"

Expected: soubor 600 mode, 7 řádků, klíče: JWT_SECRET_KEY, DATA_DIR, DATA_SOURCE, KEBOOLA_STORAGE_TOKEN, KEBOOLA_STACK_URL, SEED_ADMIN_EMAIL, LOG_LEVEL.

  • Step 6: Aktualizovat docker-compose.yml konfiguraci na VM na pulling z GHCR
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo -u deploy bash -c 'cd /home/deploy/app && git fetch origin feature/v2-fastapi-duckdb-docker-cli && git reset --hard origin/feature/v2-fastapi-duckdb-docker-cli'"

Pozor: VM má starý remote ZdenekSrotyr/tmp_oss. Tohle tedy nebude fungovat, pokud se ten repo smazal. Alternativa: nahradit origin remote za keboola/agnes-the-ai-analyst:

gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo -u deploy bash -c 'cd /home/deploy/app && git remote set-url origin https://github.com/keboola/agnes-the-ai-analyst.git && git fetch origin main && git reset --hard origin/main'"

Expected: HEAD is now at <sha> <message>.

  • Step 7: Pullnout image z GHCR a nastartovat s novým override
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo -u deploy bash -c 'cd /home/deploy/app && export AGNES_TAG=stable && docker compose -f docker-compose.yml -f docker-compose.prod.yml pull && docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d'"

Expected: Container app-app-1 Started, Container app-scheduler-1 Started.

  • Step 8: Ověřit běh
# Počkat 30 sekund
sleep 30
curl -s --max-time 10 http://<redacted-ip>:8000/api/health | python3 -m json.tool | head -10

Expected: "status": "healthy" nebo "degraded" (stale tables jsou OK). Ne connection refused.

  • Step 9: Ověřit, že app používá nový image
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo docker inspect app-app-1 --format '{{.Config.Image}}'"

Expected: ghcr.io/keboola/agnes-the-ai-analyst:stable (ne app-app).

  • Step 10: Ověřit login
curl -sS --max-time 5 -X POST http://<redacted-ip>:8000/auth/password/login \
  -H "Content-Type: application/json" \
  -d '{"email":"zdenek.srotyr@keboola.com","password":"1234"}' 2>&1 | python3 -c "import sys,json; d=json.load(sys.stdin); print('OK — role:', d.get('role'))"

Expected: OK — role: admin.

  • Step 11: Zapsat poznámku o nové .env strategii do dokumentace

Add to docs/DEPLOYMENT.md (if not present) section "Production environment":

## Production .env strategy

Secrets (KEBOOLA_STORAGE_TOKEN, JWT_SECRET_KEY) are fetched from GCP Secret Manager
by `scripts/fetch-env-from-secrets.sh` during VM boot. Non-secret config (STACK_URL,
SEED_ADMIN_EMAIL, LOG_LEVEL) is passed via env vars in the startup script.

To rotate a secret:
1. Add a new version via `gcloud secrets versions add ...`
2. SSH to VM and re-run `./fetch-env.sh`
3. Restart: `docker compose up -d --force-recreate app`
  • Step 12: Commit dokumentace
git add docs/DEPLOYMENT.md
git commit -m "docs: document Secret Manager-backed .env for production"

Task 1.7: Zopakovat MVP deploy na dev VM

  • Step 1: Opakovat Task 1.6 steps 1-10 proti data-analyst-dev VM

Stejné příkazy, jen zaměnit data-analyst za data-analyst-dev a IP <redacted-ip> za <redacted-ip>.

  • Step 2: Verify
curl -s --max-time 10 http://<redacted-ip>:8000/api/health | python3 -m json.tool | head -3

Expected: valid JSON s "status".

Task 1.8: Smazat osobní fork

  • Step 1: Odstranit deploy key z ZdenekSrotyr/tmp_oss (pokud existuje)
gh api repos/ZdenekSrotyr/tmp_oss/keys 2>&1 | python3 -m json.tool

Pokud něco vrací, smazat: gh api -X DELETE repos/ZdenekSrotyr/tmp_oss/keys/<id>.

  • Step 2: Smazat repo
gh repo delete ZdenekSrotyr/tmp_oss --yes

Expected: ✓ Deleted repository ZdenekSrotyr/tmp_oss.

  • Step 3: Ověřit, že je fuč
gh api repos/ZdenekSrotyr/tmp_oss 2>&1 | head -2

Expected: Not Found (HTTP 404).

Task 1.9: Invalidovat starý Keboola token

  • Step 1: V Keboola UI zrušit starý master token

(Ruční krok v Keboola UI. Nový token už je v Secret Manageru z Task 1.3.)

Ověřit, že nová verze tokenu funguje:

curl -s --max-time 10 http://<redacted-ip>:8000/api/sync/status 2>&1 | python3 -m json.tool | head -20

Expected: nějaký valid JSON. Pokud 401 Unauthorized nebo Invalid token, app ještě má cached starý token — restartovat:

gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo -u deploy bash -c 'cd /home/deploy/app && docker compose restart app'"

Task 1.10: Checkpoint — Fáze 1 hotová

  • Step 1: Přepnout heslo z 1234 na něco silného

Přes UI nebo:

read -s NEW_PASSWORD
TOKEN=$(curl -sS -X POST http://<redacted-ip>:8000/auth/password/login \
  -H "Content-Type: application/json" \
  -d '{"email":"zdenek.srotyr@keboola.com","password":"1234"}' | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])")
# [Volba: použít admin endpoint pro změnu hesla, pokud existuje — jinak přes UI]
unset NEW_PASSWORD TOKEN
  • Step 2: Ověřit stav

Zkontrolovat checklist:

  • Prod VM data-analyst běží z ghcr.io/...:stable
  • Dev VM data-analyst-dev běží z ghcr.io/...:stable
  • Secrets v GCP Secret Manageru
  • Heslo admin usera není 1234
  • ZdenekSrotyr/tmp_oss je smazaný
  • Starý Keboola token je invalidován

Fáze 2 — TF modul + persistent disk + F1 rebuild

Goal fáze: Keboola instance běží na VMs, kterou spravuje Terraform modul z infra/modules/customer-instance/. Data jsou na samostatném persistent disku. TF state v GCS bucketu.

Task 2.1: Refactor infra/main.tf na modulární strukturu

Files:

  • Create: infra/modules/customer-instance/main.tf

  • Create: infra/modules/customer-instance/variables.tf

  • Create: infra/modules/customer-instance/outputs.tf

  • Create: infra/modules/customer-instance/startup-script.sh

  • Delete: infra/main.tf (old monolith)

  • Keep (upraveno): infra/variables.tf, infra/outputs.tf, infra/terraform.tfvars.example

  • Create: infra/examples/minimal/main.tf (usage example)

  • Step 1: Vytvořit adresářovou strukturu

cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
mkdir -p infra/modules/customer-instance
mkdir -p infra/examples/minimal
  • Step 2: Napsat infra/modules/customer-instance/variables.tf

Write:

variable "gcp_project_id" {
  description = "GCP project ID kde bude instance nasazená"
  type        = string
}

variable "region" {
  description = "GCP region"
  type        = string
  default     = "europe-west1"
}

variable "zone" {
  description = "GCP zone"
  type        = string
  default     = "europe-west1-b"
}

variable "customer_name" {
  description = "Krátké identifikátor zákazníka (např. keboola, another-customer). Použije se v prefixu resourců."
  type        = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,20}$", var.customer_name))
    error_message = "customer_name musí být lowercase, začínat písmenem, 2-21 znaků."
  }
}

variable "prod_instance" {
  description = "Prod VM konfigurace"
  type = object({
    name         = string
    machine_type = optional(string, "e2-small")
    disk_size_gb = optional(number, 30)
    data_disk_gb = optional(number, 50)
    image_tag    = optional(string, "stable")
    upgrade_mode = optional(string, "auto")
    tls_mode     = optional(string, "caddy")
    domain       = optional(string, "")
  })
}

variable "dev_instances" {
  description = "Seznam dev VMs. Prázdné pole = žádné dev VMs."
  type = list(object({
    name         = string
    machine_type = optional(string, "e2-small")
    image_tag    = optional(string, "dev")
  }))
  default = []
}

variable "seed_admin_email" {
  description = "Email prvního admin usera"
  type        = string
}

variable "data_source" {
  description = "Typ data source — keboola | bigquery | csv"
  type        = string
  default     = "keboola"
}

variable "keboola_stack_url" {
  description = "Keboola Stack URL (pokud data_source = keboola)"
  type        = string
  default     = ""
}

variable "image_repo" {
  description = "Docker image repo"
  type        = string
  default     = "ghcr.io/keboola/agnes-the-ai-analyst"
}
  • Step 3: Napsat infra/modules/customer-instance/main.tf

Write:

terraform {
  required_version = ">= 1.5"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.0"
    }
  }
}

locals {
  all_instances = concat(
    [merge(var.prod_instance, { role = "prod" })],
    [for d in var.dev_instances : merge(d, {
      role         = "dev"
      disk_size_gb = 30
      data_disk_gb = 20
      upgrade_mode = "auto"
      tls_mode     = "caddy"
      domain       = ""
    })]
  )
}

# --- Secrets ---

resource "google_secret_manager_secret" "jwt" {
  secret_id = "agnes-${var.customer_name}-jwt-secret"
  project   = var.gcp_project_id
  replication { auto {} }
}

resource "random_password" "jwt" {
  length  = 48
  special = false
}

resource "google_secret_manager_secret_version" "jwt" {
  secret      = google_secret_manager_secret.jwt.id
  secret_data = random_password.jwt.result
}

# Keboola token — manuálně vytvořený secret (tenhle TF ho jen referenční).
data "google_secret_manager_secret_version" "keboola_token" {
  count   = var.data_source == "keboola" ? 1 : 0
  secret  = "keboola-storage-token"
  project = var.gcp_project_id
}

# --- VM service account (dedikovaný, bez cloud-platform scope) ---

resource "google_service_account" "vm" {
  account_id   = "agnes-${var.customer_name}-vm"
  display_name = "Agnes VM runtime SA (${var.customer_name})"
  project      = var.gcp_project_id
}

resource "google_project_iam_member" "vm_secrets" {
  project = var.gcp_project_id
  role    = "roles/secretmanager.secretAccessor"
  member  = "serviceAccount:${google_service_account.vm.email}"
}

# --- Network ---

resource "google_compute_firewall" "web" {
  name    = "agnes-${var.customer_name}-allow-web"
  project = var.gcp_project_id
  network = "default"

  allow {
    protocol = "tcp"
    ports    = ["22", "80", "443", "8000"]
  }

  source_ranges = ["<redacted-ip>/0"]
  target_tags   = ["agnes-${var.customer_name}"]
}

# --- Persistent data disks + VMs (prod + dev) ---

resource "google_compute_disk" "data" {
  for_each = { for inst in local.all_instances : inst.name => inst }

  name    = "${each.value.name}-data"
  project = var.gcp_project_id
  zone    = var.zone
  size    = each.value.data_disk_gb
  type    = "pd-ssd"
}

resource "google_compute_address" "ip" {
  for_each = { for inst in local.all_instances : inst.name => inst }

  name    = "${each.value.name}-ip"
  project = var.gcp_project_id
  region  = var.region
}

resource "google_compute_instance" "vm" {
  for_each = { for inst in local.all_instances : inst.name => inst }

  name         = each.value.name
  project      = var.gcp_project_id
  machine_type = each.value.machine_type
  zone         = var.zone
  tags         = ["agnes-${var.customer_name}"]

  boot_disk {
    initialize_params {
      image = "ubuntu-os-cloud/ubuntu-2404-lts-amd64"
      size  = each.value.disk_size_gb
      type  = "pd-ssd"
    }
  }

  attached_disk {
    source      = google_compute_disk.data[each.key].self_link
    device_name = "data"
  }

  network_interface {
    network = "default"
    access_config {
      nat_ip = google_compute_address.ip[each.key].address
    }
  }

  metadata = {
    enable-oslogin = "TRUE"
  }

  metadata_startup_script = templatefile("${path.module}/startup-script.sh", {
    customer_name     = var.customer_name
    image_repo        = var.image_repo
    image_tag         = each.value.image_tag
    upgrade_mode      = each.value.upgrade_mode
    tls_mode          = each.value.tls_mode
    domain            = each.value.domain
    data_source       = var.data_source
    keboola_stack_url = var.keboola_stack_url
    seed_admin_email  = var.seed_admin_email
    role              = each.value.role
  })

  service_account {
    email  = google_service_account.vm.email
    scopes = ["cloud-platform"]
  }

  labels = {
    app      = "agnes"
    customer = var.customer_name
    role     = each.value.role
    managed  = "terraform"
  }

  lifecycle {
    ignore_changes = [metadata_startup_script]
  }
}
  • Step 4: Napsat infra/modules/customer-instance/startup-script.sh

Write:

#!/bin/bash
# Agnes VM startup script.
# Idempotentní — spustí se při každém boot.
set -euo pipefail
exec > /var/log/agnes-startup.log 2>&1

CUSTOMER_NAME="${customer_name}"
IMAGE_REPO="${image_repo}"
IMAGE_TAG="${image_tag}"
UPGRADE_MODE="${upgrade_mode}"
TLS_MODE="${tls_mode}"
DOMAIN="${domain}"
DATA_SOURCE="${data_source}"
KEBOOLA_STACK_URL="${keboola_stack_url}"
SEED_ADMIN_EMAIL="${seed_admin_email}"
ROLE="${role}"

echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup ==="

# --- 1. Docker (install if missing) ---
if ! command -v docker &>/dev/null; then
    curl -fsSL https://get.docker.com | sh
fi
if ! docker compose version &>/dev/null; then
    apt-get update && apt-get install -y docker-compose-plugin
fi

# --- 2. Persistent disk mount ---
DATA_DEV="/dev/disk/by-id/google-data"
DATA_MNT="/data"
if [ -b "$DATA_DEV" ]; then
    if ! blkid "$DATA_DEV" | grep -q ext4; then
        mkfs.ext4 -F "$DATA_DEV"
    fi
    mkdir -p "$DATA_MNT"
    mountpoint -q "$DATA_MNT" || mount -o discard,defaults "$DATA_DEV" "$DATA_MNT"
    grep -q "$DATA_DEV" /etc/fstab || echo "$DATA_DEV $DATA_MNT ext4 discard,defaults,nofail 0 2" >> /etc/fstab
    mkdir -p "$DATA_MNT/state" "$DATA_MNT/analytics" "$DATA_MNT/extracts"
fi

# --- 3. App directory (pro docker-compose.yml) ---
APP_DIR="/opt/agnes"
mkdir -p "$APP_DIR"
cd "$APP_DIR"

# Fetch minimal docker-compose — z public repa na jejich tagu
curl -fsSL "https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.yml" \
    -o docker-compose.yml
curl -fsSL "https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml" \
    -o docker-compose.prod.yml

# --- 4. Fetch secrets from Secret Manager ---
KEBOOLA_TOKEN=""
if [ "$DATA_SOURCE" = "keboola" ]; then
    KEBOOLA_TOKEN=$(gcloud secrets versions access latest --secret=keboola-storage-token 2>/dev/null || echo "")
fi
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$CUSTOMER_NAME-jwt-secret)

cat > "$APP_DIR/.env" <<EOF
JWT_SECRET_KEY=$JWT_KEY
DATA_DIR=$DATA_MNT
DATA_SOURCE=$DATA_SOURCE
KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
LOG_LEVEL=info
DOMAIN=$DOMAIN
AGNES_TAG=$IMAGE_TAG
EOF
chmod 600 "$APP_DIR/.env"

# --- 5. Start Agnes ---
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# --- 6. Watchtower (auto pull nových image) ---
if [ "$UPGRADE_MODE" = "auto" ]; then
    docker run -d --name watchtower --restart=unless-stopped \
        -v /var/run/docker.sock:/var/run/docker.sock \
        containrrr/watchtower \
        --interval 300 --cleanup \
        $(docker ps --filter "ancestor=$IMAGE_REPO:$IMAGE_TAG" --format "{{.Names}}") 2>/dev/null || true
fi

echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup complete ==="
docker compose ps
  • Step 5: Napsat infra/modules/customer-instance/outputs.tf

Write:

output "instance_ips" {
  description = "Mapa { name → external IP }"
  value       = { for k, v in google_compute_address.ip : k => v.address }
}

output "prod_ip" {
  description = "External IP prod instance"
  value       = google_compute_address.ip[var.prod_instance.name].address
}

output "vm_service_account" {
  description = "Email VM SA (pro další IAM bindings, např. BigQuery)"
  value       = google_service_account.vm.email
}

output "jwt_secret_name" {
  description = "Plný název JWT secretu v Secret Manageru"
  value       = google_secret_manager_secret.jwt.name
}
  • Step 6: Smazat starý infra/main.tf a uložit si ho jako backup
mv infra/main.tf infra/main.tf.backup-pre-module
  • Step 7: Vytvořit infra/examples/minimal/main.tf

Write:

# Minimal example: single-VM Agnes deploy.
# Pro OSS self-hoster, co nechce ani persistent disk ani dev VM.
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
}

provider "google" {
  project = var.gcp_project_id
  region  = "europe-west1"
}

variable "gcp_project_id" {
  type = string
}

module "agnes" {
  source = "../../modules/customer-instance"

  gcp_project_id   = var.gcp_project_id
  customer_name    = "self-hosted"
  seed_admin_email = "admin@example.com"

  prod_instance = {
    name         = "agnes"
    data_disk_gb = 30
  }

  dev_instances = []

  data_source = "keboola"
}

output "agnes_ip" {
  value = module.agnes.prod_ip
}
  • Step 8: Smazat infra/variables.tf, infra/outputs.tf, infra/terraform.tfvars.example (už patří do modulu / examples)
# Backup si udělat
mv infra/variables.tf infra/variables.tf.backup-pre-module
mv infra/outputs.tf infra/outputs.tf.backup-pre-module
mv infra/terraform.tfvars.example infra/terraform.tfvars.example.backup-pre-module
  • Step 9: terraform init + validate v example
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss/infra/examples/minimal"
terraform init -backend=false
terraform validate

Expected: Success! The configuration is valid.

  • Step 10: Commit
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
git add infra/modules/ infra/examples/ 
git add -u infra/  # pro mv backupy
git commit -m "infra: extract customer-instance Terraform module; add minimal example"

Task 2.2: Tag prvního release TF modulu

  • Step 1: Otevřít PR z feature branch do main
git push origin feature/v2-fastapi-duckdb-docker-cli
gh pr create --title "feat: multi-customer deployment (Fáze 1-2)" \
    --body "Implements Phases 1-2 of docs/superpowers/plans/2026-04-21-multi-customer-deployment.md"
  • Step 2: Po mergi do main vytvořit tag infra-v1.0.0
git checkout main
git pull
git tag -a infra-v1.0.0 -m "Initial customer-instance module release"
git push origin infra-v1.0.0

Task 2.3: Založit privátní repo keboola/agnes-infra-keboola (manuálně)

Tohle je krok mimo tento repo. Plán jen popisuje.

  • Step 1: Vytvořit prázdný privátní repo
gh repo create keboola/agnes-infra-keboola --private --description "Agnes deployment — Keboola internal instance"
  • Step 2: Klonovat lokálně vedle tohohle repa
cd ~/Library/Mobile\ Documents/com\~apple\~CloudDocs/Sources/VsCode/component_factory/
gh repo clone keboola/agnes-infra-keboola
cd agnes-infra-keboola
  • Step 3: Vytvořit strukturu
mkdir -p terraform .github/workflows config

# Terraform root
cat > terraform/main.tf <<'EOF'
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  backend "gcs" {
    bucket = "agnes-internal-prod-tfstate"
    prefix = "keboola"
  }
}

provider "google" {
  project = var.gcp_project_id
  region  = var.region
  zone    = var.zone
}

module "agnes" {
  source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.0.0"

  gcp_project_id    = var.gcp_project_id
  region            = var.region
  zone              = var.zone
  customer_name     = "keboola"
  seed_admin_email  = var.seed_admin_email
  data_source       = "keboola"
  keboola_stack_url = var.keboola_stack_url
  prod_instance     = var.prod_instance
  dev_instances     = var.dev_instances
}

output "prod_ip" { value = module.agnes.prod_ip }
output "instance_ips" { value = module.agnes.instance_ips }
EOF

cat > terraform/variables.tf <<'EOF'
variable "gcp_project_id"    { type = string }
variable "region"            { type = string, default = "europe-west1" }
variable "zone"              { type = string, default = "europe-west1-b" }
variable "seed_admin_email"  { type = string }
variable "keboola_stack_url" { type = string }
variable "prod_instance"     { type = any }
variable "dev_instances"     { type = any, default = [] }
EOF

cat > terraform/terraform.tfvars.example <<'EOF'
gcp_project_id    = "internal-prod"
seed_admin_email  = "zdenek.srotyr@keboola.com"
keboola_stack_url = "https://connection.us-east4.gcp.keboola.com/"

prod_instance = {
  name         = "agnes-prod"
  machine_type = "e2-small"
  data_disk_gb = 50
  image_tag    = "stable"
  upgrade_mode = "auto"
  tls_mode     = "caddy"
  domain       = ""
}

dev_instances = [
  { name = "agnes-dev", image_tag = "dev" }
]
EOF

cat > terraform/.gitignore <<'EOF'
terraform.tfvars
*.tfstate
*.tfstate.*
.terraform/
.terraform.lock.hcl
EOF

cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars on real values if they differ
  • Step 4: Initial commit
git add .
git commit -m "initial: Keboola-as-customer Agnes deployment"
git push -u origin main
  • Step 5: Uploadnout GCP_SA_KEY jako GitHub secret
# Klíč vytvořený v Task 1.2 step 3
gh secret set GCP_SA_KEY --repo keboola/agnes-infra-keboola \
    < ../tmp_oss/agnes-deploy-internal-prod-key.json

Poznámka: Pokud klíč ne už smazal, re-generate: gcloud iam service-accounts keys create ....

  • Step 6: První terraform init + plan (lokálně, abychom viděli diff)
cd terraform
export GOOGLE_APPLICATION_CREDENTIALS="../agnes-deploy-key.json"
terraform init
terraform plan

Expected: Plan: N to add, 0 to change, 0 to destroy. (N ~ 15-20 resources)

Zkontrolovat plán: žádné destroy na existujících data-analyst / data-analyst-dev (to teprve poté, co bude nové nahoře).

Task 2.4: Migrace dat ze starých VMs na nové (bez downtime risku)

Strategy: Zachovat staré VMs běžící. Terraform vytvoří nové VMs s jinými jmény (agnes-prod, agnes-dev). Data se zkopírují. Poté přepneme DNS/IP (nebo jen komunikujeme novou IP) a staré VMs smažeme.

  • Step 1: Snapshot starého /data

Už máme z Task 0.2. Pokud je snapshot starší než 24 h, udělat nový:

gcloud compute disks snapshot data-analyst \
    --zone=europe-west1-b \
    --snapshot-names=data-analyst-migration-$(date +%Y%m%d-%H%M) \
    --project=internal-prod
  • Step 2: Terraform apply — vytvoří nové VMs (agnes-prod, agnes-dev) vedle starých
cd ~/.../agnes-infra-keboola/terraform
terraform apply
# Type 'yes' to confirm

Expected: ~15-20 resources created, ~5 min. Outputs: prod_ip, instance_ips.

  • Step 3: Zkopírovat data ze starého boot-disku na nový persistent disk

Nové VMs mají prázdný /data. Musíme do něj nakopírovat stav z data-analyst VM.

Nejjednodušší cesta: rsync mezi VM přes SSH.

# SSH na nové prod VM
NEW_PROD_IP=$(cd ~/.../agnes-infra-keboola/terraform && terraform output -raw prod_ip)

# Zkopírovat SSH klíč na starou VM, aby mohla mít přístup na novou
# (nebo použít oslogin → další prerekvizita)

# Alternativa: udělat z druhé strany — SSH na starou VM, rsync na novou
gcloud compute ssh data-analyst --zone=europe-west1-b --project=internal-prod --command="sudo docker compose -f /home/deploy/app/docker-compose.yml -f /home/deploy/app/docker-compose.prod.yml down"

# Rsync přes gcloud compute scp recursive (funguje jen z lokálu)
gcloud compute scp --recurse --zone=europe-west1-b --project=internal-prod \
    data-analyst:/home/deploy/app/data-volume/ \
    agnes-prod:/data/

# Spustit app na nové VM znovu
gcloud compute ssh agnes-prod --zone=europe-west1-b --project=internal-prod --command="sudo docker compose -f /opt/agnes/docker-compose.yml -f /opt/agnes/docker-compose.prod.yml restart"

Alternativně (čistěji): restore ze snapshotu přes gcloud compute disks create --source-snapshot, pak attach místo prázdného data disku.

  • Step 4: Ověřit nový prod
NEW_PROD_IP=$(cd ~/.../agnes-infra-keboola/terraform && terraform output -raw prod_ip)
curl -s --max-time 10 "http://$NEW_PROD_IP:8000/api/health" | python3 -m json.tool | head -10

Expected: healthy / degraded, tables visible.

  • Step 5: Ověřit login na novém prod
curl -sS -X POST "http://$NEW_PROD_IP:8000/auth/password/login" \
  -H "Content-Type: application/json" \
  -d '{"email":"zdenek.srotyr@keboola.com","password":"<nové silné heslo z Task 1.10>"}' \
  | python3 -c "import sys,json;print('OK' if json.load(sys.stdin).get('role')=='admin' else 'FAIL')"

Expected: OK

  • Step 6: Zopakovat pro dev VM (agnes-dev)

Stejné kroky 1-5.

  • Step 7: Vypnout staré VMs (zatím NEmazat — jen stop)
gcloud compute instances stop data-analyst --zone=europe-west1-b --project=internal-prod
gcloud compute instances stop data-analyst-dev --zone=europe-west1-b --project=internal-prod
  • Step 8: Ověřit, že nový prod běží minimálně 24 h bez problému
# Poznámka v kalendáři / Slacku: "check agnes-prod health in 24h"
curl -s "http://$NEW_PROD_IP:8000/api/health" | python3 -m json.tool
  • Step 9: Po 24h stability smazat staré VMs + jejich disky + statické IP
gcloud compute instances delete data-analyst --zone=europe-west1-b --project=internal-prod --quiet
gcloud compute instances delete data-analyst-dev --zone=europe-west1-b --project=internal-prod --quiet

gcloud compute disks delete data-analyst --zone=europe-west1-b --project=internal-prod --quiet 2>&1 || true
gcloud compute disks delete data-analyst-dev --zone=europe-west1-b --project=internal-prod --quiet 2>&1 || true

gcloud compute addresses delete data-analyst-ip --region=europe-west1 --project=internal-prod --quiet 2>&1 || true
  • Step 10: Checkpoint — Fáze 2 hotová

Checklist:

  • Terraform modul v infra/modules/customer-instance/
  • keboola/agnes-infra-keboola privátní repo existuje, terraform apply funguje
  • Prod VM agnes-prod běží s persistent diskem
  • Dev VM agnes-dev běží
  • Data zmigrovaná, login funguje
  • Staré VMs smazané, projekt vyčištěný

Po Fázi 2 lze pokračovat paralelně Fázemi 3, 4, 5.


Fáze 3 — TLS přes Caddy

Goal fáze: Agnes je dostupná na HTTPS s automatickým Let's Encrypt certifikátem. Cookie secure=True funguje.

Task 3.1: Přidat Caddy service do docker-compose

Files:

  • Create: Caddyfile (v public repu root)

  • Modify: docker-compose.prod.yml (přidat caddy service)

  • Step 1: Vytvořit Caddyfile

Write Caddyfile:

# Agnes reverse proxy with automatic Let's Encrypt.
# Config přes ENV vars: AGNES_DOMAIN, ACME_EMAIL.

{$AGNES_DOMAIN} {
    # Health check endpoint bez TLS redirect (pro smoke testy interně)
    @health path /api/health
    
    encode gzip
    
    reverse_proxy app:8000 {
        header_up X-Forwarded-Proto https
    }
    
    tls {$ACME_EMAIL}
    
    log {
        output stdout
        format json
    }
}

# Fallback pro IP access (bez HTTPS, bez cert)
:80 {
    reverse_proxy app:8000
}
  • Step 2: Přidat caddy do docker-compose.prod.yml

Add to services (pokud už tam není):

  caddy:
    image: caddy:2-alpine
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
      - caddy_config:/config
    environment:
      AGNES_DOMAIN: ${AGNES_DOMAIN:-:80}
      ACME_EMAIL: ${ACME_EMAIL:-admin@example.com}
    depends_on:
      - app
    profiles:
      - tls   # nezapne se bez --profile tls

volumes:
  caddy_data:
  caddy_config:
  • Step 3: Aktualizovat modul — předat tls_mode do startup-script

V infra/modules/customer-instance/startup-script.sh najít sekci # --- 5. Start Agnes --- a rozšířit:

# --- 5. Start Agnes ---
COMPOSE_PROFILES=""
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
    COMPOSE_PROFILES="--profile tls"
    # Další ENV pro Caddy
    {
        echo "AGNES_DOMAIN=$DOMAIN"
        echo "ACME_EMAIL=admin@$${DOMAIN#*.}"
    } >> "$APP_DIR/.env"
fi

docker compose -f docker-compose.yml -f docker-compose.prod.yml $COMPOSE_PROFILES pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml $COMPOSE_PROFILES up -d
  • Step 4: Commit changes v public repu
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
git add Caddyfile docker-compose.prod.yml infra/modules/customer-instance/startup-script.sh
git commit -m "feat(tls): add Caddy reverse proxy with Let's Encrypt support"
  • Step 5: Tag nového releasu modulu
# Po mergi PR do main
git checkout main && git pull
git tag -a infra-v1.1.0 -m "Add TLS support via Caddy"
git push origin infra-v1.1.0

Task 3.2: Zapnout TLS pro Keboola instanci

Tohle vyžaduje DNS záznam. Pokud nemáš doménu, skip a zůstaň na :8000.

  • Step 1: V keboola/agnes-infra-keboola/terraform/terraform.tfvars nastavit doménu

Pokud máme agnes.keboola.com (ověřit u IT), edit:

prod_instance = {
  name     = "agnes-prod"
  # ...
  tls_mode = "caddy"
  domain   = "agnes.keboola.com"
}

A v main.tf bumpnout module ref:

source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.1.0"
  • Step 2: Terraform apply
cd ~/.../agnes-infra-keboola/terraform
terraform apply
  • Step 3: Nastavit DNS A record agnes.keboola.com → prod_ip

Ruční krok (potřebuje přístup do Keboola DNS). Výstup prod_ip je IP.

  • Step 4: Počkat na DNS propagation + LE cert
until nslookup agnes.keboola.com | grep -q "$(terraform output -raw prod_ip)"; do sleep 30; done
sleep 60  # čas na LE cert issuance
curl -sSI --max-time 10 https://agnes.keboola.com | head -5

Expected: HTTP/2 200 (ne 301, ne TLS error).


Fáze 4 — Watchtower (dev VM auto-deploy), OS Login, VM SA

Goal fáze: Dev VMs auto-pullují nové image. OS Login pro SSH (bez osobního klíče). Dedikovaný VM SA.

Task 4.1: Watchtower integrace (už v Task 2 startup-script, zde jen ověření)

  • Step 1: SSH na dev VM a ověřit, že watchtower běží
gcloud compute ssh agnes-dev --zone=europe-west1-b --project=internal-prod --command="sudo docker ps | grep watchtower"

Expected: container watchtower STATUS Up X minutes.

  • Step 2: Otestovat auto-deploy: pushnout drobnou změnu na feature branch, počkat
# V public repu
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
git checkout -b feature/watchtower-test
echo "# test" >> README.md
git add README.md
git commit -m "test: trigger :dev image rebuild"
git push origin feature/watchtower-test

Počkat ~ 5-10 min (CI build + watchtower poll interval 5 min).

# Kontrola image sha na dev VM
gcloud compute ssh agnes-dev --zone=europe-west1-b --project=internal-prod \
    --command="sudo docker inspect app-app-1 --format '{{.Image}}' && sudo docker image inspect \$(sudo docker inspect app-app-1 --format '{{.Image}}') --format '{{.Created}}'"

Expected: Created timestamp v posledních ~ 10 minutách.

Task 4.2: OS Login

  • Step 1: Ověřit, že modul nastavuje enable-oslogin=TRUE

Už je v infra/modules/customer-instance/main.tf:

metadata = {
  enable-oslogin = "TRUE"
}
  • Step 2: Zkontrolovat, že uživatelé mají roles/compute.osAdminLogin na projektu
gcloud projects get-iam-policy internal-prod \
    --flatten="bindings[].members" \
    --filter="bindings.role=roles/compute.osAdminLogin" \
    --format="value(bindings.members)"

Pokud prázdné, přidat:

gcloud projects add-iam-policy-binding internal-prod \
    --member=user:zdenek.srotyr@keboola.com \
    --role=roles/compute.osAdminLogin
  • Step 3: Test SSH přes OS Login
gcloud compute ssh agnes-prod --zone=europe-west1-b --project=internal-prod --command="whoami"

Expected: username ve formátu zdenek_srotyr_keboola_com (OS Login generated).

Task 4.3: VM SA už má správný scope (ověřit)

  • Step 1: Ověřit, že VM SA má jen secretmanager.secretAccessor
gcloud projects get-iam-policy internal-prod \
    --flatten="bindings[].members" \
    --filter="bindings.members:agnes-keboola-vm@" \
    --format="value(bindings.role)"

Expected: roles/secretmanager.secretAccessor (jen tohle).


Fáze 5 — CI/CD v privátním infra repu

Goal fáze: PR v keboola/agnes-infra-keboola spustí terraform plan; merge → terraform apply. Prod aplikuje přes environment protection s reviewerem.

Task 5.1: plan.yml workflow

Files (v keboola/agnes-infra-keboola repu):

  • Create: .github/workflows/plan.yml

  • Step 1: Napsat plan.yml

name: Terraform Plan

on:
  pull_request:
    paths:
      - 'terraform/**'

permissions:
  contents: read
  pull-requests: write

jobs:
  plan:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v5

      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ~1.7

      - run: terraform init
      - run: terraform fmt -check
      - id: plan
        run: |
          terraform plan -no-color -out=tfplan 2>&1 | tee plan.txt
          echo "status=$(echo $? )" >> $GITHUB_OUTPUT          

      - uses: actions/github-script@v7
        if: always()
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('terraform/plan.txt', 'utf8').slice(0, 60000);
            const body = `### Terraform plan\n\n\`\`\`\n${plan}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });            
  • Step 2: Commit
cd ~/.../agnes-infra-keboola
git add .github/workflows/plan.yml
git commit -m "ci: add terraform plan on PR"
git push

Task 5.2: apply.yml workflow s environment protection

Files:

  • Create: .github/workflows/apply.yml

  • Step 1: Napsat apply.yml

name: Terraform Apply

on:
  push:
    branches: [main]
    paths:
      - 'terraform/**'
  workflow_dispatch: {}

permissions:
  contents: read

jobs:
  apply-dev:
    runs-on: ubuntu-latest
    environment: dev     # no protection
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v5
      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ~1.7
      - run: terraform init
      - run: terraform apply -auto-approve -target='module.agnes.google_compute_instance.vm["agnes-dev"]'

  apply-prod:
    needs: apply-dev
    runs-on: ubuntu-latest
    environment: prod    # protected — requires reviewer
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v5
      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ~1.7
      - run: terraform init
      - run: terraform apply -auto-approve
      
      - name: Smoke test
        run: |
          PROD_IP=$(terraform output -raw prod_ip)
          for i in 1 2 3 4 5; do
            if curl -sf "http://$PROD_IP:8000/api/health" >/dev/null; then
              echo "Healthy"; exit 0
            fi
            sleep 15
          done
          echo "Health check failed"; exit 1          
  • Step 2: V GitHub UI nastavit environmenty

Navigovat do keboola/agnes-infra-keboola → Settings → Environments → New environment:

  • dev: žádná protection

  • prod:

    • Required reviewers: @ZdenekSrotyr (nebo @keboola-ops-team)
    • Wait timer: 5 min
    • Deployment branches: Selected branches → main
  • Step 3: Commit workflow

git add .github/workflows/apply.yml
git commit -m "ci: add terraform apply with dev/prod environments and smoke test"
git push
  • Step 4: Test flow — otevřít dummy PR, sledovat plan, merge, apply
git checkout -b test/ci-flow
# trivial edit in tfvars, např. přidat dev VM
echo "# ci flow test" >> terraform/README.md
git add terraform/README.md
git commit -m "test: CI flow"
git push origin test/ci-flow
gh pr create --title "test: CI flow" --body "Testing plan → apply flow"

V PR:

  1. Počkat na plan.yml → komentář s plánem
  2. Schválit + merge
  3. Sledovat apply-dev (auto), pak apply-prod (čeká na reviewera)
  4. Schválit prod deploy
  5. Ověřit smoke test PASS

Task 5.3: Rotovat SA key (z lokálního -> jen v GH secret)

  • Step 1: Smazat lokální SA key
rm ~/.../agnes-deploy-internal-prod-key.json
  • Step 2: Na GCP smazat starý klíč (key rotation)
# Seznam klíčů
gcloud iam service-accounts keys list \
    --iam-account=agnes-deploy@internal-prod.iam.gserviceaccount.com \
    --project=internal-prod

Po ověření, že GH Actions s novým klíčem funguje (po úspěšném prvním apply), smazat starý.


Fáze 6 — Template repo + onboarding playbook

Goal fáze: Druhý zákazník (another-customer) se dá nasadit za < 1 hodinu.

Task 6.1: Vytvořit keboola/agnes-infra-template

  • Step 1: Založit prázdný repo jako template
gh repo create keboola/agnes-infra-template --public --description "Template for Agnes per-customer infrastructure" -c
cd ~/Library/Mobile\ Documents/com\~apple\~CloudDocs/Sources/VsCode/component_factory/
gh repo clone keboola/agnes-infra-template
cd agnes-infra-template
  • Step 2: Zkopírovat strukturu z agnes-infra-keboola, nahradit konkrétní hodnoty placeholdery
# Zkopírovat strukturu
cp -r ../agnes-infra-keboola/terraform .
cp -r ../agnes-infra-keboola/.github .

# Reset konkrétní hodnoty
cat > terraform/main.tf <<'EOF'
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
  backend "gcs" {
    bucket = "REPLACE_WITH_YOUR_BUCKET"
    prefix = "REPLACE_WITH_CUSTOMER_NAME"
  }
}

provider "google" {
  project = var.gcp_project_id
  region  = var.region
  zone    = var.zone
}

module "agnes" {
  source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.1.0"

  gcp_project_id    = var.gcp_project_id
  region            = var.region
  zone              = var.zone
  customer_name     = var.customer_name
  seed_admin_email  = var.seed_admin_email
  data_source       = var.data_source
  keboola_stack_url = var.keboola_stack_url
  prod_instance     = var.prod_instance
  dev_instances     = var.dev_instances
}

output "prod_ip"      { value = module.agnes.prod_ip }
output "instance_ips" { value = module.agnes.instance_ips }
EOF

cat > terraform/variables.tf <<'EOF'
variable "gcp_project_id"    { type = string }
variable "region"            { type = string, default = "europe-west1" }
variable "zone"              { type = string, default = "europe-west1-b" }
variable "customer_name"     { type = string }
variable "seed_admin_email"  { type = string }
variable "data_source"       { type = string, default = "keboola" }
variable "keboola_stack_url" { type = string, default = "" }
variable "prod_instance"     { type = any }
variable "dev_instances"     { type = any, default = [] }
EOF

cat > terraform/terraform.tfvars.example <<'EOF'
# Kopie tohoto souboru → terraform.tfvars, vyplnit hodnoty.
# terraform.tfvars je gitignored (nikdy necommitovat!)

gcp_project_id    = "REPLACE"             # Váš GCP projekt
customer_name     = "REPLACE"             # Krátký identifikátor, např. "acme"
seed_admin_email  = "admin@example.com"
data_source       = "keboola"             # keboola | bigquery | csv
keboola_stack_url = "https://connection.keboola.com/"

prod_instance = {
  name         = "agnes-prod"
  machine_type = "e2-small"
  data_disk_gb = 50
  image_tag    = "stable"
  upgrade_mode = "auto"
  tls_mode     = "caddy"
  domain       = ""
}

dev_instances = [
  { name = "agnes-dev", image_tag = "dev" }
]
EOF
  • Step 3: Zkopírovat bootstrap skript z public repa
cp ../tmp_oss/scripts/bootstrap-gcp.sh .
  • Step 4: Napsat README.md pro onboarding

Write:

# Agnes Infrastructure Template

Deploy Agnes (AI Data Analyst) into your own GCP project.

## Prerequisites

- GCP project with billing enabled
- `gcloud` CLI authenticated as project Owner
- `terraform` >= 1.5
- GitHub account (for private repo + Actions)

## 1. Bootstrap GCP

```bash
./bootstrap-gcp.sh <YOUR_GCP_PROJECT_ID>

Výstup: SA key JSON.

2. Klonovat template

gh repo create <YOUR_ORG>/agnes-infra --template keboola/agnes-infra-template --private
cd agnes-infra

3. Nastavit secrets

# SA key (z kroku 1)
gh secret set GCP_SA_KEY < path/to/key.json
rm path/to/key.json

# Keboola token (pokud data_source = keboola)
gcloud secrets create keboola-storage-token --data-file=- <<< "YOUR_TOKEN"

4. Konfigurace

Editovat terraform/main.tf — aktualizovat backend.bucket a backend.prefix.

Kopírovat terraform/terraform.tfvars.exampleterraform/terraform.tfvars, vyplnit.

5. První apply

cd terraform
terraform init
terraform plan
terraform apply

IP prod VM je v outputu.

6. Login

# Bootstrap prvního admin usera
curl -X POST http://$(terraform output -raw prod_ip):8000/auth/bootstrap \
    -H "Content-Type: application/json" \
    -d '{"email": "YOU@example.com", "password": "YOUR_STRONG_PASSWORD"}'

Otevřít http://<prod_ip>:8000/login.

7. Upgrade workflow

  • :stable image → auto-upgrade přes Watchtower
  • Infra změna: PR v tomto repu → terraform plan v PR → merge → apply (prod vyžaduje reviewer)
  • TF modul upgrade: Renovate otevře PR s novým ref=infra-vX.Y.Z

Další detaily: https://github.com/keboola/agnes-the-ai-analyst/blob/main/docs/ONBOARDING.md


- [ ] **Step 5: Vytvořit README + push + mark as template**

```bash
git add .
git commit -m "initial template"
git push -u origin main
gh repo edit keboola/agnes-infra-template --template

Task 6.2: Napsat ONBOARDING.md v public repu

Files:

  • Create: docs/ONBOARDING.md (v public repu)

  • Step 1: Napsat ONBOARDING.md

Write docs/ONBOARDING.md obsah identický s README v template repu + poznámkou "fyzická šablona: keboola/agnes-infra-template".

  • Step 2: Commit
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
git add docs/ONBOARDING.md
git commit -m "docs: onboarding guide for deploying Agnes per customer"

Task 6.3: Vyzkoušet onboarding na dummy customer (sanity check)

  • Step 1: Vytvořit testovací GCP projekt
gcloud projects create agnes-onboarding-test-$(date +%s) --name="Agnes onboarding test"
# Link billing (via UI) if required
  • Step 2: Spustit bootstrap
./scripts/bootstrap-gcp.sh <test-project-id>
  • Step 3: Klonovat template do dummy repa
gh repo create zdeneksrotyr/agnes-infra-test --template keboola/agnes-infra-template --private
gh repo clone zdeneksrotyr/agnes-infra-test
cd agnes-infra-test
  • Step 4: Projít README krok za krokem a změřit čas

Cíl: end-to-end < 1 hod. Zaznamenat překážky, zpět do README.

  • Step 5: Cleanup — smazat test projekt
gcloud projects delete <test-project-id>
gh repo delete zdeneksrotyr/agnes-infra-test --yes

Task 6.4: Renovate configuration

  • Step 1: Přidat renovate.json do template repa

Write keboola/agnes-infra-template/renovate.json:

{
  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
  "extends": ["config:base"],
  "customManagers": [
    {
      "customType": "regex",
      "fileMatch": ["\\.tf$"],
      "matchStrings": [
        "source\\s*=\\s*\"github\\.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance\\?ref=(?<currentValue>infra-v\\d+\\.\\d+\\.\\d+)\""
      ],
      "datasourceTemplate": "github-releases",
      "depNameTemplate": "keboola/agnes-the-ai-analyst",
      "packageNameTemplate": "keboola/agnes-the-ai-analyst",
      "versioningTemplate": "regex:^infra-v(?<major>\\d+)\\.(?<minor>\\d+)\\.(?<patch>\\d+)$"
    }
  ],
  "packageRules": [
    {
      "matchPackageNames": ["keboola/agnes-the-ai-analyst"],
      "matchUpdateTypes": ["major"],
      "prPriority": 10
    }
  ]
}
  • Step 2: Instalovat Renovate GitHub App na privátní repa

Ruční krok v GitHub: Settings → Integrations → Renovate → grant access.


Finální checkpoint

  • Fáze 1 complete — prod běží z :stable image, žádný git pull z forku
  • Fáze 2 complete — TF modul, PD, Keboola nasazena přes modul
  • Fáze 3 complete — HTTPS funguje (pokud DNS dostupné)
  • Fáze 4 complete — watchtower na dev VM auto-pulluje :dev, OS Login aktivní
  • Fáze 5 complete — GHA CI/CD funguje, prod apply vyžaduje review
  • Fáze 6 complete — template repo existuje, ONBOARDING.md, Renovate nakonfigurovaný
  • Starý osobní fork smazán
  • Keboola token rotován a v Secret Manageru
  • Dokumentace aktualizovaná

Self-Review

Spec coverage:

  • §2 Model self-deploy → Task 1.2 (bootstrap), Task 2.3 (private repo), Task 6 (template)
  • §3 Repo architektura → Task 2.1 (modul), Task 6.1 (template), Task 2.3 (customer repo)
  • §4 Release model → Task 1.1 (per-branch tagging), existuje release.yml
  • §5 Branch-aware dev → Task 2.1 (dev_instances proměnná), Task 4.1 (watchtower)
  • §6 Prod upgrade model → Task 4.1 (auto via watchtower), pinned mode přes tfvars (zákazník zvolí)
  • §7 Security → Task 1.2-1.4 (Secret Manager, SA), Task 4.2 (OS Login), Task 5.2 (env protection)
  • §8 Onboarding → Task 6.1-6.4
  • §9 Tok změn → Task 5.1-5.2 (plan/apply), Task 4.1 (watchtower pipeline)
  • §10 Backup/monitoring → částečně; monitoring je follow-up (§14)

Placeholder scan: Všechny kódy, konfigurace, příkazy jsou konkrétní.

Type consistency: prod_instance object a dev_instances list mají konzistentní schéma napříč Task 2.1, Task 2.3, Task 6.1.

Gap: Zákazníkem-zvolený pinned upgrade režim (§6.1) spouští Renovate — Renovate konfigurace je v Task 6.4, ale nepokrývá upgrade image tagu (jen modul ref). Follow-up: rozšířit customManagers v renovate.json na image_tag v tfvars.