Merge feature/multi-customer-deployment: multi-customer deployment infra

- infra/modules/customer-instance/ — reusable Terraform module (tag infra-v1.0.0)
- infra/examples/minimal/ — OSS self-host quickstart
- scripts/bootstrap-gcp.sh — per-customer GCP setup
- scripts/fetch-env-from-secrets.sh — VM-side .env from Secret Manager
- docker-compose.prod.yml — bind data volume to host /data for persistent disks
- docs/superpowers/specs/2026-04-21-multi-customer-deployment-spec.md
- docs/superpowers/plans/2026-04-21-multi-customer-deployment.md
- docs/superpowers/plans/2026-04-21-deployment-log.md
This commit is contained in:
ZdenekSrotyr 2026-04-21 16:43:06 +02:00
commit 94b6a8eff2
15 changed files with 3136 additions and 308 deletions

5
.gitignore vendored
View file

@ -141,3 +141,8 @@ docs/AGENT-REPORTS/
docs/ZS_PADAK_*
.github/workflows/ci.yml
/auth/
/tmp/
# GCP service account keys — never commit
*-key.json
/agnes-deploy-*.json

View file

@ -1,4 +1,7 @@
# Production override — uses pre-built GHCR image instead of local build.
# Production override — uses pre-built GHCR image instead of local build,
# and binds the `data` volume to /data on the host (so persistent-disk mounts
# at /data are used by all services).
#
# Usage: docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# Override tag: AGNES_TAG=stable-2026.04.3 docker compose -f ... up -d
services:
@ -16,3 +19,15 @@ services:
image: ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}
session-collector:
image: ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}
# Override the `data` named volume to bind-mount /data from the host.
# This ensures a persistent disk mounted at /data (by Terraform startup
# script) is the actual backing store, not a Docker-managed volume on the
# boot disk.
volumes:
data:
driver: local
driver_opts:
type: none
o: bind
device: /data

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,442 @@
# Multi-Customer Deployment — Design Spec
Datum: 2026-04-21
Status: Návrh k implementaci
Autor: Zdeněk Šrotýř + Claude (sparring)
## 1. Cíl
Zavést *production-grade* nasazení Agnes, které:
1. Nechává **upstream repo public** (žádné zákaznické info tam).
2. Umožňuje **N zákazníků paralelně**, každý v izolovaném prostoru.
3. Je **anonymizované** — jeden zákazník nevidí existenci ani identitu ostatních.
4. Má **auto-deploy s rozumnými gates** — feature branch push → dev VM aktualizace do minut; merge do main → prod s review gate.
5. Podporuje **branch-aware dev environments** — víc vývojářů paralelně, každý na své branchi, bez interference.
6. **Škáluje O(1) na zákazníka** — přidání GRPN vedle Keboola znamená jen klonování šablony, ne změnu upstream.
## 2. Model — Pure Self-Deploy
### 2.1 Role
| Strana | Co dělá |
|---|---|
| **Keboola jako upstream** | Udržuje app kód, buildí & pushuje Docker image na GHCR, udržuje TF modul, udržuje infra template |
| **Zákazník (vč. Keboola-as-customer)** | Vlastní GCP projekt, vlastní privátní infra repo, vlastní CI/CD, spravuje svoje VMs, nese náklady |
Keboola jako upstream **nemá žádný přístup k zákaznickým GCP projektům**. Zákazník zodpovídá za svoje nasazení.
Keboola interní produkční Agnes instance je **speciální případ zákazníka** — Keboola IT vlastní `kids-ai-data-analysis` GCP projekt a spravuje tam svou Agnes stejně jako to bude dělat GRPN ve svém GCP.
### 2.2 Budoucí rozšíření (out of scope pro tuto vlnu)
- **AWS podpora**: TF modul je dnes GCP-specific. Jakmile přijde první AWS zákazník, přidáme paralelní modul `modules/customer-instance-aws/`.
- **Managed service**: Keboola bude nabízet "nasadíme vám to za vás" — znamená přidat Keboola jako operator role s IAM delegací do zákazníkova GCP. Design v tomhle specu je kompatibilní, jen vyžaduje extra vrstvu IAM bindings.
## 3. Repo architektura
### 3.1 Počet a typ repozitářů
```
keboola/agnes-the-ai-analyst PUBLIC App + TF modul + dokumentace
keboola/agnes-infra-template PUBLIC Skeleton pro privátní infra repo (template)
keboola/agnes-infra-keboola PRIVATE Keboola-as-customer deployment
{acme}/agnes-infra PRIVATE Nový zákazník — v jejich GitHub org, klonováno z template
```
Počet: **2 upstream + N per-customer**. Upstream repa jsou stabilní, per-customer vznikají při onboarding.
### 3.2 Obsah `keboola/agnes-the-ai-analyst` (public)
```
agnes-the-ai-analyst/
├── app/ src/ connectors/ cli/ # produkt
├── Dockerfile docker-compose.yml
├── .github/workflows/
│ └── release.yml # build + push do GHCR; tagy: :dev, :stable, :dev-branch-xyz
├── infra/
│ ├── modules/
│ │ └── customer-instance/ # versioned: tag infra-v1.0, v1.1, ...
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── examples/
│ └── minimal/ # quickstart pro OSS self-hoster
└── docs/
├── DEPLOYMENT.md # pro self-host (compose, bez Terraform)
├── ONBOARDING.md # pro managed (cesta k TF + template)
└── architecture.md
```
**TF modul `customer-instance`** je verzován samostatně semver (`infra-v1.x`), odlišeně od app image (CalVer `YYYY.MM.N`).
### 3.3 Obsah `keboola/agnes-infra-template` (public template)
```
agnes-infra-template/
├── terraform/
│ ├── main.tf # module { source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.0" }
│ ├── variables.tf
│ ├── backend.tf # gcs by default, komentář jak přepnout na s3/remote
│ ├── terraform.tfvars.example
│ └── .gitignore # terraform.tfvars, *.tfstate
├── .github/workflows/
│ ├── plan.yml # PR → terraform plan
│ └── apply.yml # main → terraform apply
├── config/
│ └── instance.yaml.example
├── bootstrap.sh # jednorázový setup GCP: SA, API enable, bucket, secrets
└── README.md # step-by-step onboarding
```
Zákazník (nebo Keboola při onboardingu) použije `gh repo create --template keboola/agnes-infra-template` → přijde privátní repo s hotovou strukturou.
### 3.4 Obsah per-customer privátního repa (např. `keboola/agnes-infra-keboola`)
Přesně ta samá struktura jako template, jen s konkrétními hodnotami v `terraform.tfvars`:
```hcl
# keboola/agnes-infra-keboola/terraform/terraform.tfvars
# (gitignored, nebo lokálně v Secret Manageru — viz §6)
gcp_project_id = "kids-ai-data-analysis"
region = "europe-west1"
zone = "europe-west1-b"
prod_instance = {
name = "agnes-prod"
machine_type = "e2-small"
image_tag = "stable" # floating | "stable-2026.04.N" (pinned)
upgrade_mode = "auto" # auto (watchtower) | pinned (Renovate)
tls_mode = "caddy" # caddy | gcp-lb | cloudflare | none
domain = "" # prázdné = jen IP
}
dev_instances = [
{ name = "agnes-dev-default", image_tag = "dev" },
# přidávat další dev VMs per branch/developer
]
seed_admin_email = "zdenek.srotyr@keboola.com"
# Keboola-specific
data_source = "keboola"
keboola_stack_url = "https://connection.us-east4.gcp.keboola.com/"
keboola_token_secret_id = "keboola-storage-token" # reference do Secret Manageru
```
## 4. Release model
### 4.1 Image tagging v GHCR
Public repo CI (release.yml) buildí a pushuje do `ghcr.io/keboola/agnes-the-ai-analyst` při každém push:
| Trigger | Tagy které vzniknou |
|---|---|
| Push `main` | `:stable`, `:stable-YYYY.MM.N`, `:sha-xxxxxxx` |
| Push `feature/xyz` | `:dev`, `:dev-feature-xyz`, `:sha-xxxxxxx` |
| Push `release/1.2.x` | `:release-1.2.x`, `:release-1.2.x-YYYY.MM.N` |
`:dev` a `:stable` jsou **floating** tagy — posouvají se při každém pushe. Verzované tagy jsou **neměnné**.
### 4.2 Visibility obrazu
`ghcr.io/keboola/agnes-the-ai-analyst` je **public image**. Zákaznické VMs pullují bez credentials.
Důvod: kód je veřejný, obraz nesmí obsahovat nic, co veřejný kód neobsahuje. Secrets jdou do `.env` na VM, ne do image.
### 4.3 Smoke test
Po push `main` a tagování `:stable-N`, CI spustí smoke test: `docker compose up` + curl `/api/health` + auth + query. PASS → `:stable` floating se posune. FAIL → build dostane `:deprecated-N` label, `:stable` se nehne, GitHub issue s logy.
### 4.4 CalVer + smoke test = kontinuální release
Žádné manuální release rozhodnutí. Každý merge do main = release (pokud smoke test projde). Číslování `YYYY.MM.N` = rok.měsíc.sekvence.
## 5. Branch-aware dev environments
### 5.1 Motivace
Víc vývojářů paralelně potřebuje víc dev environmentů bez interference. „Floating `:dev`" je nedostatečné — poslední push přepíše ostatní.
### 5.2 Mechanismus
Každý feature branch push → samostatný tag `:dev-{branch-slug}` navíc k floating `:dev`.
V privátním infra repu zákazník vyjmenuje dev VMs s pinned tagem:
```hcl
dev_instances = [
{ name = "agnes-dev", image_tag = "dev" }, # floating (demo / reviewers)
{ name = "agnes-alice-feat1", image_tag = "dev-feature-alice-dashboard" }, # Alice má svou
{ name = "agnes-bob-pr142", image_tag = "dev-pr-142" }, # Bob pinned na PR
]
```
### 5.3 Lifecycle dev VM
```
1. Někdo otevře PR v privátním infra repu:
+ { name = "agnes-carol", image_tag = "dev-feature-carol-new-auth" }
2. CI plan.yml komentuje v PR: „vytvoří se VM agnes-carol (e2-small, europe-west1-b)"
3. Merge → apply.yml spustí terraform apply
4. VM up za ~2 min
5. Watchtower na VM polluje :dev-feature-carol-new-auth každých 5 min
6. Každý push na feature/carol/new-auth → nový image → watchtower pullne → VM má aktuální verzi
7. Až Carol dokončí feature (merge do main), smaže řádek v tfvars → terraform apply → VM destroy
```
**Žádný nový SA, žádný nový GitHub environment, žádná infra operace navíc.** Jen editace seznamu v tfvars.
### 5.4 Ephemeral preview environments (budoucnost)
V pozdější fázi zvážit automatizaci: PR otevřen → GHA vytvoří per-PR VM; PR zavřen → destroy. Aktuálně explicitní flow přes tfvars stačí.
## 6. Prod upgrade model
### 6.1 Dva režimy (per-instance volitelné)
| Režim | Jak | Pro koho |
|---|---|---|
| **auto** | Watchtower na VM polluje `:stable` (floating), pullne + restart, když se objeví nový digest | Default — rychlost, low-touch |
| **pinned** | `image_tag = "stable-2026.04.7"` v tfvars. Renovate polluje GHCR, otevírá PR s bump. Ops schválí → merge → apply | Regulovaní zákazníci, audit trail |
### 6.2 Gate pro auto režim
Jedinou ochranou před rozbitým `:stable` je **CI smoke test** před posunutím floating tagu. Pokud projde tam, prod auto-upgradne. Doporučení: mít i u Keboola instance **monitoring + alert na `/api/health` degraded status**, aby případný skluz smoke testu nezůstal dlouho bez povšimnutí.
### 6.3 Rollback
Rollback = změnit `image_tag` na předchozí verzi a `docker compose up -d`. Zjednodušená forma:
- **Auto režim:** rychle přepnout watchtower na specifický tag; pak investigate
- **Pinned režim:** PR revert, apply
## 7. Security model
### 7.1 Authentication mezi komponenty
| Kdo → kde | Jak se přihlásí |
|---|---|
| Public CI → GHCR push | `${{ secrets.GITHUB_TOKEN }}` (built-in) |
| VM → GHCR pull | Public image, bez auth |
| Privátní CI → GCP | SA JSON key v `GCP_SA_KEY` secret (Fáze 1); WIF (Fáze follow-up) |
| CI na zákaznickém GCP → Secret Manager | SA má `roles/secretmanager.admin` |
| App na VM → Secret Manager | VM má dedikovaný SA s `roles/secretmanager.secretAccessor` |
| App na VM → Keboola Storage | Token z Secret Manageru |
### 7.2 Deploy SA — scope per zákazník
SA `agnes-deploy@<gcp-project>` dostane **jen** tyto role:
```
roles/compute.instanceAdmin.v1 # create/update/delete VMs
roles/compute.securityAdmin # firewall rules
roles/compute.networkAdmin # static IP
roles/iam.serviceAccountUser # attach VM SA k instancím
roles/secretmanager.admin # vytvořit/rotovat secrets
roles/storage.admin # tfstate bucket
```
Žádný `owner`, žádný `editor`. Blast radius pro leak SA key = přepis VMs v tomhle projektu. Nic mimo projekt, nic dat.
### 7.3 GitHub environmenty
```yaml
environments:
dev:
# žádná protection
secrets:
GCP_SA_KEY: <same key>
prod:
protection_rules:
required_reviewers: [@keboola-ops-team]
wait_timer: 5m
deployment_branches: main
secrets:
GCP_SA_KEY: <same key>
```
Oba environmenty sdílí ten samý SA key (jeden GCP, jedna identita). Rozdíl je **jen v protection rules** — kdo smí pushnout kam.
### 7.4 VM hardening
- **OS Login** místo per-user SSH klíčů (follow-up)
- **Dedikovaný VM SA** s minimem práv (jen read z Secret Manageru, nic dalšího)
- **Ephemeral disk strategy**: boot disk = produkt (stateless), `/data` = persistent disk (stateful, snapshoty)
- **Žádný token v startup-script metadatě** — všechny secrets teprve při boot z Secret Manageru
### 7.5 Rotace tajemství
| Tajemství | Kde žije | Jak se rotuje |
|---|---|---|
| Keboola Storage token | Secret Manager v zákaznickém GCP | Keboola UI → nová verze v SM → app restart |
| JWT_SECRET_KEY | Secret Manager, generováno TF | `terraform apply` s `-replace=google_secret_manager_secret_version.jwt` |
| SA JSON key | GitHub secret | Vygenerovat nový klíč, paste do GH secret, smazat starý klíč v GCP |
| User passwords | Argon2 hash v DuckDB `users` | User-facing flow (reset endpoint, admin CLI) |
## 8. Onboarding nového zákazníka
### 8.1 Kroky (cílový čas: < 1 hod)
```
1. Zákazník (nebo Keboola ops za něj) založí GCP projekt + billing
2. Někdo s owner rolí v projektu spustí bootstrap.sh:
- Enable APIs (compute, iam, secretmanager, storage, iamcredentials)
- Vytvoří SA agnes-deploy s rolemi
- Vygeneruje SA key (předá ownerovi)
- Vytvoří gs://agnes-{project}-tfstate
3. Zákazník (nebo Keboola ops) klonuje template:
gh repo create {org}/agnes-infra --template keboola/agnes-infra-template --private
4. V novém repu:
- Nastaví GH secret GCP_SA_KEY (paste z kroku 2)
- Upraví terraform.tfvars na jejich hodnoty
- Vytvoří initial commit + push
5. Nastaví Secret Manager tajemství (Keboola token atd.)
6. První PR s tfvars → plan → merge → apply
7. DNS — zákazník si později nastaví CNAME na IP (nebo zůstane na IP)
8. Admin user — bootstrap endpoint POST /auth/bootstrap nebo admin CLI
9. Smoke test: login, sync, query
```
### 8.2 Co je vidět komu
| Role | Vidí |
|---|---|
| Každý na internetu | Public repo `agnes-the-ai-analyst`, jeho issues, PRs, image na GHCR |
| Keboola ops tým | Výše + privátní template repo + infra-keboola repo |
| Zákazník (acme) | Výše public + svůj vlastní infra-acme repo ve svém org |
| Nikdo | Ostatní zákazníky kromě jejich vlastního |
## 9. Tok změn
### 9.1 Change v app kódu (nejčastější)
```
1. Vývojář: push feature branch v public repu
2. Public CI: build :dev-feature-xyz (a :dev floating)
3. Watchtower na každé VM s image_tag = "dev": pullne do 5 min
Watchtower na VM s image_tag = "dev-feature-xyz": pullne taky
4. Dev review
5. Merge do main
6. Public CI: build :stable-YYYY.MM.N (a :stable floating)
7. Smoke test CI: PASS → :stable se posune
8. Prod VMs:
- auto režim: watchtower pullne do 5 min
- pinned režim: Renovate otevře PR v privátním repu
```
### 9.2 Change v infra (VM size, dev VM list, nová disk)
```
1. Ops otevře PR v privátním infra repu
2. CI plan.yml: terraform plan → komentář v PR
3. Review + merge
4. CI apply.yml:
- pro dev změny: environment "dev" → apply bez gatu
- pro prod změny: environment "prod" → required reviewer → apply
5. Po apply: smoke test přes curl /api/health
```
### 9.3 Change v TF modulu
```
1. Maintainer otevře PR v public repu do infra/modules/customer-instance/
2. CI validuje modul proti examples/
3. Merge → auto git tag infra-v1.1.0
4. Renovate v každém privátním infra repu:
→ otevře PR "bump source ref to infra-v1.1.0"
5. Každý zákazník schvaluje samostatně → terraform plan → apply
```
## 10. Provozní aspekty
### 10.1 Monitoring a alerting (doporučení, ne v první vlně)
- Cloud Monitoring dashboard per-customer
- Alert na `/api/health` `status != "healthy"` déle než 5 min
- Alert na VM CPU > 80 % déle než 30 min
- Log-based metric: sync failures, auth failures, HTTP 5xx rate
- Integrace se Slack/email přes Alerting policy
### 10.2 Backup
- Snapshoty `/data` persistent disku denně, retention 30 dní (TF `google_compute_resource_policy`)
- `system.duckdb` obsahuje users/permissions — při schema migraci snapshot kopie (již existuje jako `*.pre-migrate`)
### 10.3 Disaster recovery
- Recreation VM z nuly = `terraform apply` (~5 min) + restore `/data` ze snapshotu (~5 min)
- Total loss zákazníka = destroy GCP projektu; recreate ze snapshotu + tfstate
### 10.4 Cost per customer (orientačně)
| Položka | $/měs |
|---|---|
| Prod VM e2-small + 30GB SSD | ~$15 |
| Dev VM e2-small + 30GB SSD | ~$15 |
| Persistent disk (50 GB) | ~$2 |
| Static IP (×2 — prod, dev) | ~$5 |
| Snapshots (daily, 30d retention) | ~$2 |
| Secret Manager | ~$0 (pod freetier) |
| **Celkem base** | **~$40/měs** |
Škáluje lineárně s počtem dev VMs.
## 11. Principy / Non-goals
- ✅ **Public upstream zůstává public.** Nic, co zákazníka identifikuje, tam není.
- ✅ **Zákazník má plnou kontrolu svého nasazení.** Včetně rozhodnutí, zda upgradovat.
- ✅ **Žádná centrální Keboola ops infra.** Žádný sdílený GCP projekt, žádný sdílený state.
- ❌ **Není to multi-tenant** v jednom deploymentu. Jeden `docker compose up` = jeden zákazník.
- ❌ **Keboola není SaaS hostér** (aspoň ne teď). Pokud zákazník chce managed, je to ručně poskytnutá služba, ne produkt.
- ❌ **Žádný cross-customer routing.** Žádný sdílený load balancer, žádný sdílený DNS.
## 12. Rozhodnutí a otázky
Všechny designové otázky, které vznikly během brainstormingu, jsou vyřešené. Odkazy zde pro trasovatelnost:
| Otázka | Rozhodnutí |
|---|---|
| Managed vs self-deploy | A) Pure self-deploy (mění se v Fázi 2+ pokud bude potřeba) |
| Centrální ops repo | Ne — 1 public + 1 template + N per-customer |
| TF state lokace | gs:// v zákaznickém GCP (default); flex na S3/TFC v template |
| Template repo název | `keboola/agnes-infra-template` |
| CI auth | SA JSON key v GH secret (Fáze 1); WIF (follow-up) |
| Image visibility | Public na GHCR |
| Prod upgrade režim | Per-instance volba auto/pinned, default auto |
| TLS | Caddy default, flex na gcp-lb/cloudflare |
| DNS | Zákazník si řeší sám, default jen IP |
| GCP projekt pro Keboola | `kids-ai-data-analysis` zůstává |
| Dev VM model | Seznam `dev_instances` v tfvars, per-položka image_tag |
| `ZdenekSrotyr/tmp_oss` | Smazat po Fázi 1 |
## 13. Glosář
| Zkratka | Význam |
|---|---|
| **GHCR** | GitHub Container Registry — ghcr.io |
| **WIF** | Workload Identity Federation — GCP mechanismus auth CI bez static key |
| **SA** | Service Account (GCP) |
| **TF** | Terraform |
| **OIDC** | OpenID Connect — auth protokol, GitHub vydává OIDC tokeny pro GHA |
| **CalVer** | Calendar Versioning — YYYY.MM.N |
| **PD** | Persistent Disk (GCP) |
## 14. Follow-up iterace
Mimo scope této první vlny, ale plánováno:
- **WIF místo SA JSON key** (bezpečnost)
- **OS Login** (odstranění osobních SSH klíčů)
- **Monitoring + alerting** (Cloud Monitoring, Slack integration)
- **Automatické snapshoty** + restore procedura
- **Ephemeral PR preview environments**
- **AWS podpora** (paralelní TF modul)
- **Plugin API** pro proprietární customer extensions (viz issue #8)
- **Managed service varianta** (Keboola hostuje za zákazníka)
## 15. Reference
- Předchozí spec: `docs/superpowers/specs/2026-04-09-multi-instance-deployment-design.md` (CalVer release model)
- Issue: keboola/agnes-the-ai-analyst#8 — plugin API for private customer extensions

View file

@ -0,0 +1,54 @@
# Minimal example: single-VM Agnes deploy.
# Pro OSS self-hoster, co chce prod VM bez dev, bez TLS.
terraform {
required_version = ">= 1.5"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.gcp_project_id
region = "europe-west1"
}
variable "gcp_project_id" {
description = "GCP project ID (must have billing enabled)"
type = string
}
variable "admin_email" {
description = "Email for first admin user"
type = string
}
module "agnes" {
source = "../../modules/customer-instance"
gcp_project_id = var.gcp_project_id
customer_name = "self-hosted"
seed_admin_email = var.admin_email
prod_instance = {
name = "agnes"
machine_type = "e2-small"
data_disk_gb = 30
image_tag = "stable"
upgrade_mode = "auto"
tls_mode = "none"
domain = ""
}
dev_instances = []
# Customize below for your setup
data_source = "keboola"
}
output "agnes_ip" {
description = "SSH in via: ssh <user>@<ip>; UI at http://<ip>:8000"
value = module.agnes.prod_ip
}

View file

@ -1,170 +0,0 @@
terraform {
required_version = ">= 1.5"
backend "gcs" {
bucket = "agnes-terraform-state"
prefix = "instances"
}
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
zone = var.zone
}
# --- Auto-generated secrets ---
resource "random_password" "jwt_secret" {
length = 48
special = false
}
# --- Network ---
resource "google_compute_firewall" "data_analyst" {
name = "${var.instance_name}-allow-web"
network = "default"
allow {
protocol = "tcp"
ports = ["22", "80", "443", "8000"]
}
source_ranges = ["0.0.0.0/0"]
target_tags = [var.instance_name]
}
# --- Static IP ---
resource "google_compute_address" "data_analyst" {
name = "${var.instance_name}-ip"
region = var.region
}
# --- Startup script ---
locals {
startup_script = <<-SCRIPT
#!/bin/bash
set -euo pipefail
exec > /var/log/startup.log 2>&1
echo "=== Installing Docker ==="
if ! command -v docker &> /dev/null; then
curl -fsSL https://get.docker.com | sh
usermod -aG docker ${var.ssh_user}
fi
# Install docker compose plugin
if ! docker compose version &> /dev/null; then
apt-get update && apt-get install -y docker-compose-plugin
fi
echo "=== Cloning repository ==="
APP_DIR="/opt/data-analyst"
if [ ! -d "$APP_DIR" ]; then
git clone https://github.com/keboola/agnes-the-ai-analyst.git "$APP_DIR"
cd "$APP_DIR"
git checkout main
else
cd "$APP_DIR"
git pull origin main || true
fi
echo "=== Creating .env ==="
cat > "$APP_DIR/.env" << 'ENVEOF'
JWT_SECRET_KEY=${random_password.jwt_secret.result}
DATA_DIR=/data
DATA_SOURCE=${var.keboola_token != "" ? "keboola" : "local"}
KEBOOLA_STORAGE_TOKEN=${var.keboola_token}
KEBOOLA_STACK_URL=${var.keboola_stack_url}
KEBOOLA_PROJECT_ID=${var.keboola_project_id}
SEED_ADMIN_EMAIL=${var.admin_email}
LOG_LEVEL=info
ENVEOF
# Strip leading whitespace from heredoc
sed -i 's/^ //' "$APP_DIR/.env"
chmod 600 "$APP_DIR/.env"
echo "=== Creating instance.yaml ==="
mkdir -p "$APP_DIR/config"
cat > "$APP_DIR/config/instance.yaml" << YAMLEOF
instance:
name: "${var.instance_name}"
subtitle: "Data Analytics Platform"
server:
host: "${google_compute_address.data_analyst.address}"
hostname: "${var.domain != "" ? var.domain : google_compute_address.data_analyst.address}"
port: 8000
auth:
allowed_domain: ""
data_source:
type: "${var.keboola_token != "" ? "keboola" : "local"}"
YAMLEOF
echo "=== Creating data directory ==="
mkdir -p /data/state /data/analytics /data/extracts
chown -R 1000:1000 /data
echo "=== Starting Docker Compose ==="
cd "$APP_DIR"
docker compose pull 2>/dev/null || true
docker compose build
docker compose up -d
echo "=== Startup complete ==="
docker compose ps
SCRIPT
}
# --- VM Instance ---
resource "google_compute_instance" "data_analyst" {
name = var.instance_name
machine_type = var.machine_type
zone = var.zone
tags = [var.instance_name]
boot_disk {
initialize_params {
image = "ubuntu-os-cloud/ubuntu-2404-lts-amd64"
size = var.disk_size_gb
type = "pd-ssd"
}
}
network_interface {
network = "default"
access_config {
nat_ip = google_compute_address.data_analyst.address
}
}
metadata = {
ssh-keys = "${var.ssh_user}:${file(pathexpand(var.ssh_public_key_path))}"
}
metadata_startup_script = local.startup_script
service_account {
scopes = ["cloud-platform"]
}
labels = {
app = "data-analyst"
managed = "terraform"
}
}

View file

@ -0,0 +1,163 @@
terraform {
required_version = ">= 1.5"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.0"
}
}
}
locals {
# Normalize all instances into a single list so for_each is uniform across prod + dev.
all_instances = concat(
[merge(var.prod_instance, { role = "prod" })],
[for d in var.dev_instances : merge(d, {
role = "dev"
disk_size_gb = 30
data_disk_gb = 20
upgrade_mode = "auto"
tls_mode = "caddy"
domain = ""
})]
)
}
# --- Secrets ---
resource "google_secret_manager_secret" "jwt" {
secret_id = "agnes-${var.customer_name}-jwt-secret"
project = var.gcp_project_id
replication {
auto {}
}
}
resource "random_password" "jwt" {
length = 48
special = false
}
resource "google_secret_manager_secret_version" "jwt" {
secret = google_secret_manager_secret.jwt.id
secret_data = random_password.jwt.result
}
# --- VM service account (dedikovaný, jen read Secret Manageru) ---
resource "google_service_account" "vm" {
account_id = "agnes-${var.customer_name}-vm"
display_name = "Agnes VM runtime SA (${var.customer_name})"
project = var.gcp_project_id
}
resource "google_project_iam_member" "vm_secrets" {
project = var.gcp_project_id
role = "roles/secretmanager.secretAccessor"
member = "serviceAccount:${google_service_account.vm.email}"
}
# --- Network ---
resource "google_compute_firewall" "web" {
name = "agnes-${var.customer_name}-allow-web"
project = var.gcp_project_id
network = "default"
allow {
protocol = "tcp"
ports = ["22", "80", "443", "8000"]
}
source_ranges = ["0.0.0.0/0"]
target_tags = ["agnes-${var.customer_name}"]
}
# --- Persistent data disks + VMs (prod + dev) ---
resource "google_compute_disk" "data" {
for_each = { for inst in local.all_instances : inst.name => inst }
name = "${each.value.name}-data"
project = var.gcp_project_id
zone = var.zone
size = each.value.data_disk_gb
type = "pd-ssd"
}
resource "google_compute_address" "ip" {
for_each = { for inst in local.all_instances : inst.name => inst }
name = "${each.value.name}-ip"
project = var.gcp_project_id
region = var.region
}
resource "google_compute_instance" "vm" {
for_each = { for inst in local.all_instances : inst.name => inst }
name = each.value.name
project = var.gcp_project_id
machine_type = each.value.machine_type
zone = var.zone
tags = ["agnes-${var.customer_name}"]
boot_disk {
initialize_params {
image = "ubuntu-os-cloud/ubuntu-2404-lts-amd64"
size = each.value.disk_size_gb
type = "pd-ssd"
}
}
attached_disk {
source = google_compute_disk.data[each.key].self_link
device_name = "data"
}
network_interface {
network = "default"
access_config {
nat_ip = google_compute_address.ip[each.key].address
}
}
metadata = {
enable-oslogin = "TRUE"
}
metadata_startup_script = templatefile("${path.module}/startup-script.sh.tpl", {
customer_name = var.customer_name
image_repo = var.image_repo
image_tag = each.value.image_tag
upgrade_mode = each.value.upgrade_mode
tls_mode = each.value.tls_mode
domain = each.value.domain
data_source = var.data_source
keboola_stack_url = var.keboola_stack_url
seed_admin_email = var.seed_admin_email
role = each.value.role
})
service_account {
email = google_service_account.vm.email
scopes = ["cloud-platform"]
}
labels = {
app = "agnes"
customer = var.customer_name
role = each.value.role
managed = "terraform"
}
# Změna startup scriptu nemění běžící VM (script běží jen na boot).
# Pro aplikaci změn je potřeba VM restartovat nebo recreate.
lifecycle {
ignore_changes = [metadata_startup_script]
}
}

View file

@ -0,0 +1,19 @@
output "instance_ips" {
description = "Mapa { name => external IP }"
value = { for k, v in google_compute_address.ip : k => v.address }
}
output "prod_ip" {
description = "External IP prod instance"
value = google_compute_address.ip[var.prod_instance.name].address
}
output "vm_service_account" {
description = "Email VM SA (pro další IAM bindings, např. BigQuery)"
value = google_service_account.vm.email
}
output "jwt_secret_name" {
description = "Plný název JWT secretu v Secret Manageru"
value = google_secret_manager_secret.jwt.name
}

View file

@ -0,0 +1,100 @@
#!/bin/bash
# Agnes VM startup script — templated by Terraform.
# Idempotent — spustí se při každém boot.
set -euo pipefail
exec > /var/log/agnes-startup.log 2>&1
CUSTOMER_NAME="${customer_name}"
IMAGE_REPO="${image_repo}"
IMAGE_TAG="${image_tag}"
UPGRADE_MODE="${upgrade_mode}"
TLS_MODE="${tls_mode}"
DOMAIN="${domain}"
DATA_SOURCE="${data_source}"
KEBOOLA_STACK_URL="${keboola_stack_url}"
SEED_ADMIN_EMAIL="${seed_admin_email}"
ROLE="${role}"
echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup at $(date) ==="
# --- 1. Docker (install if missing) ---
if ! command -v docker &>/dev/null; then
curl -fsSL https://get.docker.com | sh
fi
if ! docker compose version &>/dev/null; then
apt-get update && apt-get install -y docker-compose-plugin
fi
# --- 2. Persistent data disk mount ---
DATA_DEV="/dev/disk/by-id/google-data"
DATA_MNT="/data"
if [ -b "$DATA_DEV" ]; then
if ! blkid "$DATA_DEV" | grep -q ext4; then
mkfs.ext4 -F "$DATA_DEV"
fi
mkdir -p "$DATA_MNT"
mountpoint -q "$DATA_MNT" || mount -o discard,defaults "$DATA_DEV" "$DATA_MNT"
grep -qF "$DATA_DEV" /etc/fstab || echo "$DATA_DEV $DATA_MNT ext4 discard,defaults,nofail 0 2" >> /etc/fstab
mkdir -p "$DATA_MNT/state" "$DATA_MNT/analytics" "$DATA_MNT/extracts"
fi
# --- 3. App directory + docker-compose files from public repo ---
APP_DIR="/opt/agnes"
mkdir -p "$APP_DIR"
cd "$APP_DIR"
# Fetch minimal docker-compose from public repo (main branch — stable)
curl -fsSL "https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.yml" -o docker-compose.yml
curl -fsSL "https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml" -o docker-compose.prod.yml
# TLS overlay (Caddy + Let's Encrypt) — jen pokud potřeba
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
curl -fsSL "https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/Caddyfile" -o Caddyfile 2>/dev/null || true
fi
# --- 4. Fetch secrets from Secret Manager ---
KEBOOLA_TOKEN=""
if [ "$DATA_SOURCE" = "keboola" ]; then
KEBOOLA_TOKEN=$(gcloud secrets versions access latest --secret=keboola-storage-token 2>/dev/null || echo "")
fi
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
cat > "$APP_DIR/.env" <<ENVEOF
JWT_SECRET_KEY=$JWT_KEY
DATA_DIR=$DATA_MNT
DATA_SOURCE=$DATA_SOURCE
KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
LOG_LEVEL=info
DOMAIN=$DOMAIN
AGNES_TAG=$IMAGE_TAG
ACME_EMAIL=admin@$${DOMAIN#*.}
ENVEOF
chmod 600 "$APP_DIR/.env"
# --- 5. Start Agnes ---
COMPOSE_PROFILES_ARG=""
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
COMPOSE_PROFILES_ARG="--profile tls"
fi
docker compose -f docker-compose.yml -f docker-compose.prod.yml $COMPOSE_PROFILES_ARG pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml $COMPOSE_PROFILES_ARG up -d
# --- 6. Watchtower (auto-pull new images) ---
if [ "$UPGRADE_MODE" = "auto" ]; then
# Odstraň starý watchtower pokud existuje (pro idempotenci)
docker rm -f agnes-watchtower 2>/dev/null || true
docker run -d \
--name agnes-watchtower \
--restart=unless-stopped \
-v /var/run/docker.sock:/var/run/docker.sock \
containrrr/watchtower \
--interval 300 \
--cleanup \
--include-restarting
fi
echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup complete at $(date) ==="
docker compose ps

View file

@ -0,0 +1,72 @@
variable "gcp_project_id" {
description = "GCP project ID kde bude instance nasazená"
type = string
}
variable "region" {
description = "GCP region"
type = string
default = "europe-west1"
}
variable "zone" {
description = "GCP zone"
type = string
default = "europe-west1-b"
}
variable "customer_name" {
description = "Krátký identifikátor zákazníka (např. keboola, grpn). Použije se v prefixu resourců."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,20}$", var.customer_name))
error_message = "customer_name musí být lowercase, začínat písmenem, 2-21 znaků."
}
}
variable "prod_instance" {
description = "Prod VM konfigurace"
type = object({
name = string
machine_type = optional(string, "e2-small")
disk_size_gb = optional(number, 30)
data_disk_gb = optional(number, 50)
image_tag = optional(string, "stable")
upgrade_mode = optional(string, "auto")
tls_mode = optional(string, "caddy")
domain = optional(string, "")
})
}
variable "dev_instances" {
description = "Seznam dev VMs. Prázdné pole = žádné dev VMs."
type = list(object({
name = string
machine_type = optional(string, "e2-small")
image_tag = optional(string, "dev")
}))
default = []
}
variable "seed_admin_email" {
description = "Email prvního admin usera"
type = string
}
variable "data_source" {
description = "Typ data source — keboola | bigquery | csv"
type = string
default = "keboola"
}
variable "keboola_stack_url" {
description = "Keboola Stack URL (pokud data_source = keboola)"
type = string
default = ""
}
variable "image_repo" {
description = "Docker image repo"
type = string
default = "ghcr.io/keboola/agnes-the-ai-analyst"
}

View file

@ -1,39 +0,0 @@
output "instance_ip" {
description = "Public IP address of the server"
value = google_compute_address.data_analyst.address
}
output "ssh_command" {
description = "SSH command to connect"
value = "ssh ${var.ssh_user}@${google_compute_address.data_analyst.address}"
}
output "api_url" {
description = "API URL"
value = "http://${google_compute_address.data_analyst.address}:8000"
}
output "web_url" {
description = "Web UI URL"
value = var.domain != "" ? "https://${var.domain}" : "http://${google_compute_address.data_analyst.address}:8000"
}
output "swagger_url" {
description = "Swagger API docs URL"
value = "http://${google_compute_address.data_analyst.address}:8000/docs"
}
output "bootstrap_command" {
description = "Command to bootstrap first admin user"
value = "curl -X POST http://${google_compute_address.data_analyst.address}:8000/auth/bootstrap -H 'Content-Type: application/json' -d '{\"email\":\"admin@keboola.com\",\"name\":\"Admin\"}'"
}
output "cli_setup_commands" {
description = "Commands to set up local CLI"
value = <<-EOT
da setup init --server http://${google_compute_address.data_analyst.address}:8000
da setup bootstrap admin@keboola.com
da setup test-connection
da sync
EOT
}

View file

@ -1,19 +0,0 @@
# Copy to terraform.tfvars and fill in values
project_id = "your-gcp-project"
region = "europe-north1"
zone = "europe-north1-a"
machine_type = "e2-small" # 2 vCPU, 2GB RAM, ~$7/mo
disk_size_gb = 30
instance_name = "data-analyst"
ssh_user = "deploy"
ssh_public_key_path = "~/.ssh/id_ed25519.pub"
# JWT secret is auto-generated by Terraform (random_password)
# Keboola (optional — leave empty for sample data)
keboola_token = ""
keboola_stack_url = "https://connection.keboola.com"
keboola_project_id = ""
# Domain (optional — leave empty for IP-only access)
domain = ""

View file

@ -1,79 +0,0 @@
variable "project_id" {
description = "GCP project ID"
type = string
}
variable "region" {
description = "GCP region"
type = string
default = "europe-west1"
}
variable "zone" {
description = "GCP zone"
type = string
default = "europe-west1-b"
}
variable "machine_type" {
description = "VM machine type"
type = string
default = "e2-small"
}
variable "disk_size_gb" {
description = "Boot disk size in GB"
type = number
default = 30
}
variable "instance_name" {
description = "Name for the VM instance"
type = string
default = "data-analyst"
}
variable "ssh_user" {
description = "SSH username"
type = string
default = "deploy"
}
variable "ssh_public_key_path" {
description = "Path to SSH public key file"
type = string
default = "~/.ssh/id_ed25519.pub"
}
# App config (JWT secret auto-generated by Terraform)
variable "keboola_token" {
description = "Keboola Storage API token"
type = string
sensitive = true
default = ""
}
variable "keboola_stack_url" {
description = "Keboola Stack URL"
type = string
default = "https://connection.keboola.com"
}
variable "keboola_project_id" {
description = "Keboola project ID"
type = string
default = ""
}
variable "admin_email" {
description = "Admin email for initial seed (e.g., admin@company.com)"
type = string
default = ""
}
variable "domain" {
description = "Domain name for SSL (optional, empty = IP only)"
type = string
default = ""
}

84
scripts/bootstrap-gcp.sh Executable file
View file

@ -0,0 +1,84 @@
#!/usr/bin/env bash
# Bootstrap GCP projekt pro Agnes deployment.
# Jednorázové, idempotentní. Spusť jako owner GCP projektu.
#
# Usage: bootstrap-gcp.sh <GCP_PROJECT_ID> [SA_NAME]
#
# Produkuje:
# - enabled APIs (compute, iam, secretmanager, storage, iamcredentials)
# - service account <SA_NAME> s rolemi pro TF apply
# - GCS bucket agnes-<PROJECT_ID>-tfstate (versioned, uniform bucket-level access)
# - SA JSON key (lokální soubor — paste do GitHub secret GCP_SA_KEY a smazat)
set -euo pipefail
PROJECT_ID="${1:?Usage: $0 <GCP_PROJECT_ID> [SA_NAME=agnes-deploy]}"
SA_NAME="${2:-agnes-deploy}"
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
echo "=== Bootstrap GCP project: ${PROJECT_ID} ==="
gcloud config set project "${PROJECT_ID}" 1>/dev/null
echo "=== Enable APIs ==="
gcloud services enable \
compute.googleapis.com \
iam.googleapis.com \
iamcredentials.googleapis.com \
secretmanager.googleapis.com \
cloudresourcemanager.googleapis.com \
storage.googleapis.com \
--project="${PROJECT_ID}"
echo "=== Create deploy service account (if not exists) ==="
if ! gcloud iam service-accounts describe "${SA_EMAIL}" --project="${PROJECT_ID}" 2>/dev/null 1>&2; then
gcloud iam service-accounts create "${SA_NAME}" \
--display-name="Agnes Terraform deploy" \
--project="${PROJECT_ID}"
else
echo " (SA already exists — skipping creation)"
fi
echo "=== Grant roles ==="
for role in \
compute.instanceAdmin.v1 \
compute.securityAdmin \
compute.networkAdmin \
iam.serviceAccountUser \
iam.serviceAccountAdmin \
secretmanager.admin \
storage.admin \
resourcemanager.projectIamAdmin; do
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/${role}" \
--condition=None \
--quiet 1>/dev/null
done
echo "=== Create tfstate bucket (if not exists) ==="
BUCKET="agnes-${PROJECT_ID}-tfstate"
if ! gsutil ls -b "gs://${BUCKET}" 2>/dev/null 1>&2; then
gsutil mb -p "${PROJECT_ID}" -l europe-west1 -b on "gs://${BUCKET}"
gsutil versioning set on "gs://${BUCKET}"
else
echo " (bucket already exists — skipping creation)"
fi
echo "=== Generate SA key ==="
KEY_FILE="./${SA_NAME}-${PROJECT_ID}-key.json"
gcloud iam service-accounts keys create "${KEY_FILE}" \
--iam-account="${SA_EMAIL}" \
--project="${PROJECT_ID}"
echo ""
echo "=== HOTOVO ==="
echo ""
echo "SA email: ${SA_EMAIL}"
echo "TF state bucket: gs://${BUCKET}"
echo "SA key file: ${KEY_FILE}"
echo ""
echo "DALŠÍ KROKY:"
echo "1. Pushni klíč do GitHub secretu privátního infra repa:"
echo " gh secret set GCP_SA_KEY --repo <owner>/<repo> < ${KEY_FILE}"
echo "2. POTOM smaž klíč z lokálu:"
echo " rm ${KEY_FILE}"
echo ""

View file

@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Stáhne secrets z GCP Secret Manageru a vytvoří .env pro Agnes.
# Spouští se na VM pod uživatelem, který má gcloud přístup k Secret Manageru
# (typicky přes VM service account s roles/secretmanager.secretAccessor).
#
# Usage: ./fetch-env-from-secrets.sh [APP_DIR]
# Default APP_DIR: /home/deploy/app
set -euo pipefail
APP_DIR="${1:-${APP_DIR:-/home/deploy/app}}"
ENV_FILE="${APP_DIR}/.env"
# Non-secret config (override via environment or hardcoded defaults)
DATA_SOURCE="${DATA_SOURCE:-keboola}"
KEBOOLA_STACK_URL="${KEBOOLA_STACK_URL:-https://connection.us-east4.gcp.keboola.com/}"
SEED_ADMIN_EMAIL="${SEED_ADMIN_EMAIL:-zdenek.srotyr@keboola.com}"
LOG_LEVEL="${LOG_LEVEL:-info}"
DATA_DIR="${DATA_DIR:-/data}"
AGNES_TAG="${AGNES_TAG:-stable}"
echo "Fetching secrets from Secret Manager..."
JWT_KEY=$(gcloud secrets versions access latest --secret=jwt-secret-key)
KEBOOLA_TOKEN=""
if [ "$DATA_SOURCE" = "keboola" ]; then
KEBOOLA_TOKEN=$(gcloud secrets versions access latest --secret=keboola-storage-token)
fi
echo "Writing ${ENV_FILE}..."
cat > "${ENV_FILE}" <<EOF
JWT_SECRET_KEY=${JWT_KEY}
DATA_DIR=${DATA_DIR}
DATA_SOURCE=${DATA_SOURCE}
KEBOOLA_STORAGE_TOKEN=${KEBOOLA_TOKEN}
KEBOOLA_STACK_URL=${KEBOOLA_STACK_URL}
SEED_ADMIN_EMAIL=${SEED_ADMIN_EMAIL}
LOG_LEVEL=${LOG_LEVEL}
AGNES_TAG=${AGNES_TAG}
EOF
chmod 600 "${ENV_FILE}"
# Chown je best-effort — pokud skript neběží jako root, ignoruj
chown deploy:deploy "${ENV_FILE}" 2>/dev/null || true
echo "Done. ${ENV_FILE} has $(wc -l < "${ENV_FILE}") lines, chmod 600."