From 1ca5295d54edacf3910755986ba3c9f3aa727432 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com> Date: Tue, 21 Apr 2026 21:33:06 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20add=20HACKATHON.md=20=E2=80=94=20conden?= =?UTF-8?q?sed=20deploy=20+=20dev=20playbooks=20(#21)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Written for both humans and AI agents — explicit commands, expected outputs, troubleshooting tables, 'safe to run anytime' vs 'requires thought' sections, pitfalls checklist. Three parts: 1. Deploy for a new customer (45 min target, 7 steps) 2. Develop against Agnes (branch → image → dev VM loop, common tasks) 3. AI agent checklist (guardrails, verification, common pitfalls) Complements the deep docs (ONBOARDING.md, DEPLOYMENT.md, architecture.md) with a practical quick-reference for hackathon-style deploys. --- README.md | 1 + docs/HACKATHON.md | 362 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 363 insertions(+) create mode 100644 docs/HACKATHON.md diff --git a/README.md b/README.md index e8598f7..4a57889 100644 --- a/README.md +++ b/README.md @@ -133,6 +133,7 @@ See `config/instance.yaml.example` for all available options. ## Documentation +- [Hackathon TL;DR](docs/HACKATHON.md) — condensed deploy + dev playbooks (for both humans and AI agents) - [Onboarding Guide](docs/ONBOARDING.md) — end-to-end Terraform deployment into a GCP project (recommended for production) - [Deployment Guide](docs/DEPLOYMENT.md) — chooses between Terraform and Docker Compose; covers OSS self-host - [Configuration Reference](docs/CONFIGURATION.md) — `instance.yaml`, env vars, per-instance options diff --git a/docs/HACKATHON.md b/docs/HACKATHON.md new file mode 100644 index 0000000..ae1f858 --- /dev/null +++ b/docs/HACKATHON.md @@ -0,0 +1,362 @@ +# Hackathon TL;DR — Deploy & Develop + +Two condensed playbooks, written to be copy-pasteable by both humans and AI agents. For depth see [`ONBOARDING.md`](ONBOARDING.md) and [`DEPLOYMENT.md`](DEPLOYMENT.md). + +- [Part 1 — Deploy for a new customer](#part-1--deploy-for-a-new-customer) +- [Part 2 — Develop against Agnes](#part-2--develop-against-agnes) +- [Part 3 — AI agent checklist](#part-3--ai-agent-checklist) + +--- + +## Part 1 — Deploy for a new customer + +**Goal:** Agnes running in `https://'s own GCP project`, accessible on an IP. Target time: **45 minutes**. + +### Prerequisites (verify first) + +```bash +gcloud --version # ≥ 500.0.0 +terraform --version # ≥ 1.5 +gh --version # any recent +gh auth status # must be logged in with repo + workflow + admin:repo_hook scopes +``` + +Plus: +- GCP project with **billing linked**; you have `roles/owner` or equivalent +- Keboola Storage token (if `data_source = "keboola"`) + +### Step 1 — Bootstrap GCP project + +```bash +# Download + run bootstrap (creates deploy SA, tfstate bucket, enables APIs) +curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/bootstrap-gcp.sh -o bootstrap-gcp.sh +chmod +x bootstrap-gcp.sh +./bootstrap-gcp.sh +``` + +Outputs you need from the run: +- SA email `agnes-deploy@.iam.gserviceaccount.com` +- Bucket `gs://agnes--tfstate` +- Key file `./agnes-deploy--key.json` — move to `~/.agnes-keys/` and `chmod 600`. **Never git-commit.** + +### Step 2 — Keboola token secret + +Skip if `data_source` is not keboola. + +```bash +echo -n "" | gcloud secrets create keboola-storage-token \ + --data-file=- --replication-policy=automatic --project= +``` + +### Step 3 — Create customer's private infra repo from template + +```bash +# can be your own org (if Keboola is deploying) or customer's org +gh repo create /agnes-infra- \ + --template keboola/agnes-infra-template \ + --private \ + --clone +cd agnes-infra- + +# SA key to GitHub secret +gh secret set GCP_SA_KEY < ~/.agnes-keys/agnes-deploy--key.json +``` + +### Step 4 — GitHub environments + auto-merge + +```bash +# dev environment — no protection +gh api -X PUT repos//agnes-infra-/environments/dev + +# prod environment — branch policy (main only), no reviewer here; add manually via UI +echo '{"deployment_branch_policy":{"protected_branches":true,"custom_branch_policies":false}}' \ + | gh api -X PUT repos//agnes-infra-/environments/prod --input - + +# Settings → Environments → prod → Add required reviewer + +# Allow auto-merge on this repo (for Renovate) +gh api -X PATCH repos//agnes-infra- \ + -f allow_auto_merge=true -f delete_branch_on_merge=true + +# Install Renovate GitHub App on this repo: +# https://github.com/apps/renovate → Configure → /agnes-infra- +``` + +### Step 5 — Edit `terraform/main.tf` and `terraform.tfvars` + +```hcl +# terraform/main.tf — replace both placeholders +backend "gcs" { + bucket = "agnes--tfstate" + prefix = "" +} +``` + +```bash +cp terraform/terraform.tfvars.example terraform/terraform.tfvars +# Edit, minimum required: +# gcp_project_id = "" +# customer_name = "" # lowercase, 2–21 chars +# seed_admin_email = "admin@.com" +# keboola_stack_url = "https://connection..gcp.keboola.com/" +``` + +### Step 6 — First apply (from local, before CI is set up) + +```bash +cd terraform +export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy--key.json +terraform init +terraform plan # verify counts: ~20 resources to add +terraform apply # type 'yes' + +# Expected: ~5 min. Outputs include prod_ip. +terraform output -raw prod_ip +``` + +Alternative: commit tfvars + push to main — GitHub Actions `apply-dev` auto-runs, `apply-prod` waits for reviewer. + +### Step 7 — Bootstrap the admin + log in + +```bash +PROD_IP=$(terraform output -raw prod_ip) +curl -X POST "http://$PROD_IP:8000/auth/bootstrap" \ + -H "Content-Type: application/json" \ + -d '{"email":"","password":""}' +``` + +Expected: 200 OK + JSON with `role: "admin"`. + +Open `http://:8000/login` → sign in. Done. + +### Verify (smoke tests) + +```bash +curl -s "http://$PROD_IP:8000/api/health" | jq '{status, version, channel}' +curl -s "http://$PROD_IP:8000/api/version" | jq +``` + +Both should return JSON. Badge in UI footer shows `channel-version · deployed Xs ago`. + +### Troubleshooting + +| Symptom | Cause | Fix | +|---|---|---| +| `PERMISSION_DENIED: monitoring.uptimeCheckConfigs.create` | monitoring.googleapis.com not enabled / role missing | Re-run `bootstrap-gcp.sh` (grants `roles/monitoring.editor`) | +| `At least one of the pre-conditions you specified did not hold` | Stale GCS state lock | Remove `gs://…/tfstate//default.tflock` + retry | +| `Error acquiring the state lock` | Active apply concurrent | Wait, or `terraform force-unlock ` | +| Docker pull fails `ghcr.io/…:: not found` | `image_tag` pins a branch/tag that was never built | Fix `image_tag` in tfvars (use `stable` or `dev` floating) | +| VM up but `/api/health` 000 / connection refused | Startup script still running (takes 60–90 s) | Tail `/var/log/agnes-startup.log` via `gcloud compute ssh` | +| `/auth/bootstrap` returns 403 `already have passwords` | Someone already bootstrapped. | Use normal login at `/login/password` | + +--- + +## Part 2 — Develop against Agnes + +**Loop:** branch in public repo → auto-built `:dev-` image → point a dev VM at that tag → iterate. + +### Development architecture (mental model) + +``` +keboola/agnes-the-ai-analyst (public) +│ +│ push branch zs/my-edit +├──▶ release.yml builds ghcr.io/…:dev-zs-my-edit (one image per push to branch) +│ +│ push to main +└──▶ release.yml builds :stable + :stable-YYYY.MM.N + +/terraform.tfvars +│ +│ dev_instances = [..., { name = "agnes-zs", image_tag = "dev-zs-my-edit" }, ...] +└──▶ apply-dev recreates VM agnes-zs, pinned to that tag, cron pulls new digests every 5 min +``` + +### Step 1 — Branch and push (in `keboola/agnes-the-ai-analyst`) + +```bash +git checkout -b zs/my-edit # or feature/xyz, fix/bar — any slash prefix works +# …edit code… +git commit -am "wip: my experiment" +git push origin zs/my-edit +# → release.yml builds ghcr.io/keboola/agnes-the-ai-analyst:dev-zs-my-edit (~5 min) +``` + +**Slug rule:** `` is slugified — non-`[a-z0-9-]` → `-`, lowercased, max 50 chars. Leading `feature/` is stripped. So: +- `zs/my-edit` → `:dev-zs-my-edit` +- `feature/alice/dashboard` → `:dev-alice-dashboard` +- `fix/issue_42` → `:dev-fix-issue-42` + +Verify the image exists before continuing: +```bash +docker manifest inspect ghcr.io/keboola/agnes-the-ai-analyst:dev-zs-my-edit +``` + +### Step 2 — Open PR in the customer's infra repo + +```bash +cd +git checkout -b add-dev-vm-zs +# Edit terraform/variables.tf or terraform.tfvars: +# dev_instances = [ +# { name = "agnes-dev", image_tag = "dev" }, +# { name = "agnes-zs-edit", image_tag = "dev-zs-my-edit" }, # <-- added +# ] +git commit -am "add: dev VM pinned to zs/my-edit" +git push origin add-dev-vm-zs +gh pr create +``` + +`plan.yml` comments on PR with diff. Review + merge → `apply-dev` creates VM (~2 min). + +### Step 3 — Access your VM + +```bash +cd terraform +terraform output -json instance_ips # grep your VM's name + +# Open http://:8000/login +# Use the customer's admin credentials (seed + password) +``` + +### Step 4 — Iterate + +Every push to your branch: +1. release.yml rebuilds `:dev-zs-my-edit` +2. Cron on your VM (every 5 min) detects new digest, pulls, restarts containers +3. Within ~6 min of your push, your VM runs the new code + +No manual apply needed. + +### Step 5 — Merge to main + +Open PR on public repo → review → merge. This: +1. Builds `:stable + :stable-YYYY.MM.N` (main) +2. Smoke test in CI +3. Cron on **all prod VMs** (every customer!) pulls new `:stable` within 5 min + +Your branch's `:dev-zs-my-edit` tag persists in GHCR but is no longer updated. Your dev VM still runs the last build of your branch until you change its `image_tag`. + +### Step 6 — Clean up + +PR in customer infra repo removing the entry: + +```diff + dev_instances = [ + { name = "agnes-dev", image_tag = "dev" }, +- { name = "agnes-zs-edit", image_tag = "dev-zs-my-edit" }, + ] +``` + +Merge → apply-dev destroys the VM + data disk + IP + monitoring resources. Daily snapshot (if enabled) retains data for 30 days. + +### Common development tasks + +| Task | Where | How | +|---|---|---| +| Write code | public repo | Normal git workflow | +| Run tests locally | public repo | `TESTING=1 pytest tests/ -v` | +| Bump infra module | public repo | Edit `infra/modules/customer-instance/`, PR, merge, create `infra-vX.Y.Z` tag | +| Point customer at new module | customer infra repo | Renovate opens PR; or edit `ref=` in main.tf manually | +| Force-propagate startup script change | customer infra repo | Actions → Terraform Apply → Run workflow → `recreate_targets=module.agnes.google_compute_instance.vm["agnes-prod"]` | +| Add dev VM for someone else | customer infra repo | Add entry to `dev_instances`, PR, merge | +| Rotate Keboola token | customer GCP + VM | `gcloud secrets versions add keboola-storage-token --data-file=-` then SSH + `sudo /usr/local/bin/agnes-auto-upgrade.sh` (no manual edits to `.env`) | +| Restart app manually | customer VM | `sudo docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml restart app` | +| See current deployed version | any | Footer badge on UI, or `curl /api/version` | + +### Rules + +- **Never `gcloud compute ssh` + edit `/opt/agnes/.env`** — gets wiped on next VM recreate. Route all config through Terraform or Secret Manager. +- **Never edit inside a VM's `/opt/agnes/`** — same reason. If you need a change to survive, it belongs in the module / compose files / tfvars. +- **Never bump `infra-v*` by tagging upstream without first testing on a dev VM** — a broken module propagates to all customer repos via Renovate. +- **Never delete a customer's data disk without a manual `gcloud compute disks snapshot` first** if in doubt about what's on it. + +--- + +## Part 3 — AI agent checklist + +These are guardrails/verification steps an AI agent should follow autonomously. + +### Before taking destructive action + +Run and read, don't assume: +```bash +terraform plan # what will actually change? +gh run list --limit 3 # any CI failures? +curl -s :8000/api/health | jq .status # is prod actually healthy? +``` + +### When propagating module bumps + +1. Read `docs/superpowers/plans/2026-04-21-deployment-log.md` for context on iteration history. +2. Check current `ref=` in customer infra repo against latest `infra-v*` tag in upstream. +3. Prefer Renovate PR over manual edit — has automatic `terraform validate` gate. +4. For startup-script changes (not just module-resource changes), use `workflow_dispatch` → `recreate_targets` to force VM recreate. Normal apply won't propagate (`ignore_changes`). + +### When a customer reports "it's broken" + +```bash +# What version is deployed? +curl -s http://:8000/api/version | jq + +# Recent deploys? +gh run list --repo /agnes-infra- --limit 5 + +# VM state? +gcloud compute instances list --project= --filter="name~agnes-" + +# App logs (last 50) +gcloud compute ssh agnes-prod --zone=... --project=... \ + --command="sudo docker logs agnes-app-1 --tail 50" + +# Startup script log (if VM just booted) +gcloud compute ssh agnes-prod --zone=... --project=... \ + --command="sudo tail -30 /var/log/agnes-startup.log" +``` + +### When you're unsure + +Prefer non-destructive paths first: +1. `terraform plan` (read-only) before `apply` +2. Add a new resource before deleting an old one +3. Snapshot before destroying a disk +4. Dev VM before touching prod — always + +### Common pitfalls to detect + +| Pitfall | Check | +|---|---| +| Uncommitted local changes on operator's laptop | `git status -s` in infra repo | +| Multiple concurrent applies (state lock) | `gsutil ls gs://.../tflock` | +| `image_tag` points at non-existent GHCR image | `docker manifest inspect ghcr.io/…:` | +| Seed user without password on fresh deploy | `curl /api/health | jq .services.users.count` — if 1 and nobody has logged in, `/auth/bootstrap` is still open | +| Main branch protection prevents direct push | Use PR + auto-merge; never force-push to main | +| Renovate not installed → module bumps don't happen | Check `https://github.com///pulls?q=author%3Aapp%2Frenovate` | +| `/opt/agnes/.env` edited manually → drift | `git diff` against module's expected `.env` shape | + +### Safe-to-run anytime + +- `curl /api/health`, `curl /api/version` — no auth, no side effects +- `terraform plan` — read-only +- `gh run list`, `gh pr list` — read-only +- `gcloud ... describe` / `list` — read-only +- `docker logs` / `docker inspect` — read-only (on the VM) + +### Requires thought + +- `terraform apply` — mutates infra +- `gh workflow run` with `recreate_targets` — destroys + recreates VMs +- `gcloud compute instances delete` — unrecoverable after 30 days +- `gcloud secrets versions destroy` — unrecoverable +- `gh repo delete` — unrecoverable + +--- + +## Reference links + +- Full onboarding: [`docs/ONBOARDING.md`](ONBOARDING.md) +- Deployment comparison: [`docs/DEPLOYMENT.md`](DEPLOYMENT.md) +- Spec (architecture rationale): [`docs/superpowers/specs/2026-04-21-multi-customer-deployment-spec.md`](superpowers/specs/2026-04-21-multi-customer-deployment-spec.md) +- Deployment log (what we actually built, with iterations and known limitations): [`docs/superpowers/plans/2026-04-21-deployment-log.md`](superpowers/plans/2026-04-21-deployment-log.md) +- Module source: [`infra/modules/customer-instance/`](../infra/modules/customer-instance/) +- Upstream issues: https://github.com/keboola/agnes-the-ai-analyst/issues