CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release sections collapsed to one, stale v1->v35 schema history dropped (it lives in CHANGELOG), marketplace endpoint internals and verbose process sections moved out or tightened. New focused docs: - docs/RELEASING.md - release process, deploy workflows, CI quirks (RELEASE_TEMPLATE.md folded in as an appendix) - docs/marketplace.md - marketplace ingestion + re-serving internals - docs/README.md - documentation index by audience, linked from README.md and CLAUDE.md Archived under docs/archive/: docs/superpowers/ (52 historical planning artifacts), HACKATHON.md, pd-ps-comments.md, security-audit-2026-04.md, future/NOTIFICATIONS.md. Removed the docs/auto-install.md stub. Fixed dangling links in connectors/jira/README.md and dev_docs/README.md, repointed code/doc references to archived paths.
14 KiB
Hackathon TL;DR — Deploy & Develop
Two condensed playbooks, written to be copy-pasteable by both humans and AI agents. For depth see ONBOARDING.md and DEPLOYMENT.md.
Part 1 — Deploy for a new customer
Goal: Agnes running in https://<customer-name>'s own GCP project, accessible on an IP. Target time: 45 minutes.
Prerequisites (verify first)
gcloud --version # ≥ 500.0.0
terraform --version # ≥ 1.5
gh --version # any recent
gh auth status # must be logged in with repo + workflow + admin:repo_hook scopes
Plus:
- GCP project with billing linked; you have
roles/owneror equivalent - Keboola Storage token (if
data_source = "keboola")
Step 1 — Bootstrap GCP project
# Download + run bootstrap (creates deploy SA, tfstate bucket, enables APIs)
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/bootstrap-gcp.sh -o bootstrap-gcp.sh
chmod +x bootstrap-gcp.sh
./bootstrap-gcp.sh <GCP_PROJECT_ID>
Outputs you need from the run:
- SA email
agnes-deploy@<project>.iam.gserviceaccount.com - Bucket
gs://agnes-<project>-tfstate - Key file
./agnes-deploy-<project>-key.json— move to~/.agnes-keys/andchmod 600. Never git-commit.
Step 2 — Keboola token secret
Skip if data_source is not keboola.
echo -n "<KEBOOLA_STORAGE_TOKEN>" | gcloud secrets create keboola-storage-token \
--data-file=- --replication-policy=automatic --project=<GCP_PROJECT_ID>
Step 3 — Create customer's private infra repo from template
# <customer-org> can be your own org (if Keboola is deploying) or customer's org
gh repo create <customer-org>/agnes-infra-<customer-name> \
--template keboola/agnes-infra-template \
--private \
--clone
cd agnes-infra-<customer-name>
# SA key to GitHub secret
gh secret set GCP_SA_KEY < ~/.agnes-keys/agnes-deploy-<project>-key.json
Step 4 — GitHub environments + auto-merge
# dev environment — no protection
gh api -X PUT repos/<customer-org>/agnes-infra-<customer-name>/environments/dev
# prod environment — branch policy (main only), no reviewer here; add manually via UI
echo '{"deployment_branch_policy":{"protected_branches":true,"custom_branch_policies":false}}' \
| gh api -X PUT repos/<customer-org>/agnes-infra-<customer-name>/environments/prod --input -
# Settings → Environments → prod → Add required reviewer
# Allow auto-merge on this repo (for Renovate)
gh api -X PATCH repos/<customer-org>/agnes-infra-<customer-name> \
-f allow_auto_merge=true -f delete_branch_on_merge=true
# Install Renovate GitHub App on this repo:
# https://github.com/apps/renovate → Configure → <customer-org>/agnes-infra-<customer-name>
Step 5 — Edit terraform/main.tf and terraform.tfvars
# terraform/main.tf — replace both placeholders
backend "gcs" {
bucket = "agnes-<GCP_PROJECT_ID>-tfstate"
prefix = "<customer-name>"
}
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit, minimum required:
# gcp_project_id = "<GCP_PROJECT_ID>"
# customer_name = "<customer-name>" # lowercase, 2–21 chars
# seed_admin_email = "admin@<customer>.com"
# keboola_stack_url = "https://connection.<region>.gcp.keboola.com/"
Step 6 — First apply (from local, before CI is set up)
cd terraform
export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy-<project>-key.json
terraform init
terraform plan # verify counts: ~20 resources to add
terraform apply # type 'yes'
# Expected: ~5 min. Outputs include prod_ip.
terraform output -raw prod_ip
Alternative: commit tfvars + push to main — GitHub Actions apply-dev auto-runs, apply-prod waits for reviewer.
Step 7 — Bootstrap the admin + log in
PROD_IP=$(terraform output -raw prod_ip)
curl -X POST "http://$PROD_IP:8000/auth/bootstrap" \
-H "Content-Type: application/json" \
-d '{"email":"<seed_admin_email from tfvars>","password":"<STRONG_PASSWORD>"}'
Expected: 200 OK + JSON with role: "admin".
Open http://<prod_ip>:8000/login → sign in. Done.
Verify (smoke tests)
curl -s "http://$PROD_IP:8000/api/health" | jq '{status, version, channel}'
curl -s "http://$PROD_IP:8000/api/version" | jq
Both should return JSON. Badge in UI footer shows channel-version · deployed Xs ago.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
PERMISSION_DENIED: monitoring.uptimeCheckConfigs.create |
monitoring.googleapis.com not enabled / role missing | Re-run bootstrap-gcp.sh (grants roles/monitoring.editor) |
At least one of the pre-conditions you specified did not hold |
Stale GCS state lock | Remove gs://…/tfstate/<prefix>/default.tflock + retry |
Error acquiring the state lock |
Active apply concurrent | Wait, or terraform force-unlock <LOCK_ID> |
Docker pull fails ghcr.io/…:<tag>: not found |
image_tag pins a branch/tag that was never built |
Fix image_tag in tfvars (use stable or dev floating) |
VM up but /api/health 000 / connection refused |
Startup script still running (takes 60–90 s) | Tail /var/log/agnes-startup.log via gcloud compute ssh |
/auth/bootstrap returns 403 already have passwords |
Someone already bootstrapped. | Use normal login at /login/password |
Part 2 — Develop against Agnes
Loop: branch in public repo → auto-built :dev-<slug> image → point a dev VM at that tag → iterate.
Development architecture (mental model)
keboola/agnes-the-ai-analyst (public)
│
│ push branch zs/my-edit
├──▶ release.yml builds ghcr.io/…:dev-zs-my-edit (one image per push to branch)
│
│ push to main
└──▶ release.yml builds :stable + :stable-YYYY.MM.N
<customer-infra-repo>/terraform.tfvars
│
│ dev_instances = [..., { name = "agnes-zs", image_tag = "dev-zs-my-edit" }, ...]
└──▶ apply-dev recreates VM agnes-zs, pinned to that tag, cron pulls new digests every 5 min
Step 1 — Branch and push (in keboola/agnes-the-ai-analyst)
git checkout -b zs/my-edit # or feature/xyz, fix/bar — any slash prefix works
# …edit code…
git commit -am "wip: my experiment"
git push origin zs/my-edit
# → release.yml builds ghcr.io/keboola/agnes-the-ai-analyst:dev-zs-my-edit (~5 min)
Slug rule: <branch> is slugified — non-[a-z0-9-] → -, lowercased, max 50 chars. Leading feature/ is stripped. So:
zs/my-edit→:dev-zs-my-editfeature/alice/dashboard→:dev-alice-dashboardfix/issue_42→:dev-fix-issue-42
Verify the image exists before continuing:
docker manifest inspect ghcr.io/keboola/agnes-the-ai-analyst:dev-zs-my-edit
Step 2 — Open PR in the customer's infra repo
cd <customer-infra-repo>
git checkout -b add-dev-vm-zs
# Edit terraform/variables.tf or terraform.tfvars:
# dev_instances = [
# { name = "agnes-dev", image_tag = "dev" },
# { name = "agnes-zs-edit", image_tag = "dev-zs-my-edit" }, # <-- added
# ]
git commit -am "add: dev VM pinned to zs/my-edit"
git push origin add-dev-vm-zs
gh pr create
plan.yml comments on PR with diff. Review + merge → apply-dev creates VM (~2 min).
Step 3 — Access your VM
cd terraform
terraform output -json instance_ips # grep your VM's name
# Open http://<ip>:8000/login
# Use the customer's admin credentials (seed + password)
Step 4 — Iterate
Every push to your branch:
- release.yml rebuilds
:dev-zs-my-edit - Cron on your VM (every 5 min) detects new digest, pulls, restarts containers
- Within ~6 min of your push, your VM runs the new code
No manual apply needed.
Step 5 — Merge to main
Open PR on public repo → review → merge. This:
- Builds
:stable + :stable-YYYY.MM.N(main) - Smoke test in CI
- Cron on all prod VMs (every customer!) pulls new
:stablewithin 5 min
Your branch's :dev-zs-my-edit tag persists in GHCR but is no longer updated. Your dev VM still runs the last build of your branch until you change its image_tag.
Step 6 — Clean up
PR in customer infra repo removing the entry:
dev_instances = [
{ name = "agnes-dev", image_tag = "dev" },
- { name = "agnes-zs-edit", image_tag = "dev-zs-my-edit" },
]
Merge → apply-dev destroys the VM + data disk + IP + monitoring resources. Daily snapshot (if enabled) retains data for 30 days.
Common development tasks
| Task | Where | How |
|---|---|---|
| Write code | public repo | Normal git workflow |
| Run tests locally | public repo | TESTING=1 pytest tests/ -v |
| Bump infra module | public repo | Edit infra/modules/customer-instance/, PR, merge, create infra-vX.Y.Z tag |
| Point customer at new module | customer infra repo | Renovate opens PR; or edit ref= in main.tf manually |
| Force-propagate startup script change | customer infra repo | Actions → Terraform Apply → Run workflow → recreate_targets=module.agnes.google_compute_instance.vm["agnes-prod"] |
| Add dev VM for someone else | customer infra repo | Add entry to dev_instances, PR, merge |
| Rotate Keboola token | customer GCP + VM | gcloud secrets versions add keboola-storage-token --data-file=- then SSH + sudo /usr/local/bin/agnes-auto-upgrade.sh (no manual edits to .env) |
| Restart app manually | customer VM | sudo docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml restart app |
| See current deployed version | any | Footer badge on UI, or curl /api/version |
Rules
- Never
gcloud compute ssh+ edit/opt/agnes/.env— gets wiped on next VM recreate. Route all config through Terraform or Secret Manager. - Never edit inside a VM's
/opt/agnes/— same reason. If you need a change to survive, it belongs in the module / compose files / tfvars. - Never bump
infra-v*by tagging upstream without first testing on a dev VM — a broken module propagates to all customer repos via Renovate. - Never delete a customer's data disk without a manual
gcloud compute disks snapshotfirst if in doubt about what's on it.
Part 3 — AI agent checklist
These are guardrails/verification steps an AI agent should follow autonomously.
Before taking destructive action
Run and read, don't assume:
terraform plan # what will actually change?
gh run list --limit 3 # any CI failures?
curl -s <PROD_IP>:8000/api/health | jq .status # is prod actually healthy?
When propagating module bumps
- Read
docs/superpowers/plans/2026-04-21-deployment-log.mdfor context on iteration history. - Check current
ref=in customer infra repo against latestinfra-v*tag in upstream. - Prefer Renovate PR over manual edit — has automatic
terraform validategate. - For startup-script changes (not just module-resource changes), use
workflow_dispatch→recreate_targetsto force VM recreate. Normal apply won't propagate (ignore_changes).
When a customer reports "it's broken"
# What version is deployed?
curl -s http://<ip>:8000/api/version | jq
# Recent deploys?
gh run list --repo <customer-org>/agnes-infra-<customer> --limit 5
# VM state?
gcloud compute instances list --project=<customer-project> --filter="name~agnes-"
# App logs (last 50)
gcloud compute ssh agnes-prod --zone=... --project=... \
--command="sudo docker logs agnes-app-1 --tail 50"
# Startup script log (if VM just booted)
gcloud compute ssh agnes-prod --zone=... --project=... \
--command="sudo tail -30 /var/log/agnes-startup.log"
When you're unsure
Prefer non-destructive paths first:
terraform plan(read-only) beforeapply- Add a new resource before deleting an old one
- Snapshot before destroying a disk
- Dev VM before touching prod — always
Common pitfalls to detect
| Pitfall | Check |
|---|---|
| Uncommitted local changes on operator's laptop | git status -s in infra repo |
| Multiple concurrent applies (state lock) | gsutil ls gs://.../tflock |
image_tag points at non-existent GHCR image |
docker manifest inspect ghcr.io/…:<tag> |
| Seed user without password on fresh deploy | `curl /api/health |
| Main branch protection prevents direct push | Use PR + auto-merge; never force-push to main |
| Renovate not installed → module bumps don't happen | Check https://github.com/<org>/<repo>/pulls?q=author%3Aapp%2Frenovate |
/opt/agnes/.env edited manually → drift |
git diff against module's expected .env shape |
Safe-to-run anytime
curl /api/health,curl /api/version— no auth, no side effectsterraform plan— read-onlygh run list,gh pr list— read-onlygcloud ... describe/list— read-onlydocker logs/docker inspect— read-only (on the VM)
Requires thought
terraform apply— mutates infragh workflow runwithrecreate_targets— destroys + recreates VMsgcloud compute instances delete— unrecoverable after 30 daysgcloud secrets versions destroy— unrecoverablegh repo delete— unrecoverable
Reference links
- Full onboarding:
docs/ONBOARDING.md - Deployment comparison:
docs/DEPLOYMENT.md - Spec (architecture rationale):
docs/superpowers/specs/2026-04-21-multi-customer-deployment-spec.md - Deployment log (what we actually built, with iterations and known limitations):
docs/superpowers/plans/2026-04-21-deployment-log.md - Module source:
infra/modules/customer-instance/ - Upstream issues: https://github.com/keboola/agnes-the-ai-analyst/issues