* dryrun: verify per-branch GHCR tag * ci: propagate infra-v* tag bumps to template repo On push of any infra-v* tag, opens a PR in keboola/agnes-infra-template that bumps the module ref in terraform/main.tf. Auto-merge rules in the template (Renovate + CI validate + GitHub native auto-merge) land it without manual work on patch/minor bumps. Requires repo secret TEMPLATE_REPO_TOKEN (fine-grained PAT with Contents:write + Pull requests:write on keboola/agnes-infra-template). Fail-soft: if secret is missing the job is skipped and Renovate on the template repo picks up the new tag on its next cycle as a fallback. * docs(onboarding): 'Keeping the template up-to-date' maintainer section Documents the two mechanisms (upstream release hook + Renovate), the required repo settings (allow_auto_merge, validate.yml gate), the TOKEN secret setup, and the one-time setup checklist. Notes the difference between template repo (auto-merge on) and customer infra repos (human approval).
13 KiB
Onboarding a new Agnes instance
End-to-end guide for deploying Agnes into a new GCP project. Target time: under 1 hour.
The target reader is a Keboola ops engineer or a customer with GCP Owner access.
Overview
Every Agnes instance lives in one GCP project per customer, driven by a private infra repo cloned from keboola/agnes-infra-template. The upstream app + TF module is in keboola/agnes-the-ai-analyst; customers do not fork it.
Prerequisites
- GCP project with billing linked (you / customer owns it)
gcloudCLI authenticated as project Ownerterraform≥ 1.5ghCLI authenticated- (optional)
dockerfor local smoke tests
1. Bootstrap GCP
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/bootstrap-gcp.sh -o bootstrap-gcp.sh
chmod +x bootstrap-gcp.sh
./bootstrap-gcp.sh <GCP_PROJECT_ID>
Outputs:
agnes-deploy@<project>.iam.gserviceaccount.com(Terraform SA with scoped roles)gs://agnes-<project>-tfstate(versioned, uniform bucket-level access)./agnes-deploy-<project>-key.json(SA JSON key — store in~/.agnes-keys/or password manager, not git)
Idempotent — safe to re-run.
2. Customer's data source secrets
If data_source = "keboola":
echo -n "<KEBOOLA_STORAGE_TOKEN>" | gcloud secrets create keboola-storage-token \
--data-file=- --replication-policy=automatic --project=<GCP_PROJECT_ID>
3. Create private infra repo from template
Create and clone in one step (the --clone flag waits for the template copy to finish; cloning in two steps can race):
gh repo create <customer-org>/agnes-infra-<customer> \
--template keboola/agnes-infra-template \
--private \
--clone
cd agnes-infra-<customer>
Upload the SA key to GitHub secrets:
gh secret set GCP_SA_KEY < ~/.agnes-keys/agnes-deploy-<project>-key.json
Create GitHub environments dev (no protection) and prod (required reviewer, wait timer 5 min, branch main only):
gh api -X PUT repos/<customer-org>/agnes-infra-<customer>/environments/dev
echo '{"wait_timer":300,"deployment_branch_policy":{"protected_branches":true,"custom_branch_policies":false}}' \
| gh api -X PUT repos/<customer-org>/agnes-infra-<customer>/environments/prod --input -
Add reviewers via GitHub UI (Settings → Environments → prod).
4. Configure tfvars and backend
Edit terraform/main.tf:
backend "gcs" {
bucket = "agnes-<GCP_PROJECT_ID>-tfstate"
prefix = "<customer>"
}
Copy the example and fill it in:
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Required:
# gcp_project_id = "<GCP_PROJECT_ID>"
# customer_name = "<customer>"
# seed_admin_email = "...@customer.com"
# keboola_stack_url = "https://connection.<region>.gcp.keboola.com/"
#
# Optional (module infra-v1.4.0+):
# runtime_secrets = ["keboola-storage-token"] # empty if non-keboola data_source
# firewall_ssh_source_ranges = ["35.235.240.0/20"] # IAP range; "0.0.0.0/0" if public SSH
# notification_channel_ids = ["projects/<p>/notificationChannels/<id>"]
# compose_ref = "main" # or a "stable-YYYY.MM.N" tag
See the module README for the full variable schema.
5. First apply
cd terraform
export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy-<project>-key.json
terraform init
terraform plan
terraform apply
Or push terraform.tfvars committed path and let GitHub Actions do it:
git add . && git commit -m "initial: <customer> deployment" && git push origin main
# CI runs apply-dev, waits for prod reviewer, then apply-prod
Output: prod_ip = external IP.
6. Bootstrap admin user
On first boot the app auto-seeds an admin user from SEED_ADMIN_EMAIL — but without a password, which means nobody can log in yet. Activate it via POST /auth/bootstrap:
PROD_IP=$(terraform output -raw prod_ip)
curl -X POST "http://$PROD_IP:8000/auth/bootstrap" \
-H "Content-Type: application/json" \
-d '{"email":"<seed_admin_email from tfvars>","password":"<STRONG_PASSWORD>"}'
If the email matches the seed user, the endpoint sets its password and promotes to admin. If it doesn't match, a new admin is created. The endpoint self-deactivates once any user has a password — so do this before exposing the URL.
Log in: http://<prod_ip>:8000/login with the email + password you just set.
Security: The bootstrap endpoint is only disabled by a real password being set. Running terraform destroy + apply recreates the seed user and re-opens bootstrap — so if you destroy/recreate, a new attacker window opens until you re-run bootstrap.
7. DNS + TLS (optional)
For HTTPS, set in terraform.tfvars:
prod_instance = {
...
tls_mode = "caddy"
domain = "agnes.<customer>.com"
}
Then create a DNS A-record pointing agnes.<customer>.com → prod_ip. Caddy will auto-issue Let's Encrypt cert.
8. Smoke test
PROD_IP=$(cd terraform && terraform output -raw prod_ip)
# Health
curl "http://$PROD_IP:8000/api/health" | jq '.status' # "healthy" or "degraded"
# First sync (populates data from Keboola / other source)
curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \
-H "Authorization: Bearer $ADMIN_JWT"
9. Monitoring + backup (recommended)
- Cloud Monitoring alert on
/api/healthstatus != "healthy"for > 5 min - Daily snapshot of
/dataPD:gcloud compute resource-policies create snapshot-schedule ... - Slack webhook from Cloud Monitoring for alerts
(These are follow-ups — not required for first deploy.)
Ongoing maintenance
- App auto-upgrades (cron every 5 min) to latest
:stableifupgrade_mode = "auto". Else Renovate will open PR on newstable-YYYY.MM.N. - Infra module upgrade: change
ref=infra-vX.Y.Zinterraform/main.tf, PR → plan → merge → apply. (Renovate opens these PRs automatically when enabled.) - Add dev VM for a branch: add entry to
dev_instanceslist withimage_tag = "dev-feature-xyz", PR, merge, apply. - Token rotation:
gcloud secrets versions add keboola-storage-token --data-file=-then run the auto-upgrade script on each VM:
Or restart containers directly:gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo /usr/local/bin/agnes-auto-upgrade.sh"sudo docker compose -f ... restart app.
Propagating module (startup-script) changes
Important gotcha: The customer-instance module has lifecycle { ignore_changes = [metadata_startup_script] } on VMs — intentional, so terraform apply doesn't reboot VMs on every rerun. The consequence is that startup-script changes are not picked up on a normal terraform apply.
After bumping the module ref (e.g. ref=infra-v1.5.0 → infra-v1.6.0), do one of:
Option A — Workflow dispatch with recreate_targets (recommended)
apply.yml has a workflow_dispatch input recreate_targets that takes a comma-separated list of TF resource addresses and passes each as -replace= to terraform apply. Use this to destroy + recreate VMs with the new startup script, without any SSH.
Actions → Terraform Apply → Run workflow → recreate_targets:
module.agnes.google_compute_instance.vm["agnes-dev"],module.agnes.google_compute_instance.vm["agnes-prod"]
The workflow routes dev targets to apply-dev and prod targets to apply-prod, so the usual dev-first + prod-reviewer gate still applies. Persistent data disks and static IPs are separate resources and are preserved across replacement — only the VM (and its fresh boot disk) is recreated.
Downtime: ~2 min per VM, sequential. Data loss: none (persistent disk keeps /data; static IP keeps URL stable).
Option B — Local terraform (emergency)
export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy-<project>-key.json
cd terraform
terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'
Same semantics as Option A, but no CI audit trail. Use only when CI is broken.
Do NOT
Do not manually edit /opt/agnes/.env or the docker-compose overlay files on a running VM. Any such change is lost on the next VM recreate, and it drifts from Terraform state. If a value needs changing, route it through a module variable or a module upgrade.
Restoring from backup
Daily snapshots of each data disk are created automatically (module ≥ infra-v1.3.0). Retention: 30 days.
To restore:
# List snapshots for a specific disk
gcloud compute snapshots list --project=<GCP_PROJECT_ID> \
--filter="sourceDisk~agnes-prod-data"
# Create a new disk from a snapshot
gcloud compute disks create agnes-prod-data-restored \
--source-snapshot=<SNAPSHOT_NAME> \
--zone=europe-west1-b \
--type=pd-ssd \
--project=<GCP_PROJECT_ID>
# Stop the VM, swap disks:
gcloud compute instances stop agnes-prod --zone=...
gcloud compute instances detach-disk agnes-prod --disk=agnes-prod-data --zone=...
gcloud compute instances attach-disk agnes-prod --disk=agnes-prod-data-restored --device-name=data --zone=...
gcloud compute instances start agnes-prod --zone=...
# Verify /api/health, then optionally delete the old disk
For Terraform state consistency after manual disk swap, you may need terraform state rm + terraform import for the disk resource.
Monitoring alerts
Module ≥ infra-v1.3.0 creates per-VM uptime checks + alert policies. To receive notifications, wire a Monitoring notification channel:
# Email channel
gcloud alpha monitoring channels create \
--display-name="Agnes ops email" \
--type=email \
--channel-labels=email_address=ops@<customer>.com \
--project=<GCP_PROJECT_ID>
# Get the channel ID, then in terraform.tfvars:
# notification_channel_ids = ["projects/<project>/notificationChannels/<id>"]
# terraform apply
For Slack integrations, use type slack with a webhook URL.
Keeping the template up-to-date (maintainer note)
New customers clone keboola/agnes-infra-template — so the template's terraform/main.tf must always point at the latest stable infra-v* tag. Two cooperating mechanisms keep it current:
-
Upstream release hook (
.github/workflows/propagate-infra-tag.ymlinkeboola/agnes-the-ai-analyst): on push of anyinfra-v*tag, opens a PR in the template repo that bumps the module ref. Requires a repository secretTEMPLATE_REPO_TOKEN(fine-grained PAT or GitHub App token withContents:write+Pull requests:writeon the template repo). Without the secret, the job is skipped — fail-soft. -
Renovate on the template repo: tracks
infra-v*tags on polling cycles as a fallback when the release hook is unavailable. Config is already inrenovate.json.
For both to land automatically (no human clicks needed):
allow_auto_merge: trueon the template repo (set viagh api -X PATCH repos/keboola/agnes-infra-template -f allow_auto_merge=true)automerge: trueinrenovate.jsonfor minor+patch (already configured)- CI validate gate (
.github/workflows/validate.ymlin the template repo — runsterraform init -backend=false+terraform validateon the PR). Renovate'splatformAutomergewaits for this check to pass before merging. - Major bumps stay manual (labeled
breaking,automerge: false).
Customer-owned infra repos (e.g. keboola/agnes-infra-keboola) share the same Renovate config but typically leave patch/minor auto-merge disabled (because terraform apply touches live infrastructure; customers want a human to approve each bump). The template repo is different — it holds no state and doesn't touch GCP.
One-time setup checklist
- Install Renovate GitHub App on
keboola/agnes-infra-templateand on eachkeboola/agnes-infra-<customer>repo - Create a fine-grained PAT with
Contents:write+Pull requests:writeon the template repo - Add it as
TEMPLATE_REPO_TOKENsecret onkeboola/agnes-the-ai-analyst - Verify: tag a test
infra-vX.Y.Zin upstream → PR appears in template → CI validates → auto-merges
Decommission
cd terraform
terraform destroy
Then delete:
- GCS bucket
gs://agnes-<project>-tfstate(or keep for audit) - Service account
agnes-deploy@... - Secret Manager secrets (
keboola-storage-token,agnes-<customer>-jwt-secret) - GitHub private repo
<customer-org>/agnes-infra-<customer>
Troubleshooting
See keboola/agnes-the-ai-analyst issues and docs.