agnes-the-ai-analyst/docs/ONBOARDING.md
ZdenekSrotyr 1a55167234 docs: workflow-driven VM recreate for startup-script propagation
- ONBOARDING.md: replace 'propagating module changes' section with two
  explicit options — workflow_dispatch with recreate_targets (recommended,
  CI audit trail), or local terraform apply -replace (emergency). Adds a
  'do not' section banning manual .env edits on VMs.
- deployment-log.md: iteration 4 summary (version badge + module v1.5.0 +
  workflow_dispatch).
2026-04-21 20:24:31 +02:00

10 KiB

Onboarding a new Agnes instance

End-to-end guide for deploying Agnes into a new GCP project. Target time: under 1 hour.

The target reader is a Keboola ops engineer or a customer with GCP Owner access.

Overview

Every Agnes instance lives in one GCP project per customer, driven by a private infra repo cloned from keboola/agnes-infra-template. The upstream app + TF module is in keboola/agnes-the-ai-analyst; customers do not fork it.

Prerequisites

  • GCP project with billing linked (you / customer owns it)
  • gcloud CLI authenticated as project Owner
  • terraform ≥ 1.5
  • gh CLI authenticated
  • (optional) docker for local smoke tests

1. Bootstrap GCP

curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/bootstrap-gcp.sh -o bootstrap-gcp.sh
chmod +x bootstrap-gcp.sh
./bootstrap-gcp.sh <GCP_PROJECT_ID>

Outputs:

  • agnes-deploy@<project>.iam.gserviceaccount.com (Terraform SA with scoped roles)
  • gs://agnes-<project>-tfstate (versioned, uniform bucket-level access)
  • ./agnes-deploy-<project>-key.json (SA JSON key — store in ~/.agnes-keys/ or password manager, not git)

Idempotent — safe to re-run.

2. Customer's data source secrets

If data_source = "keboola":

echo -n "<KEBOOLA_STORAGE_TOKEN>" | gcloud secrets create keboola-storage-token \
    --data-file=- --replication-policy=automatic --project=<GCP_PROJECT_ID>

3. Create private infra repo from template

Create and clone in one step (the --clone flag waits for the template copy to finish; cloning in two steps can race):

gh repo create <customer-org>/agnes-infra-<customer> \
    --template keboola/agnes-infra-template \
    --private \
    --clone
cd agnes-infra-<customer>

Upload the SA key to GitHub secrets:

gh secret set GCP_SA_KEY < ~/.agnes-keys/agnes-deploy-<project>-key.json

Create GitHub environments dev (no protection) and prod (required reviewer, wait timer 5 min, branch main only):

gh api -X PUT repos/<customer-org>/agnes-infra-<customer>/environments/dev
echo '{"wait_timer":300,"deployment_branch_policy":{"protected_branches":true,"custom_branch_policies":false}}' \
  | gh api -X PUT repos/<customer-org>/agnes-infra-<customer>/environments/prod --input -

Add reviewers via GitHub UI (Settings → Environments → prod).

4. Configure tfvars and backend

Edit terraform/main.tf:

backend "gcs" {
  bucket = "agnes-<GCP_PROJECT_ID>-tfstate"
  prefix = "<customer>"
}

Copy the example and fill it in:

cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Required:
#   gcp_project_id    = "<GCP_PROJECT_ID>"
#   customer_name     = "<customer>"
#   seed_admin_email  = "...@customer.com"
#   keboola_stack_url = "https://connection.<region>.gcp.keboola.com/"
#
# Optional (module infra-v1.4.0+):
#   runtime_secrets            = ["keboola-storage-token"]  # empty if non-keboola data_source
#   firewall_ssh_source_ranges = ["35.235.240.0/20"]        # IAP range; "0.0.0.0/0" if public SSH
#   notification_channel_ids   = ["projects/<p>/notificationChannels/<id>"]
#   compose_ref                = "main"                     # or a "stable-YYYY.MM.N" tag

See the module README for the full variable schema.

5. First apply

cd terraform
export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy-<project>-key.json
terraform init
terraform plan
terraform apply

Or push terraform.tfvars committed path and let GitHub Actions do it:

git add . && git commit -m "initial: <customer> deployment" && git push origin main
# CI runs apply-dev, waits for prod reviewer, then apply-prod

Output: prod_ip = external IP.

6. Bootstrap admin user

On first boot the app auto-seeds an admin user from SEED_ADMIN_EMAIL — but without a password, which means nobody can log in yet. Activate it via POST /auth/bootstrap:

PROD_IP=$(terraform output -raw prod_ip)
curl -X POST "http://$PROD_IP:8000/auth/bootstrap" \
    -H "Content-Type: application/json" \
    -d '{"email":"<seed_admin_email from tfvars>","password":"<STRONG_PASSWORD>"}'

If the email matches the seed user, the endpoint sets its password and promotes to admin. If it doesn't match, a new admin is created. The endpoint self-deactivates once any user has a password — so do this before exposing the URL.

Log in: http://<prod_ip>:8000/login with the email + password you just set.

Security: The bootstrap endpoint is only disabled by a real password being set. Running terraform destroy + apply recreates the seed user and re-opens bootstrap — so if you destroy/recreate, a new attacker window opens until you re-run bootstrap.

7. DNS + TLS (optional)

For HTTPS, set in terraform.tfvars:

prod_instance = {
  ...
  tls_mode = "caddy"
  domain   = "agnes.<customer>.com"
}

Then create a DNS A-record pointing agnes.<customer>.comprod_ip. Caddy will auto-issue Let's Encrypt cert.

8. Smoke test

PROD_IP=$(cd terraform && terraform output -raw prod_ip)

# Health
curl "http://$PROD_IP:8000/api/health" | jq '.status'  # "healthy" or "degraded"

# First sync (populates data from Keboola / other source)
curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \
     -H "Authorization: Bearer $ADMIN_JWT"
  • Cloud Monitoring alert on /api/health status != "healthy" for > 5 min
  • Daily snapshot of /data PD: gcloud compute resource-policies create snapshot-schedule ...
  • Slack webhook from Cloud Monitoring for alerts

(These are follow-ups — not required for first deploy.)

Ongoing maintenance

  • App auto-upgrades (cron every 5 min) to latest :stable if upgrade_mode = "auto". Else Renovate will open PR on new stable-YYYY.MM.N.
  • Infra module upgrade: change ref=infra-vX.Y.Z in terraform/main.tf, PR → plan → merge → apply. (Renovate opens these PRs automatically when enabled.)
  • Add dev VM for a branch: add entry to dev_instances list with image_tag = "dev-feature-xyz", PR, merge, apply.
  • Token rotation: gcloud secrets versions add keboola-storage-token --data-file=- then run the auto-upgrade script on each VM:
    gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo /usr/local/bin/agnes-auto-upgrade.sh"
    
    Or restart containers directly: sudo docker compose -f ... restart app.

Propagating module (startup-script) changes

Important gotcha: The customer-instance module has lifecycle { ignore_changes = [metadata_startup_script] } on VMs — intentional, so terraform apply doesn't reboot VMs on every rerun. The consequence is that startup-script changes are not picked up on a normal terraform apply.

After bumping the module ref (e.g. ref=infra-v1.5.0infra-v1.6.0), do one of:

apply.yml has a workflow_dispatch input recreate_targets that takes a comma-separated list of TF resource addresses and passes each as -replace= to terraform apply. Use this to destroy + recreate VMs with the new startup script, without any SSH.

Actions → Terraform Apply → Run workflow → recreate_targets:
  module.agnes.google_compute_instance.vm["agnes-dev"],module.agnes.google_compute_instance.vm["agnes-prod"]

The workflow routes dev targets to apply-dev and prod targets to apply-prod, so the usual dev-first + prod-reviewer gate still applies. Persistent data disks and static IPs are separate resources and are preserved across replacement — only the VM (and its fresh boot disk) is recreated.

Downtime: ~2 min per VM, sequential. Data loss: none (persistent disk keeps /data; static IP keeps URL stable).

Option B — Local terraform (emergency)

export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy-<project>-key.json
cd terraform
terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'

Same semantics as Option A, but no CI audit trail. Use only when CI is broken.

Do NOT

Do not manually edit /opt/agnes/.env or the docker-compose overlay files on a running VM. Any such change is lost on the next VM recreate, and it drifts from Terraform state. If a value needs changing, route it through a module variable or a module upgrade.

Restoring from backup

Daily snapshots of each data disk are created automatically (module ≥ infra-v1.3.0). Retention: 30 days.

To restore:

# List snapshots for a specific disk
gcloud compute snapshots list --project=<GCP_PROJECT_ID> \
    --filter="sourceDisk~agnes-prod-data"

# Create a new disk from a snapshot
gcloud compute disks create agnes-prod-data-restored \
    --source-snapshot=<SNAPSHOT_NAME> \
    --zone=europe-west1-b \
    --type=pd-ssd \
    --project=<GCP_PROJECT_ID>

# Stop the VM, swap disks:
gcloud compute instances stop agnes-prod --zone=...
gcloud compute instances detach-disk agnes-prod --disk=agnes-prod-data --zone=...
gcloud compute instances attach-disk agnes-prod --disk=agnes-prod-data-restored --device-name=data --zone=...
gcloud compute instances start agnes-prod --zone=...

# Verify /api/health, then optionally delete the old disk

For Terraform state consistency after manual disk swap, you may need terraform state rm + terraform import for the disk resource.

Monitoring alerts

Module ≥ infra-v1.3.0 creates per-VM uptime checks + alert policies. To receive notifications, wire a Monitoring notification channel:

# Email channel
gcloud alpha monitoring channels create \
    --display-name="Agnes ops email" \
    --type=email \
    --channel-labels=email_address=ops@<customer>.com \
    --project=<GCP_PROJECT_ID>

# Get the channel ID, then in terraform.tfvars:
#   notification_channel_ids = ["projects/<project>/notificationChannels/<id>"]
# terraform apply

For Slack integrations, use type slack with a webhook URL.

Decommission

cd terraform
terraform destroy

Then delete:

  • GCS bucket gs://agnes-<project>-tfstate (or keep for audit)
  • Service account agnes-deploy@...
  • Secret Manager secrets (keboola-storage-token, agnes-<customer>-jwt-secret)
  • GitHub private repo <customer-org>/agnes-infra-<customer>

Troubleshooting

See keboola/agnes-the-ai-analyst issues and docs.