From 3e9213bfc4525183916e575d2763e5adb7019cda Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 21 Apr 2026 19:06:20 +0200 Subject: [PATCH] docs(onboarding): add module propagation, backup restore, monitoring setup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 'Propagating module changes' — explains ignore_changes + -replace workflow - 'Restoring from backup' — step-by-step disk swap from daily snapshot - 'Monitoring alerts' — wiring notification channels --- docs/ONBOARDING.md | 82 ++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 80 insertions(+), 2 deletions(-) diff --git a/docs/ONBOARDING.md b/docs/ONBOARDING.md index 257da7b..1bc827c 100644 --- a/docs/ONBOARDING.md +++ b/docs/ONBOARDING.md @@ -158,9 +158,87 @@ curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \ ## Ongoing maintenance - **App auto-upgrades** (cron every 5 min) to latest `:stable` if `upgrade_mode = "auto"`. Else Renovate will open PR on new `stable-YYYY.MM.N`. -- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply. +- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply. (Renovate opens these PRs automatically when enabled.) - **Add dev VM for a branch:** add entry to `dev_instances` list with `image_tag = "dev-feature-xyz"`, PR, merge, apply. -- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then `sudo docker compose restart app` on each VM. +- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then run the auto-upgrade script on each VM: + ```bash + gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo /usr/local/bin/agnes-auto-upgrade.sh" + ``` + Or restart containers directly: `sudo docker compose -f ... restart app`. + +## Propagating module (startup-script) changes + +**Important gotcha:** The `customer-instance` module has `lifecycle { ignore_changes = [metadata_startup_script] }` on VMs — this is intentional so `terraform apply` doesn't reboot VMs on every rerun. The consequence is that **changes inside the startup script are not picked up on a normal `terraform apply`**. + +To propagate a startup-script change (for example, after bumping `ref=infra-v1.3.0`): + +```bash +# VM is recreated; boot disk is fresh; persistent data disk is preserved +terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]' +``` + +Downtime: ~2 minutes. The persistent data disk (where `/data` lives) is *not* recreated — only the VM. Startup script re-runs on the new VM with the latest template content, and your data is still there. + +Alternative (less disruptive): hot-patch the VM via SSH: + +```bash +gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo bash -c ' + cd /opt/agnes + curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml -o docker-compose.prod.yml + curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.host-mount.yml -o docker-compose.host-mount.yml + docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml up -d +'" +``` + +This preserves container state but won't re-install cron / rebuild persistent disk layout. + +## Restoring from backup + +Daily snapshots of each data disk are created automatically (module ≥ `infra-v1.3.0`). Retention: 30 days. + +To restore: + +```bash +# List snapshots for a specific disk +gcloud compute snapshots list --project= \ + --filter="sourceDisk~agnes-prod-data" + +# Create a new disk from a snapshot +gcloud compute disks create agnes-prod-data-restored \ + --source-snapshot= \ + --zone=europe-west1-b \ + --type=pd-ssd \ + --project= + +# Stop the VM, swap disks: +gcloud compute instances stop agnes-prod --zone=... +gcloud compute instances detach-disk agnes-prod --disk=agnes-prod-data --zone=... +gcloud compute instances attach-disk agnes-prod --disk=agnes-prod-data-restored --device-name=data --zone=... +gcloud compute instances start agnes-prod --zone=... + +# Verify /api/health, then optionally delete the old disk +``` + +For Terraform state consistency after manual disk swap, you may need `terraform state rm` + `terraform import` for the disk resource. + +## Monitoring alerts + +Module ≥ `infra-v1.3.0` creates per-VM uptime checks + alert policies. To receive notifications, wire a Monitoring notification channel: + +```bash +# Email channel +gcloud alpha monitoring channels create \ + --display-name="Agnes ops email" \ + --type=email \ + --channel-labels=email_address=ops@.com \ + --project= + +# Get the channel ID, then in terraform.tfvars: +# notification_channel_ids = ["projects//notificationChannels/"] +# terraform apply +``` + +For Slack integrations, use type `slack` with a webhook URL. ## Decommission