docs(onboarding): add module propagation, backup restore, monitoring setup

- 'Propagating module changes' — explains ignore_changes + -replace workflow - 'Restoring from backup' — step-by-step disk swap from daily snapshot - 'Monitoring alerts' — wiring notification channels
2026-04-21 19:06:20 +02:00 · 2026-04-21 19:06:20 +02:00 · 3e9213bfc4
commit 3e9213bfc4
parent 0842debf8a
1 changed files with 80 additions and 2 deletions
--- a/docs/ONBOARDING.md
+++ b/docs/ONBOARDING.md
@ -158,9 +158,87 @@ curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \
 ## Ongoing maintenance

 - **App auto-upgrades** (cron every 5 min) to latest `:stable` if `upgrade_mode = "auto"`. Else Renovate will open PR on new `stable-YYYY.MM.N`.
- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply.
+- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply. (Renovate opens these PRs automatically when enabled.)
 - **Add dev VM for a branch:** add entry to `dev_instances` list with `image_tag = "dev-feature-xyz"`, PR, merge, apply.
- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then `sudo docker compose restart app` on each VM.
+- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then run the auto-upgrade script on each VM:
+  ```bash
+  gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo /usr/local/bin/agnes-auto-upgrade.sh"
+  ```
+  Or restart containers directly: `sudo docker compose -f ... restart app`.
+
+## Propagating module (startup-script) changes
+
+**Important gotcha:** The `customer-instance` module has `lifecycle { ignore_changes = [metadata_startup_script] }` on VMs — this is intentional so `terraform apply` doesn't reboot VMs on every rerun. The consequence is that **changes inside the startup script are not picked up on a normal `terraform apply`**.
+
+To propagate a startup-script change (for example, after bumping `ref=infra-v1.3.0`):
+
+```bash
+# VM is recreated; boot disk is fresh; persistent data disk is preserved
+terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'
+```
+
+Downtime: ~2 minutes. The persistent data disk (where `/data` lives) is *not* recreated — only the VM. Startup script re-runs on the new VM with the latest template content, and your data is still there.
+
+Alternative (less disruptive): hot-patch the VM via SSH:
+
+```bash
+gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo bash -c '
+  cd /opt/agnes
+  curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml -o docker-compose.prod.yml
+  curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.host-mount.yml -o docker-compose.host-mount.yml
+  docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml up -d
+'"
+```
+
+This preserves container state but won't re-install cron / rebuild persistent disk layout.
+
+## Restoring from backup
+
+Daily snapshots of each data disk are created automatically (module ≥ `infra-v1.3.0`). Retention: 30 days.
+
+To restore:
+
+```bash
+# List snapshots for a specific disk
+gcloud compute snapshots list --project=<GCP_PROJECT_ID> \
+    --filter="sourceDisk~agnes-prod-data"
+
+# Create a new disk from a snapshot
+gcloud compute disks create agnes-prod-data-restored \
+    --source-snapshot=<SNAPSHOT_NAME> \
+    --zone=europe-west1-b \
+    --type=pd-ssd \
+    --project=<GCP_PROJECT_ID>
+
+# Stop the VM, swap disks:
+gcloud compute instances stop agnes-prod --zone=...
+gcloud compute instances detach-disk agnes-prod --disk=agnes-prod-data --zone=...
+gcloud compute instances attach-disk agnes-prod --disk=agnes-prod-data-restored --device-name=data --zone=...
+gcloud compute instances start agnes-prod --zone=...
+
+# Verify /api/health, then optionally delete the old disk
+```
+
+For Terraform state consistency after manual disk swap, you may need `terraform state rm` + `terraform import` for the disk resource.
+
+## Monitoring alerts
+
+Module ≥ `infra-v1.3.0` creates per-VM uptime checks + alert policies. To receive notifications, wire a Monitoring notification channel:
+
+```bash
+# Email channel
+gcloud alpha monitoring channels create \
+    --display-name="Agnes ops email" \
+    --type=email \
+    --channel-labels=email_address=ops@<customer>.com \
+    --project=<GCP_PROJECT_ID>
+
+# Get the channel ID, then in terraform.tfvars:
+#   notification_channel_ids = ["projects/<project>/notificationChannels/<id>"]
+# terraform apply
+```
+
+For Slack integrations, use type `slack` with a webhook URL.

 ## Decommission