docs(onboarding): add module propagation, backup restore, monitoring setup

- 'Propagating module changes' — explains ignore_changes + -replace workflow
- 'Restoring from backup' — step-by-step disk swap from daily snapshot
- 'Monitoring alerts' — wiring notification channels
This commit is contained in:
ZdenekSrotyr 2026-04-21 19:06:20 +02:00
parent 0842debf8a
commit 3e9213bfc4

View file

@ -158,9 +158,87 @@ curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \
## Ongoing maintenance
- **App auto-upgrades** (cron every 5 min) to latest `:stable` if `upgrade_mode = "auto"`. Else Renovate will open PR on new `stable-YYYY.MM.N`.
- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply.
- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply. (Renovate opens these PRs automatically when enabled.)
- **Add dev VM for a branch:** add entry to `dev_instances` list with `image_tag = "dev-feature-xyz"`, PR, merge, apply.
- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then `sudo docker compose restart app` on each VM.
- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then run the auto-upgrade script on each VM:
```bash
gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo /usr/local/bin/agnes-auto-upgrade.sh"
```
Or restart containers directly: `sudo docker compose -f ... restart app`.
## Propagating module (startup-script) changes
**Important gotcha:** The `customer-instance` module has `lifecycle { ignore_changes = [metadata_startup_script] }` on VMs — this is intentional so `terraform apply` doesn't reboot VMs on every rerun. The consequence is that **changes inside the startup script are not picked up on a normal `terraform apply`**.
To propagate a startup-script change (for example, after bumping `ref=infra-v1.3.0`):
```bash
# VM is recreated; boot disk is fresh; persistent data disk is preserved
terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'
```
Downtime: ~2 minutes. The persistent data disk (where `/data` lives) is *not* recreated — only the VM. Startup script re-runs on the new VM with the latest template content, and your data is still there.
Alternative (less disruptive): hot-patch the VM via SSH:
```bash
gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo bash -c '
cd /opt/agnes
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml -o docker-compose.prod.yml
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.host-mount.yml -o docker-compose.host-mount.yml
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml up -d
'"
```
This preserves container state but won't re-install cron / rebuild persistent disk layout.
## Restoring from backup
Daily snapshots of each data disk are created automatically (module ≥ `infra-v1.3.0`). Retention: 30 days.
To restore:
```bash
# List snapshots for a specific disk
gcloud compute snapshots list --project=<GCP_PROJECT_ID> \
--filter="sourceDisk~agnes-prod-data"
# Create a new disk from a snapshot
gcloud compute disks create agnes-prod-data-restored \
--source-snapshot=<SNAPSHOT_NAME> \
--zone=europe-west1-b \
--type=pd-ssd \
--project=<GCP_PROJECT_ID>
# Stop the VM, swap disks:
gcloud compute instances stop agnes-prod --zone=...
gcloud compute instances detach-disk agnes-prod --disk=agnes-prod-data --zone=...
gcloud compute instances attach-disk agnes-prod --disk=agnes-prod-data-restored --device-name=data --zone=...
gcloud compute instances start agnes-prod --zone=...
# Verify /api/health, then optionally delete the old disk
```
For Terraform state consistency after manual disk swap, you may need `terraform state rm` + `terraform import` for the disk resource.
## Monitoring alerts
Module ≥ `infra-v1.3.0` creates per-VM uptime checks + alert policies. To receive notifications, wire a Monitoring notification channel:
```bash
# Email channel
gcloud alpha monitoring channels create \
--display-name="Agnes ops email" \
--type=email \
--channel-labels=email_address=ops@<customer>.com \
--project=<GCP_PROJECT_ID>
# Get the channel ID, then in terraform.tfvars:
# notification_channel_ids = ["projects/<project>/notificationChannels/<id>"]
# terraform apply
```
For Slack integrations, use type `slack` with a webhook URL.
## Decommission