docs(onboarding): add module propagation, backup restore, monitoring setup
- 'Propagating module changes' — explains ignore_changes + -replace workflow - 'Restoring from backup' — step-by-step disk swap from daily snapshot - 'Monitoring alerts' — wiring notification channels
This commit is contained in:
parent
0842debf8a
commit
3e9213bfc4
1 changed files with 80 additions and 2 deletions
|
|
@ -158,9 +158,87 @@ curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \
|
|||
## Ongoing maintenance
|
||||
|
||||
- **App auto-upgrades** (cron every 5 min) to latest `:stable` if `upgrade_mode = "auto"`. Else Renovate will open PR on new `stable-YYYY.MM.N`.
|
||||
- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply.
|
||||
- **Infra module upgrade:** change `ref=infra-vX.Y.Z` in `terraform/main.tf`, PR → plan → merge → apply. (Renovate opens these PRs automatically when enabled.)
|
||||
- **Add dev VM for a branch:** add entry to `dev_instances` list with `image_tag = "dev-feature-xyz"`, PR, merge, apply.
|
||||
- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then `sudo docker compose restart app` on each VM.
|
||||
- **Token rotation:** `gcloud secrets versions add keboola-storage-token --data-file=-` then run the auto-upgrade script on each VM:
|
||||
```bash
|
||||
gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo /usr/local/bin/agnes-auto-upgrade.sh"
|
||||
```
|
||||
Or restart containers directly: `sudo docker compose -f ... restart app`.
|
||||
|
||||
## Propagating module (startup-script) changes
|
||||
|
||||
**Important gotcha:** The `customer-instance` module has `lifecycle { ignore_changes = [metadata_startup_script] }` on VMs — this is intentional so `terraform apply` doesn't reboot VMs on every rerun. The consequence is that **changes inside the startup script are not picked up on a normal `terraform apply`**.
|
||||
|
||||
To propagate a startup-script change (for example, after bumping `ref=infra-v1.3.0`):
|
||||
|
||||
```bash
|
||||
# VM is recreated; boot disk is fresh; persistent data disk is preserved
|
||||
terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'
|
||||
```
|
||||
|
||||
Downtime: ~2 minutes. The persistent data disk (where `/data` lives) is *not* recreated — only the VM. Startup script re-runs on the new VM with the latest template content, and your data is still there.
|
||||
|
||||
Alternative (less disruptive): hot-patch the VM via SSH:
|
||||
|
||||
```bash
|
||||
gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo bash -c '
|
||||
cd /opt/agnes
|
||||
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml -o docker-compose.prod.yml
|
||||
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.host-mount.yml -o docker-compose.host-mount.yml
|
||||
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml up -d
|
||||
'"
|
||||
```
|
||||
|
||||
This preserves container state but won't re-install cron / rebuild persistent disk layout.
|
||||
|
||||
## Restoring from backup
|
||||
|
||||
Daily snapshots of each data disk are created automatically (module ≥ `infra-v1.3.0`). Retention: 30 days.
|
||||
|
||||
To restore:
|
||||
|
||||
```bash
|
||||
# List snapshots for a specific disk
|
||||
gcloud compute snapshots list --project=<GCP_PROJECT_ID> \
|
||||
--filter="sourceDisk~agnes-prod-data"
|
||||
|
||||
# Create a new disk from a snapshot
|
||||
gcloud compute disks create agnes-prod-data-restored \
|
||||
--source-snapshot=<SNAPSHOT_NAME> \
|
||||
--zone=europe-west1-b \
|
||||
--type=pd-ssd \
|
||||
--project=<GCP_PROJECT_ID>
|
||||
|
||||
# Stop the VM, swap disks:
|
||||
gcloud compute instances stop agnes-prod --zone=...
|
||||
gcloud compute instances detach-disk agnes-prod --disk=agnes-prod-data --zone=...
|
||||
gcloud compute instances attach-disk agnes-prod --disk=agnes-prod-data-restored --device-name=data --zone=...
|
||||
gcloud compute instances start agnes-prod --zone=...
|
||||
|
||||
# Verify /api/health, then optionally delete the old disk
|
||||
```
|
||||
|
||||
For Terraform state consistency after manual disk swap, you may need `terraform state rm` + `terraform import` for the disk resource.
|
||||
|
||||
## Monitoring alerts
|
||||
|
||||
Module ≥ `infra-v1.3.0` creates per-VM uptime checks + alert policies. To receive notifications, wire a Monitoring notification channel:
|
||||
|
||||
```bash
|
||||
# Email channel
|
||||
gcloud alpha monitoring channels create \
|
||||
--display-name="Agnes ops email" \
|
||||
--type=email \
|
||||
--channel-labels=email_address=ops@<customer>.com \
|
||||
--project=<GCP_PROJECT_ID>
|
||||
|
||||
# Get the channel ID, then in terraform.tfvars:
|
||||
# notification_channel_ids = ["projects/<project>/notificationChannels/<id>"]
|
||||
# terraform apply
|
||||
```
|
||||
|
||||
For Slack integrations, use type `slack` with a webhook URL.
|
||||
|
||||
## Decommission
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue