docs: workflow-driven VM recreate for startup-script propagation
- ONBOARDING.md: replace 'propagating module changes' section with two explicit options — workflow_dispatch with recreate_targets (recommended, CI audit trail), or local terraform apply -replace (emergency). Adds a 'do not' section banning manual .env edits on VMs. - deployment-log.md: iteration 4 summary (version badge + module v1.5.0 + workflow_dispatch).
This commit is contained in:
parent
b091cf7003
commit
1a55167234
2 changed files with 29 additions and 15 deletions
|
|
@ -182,29 +182,36 @@ curl -X POST "http://$PROD_IP:8000/api/sync/trigger" \
|
||||||
|
|
||||||
## Propagating module (startup-script) changes
|
## Propagating module (startup-script) changes
|
||||||
|
|
||||||
**Important gotcha:** The `customer-instance` module has `lifecycle { ignore_changes = [metadata_startup_script] }` on VMs — this is intentional so `terraform apply` doesn't reboot VMs on every rerun. The consequence is that **changes inside the startup script are not picked up on a normal `terraform apply`**.
|
**Important gotcha:** The `customer-instance` module has `lifecycle { ignore_changes = [metadata_startup_script] }` on VMs — intentional, so `terraform apply` doesn't reboot VMs on every rerun. The consequence is that **startup-script changes are not picked up on a normal `terraform apply`**.
|
||||||
|
|
||||||
To propagate a startup-script change (for example, after bumping `ref=infra-v1.3.0`):
|
After bumping the module ref (e.g. `ref=infra-v1.5.0` → `infra-v1.6.0`), do one of:
|
||||||
|
|
||||||
|
### Option A — Workflow dispatch with `recreate_targets` (recommended)
|
||||||
|
|
||||||
|
`apply.yml` has a `workflow_dispatch` input `recreate_targets` that takes a comma-separated list of TF resource addresses and passes each as `-replace=` to `terraform apply`. Use this to destroy + recreate VMs with the new startup script, without any SSH.
|
||||||
|
|
||||||
|
```
|
||||||
|
Actions → Terraform Apply → Run workflow → recreate_targets:
|
||||||
|
module.agnes.google_compute_instance.vm["agnes-dev"],module.agnes.google_compute_instance.vm["agnes-prod"]
|
||||||
|
```
|
||||||
|
|
||||||
|
The workflow routes dev targets to `apply-dev` and prod targets to `apply-prod`, so the usual dev-first + prod-reviewer gate still applies. Persistent data disks and static IPs are separate resources and are **preserved** across replacement — only the VM (and its fresh boot disk) is recreated.
|
||||||
|
|
||||||
|
Downtime: ~2 min per VM, sequential. Data loss: none (persistent disk keeps `/data`; static IP keeps URL stable).
|
||||||
|
|
||||||
|
### Option B — Local terraform (emergency)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# VM is recreated; boot disk is fresh; persistent data disk is preserved
|
export GOOGLE_APPLICATION_CREDENTIALS=~/.agnes-keys/agnes-deploy-<project>-key.json
|
||||||
|
cd terraform
|
||||||
terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'
|
terraform apply -replace='module.agnes.google_compute_instance.vm["agnes-prod"]'
|
||||||
```
|
```
|
||||||
|
|
||||||
Downtime: ~2 minutes. The persistent data disk (where `/data` lives) is *not* recreated — only the VM. Startup script re-runs on the new VM with the latest template content, and your data is still there.
|
Same semantics as Option A, but no CI audit trail. Use only when CI is broken.
|
||||||
|
|
||||||
Alternative (less disruptive): hot-patch the VM via SSH:
|
### Do NOT
|
||||||
|
|
||||||
```bash
|
Do not manually edit `/opt/agnes/.env` or the docker-compose overlay files on a running VM. Any such change is lost on the next VM recreate, and it drifts from Terraform state. If a value needs changing, route it through a module variable or a module upgrade.
|
||||||
gcloud compute ssh agnes-prod --zone=... --project=... --command="sudo bash -c '
|
|
||||||
cd /opt/agnes
|
|
||||||
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml -o docker-compose.prod.yml
|
|
||||||
curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.host-mount.yml -o docker-compose.host-mount.yml
|
|
||||||
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml up -d
|
|
||||||
'"
|
|
||||||
```
|
|
||||||
|
|
||||||
This preserves container state but won't re-install cron / rebuild persistent disk layout.
|
|
||||||
|
|
||||||
## Restoring from backup
|
## Restoring from backup
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -157,6 +157,13 @@ Migrace dat zkopírovala users table, takže heslo je platné i na novém prod.
|
||||||
9. **Auth v2 → v3 action bump** v obou workflow repech (silences Node 20 deprecation warning).
|
9. **Auth v2 → v3 action bump** v obou workflow repech (silences Node 20 deprecation warning).
|
||||||
10. **Prod apply-dev úspěšně proběhl** po manuálním triggeru (initial apply měl race s timing secret creation). apply-prod čeká na reviewera.
|
10. **Prod apply-dev úspěšně proběhl** po manuálním triggeru (initial apply měl race s timing secret creation). apply-prod čeká na reviewera.
|
||||||
|
|
||||||
|
### Iterace 4 — version badge + workflow-driven recreate
|
||||||
|
|
||||||
|
1. **Version badge v UI**: `/api/version` endpoint + footer badge v `base.html` loadí asynchronně, zobrazuje `<channel>-<version> · <tag> · deployed <relative> (<UTC>)` s commit SHA v tooltipu.
|
||||||
|
2. **Module `infra-v1.5.0`**: startup-script odvozuje `AGNES_VERSION` a `RELEASE_CHANNEL` z image tagu (stable-YYYY.MM.N / dev-…) a zjistí `AGNES_COMMIT_SHA` z `docker pull` digest. Tyto vars jdou do `.env` → app čte → `/api/version` vrací → badge renderuje.
|
||||||
|
3. **`workflow_dispatch` s `recreate_targets`** v `apply.yml` (oba repa): manuálně spustitelný workflow input s comma-separated TF resource addresses → `-replace=<addr>` předáno `terraform apply`. Řeší `ignore_changes = [metadata_startup_script]` gotchu. Dev targets routed do `apply-dev`, prod do `apply-prod`.
|
||||||
|
4. **Dokumentace propagation přepsána** v `docs/ONBOARDING.md` — Option A (workflow_dispatch, recommended) vs Option B (local TF), plus explicit DO NOT sekce proti ručnímu SSH zásahu.
|
||||||
|
|
||||||
### Iterace 3 — code review + bootstrap fix + doc sweep
|
### Iterace 3 — code review + bootstrap fix + doc sweep
|
||||||
|
|
||||||
1. **Code review** dispatched přes `superpowers:requesting-code-review` subagent. Nálezy: 7 critical, 9 important, 10 minor.
|
1. **Code review** dispatched přes `superpowers:requesting-code-review` subagent. Nálezy: 7 critical, 9 important, 10 minor.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue