infra(customer-instance): preserve operator AGNES_TAG / AGNES_TEMP_DIR (#214)

The startup script runs on every boot but the metadata_startup_script
field is in lifecycle.ignore_changes — so a TF apply that changed
image_tag does NOT reach a long-lived VM until someone explicitly
recreates it. Meanwhile, operators commonly hand-edit /opt/agnes/.env
to pin a specific image (custom branch builds, staged rollouts).
Pre-fix, every boot rewrote .env from the baked-in template and
clobbered the operator's choice — concretely, a stop+start triggered
by a machine_type change would reset AGNES_TAG to whatever was in
the template at first provision, regardless of the operator's
intervening edit.

Now the script reads the existing .env (when present) for AGNES_TAG
and AGNES_TEMP_DIR; when those keys are set, the existing values
win over the template-computed ones. Logged on stdout when AGNES_TAG
disagrees with $IMAGE_TAG so an operator audit-trails the boot.

Fresh provisions are unchanged (no .env yet → template values land).
To force a TF-driven reset on an existing VM: rm /opt/agnes/.env
and reboot. Cut as infra-v1.8.0 — additive, downstream consumers
opt in by bumping the module ref.
This commit is contained in:
ZdenekSrotyr 2026-05-07 11:36:36 +02:00 committed by GitHub
parent f8e5fd45a4
commit f6c2012d5b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 31 additions and 2 deletions

View file

@ -10,7 +10,9 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
## [0.44.1] — 2026-05-07
### Internal
- `infra/modules/customer-instance` (tag `infra-v1.8.0`): `startup-script.sh.tpl` no longer overwrites operator-edited `AGNES_TAG` / `AGNES_TEMP_DIR` in `/opt/agnes/.env` on every boot. Reads the existing values when present and lets them win over the template-computed `$IMAGE_TAG`. Pre-fix, an in-place TF action that stopped/started the VM (e.g. `machine_type` change) would re-run the startup script and clobber any manually-pinned image tag — operators had to re-edit the file post-restart. Fresh provisions still get the TF-driven values; the `.env` file's existence is the disambiguator. To force a TF-driven reset, `rm /opt/agnes/.env` and reboot.
### Fixed

View file

@ -142,6 +142,32 @@ if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
fi
fi
# Preserve operator overrides on AGNES_TAG. Rationale: this script
# runs on every boot (and the `metadata_startup_script` is in
# `lifecycle.ignore_changes` so a TF apply that changed the
# `image_tag` variable does NOT propagate to a long-lived VM until
# someone explicitly recreates it). Operators commonly hand-edit
# `/opt/agnes/.env` to pin a custom image tag (e.g. for a dev branch
# build, or a staged rollout) — overwriting that on every reboot
# clobbers their decision. Read the existing AGNES_TAG and let it
# win when it disagrees with $IMAGE_TAG; ditto for AGNES_TEMP_DIR
# (a deployment-specific path tweak operators sometimes set to
# steer tempdirs onto a larger volume).
EXISTING_AGNES_TAG=""
EXISTING_AGNES_TEMP_DIR=""
if [ -f "$APP_DIR/.env" ]; then
EXISTING_AGNES_TAG=$(grep -E '^AGNES_TAG=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
EXISTING_AGNES_TEMP_DIR=$(grep -E '^AGNES_TEMP_DIR=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
fi
EFFECTIVE_AGNES_TAG="$${EXISTING_AGNES_TAG:-$IMAGE_TAG}"
if [ -n "$EXISTING_AGNES_TAG" ] && [ "$EXISTING_AGNES_TAG" != "$IMAGE_TAG" ]; then
echo "INFO: preserving operator-edited AGNES_TAG=$EXISTING_AGNES_TAG (TF variable said $IMAGE_TAG; rm /opt/agnes/.env to reset)"
fi
AGNES_TEMP_DIR_LINE=""
if [ -n "$EXISTING_AGNES_TEMP_DIR" ]; then
AGNES_TEMP_DIR_LINE="AGNES_TEMP_DIR=\"$EXISTING_AGNES_TEMP_DIR\""
fi
cat > "$APP_DIR/.env" <<ENVEOF
JWT_SECRET_KEY=$JWT_KEY
DATA_DIR=$DATA_MNT
@ -153,11 +179,12 @@ SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
SCHEDULER_API_TOKEN=$SCHEDULER_API_TOKEN
LOG_LEVEL=info
DOMAIN=$DOMAIN
AGNES_TAG=$IMAGE_TAG
AGNES_TAG=$EFFECTIVE_AGNES_TAG
ACME_EMAIL=$ACME_EMAIL
GOOGLE_CLIENT_ID=$GOOGLE_CLIENT_ID
GOOGLE_CLIENT_SECRET=$GOOGLE_CLIENT_SECRET
$CADDY_TLS_LINE
$AGNES_TEMP_DIR_LINE
ENVEOF
chmod 600 "$APP_DIR/.env"