fix(ops): fail-fast guard in agnes-auto-upgrade — refuse start if config disk not mounted (#146)

* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted

Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.

## Why an app-side guard

agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.

The boot-time fixes in infra are preventive. This is the runtime backstop.

## Behavior

Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:

1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
   - Mount the config disk if /data/state is not a mountpoint.
   - Detect mismatch: if /data/state is mounted from the wrong source,
     umount and retry.
2. After the loop, assert findmnt source matches the config disk.
   - On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
     marks the service failed; no docker compose action runs; existing
     containers (if any) keep running on stale state, but no new write
     lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
   on every run. Idempotent. Guards against propagation regressions
   sneaking back in via future docker / kernel changes.

VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.

## Tested

Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.

## Rollout

- This file is fetched by infra startup.sh from
  raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
  boot. Once merged to main, all VMs pick up the new script on their
  next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
  `scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
   ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
   /usr/local/bin/agnes-auto-upgrade.sh` (already done on
  foundryai-development).

* chore: vendor-agnostic comment + changelog text

Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.

* fix(ops): suppress mount stderr in retry loop

Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.

Devin BUG_0001 on PR #146.

* fix(changelog): move auto-upgrade entry to [Unreleased]

Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.

Devin BUG_0001 on PR #146.

* fix(infra): single-source agnes-auto-upgrade.sh via curl from main

Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).

Devin ANALYSIS_0001 on PR #146.

* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling

The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.

Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.

Devin BUG_0001 + ANALYSIS_0001 on PR #146.

* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing

Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.

Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).

Devin BUG_0001 on PR #146.

* chore(release): cut 0.25.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
This commit is contained in:
Vojtech 2026-04-30 22:07:22 +04:00 committed by GitHub
parent fb1573766a
commit ddffdfeafd
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 105 additions and 30 deletions

View file

@ -10,6 +10,35 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased] ## [Unreleased]
## [0.25.0] — 2026-04-30
### Fixed
- `scripts/ops/agnes-auto-upgrade.sh`: fail-fast guard before any `docker
compose` action — when the VM has a config disk attached
(`/dev/disk/by-id/google-config-disk` exists), `/data/state` MUST be backed
by it. Three retry attempts with backoff, then exit non-zero. Prevents the
silent regression where docker host-mount propagation unmounts the config
disk and the app writes user state (DuckDB, marketplaces, session secret)
onto `/data` (sdb) — wiped on the next container recreate. Re-applies
`mount --make-rprivate /data /data/state` on every run to defend against
propagation regressions.
- `infra/modules/customer-instance/startup-script.sh.tpl`: replaced the
inline heredoc copy of the auto-upgrade script with a `curl` from
`raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/ops/agnes-auto-upgrade.sh`
— single source of truth eliminates drift (the inline copy had fallen
behind on TLS overlay detection, array-form compose files, and the new
config-disk guard). VMs re-fetch on every boot, so script-only fixes
propagate without an infra recreate. Also: `docker-compose.tls.yml` is
now fetched unconditionally (not only when `tls_mode=caddy`), because
the canonical auto-upgrade script detects TLS at runtime via cert files
on disk — certs can appear after boot via `agnes-tls-rotate.sh` or
manual provisioning, and the cron job would otherwise fail every 5 min
until the file was placed. Same reasoning extends to `Caddyfile`:
fetched unconditionally now, plus `agnes-auto-upgrade.sh` skips the
tls overlay when `Caddyfile` is missing/empty (defensive — without
it the caddy service crash-loops while the overlay closes `:8000`,
net effect "app unreachable").
## [0.24.0] — 2026-04-30 ## [0.24.0] — 2026-04-30
### Changed ### Changed

View file

@ -60,11 +60,18 @@ curl -fsSL "$${RAW_BASE}/docker-compose.yml" -o docker-compose.yml
curl -fsSL "$${RAW_BASE}/docker-compose.prod.yml" -o docker-compose.prod.yml curl -fsSL "$${RAW_BASE}/docker-compose.prod.yml" -o docker-compose.prod.yml
# Overlay which binds `data` volume to host /data (persistent disk mounted above) # Overlay which binds `data` volume to host /data (persistent disk mounted above)
curl -fsSL "$${RAW_BASE}/docker-compose.host-mount.yml" -o docker-compose.host-mount.yml curl -fsSL "$${RAW_BASE}/docker-compose.host-mount.yml" -o docker-compose.host-mount.yml
# TLS overlay + Caddyfile — fetched unconditionally because agnes-auto-upgrade.sh
# TLS overlay (Caddy + Let's Encrypt) — fetch only when actually needed; surface failures # (curled from main below) detects TLS at runtime via cert files on disk,
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then # regardless of TLS_MODE. Certs can appear after boot via agnes-tls-rotate.sh
# or manual provisioning, and:
# - the cron job would fail under `set -euo pipefail` every 5 min if
# docker-compose.tls.yml were missing, and
# - the caddy service in docker-compose.yml bind-mounts ./Caddyfile:ro,
# so without it on disk Docker auto-creates an empty directory there
# and Caddy crash-loops while the overlay has already closed :8000.
# Cheap to keep on disk either way.
curl -fsSL "$${RAW_BASE}/docker-compose.tls.yml" -o docker-compose.tls.yml
curl -fsSL "$${RAW_BASE}/Caddyfile" -o Caddyfile curl -fsSL "$${RAW_BASE}/Caddyfile" -o Caddyfile
fi
# --- 4. Fetch secrets from Secret Manager — fail loudly if missing --- # --- 4. Fetch secrets from Secret Manager — fail loudly if missing ---
KEBOOLA_TOKEN="" KEBOOLA_TOKEN=""
@ -161,28 +168,23 @@ docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG up -d
# --- 6. Auto-upgrade via cron (pulls new image digest every 5 min) --- # --- 6. Auto-upgrade via cron (pulls new image digest every 5 min) ---
if [ "$UPGRADE_MODE" = "auto" ]; then if [ "$UPGRADE_MODE" = "auto" ]; then
# Cron script sources /opt/agnes/.env for AGNES_TAG — so if operator edits .env # Single-source the cron script from the OSS repo's main branch instead
# (e.g. to pin a specific stable-YYYY.MM.N), cron picks it up immediately. No # of inlining a copy here. Two reasons:
# drift between what compose up reads and what the digest-check inspects. # 1. Drift prevention — earlier inline copy missed several iterations
cat > /usr/local/bin/agnes-auto-upgrade.sh <<'SCRIPTEOF' # of the canonical script (TLS overlay detection, array-form compose
#!/bin/bash # files, config-disk fail-fast guard).
# Runs from cron — pulls new image if one is available, restarts containers. # 2. Re-fetched on every VM boot, so script-only fixes propagate
set -euo pipefail # without an infra recreate. For immediate rollout to running VMs,
cd /opt/agnes # operators can also re-run this fetch by hand.
# Source .env so AGNES_TAG reflects any operator edits since boot. #
# shellcheck disable=SC1091 # Coupling note: this URL is pinned to `main` while compose files above
set -a; . /opt/agnes/.env; set +a # honor $COMPOSE_REF. If a future canonical script references a NEW
IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:$${AGNES_TAG:-stable}" # compose file, the fetch list above MUST be updated to match — pinned-
COMPOSE_FILES="-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml" # ref VMs would otherwise break on the next cron tick. Treat the docker-
BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1) # compose.* fetch list as the contract that agnes-auto-upgrade.sh relies
docker compose $COMPOSE_FILES pull >/dev/null 2>&1 # on; new compose files referenced from main need a corresponding fetch.
AFTER=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1) SCRIPT_URL="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/ops/agnes-auto-upgrade.sh"
if [ "$BEFORE" != "$AFTER" ]; then curl -fsSL --retry 3 --retry-delay 2 "$SCRIPT_URL" -o /usr/local/bin/agnes-auto-upgrade.sh
echo "$(date): new image digest for $IMAGE — recreating containers"
docker compose $COMPOSE_FILES up -d
docker image prune -f >/dev/null 2>&1
fi
SCRIPTEOF
chmod +x /usr/local/bin/agnes-auto-upgrade.sh chmod +x /usr/local/bin/agnes-auto-upgrade.sh
# Install cron entry idempotently: remove any prior agnes-auto-upgrade line, then append ours. # Install cron entry idempotently: remove any prior agnes-auto-upgrade line, then append ours.

View file

@ -1,6 +1,6 @@
[project] [project]
name = "agnes-the-ai-analyst" name = "agnes-the-ai-analyst"
version = "0.24.0" version = "0.25.0"
description = "Agnes — AI Data Analyst platform for AI analytical systems" description = "Agnes — AI Data Analyst platform for AI analytical systems"
requires-python = ">=3.11,<3.14" requires-python = ">=3.11,<3.14"
license = "MIT" license = "MIT"

View file

@ -10,6 +10,45 @@ set -euo pipefail
cd /opt/agnes cd /opt/agnes
# shellcheck disable=SC1091 # shellcheck disable=SC1091
set -a; . /opt/agnes/.env; set +a set -a; . /opt/agnes/.env; set +a
# Fail-fast guard: if the VM has a config disk attached, it MUST be
# mounted at /data/state before any container action. Otherwise the
# app would write state onto /data (sdb) and lose it on the next
# container recreate — the regression that motivated this guard.
# Three retries (mount may race with udev on cold boot) then hard exit.
CONFIG_DEVICE=/dev/disk/by-id/google-config-disk
if [ -e "$CONFIG_DEVICE" ]; then
attempt=0
while [ $attempt -lt 3 ]; do
attempt=$((attempt + 1))
if mountpoint -q /data/state; then
expected_dev=$(readlink -f "$CONFIG_DEVICE")
actual_dev=$(findmnt -n -o SOURCE /data/state)
if [ "$expected_dev" = "$actual_dev" ]; then
break
fi
logger -t agnes-auto-upgrade "WARN: /data/state on $actual_dev, expected $expected_dev — attempting remount"
umount /data/state 2>/dev/null || true
fi
mount "$CONFIG_DEVICE" /data/state 2>/dev/null || true
sleep $((attempt * 2))
done
if ! mountpoint -q /data/state || \
[ "$(readlink -f "$CONFIG_DEVICE")" != "$(findmnt -n -o SOURCE /data/state)" ]; then
logger -t agnes-auto-upgrade "FATAL: config disk not mounted at /data/state — refusing to start containers"
echo "FATAL: /data/state is not backed by the config disk." >&2
echo " Refusing to run docker compose — app state must NEVER land on /data (sdb)." >&2
echo " Inspect: mount | grep /data/state ; ls /dev/disk/by-id/google-config-disk" >&2
exit 1
fi
# Re-apply propagation in case a prior container teardown reset it.
# Idempotent — safe to call when already private.
mount --make-rprivate /data 2>/dev/null || true
mount --make-rprivate /data/state 2>/dev/null || true
fi
IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}" IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}"
# Array form (vs. word-split string) — quoted expansion survives paths # Array form (vs. word-split string) — quoted expansion survives paths
# with spaces and is the modern bash idiom. Functionally identical here # with spaces and is the modern bash idiom. Functionally identical here
@ -20,10 +59,15 @@ PROFILE_ARGS=()
# rotate.sh wrote a 0-byte cert and exited (or got SIGKILLed mid-write). # rotate.sh wrote a 0-byte cert and exited (or got SIGKILLed mid-write).
# Bringing up the tls profile against an empty cert would just crash # Bringing up the tls profile against an empty cert would just crash
# Caddy on start; better to fall back to plain :8000 until rotate # Caddy on start; better to fall back to plain :8000 until rotate
# regenerates real bytes. # regenerates real bytes. Same `-s` rule for Caddyfile: without it (or
if [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ]; then # with an empty one) the caddy service crash-loops while the tls overlay
# has already closed :8000 — net effect is "app unreachable". Skipping
# the overlay keeps the app on plain :8000 until config lands.
if [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ] && [ -s Caddyfile ]; then
COMPOSE_FILES+=( -f docker-compose.tls.yml ) COMPOSE_FILES+=( -f docker-compose.tls.yml )
PROFILE_ARGS=( --profile tls ) PROFILE_ARGS=( --profile tls )
elif [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ]; then
logger -t agnes-auto-upgrade "WARN: certs present but Caddyfile missing/empty — skipping tls overlay"
fi fi
BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1) BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
docker compose "${COMPOSE_FILES[@]}" pull >/dev/null 2>&1 docker compose "${COMPOSE_FILES[@]}" pull >/dev/null 2>&1