* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident: foundryai-development 2026-04-30, marketplaces / DuckDB / session secret written to /data (sdb) instead of the config disk (sdc), wiped on next container recreate. ## Why an app-side guard agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is not on the config disk (because of the propagation regression fixed by the infra PR, or the boot-time udev race fixed by infra #58, or any future mount-loss path), this script previously ran `docker compose up -d` anyway — and the app silently wrote state onto the wrong disk. Next recreate, that state was gone. The boot-time fixes in infra are preventive. This is the runtime backstop. ## Behavior Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk exists on the VM: 1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s). - Mount the config disk if /data/state is not a mountpoint. - Detect mismatch: if /data/state is mounted from the wrong source, umount and retry. 2. After the loop, assert findmnt source matches the config disk. - On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd marks the service failed; no docker compose action runs; existing containers (if any) keep running on stale state, but no new write lands on the wrong disk. 3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state` on every run. Idempotent. Guards against propagation regressions sneaking back in via future docker / kernel changes. VMs without a config disk (foundryai-poc, single-disk legacy) skip the whole block — the `if [ -e $CONFIG_DEVICE ]` guard. ## Tested Patched script installed on foundryai-development as a hotfix; manual run post-migration was a no-op (digest unchanged); /data/state stayed on sdc across a full `docker compose down + up -d` cycle. ## Rollout - This file is fetched by infra startup.sh from raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every boot. Once merged to main, all VMs pick up the new script on their next boot — no infra recreate needed. - For immediate rollout to running VMs without waiting for next boot: `scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ && ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh /usr/local/bin/agnes-auto-upgrade.sh` (already done on foundryai-development). * chore: vendor-agnostic comment + changelog text Drop customer-specific VM names from the script comment and CHANGELOG entry. The OSS distribution should not name a particular operator's hosts; the technical description already conveys why the guard exists. * fix(ops): suppress mount stderr in retry loop Match the rest of the script's error-tolerant idiom (2>/dev/null). Mount failures in the cold-boot udev race the loop is designed to handle gracefully should not flow to stdout — cron would mail on every transient retry. Devin BUG_0001 on PR #146. * fix(changelog): move auto-upgrade entry to [Unreleased] Entry landed under v0.20.0 because that section was [Unreleased] when this branch first opened — releases v0.21–v0.24 cut in the meantime stranded it inside an already-released section. Move it back where new entries belong. Devin BUG_0001 on PR #146. * fix(infra): single-source agnes-auto-upgrade.sh via curl from main Replace the inline heredoc copy of the auto-upgrade script in the customer-instance Terraform startup template with a curl fetch from raw.githubusercontent.com on every boot. The inline copy had drifted several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh (missing TLS overlay detection, array-form COMPOSE_FILES, and now the config-disk fail-fast guard from this PR). Devin ANALYSIS_0001 on PR #146. * fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling The canonical agnes-auto-upgrade.sh from main detects TLS at runtime via cert files on disk, regardless of the TLS_MODE Terraform variable. Certs can appear after boot via agnes-tls-rotate.sh or manual provisioning, and the cron job would then fail every 5 min under 'set -euo pipefail' because docker-compose.tls.yml was never fetched. Also document the main-vs-COMPOSE_REF coupling: when the canonical script references a new compose file, the fetch list above must be updated to match — pinned-ref VMs would otherwise break. Devin BUG_0001 + ANALYSIS_0001 on PR #146. * fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing Caddyfile fetch now matches docker-compose.tls.yml: unconditional in startup-script.sh.tpl. Without it, Docker would auto-create an empty directory at the bind-mount target and Caddy would crash-loop while the tls overlay has already closed :8000 — making the app unreachable on any non-caddy VM where certs land via rotate or manual provisioning. Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile to exist (size > 0) before activating the tls profile, with a WARN log if it's missing. Belt-and-suspenders so the failure mode is contained even when the script is deployed by some other path (not just the customer-instance TF module). Devin BUG_0001 on PR #146. * chore(release): cut 0.25.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
81 lines
4.1 KiB
Bash
Executable file
81 lines
4.1 KiB
Bash
Executable file
#!/bin/bash
|
|
# Deployed to /usr/local/bin/agnes-auto-upgrade.sh on the VM.
|
|
# Cron fires it every 5 min; pulls latest image for the pinned AGNES_TAG
|
|
# and recreates containers only if the digest moved.
|
|
#
|
|
# Cert-aware: if /data/state/certs/{fullchain,privkey}.pem both exist
|
|
# (populated by agnes-tls-rotate.sh), enables the tls overlay so Caddy
|
|
# fronts :443. Absence → plain HTTP on :8000.
|
|
set -euo pipefail
|
|
cd /opt/agnes
|
|
# shellcheck disable=SC1091
|
|
set -a; . /opt/agnes/.env; set +a
|
|
|
|
# Fail-fast guard: if the VM has a config disk attached, it MUST be
|
|
# mounted at /data/state before any container action. Otherwise the
|
|
# app would write state onto /data (sdb) and lose it on the next
|
|
# container recreate — the regression that motivated this guard.
|
|
# Three retries (mount may race with udev on cold boot) then hard exit.
|
|
CONFIG_DEVICE=/dev/disk/by-id/google-config-disk
|
|
if [ -e "$CONFIG_DEVICE" ]; then
|
|
attempt=0
|
|
while [ $attempt -lt 3 ]; do
|
|
attempt=$((attempt + 1))
|
|
if mountpoint -q /data/state; then
|
|
expected_dev=$(readlink -f "$CONFIG_DEVICE")
|
|
actual_dev=$(findmnt -n -o SOURCE /data/state)
|
|
if [ "$expected_dev" = "$actual_dev" ]; then
|
|
break
|
|
fi
|
|
logger -t agnes-auto-upgrade "WARN: /data/state on $actual_dev, expected $expected_dev — attempting remount"
|
|
umount /data/state 2>/dev/null || true
|
|
fi
|
|
mount "$CONFIG_DEVICE" /data/state 2>/dev/null || true
|
|
sleep $((attempt * 2))
|
|
done
|
|
|
|
if ! mountpoint -q /data/state || \
|
|
[ "$(readlink -f "$CONFIG_DEVICE")" != "$(findmnt -n -o SOURCE /data/state)" ]; then
|
|
logger -t agnes-auto-upgrade "FATAL: config disk not mounted at /data/state — refusing to start containers"
|
|
echo "FATAL: /data/state is not backed by the config disk." >&2
|
|
echo " Refusing to run docker compose — app state must NEVER land on /data (sdb)." >&2
|
|
echo " Inspect: mount | grep /data/state ; ls /dev/disk/by-id/google-config-disk" >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Re-apply propagation in case a prior container teardown reset it.
|
|
# Idempotent — safe to call when already private.
|
|
mount --make-rprivate /data 2>/dev/null || true
|
|
mount --make-rprivate /data/state 2>/dev/null || true
|
|
fi
|
|
|
|
IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}"
|
|
# Array form (vs. word-split string) — quoted expansion survives paths
|
|
# with spaces and is the modern bash idiom. Functionally identical here
|
|
# since /opt/agnes paths are tame, but it's a cheap habit to keep.
|
|
COMPOSE_FILES=( -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml )
|
|
PROFILE_ARGS=()
|
|
# `-s` (size > 0) instead of `-f` — guards against the corner case where
|
|
# rotate.sh wrote a 0-byte cert and exited (or got SIGKILLed mid-write).
|
|
# Bringing up the tls profile against an empty cert would just crash
|
|
# Caddy on start; better to fall back to plain :8000 until rotate
|
|
# regenerates real bytes. Same `-s` rule for Caddyfile: without it (or
|
|
# with an empty one) the caddy service crash-loops while the tls overlay
|
|
# has already closed :8000 — net effect is "app unreachable". Skipping
|
|
# the overlay keeps the app on plain :8000 until config lands.
|
|
if [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ] && [ -s Caddyfile ]; then
|
|
COMPOSE_FILES+=( -f docker-compose.tls.yml )
|
|
PROFILE_ARGS=( --profile tls )
|
|
elif [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ]; then
|
|
logger -t agnes-auto-upgrade "WARN: certs present but Caddyfile missing/empty — skipping tls overlay"
|
|
fi
|
|
BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
|
|
docker compose "${COMPOSE_FILES[@]}" pull >/dev/null 2>&1
|
|
AFTER=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
|
|
if [ "$BEFORE" != "$AFTER" ]; then
|
|
echo "$(date): new digest for $IMAGE — recreating containers"
|
|
# ${arr[@]+"${arr[@]}"} pattern: expands to nothing when array is
|
|
# empty (vs. plain "${arr[@]}" which trips `set -u` on bash <4.4).
|
|
docker compose "${COMPOSE_FILES[@]}" ${PROFILE_ARGS[@]+"${PROFILE_ARGS[@]}"} up -d
|
|
docker image prune -f >/dev/null 2>&1
|
|
fi
|