* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident: foundryai-development 2026-04-30, marketplaces / DuckDB / session secret written to /data (sdb) instead of the config disk (sdc), wiped on next container recreate. ## Why an app-side guard agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is not on the config disk (because of the propagation regression fixed by the infra PR, or the boot-time udev race fixed by infra #58, or any future mount-loss path), this script previously ran `docker compose up -d` anyway — and the app silently wrote state onto the wrong disk. Next recreate, that state was gone. The boot-time fixes in infra are preventive. This is the runtime backstop. ## Behavior Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk exists on the VM: 1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s). - Mount the config disk if /data/state is not a mountpoint. - Detect mismatch: if /data/state is mounted from the wrong source, umount and retry. 2. After the loop, assert findmnt source matches the config disk. - On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd marks the service failed; no docker compose action runs; existing containers (if any) keep running on stale state, but no new write lands on the wrong disk. 3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state` on every run. Idempotent. Guards against propagation regressions sneaking back in via future docker / kernel changes. VMs without a config disk (foundryai-poc, single-disk legacy) skip the whole block — the `if [ -e $CONFIG_DEVICE ]` guard. ## Tested Patched script installed on foundryai-development as a hotfix; manual run post-migration was a no-op (digest unchanged); /data/state stayed on sdc across a full `docker compose down + up -d` cycle. ## Rollout - This file is fetched by infra startup.sh from raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every boot. Once merged to main, all VMs pick up the new script on their next boot — no infra recreate needed. - For immediate rollout to running VMs without waiting for next boot: `scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ && ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh /usr/local/bin/agnes-auto-upgrade.sh` (already done on foundryai-development). * chore: vendor-agnostic comment + changelog text Drop customer-specific VM names from the script comment and CHANGELOG entry. The OSS distribution should not name a particular operator's hosts; the technical description already conveys why the guard exists. * fix(ops): suppress mount stderr in retry loop Match the rest of the script's error-tolerant idiom (2>/dev/null). Mount failures in the cold-boot udev race the loop is designed to handle gracefully should not flow to stdout — cron would mail on every transient retry. Devin BUG_0001 on PR #146. * fix(changelog): move auto-upgrade entry to [Unreleased] Entry landed under v0.20.0 because that section was [Unreleased] when this branch first opened — releases v0.21–v0.24 cut in the meantime stranded it inside an already-released section. Move it back where new entries belong. Devin BUG_0001 on PR #146. * fix(infra): single-source agnes-auto-upgrade.sh via curl from main Replace the inline heredoc copy of the auto-upgrade script in the customer-instance Terraform startup template with a curl fetch from raw.githubusercontent.com on every boot. The inline copy had drifted several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh (missing TLS overlay detection, array-form COMPOSE_FILES, and now the config-disk fail-fast guard from this PR). Devin ANALYSIS_0001 on PR #146. * fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling The canonical agnes-auto-upgrade.sh from main detects TLS at runtime via cert files on disk, regardless of the TLS_MODE Terraform variable. Certs can appear after boot via agnes-tls-rotate.sh or manual provisioning, and the cron job would then fail every 5 min under 'set -euo pipefail' because docker-compose.tls.yml was never fetched. Also document the main-vs-COMPOSE_REF coupling: when the canonical script references a new compose file, the fetch list above must be updated to match — pinned-ref VMs would otherwise break. Devin BUG_0001 + ANALYSIS_0001 on PR #146. * fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing Caddyfile fetch now matches docker-compose.tls.yml: unconditional in startup-script.sh.tpl. Without it, Docker would auto-create an empty directory at the bind-mount target and Caddy would crash-loop while the tls overlay has already closed :8000 — making the app unreachable on any non-caddy VM where certs land via rotate or manual provisioning. Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile to exist (size > 0) before activating the tls profile, with a WARN log if it's missing. Belt-and-suspenders so the failure mode is contained even when the script is deployed by some other path (not just the customer-instance TF module). Devin BUG_0001 on PR #146. * chore(release): cut 0.25.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
196 lines
9.3 KiB
Smarty
196 lines
9.3 KiB
Smarty
#!/bin/bash
|
|
# Agnes VM startup script — templated by Terraform.
|
|
# Idempotent — runs on every boot.
|
|
set -euo pipefail
|
|
exec > /var/log/agnes-startup.log 2>&1
|
|
chmod 640 /var/log/agnes-startup.log # defense in depth — not readable by non-root
|
|
|
|
CUSTOMER_NAME="${customer_name}"
|
|
IMAGE_REPO="${image_repo}"
|
|
IMAGE_TAG="${image_tag}"
|
|
UPGRADE_MODE="${upgrade_mode}"
|
|
TLS_MODE="${tls_mode}"
|
|
DOMAIN="${domain}"
|
|
ACME_EMAIL="${acme_email}"
|
|
DATA_SOURCE="${data_source}"
|
|
KEBOOLA_STACK_URL="${keboola_stack_url}"
|
|
SEED_ADMIN_EMAIL="${seed_admin_email}"
|
|
SEED_ADMIN_PASSWORD="${seed_admin_password}"
|
|
ROLE="${role}"
|
|
COMPOSE_REF="${compose_ref}"
|
|
|
|
echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup at $(date) ==="
|
|
|
|
# --- 1. Docker (install if missing) ---
|
|
if ! command -v docker &>/dev/null; then
|
|
curl -fsSL https://get.docker.com | sh
|
|
fi
|
|
if ! docker compose version &>/dev/null; then
|
|
apt-get update && apt-get install -y docker-compose-plugin
|
|
fi
|
|
|
|
# --- 2. Persistent data disk mount ---
|
|
DATA_DEV="/dev/disk/by-id/google-data"
|
|
DATA_MNT="/data"
|
|
if [ -b "$DATA_DEV" ]; then
|
|
if ! blkid "$DATA_DEV" | grep -q ext4; then
|
|
mkfs.ext4 -F "$DATA_DEV"
|
|
fi
|
|
mkdir -p "$DATA_MNT"
|
|
mountpoint -q "$DATA_MNT" || mount -o discard,defaults "$DATA_DEV" "$DATA_MNT"
|
|
grep -qF "$DATA_DEV" /etc/fstab || echo "$DATA_DEV $DATA_MNT ext4 discard,defaults,nofail 0 2" >> /etc/fstab
|
|
mkdir -p "$DATA_MNT/state" "$DATA_MNT/analytics" "$DATA_MNT/extracts"
|
|
# Match Dockerfile USER agnes (uid:gid 999:999). A freshly-attached PD is
|
|
# root-owned by default; without this chown the non-root container cannot
|
|
# write to /data/state/system.duckdb and every authed request 500s after
|
|
# the first upgrade that flips USER from root to agnes (regression hit
|
|
# agnes-development on 2026-04-29). Idempotent — safe on reboot.
|
|
chown -R 999:999 "$DATA_MNT"
|
|
fi
|
|
|
|
# --- 3. App directory + docker-compose files from public repo ---
|
|
APP_DIR="/opt/agnes"
|
|
mkdir -p "$APP_DIR"
|
|
cd "$APP_DIR"
|
|
|
|
# Fetch docker-compose files pinned to $COMPOSE_REF (defaults to `main`; pin to a
|
|
# stable-YYYY.MM.N tag for reproducibility across VM rebuilds).
|
|
RAW_BASE="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/$${COMPOSE_REF}"
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.yml" -o docker-compose.yml
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.prod.yml" -o docker-compose.prod.yml
|
|
# Overlay which binds `data` volume to host /data (persistent disk mounted above)
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.host-mount.yml" -o docker-compose.host-mount.yml
|
|
# TLS overlay + Caddyfile — fetched unconditionally because agnes-auto-upgrade.sh
|
|
# (curled from main below) detects TLS at runtime via cert files on disk,
|
|
# regardless of TLS_MODE. Certs can appear after boot via agnes-tls-rotate.sh
|
|
# or manual provisioning, and:
|
|
# - the cron job would fail under `set -euo pipefail` every 5 min if
|
|
# docker-compose.tls.yml were missing, and
|
|
# - the caddy service in docker-compose.yml bind-mounts ./Caddyfile:ro,
|
|
# so without it on disk Docker auto-creates an empty directory there
|
|
# and Caddy crash-loops while the overlay has already closed :8000.
|
|
# Cheap to keep on disk either way.
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.tls.yml" -o docker-compose.tls.yml
|
|
curl -fsSL "$${RAW_BASE}/Caddyfile" -o Caddyfile
|
|
|
|
# --- 4. Fetch secrets from Secret Manager — fail loudly if missing ---
|
|
KEBOOLA_TOKEN=""
|
|
if [ "$DATA_SOURCE" = "keboola" ]; then
|
|
# No `|| echo ""` fallback — if the token secret is missing, boot should fail
|
|
# loudly rather than silently start an app that will fail sync cryptically later.
|
|
KEBOOLA_TOKEN=$(gcloud secrets versions access latest --secret=keboola-storage-token)
|
|
fi
|
|
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
|
|
|
|
# SCHEDULER_API_TOKEN — shared secret between the app and scheduler containers.
|
|
# Both source the same /opt/agnes/.env via Docker Compose env_file:, so the
|
|
# scheduler's outbound bearer token always matches the app's expected value.
|
|
# See app/auth/scheduler_token.py for the auth path it unlocks.
|
|
#
|
|
# Preserve across reboots: the token is plumbed into a long-lived synthetic
|
|
# user, and rotating it forces a restart of both containers. Read back from
|
|
# an existing .env when present; mint fresh only on the first boot.
|
|
SCHEDULER_API_TOKEN=""
|
|
if [ -f "$APP_DIR/.env" ]; then
|
|
SCHEDULER_API_TOKEN=$(grep -E '^SCHEDULER_API_TOKEN=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
|
|
fi
|
|
if [ -z "$SCHEDULER_API_TOKEN" ]; then
|
|
# 64 hex chars = 256 bits of /dev/urandom entropy. Floor enforced in
|
|
# app/auth/scheduler_token.SCHEDULER_TOKEN_MIN_LENGTH is 32; 64 leaves
|
|
# headroom for a future tightening without re-provisioning every VM.
|
|
SCHEDULER_API_TOKEN=$(openssl rand -hex 32)
|
|
fi
|
|
|
|
# Optional Google OAuth credentials. If the operator has created
|
|
# google-oauth-client-{id,secret} secrets in the project's Secret Manager
|
|
# AND wired them via runtime_secrets in the calling Terraform, the VM SA can
|
|
# read them — write into .env so the Google sign-in flow works. Missing /
|
|
# 403 / empty → silent fallback to "" so password + email auth keep working.
|
|
GOOGLE_CLIENT_ID=$(gcloud secrets versions access latest --secret=google-oauth-client-id 2>/dev/null || echo "")
|
|
GOOGLE_CLIENT_SECRET=$(gcloud secrets versions access latest --secret=google-oauth-client-secret 2>/dev/null || echo "")
|
|
|
|
# AGNES_VERSION, RELEASE_CHANNEL, AGNES_COMMIT_SHA are baked into the image
|
|
# itself as ENV (see Dockerfile ARG/ENV + release.yml build-args). We do NOT
|
|
# set them here — doing so would override the image-level values with the
|
|
# floating tag name ("stable"/"dev"), hiding the real CalVer / git SHA.
|
|
# The app picks them up from the image's runtime environment.
|
|
|
|
# CADDY_TLS controls Caddyfile cert provisioning (see Caddyfile inline docs).
|
|
# - tls_mode=caddy + ACME_EMAIL set → Let's Encrypt auto-issue (public domain)
|
|
# - tls_mode=caddy + no ACME_EMAIL → Caddy-managed self-signed (lab use)
|
|
# - any other tls_mode → leave CADDY_TLS unset, Caddyfile default
|
|
# (cert-file mode for corporate PKI) applies.
|
|
# Operators wanting cert-file mode shouldn't set tls_mode at all on the dev
|
|
# instance — leave it "none" and let the corp-PKI rotate scripts handle certs.
|
|
CADDY_TLS_LINE=""
|
|
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
|
|
# Value MUST be quoted in the .env file: agnes-auto-upgrade.sh sources
|
|
# /opt/agnes/.env via `set -a; . .env; set +a`, and bash interprets an
|
|
# unquoted `KEY=value with spaces` as `KEY=value` followed by trying to
|
|
# exec `with`/`spaces` as commands → boot succeeds but every cron tick
|
|
# logs "<email>: command not found".
|
|
if [ -n "$ACME_EMAIL" ]; then
|
|
CADDY_TLS_LINE="CADDY_TLS=\"tls $ACME_EMAIL\""
|
|
else
|
|
CADDY_TLS_LINE="CADDY_TLS=\"tls internal\""
|
|
fi
|
|
fi
|
|
|
|
cat > "$APP_DIR/.env" <<ENVEOF
|
|
JWT_SECRET_KEY=$JWT_KEY
|
|
DATA_DIR=$DATA_MNT
|
|
DATA_SOURCE=$DATA_SOURCE
|
|
KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
|
|
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
|
|
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
|
|
SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
|
|
SCHEDULER_API_TOKEN=$SCHEDULER_API_TOKEN
|
|
LOG_LEVEL=info
|
|
DOMAIN=$DOMAIN
|
|
AGNES_TAG=$IMAGE_TAG
|
|
ACME_EMAIL=$ACME_EMAIL
|
|
GOOGLE_CLIENT_ID=$GOOGLE_CLIENT_ID
|
|
GOOGLE_CLIENT_SECRET=$GOOGLE_CLIENT_SECRET
|
|
$CADDY_TLS_LINE
|
|
ENVEOF
|
|
chmod 600 "$APP_DIR/.env"
|
|
|
|
# --- 5. Start Agnes ---
|
|
COMPOSE_PROFILES_ARG=""
|
|
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
|
|
COMPOSE_PROFILES_ARG="--profile tls"
|
|
fi
|
|
|
|
COMPOSE_FILES="-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml"
|
|
|
|
docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG pull
|
|
docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG up -d
|
|
|
|
# --- 6. Auto-upgrade via cron (pulls new image digest every 5 min) ---
|
|
if [ "$UPGRADE_MODE" = "auto" ]; then
|
|
# Single-source the cron script from the OSS repo's main branch instead
|
|
# of inlining a copy here. Two reasons:
|
|
# 1. Drift prevention — earlier inline copy missed several iterations
|
|
# of the canonical script (TLS overlay detection, array-form compose
|
|
# files, config-disk fail-fast guard).
|
|
# 2. Re-fetched on every VM boot, so script-only fixes propagate
|
|
# without an infra recreate. For immediate rollout to running VMs,
|
|
# operators can also re-run this fetch by hand.
|
|
#
|
|
# Coupling note: this URL is pinned to `main` while compose files above
|
|
# honor $COMPOSE_REF. If a future canonical script references a NEW
|
|
# compose file, the fetch list above MUST be updated to match — pinned-
|
|
# ref VMs would otherwise break on the next cron tick. Treat the docker-
|
|
# compose.* fetch list as the contract that agnes-auto-upgrade.sh relies
|
|
# on; new compose files referenced from main need a corresponding fetch.
|
|
SCRIPT_URL="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/ops/agnes-auto-upgrade.sh"
|
|
curl -fsSL --retry 3 --retry-delay 2 "$SCRIPT_URL" -o /usr/local/bin/agnes-auto-upgrade.sh
|
|
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
|
|
|
|
# Install cron entry idempotently: remove any prior agnes-auto-upgrade line, then append ours.
|
|
CRON_LINE="*/5 * * * * /usr/local/bin/agnes-auto-upgrade.sh >> /var/log/agnes-auto-upgrade.log 2>&1"
|
|
(crontab -l 2>/dev/null | grep -v agnes-auto-upgrade || true; echo "$CRON_LINE") | crontab -
|
|
fi
|
|
|
|
echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup complete at $(date) ==="
|
|
docker compose ps
|