chore(auto-upgrade): re-fetch compose + Caddyfile, self-update

Sibling change to the Caddy file_server PR (#182). Without this, existing long-uptime VMs would pull the new agnes image on auto-upgrade but keep their stale Caddyfile + docker-compose.yml — leaving the file_server route + the data:/srv:ro mount inert. Confirmed live 2026-05-05 when the file_server change merged in main but stayed unreachable on a running dev VM until /opt/agnes/* was scp'd by hand. agnes-auto-upgrade.sh now hashes the bind-mounted config files (Caddyfile + every docker-compose overlay) on every 5 min tick and triggers a `docker compose up -d` recreation when the hash drifts — same trigger path as an image-digest change. Fail-soft via the .new-then-mv pattern: a curl 404 / network blip leaves the existing file untouched. Self-update at the bottom of the script: re-fetch /usr/local/bin/agnes-auto-upgrade.sh itself so the very fix that watches config files lands on running VMs without a manual ssh-and- curl cycle. Otherwise we'd have a self-perpetuating "old script problem" — the watch-config logic never propagating to the VMs that need it. Operators no longer need to ssh + scp Caddyfile/compose changes.
2026-05-05 15:18:48 +02:00 · 2026-05-05 15:18:48 +02:00 · ab61e30c91
commit ab61e30c91
parent 1be997f6d4
2 changed files with 65 additions and 2 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -12,6 +12,7 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C

 ### Added
 - **`data_source.bigquery.query_timeout_ms` config knob** (default 600 000 ms = 10 min). The DuckDB BigQuery extension's built-in default of 90 s was too tight for analyst-scale queries against view-backed BQ datasets — `agnes query --remote` would HTTP 400 with `Binder Error: Query execution exceeded the timeout. Job ID: …` whenever the underlying BQ job took longer than 90 s, even though the BQ job itself was healthy. The new knob is applied via `SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching DuckDB session — the orchestrator's `_remote_attach` ATTACH path (`src/orchestrator.py`), the analytics-DB read-only reattach path (`src/db.py:_reattach_remote_extensions` — the primary `agnes query --remote` request path), the `BqAccess` session factory (`connectors/bigquery/access.py`), and the standalone extractor (`connectors/bigquery/extractor.py`). Sentinel `0` (or non-numeric / unparseable values) leaves the extension default in place so operators on legacy extension versions that don't recognise the setting aren't broken. Configurable via `/admin/server-config` UI. Note: BigQuery's `jobs.query` RPC caps the wait at ~200 s per call regardless of this setting; the extension polls on top so the effective ceiling is the value here but each poll is ~200 s. DuckDB emits an informational warning when the value is set above the BQ RPC cap — operators can safely ignore it.
+- **`scripts/ops/agnes-auto-upgrade.sh` now re-fetches Caddyfile + every compose overlay** from `keboola/agnes-the-ai-analyst@main` on every tick, hashes them, and triggers a `docker compose up -d` recreation when the hash changes — same path as an image-digest change. Pre-fix the script only watched `docker images` digests, so a Caddyfile or compose change in main never reached running VMs (only fresh boots ran `startup.sh`'s file fetch). Without this, the new file_server downloads-path below would land in the image but stay inert against an old Caddyfile. The script also self-updates from the same path so the very fix that watches config files isn't itself stuck on running VMs. Fail-soft on curl errors — keeps the existing file rather than blanking it.
 - **Caddy `file_server` for parquet downloads** — `GET /api/data/{table_id}/download` is now intercepted at the Caddy layer (TLS profile only) and served directly via sendfile/zero-copy from the data volume mounted read-only at `/srv` inside the caddy container. Caddy authorises every request via a new lightweight RBAC probe `GET /api/data/{table_id}/check-access` (returns 204 when the caller has read access on the table, 403 otherwise) using the `forward_auth` directive — the bulk byte transfer never touches uvicorn workers. Resolves a real production failure mode where a single multi-GB analyst pull held the app's only uvicorn worker for the duration of the stream and starved the UI / `/api/health` / every other API endpoint, eventually flipping the container to `unhealthy`. Path discovery uses Caddy's `try_files` over the known `extract.duckdb` v2 source subdirs (`bigquery/data/<id>.parquet`, `keboola/data/<id>.parquet`, `jira/data/<id>.parquet`); a parquet not at any of those paths transparently falls through to the existing app handler so legacy `src_data/parquet` layouts and future connectors keep working with no Caddyfile change. Non-Caddy deployments (dev `docker compose up` without `--profile tls`) continue to use the app handler unchanged.

 ### Fixed
--- a/scripts/ops/agnes-auto-upgrade.sh
+++ b/scripts/ops/agnes-auto-upgrade.sh
@ -72,10 +72,72 @@ fi
 BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
 docker compose "${COMPOSE_FILES[@]}" pull >/dev/null 2>&1
 AFTER=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
-if [ "$BEFORE" != "$AFTER" ]; then
-    echo "$(date): new digest for $IMAGE — recreating containers"
+
+# Re-fetch the bind-mounted config files (compose overlays + Caddyfile)
+# from the OSS main branch on every tick. Without this, an image-only
+# change is fine, but a change to the Caddyfile or any compose overlay
+# (e.g. a new bind mount, a route, an env_file path) only lands on VMs
+# that get a fresh `startup.sh` boot — leaving long-uptime VMs running
+# the new image against stale config. Confirmed live on 2026-05-05
+# when a Caddyfile change adding a `data:/srv:ro` mount + a new
+# `forward_auth` + `file_server` route for parquet downloads landed
+# in main but stayed inert on running VMs because auto-upgrade only
+# watched image digests.
+#
+# Hash before/after to detect content drift; treat as "trigger recreate"
+# alongside an image digest change. Atomic move-after-fetch guards
+# against a partial download corrupting compose at the next docker
+# action — `curl --fail` plus the `.new` rename means a 404 / network
+# blip leaves the existing file untouched.
+RAW_BASE="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main"
+CONFIG_FILES=(
+  docker-compose.yml docker-compose.prod.yml docker-compose.host-mount.yml
+  docker-compose.tls.yml Caddyfile
+)
+hash_config_files() {
+  # Sort to keep hash stable across operator add/remove, missing files
+  # contribute the empty string (sha256 of "" is well-defined). Run
+  # from /opt/agnes to keep relative paths terse in the hash input.
+  ( cd /opt/agnes && for f in "${CONFIG_FILES[@]}"; do
+      sha256sum "$f" 2>/dev/null || printf 'missing %s\n' "$f"
+    done ) | sort | sha256sum | awk '{print $1}'
+}
+CONFIG_BEFORE=$(hash_config_files)
+for f in "${CONFIG_FILES[@]}"; do
+  if curl -fsSL "$RAW_BASE/$f" -o "/opt/agnes/$f.new" 2>/dev/null; then
+    mv -f "/opt/agnes/$f.new" "/opt/agnes/$f"
+  else
+    rm -f "/opt/agnes/$f.new"
+    logger -t agnes-auto-upgrade "WARN: failed to fetch $f from $RAW_BASE — keeping existing /opt/agnes/$f"
+  fi
+done
+CONFIG_AFTER=$(hash_config_files)
+
+if [ "$BEFORE" != "$AFTER" ] || [ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ]; then
+    REASON=()
+    [ "$BEFORE" != "$AFTER" ] && REASON+=("image digest")
+    [ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ] && REASON+=("config files")
+    echo "$(date): change detected (${REASON[*]}) — recreating containers"
    # ${arr[@]+"${arr[@]}"} pattern: expands to nothing when array is
    # empty (vs. plain "${arr[@]}" which trips `set -u` on bash <4.4).
    docker compose "${COMPOSE_FILES[@]}" ${PROFILE_ARGS[@]+"${PROFILE_ARGS[@]}"} up -d
    docker image prune -f >/dev/null 2>&1
 fi
+
+# Self-update: re-fetch *this* script too. Without this, the very fix
+# that lets auto-upgrade watch config files would itself never land on
+# running VMs — a self-perpetuating "old script" problem. Atomic via
+# .new + mv; chmod preserved. The next tick (5 min later) runs the
+# new logic. Skipping if curl fails leaves the existing script in place.
+if curl -fsSL "$RAW_BASE/scripts/ops/agnes-auto-upgrade.sh" \
+   -o /usr/local/bin/agnes-auto-upgrade.sh.new 2>/dev/null; then
+  if ! cmp -s /usr/local/bin/agnes-auto-upgrade.sh.new \
+                /usr/local/bin/agnes-auto-upgrade.sh; then
+    chmod +x /usr/local/bin/agnes-auto-upgrade.sh.new
+    mv -f /usr/local/bin/agnes-auto-upgrade.sh.new \
+          /usr/local/bin/agnes-auto-upgrade.sh
+    logger -t agnes-auto-upgrade "self-update: replaced /usr/local/bin/agnes-auto-upgrade.sh"
+  else
+    rm -f /usr/local/bin/agnes-auto-upgrade.sh.new
+  fi
+fi