From ab61e30c91680865c570d75aaf4b21d0175a3ef2 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 15:18:48 +0200 Subject: [PATCH] chore(auto-upgrade): re-fetch compose + Caddyfile, self-update MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sibling change to the Caddy file_server PR (#182). Without this, existing long-uptime VMs would pull the new agnes image on auto-upgrade but keep their stale Caddyfile + docker-compose.yml — leaving the file_server route + the data:/srv:ro mount inert. Confirmed live 2026-05-05 when the file_server change merged in main but stayed unreachable on a running dev VM until /opt/agnes/* was scp'd by hand. agnes-auto-upgrade.sh now hashes the bind-mounted config files (Caddyfile + every docker-compose overlay) on every 5 min tick and triggers a `docker compose up -d` recreation when the hash drifts — same trigger path as an image-digest change. Fail-soft via the .new-then-mv pattern: a curl 404 / network blip leaves the existing file untouched. Self-update at the bottom of the script: re-fetch /usr/local/bin/agnes-auto-upgrade.sh itself so the very fix that watches config files lands on running VMs without a manual ssh-and- curl cycle. Otherwise we'd have a self-perpetuating "old script problem" — the watch-config logic never propagating to the VMs that need it. Operators no longer need to ssh + scp Caddyfile/compose changes. --- CHANGELOG.md | 1 + scripts/ops/agnes-auto-upgrade.sh | 66 ++++++++++++++++++++++++++++++- 2 files changed, 65 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b23d870..406f6bb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ### Added - **`data_source.bigquery.query_timeout_ms` config knob** (default 600 000 ms = 10 min). The DuckDB BigQuery extension's built-in default of 90 s was too tight for analyst-scale queries against view-backed BQ datasets — `agnes query --remote` would HTTP 400 with `Binder Error: Query execution exceeded the timeout. Job ID: …` whenever the underlying BQ job took longer than 90 s, even though the BQ job itself was healthy. The new knob is applied via `SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching DuckDB session — the orchestrator's `_remote_attach` ATTACH path (`src/orchestrator.py`), the analytics-DB read-only reattach path (`src/db.py:_reattach_remote_extensions` — the primary `agnes query --remote` request path), the `BqAccess` session factory (`connectors/bigquery/access.py`), and the standalone extractor (`connectors/bigquery/extractor.py`). Sentinel `0` (or non-numeric / unparseable values) leaves the extension default in place so operators on legacy extension versions that don't recognise the setting aren't broken. Configurable via `/admin/server-config` UI. Note: BigQuery's `jobs.query` RPC caps the wait at ~200 s per call regardless of this setting; the extension polls on top so the effective ceiling is the value here but each poll is ~200 s. DuckDB emits an informational warning when the value is set above the BQ RPC cap — operators can safely ignore it. +- **`scripts/ops/agnes-auto-upgrade.sh` now re-fetches Caddyfile + every compose overlay** from `keboola/agnes-the-ai-analyst@main` on every tick, hashes them, and triggers a `docker compose up -d` recreation when the hash changes — same path as an image-digest change. Pre-fix the script only watched `docker images` digests, so a Caddyfile or compose change in main never reached running VMs (only fresh boots ran `startup.sh`'s file fetch). Without this, the new file_server downloads-path below would land in the image but stay inert against an old Caddyfile. The script also self-updates from the same path so the very fix that watches config files isn't itself stuck on running VMs. Fail-soft on curl errors — keeps the existing file rather than blanking it. - **Caddy `file_server` for parquet downloads** — `GET /api/data/{table_id}/download` is now intercepted at the Caddy layer (TLS profile only) and served directly via sendfile/zero-copy from the data volume mounted read-only at `/srv` inside the caddy container. Caddy authorises every request via a new lightweight RBAC probe `GET /api/data/{table_id}/check-access` (returns 204 when the caller has read access on the table, 403 otherwise) using the `forward_auth` directive — the bulk byte transfer never touches uvicorn workers. Resolves a real production failure mode where a single multi-GB analyst pull held the app's only uvicorn worker for the duration of the stream and starved the UI / `/api/health` / every other API endpoint, eventually flipping the container to `unhealthy`. Path discovery uses Caddy's `try_files` over the known `extract.duckdb` v2 source subdirs (`bigquery/data/.parquet`, `keboola/data/.parquet`, `jira/data/.parquet`); a parquet not at any of those paths transparently falls through to the existing app handler so legacy `src_data/parquet` layouts and future connectors keep working with no Caddyfile change. Non-Caddy deployments (dev `docker compose up` without `--profile tls`) continue to use the app handler unchanged. ### Fixed diff --git a/scripts/ops/agnes-auto-upgrade.sh b/scripts/ops/agnes-auto-upgrade.sh index 537f2cf..2dbd922 100755 --- a/scripts/ops/agnes-auto-upgrade.sh +++ b/scripts/ops/agnes-auto-upgrade.sh @@ -72,10 +72,72 @@ fi BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1) docker compose "${COMPOSE_FILES[@]}" pull >/dev/null 2>&1 AFTER=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1) -if [ "$BEFORE" != "$AFTER" ]; then - echo "$(date): new digest for $IMAGE — recreating containers" + +# Re-fetch the bind-mounted config files (compose overlays + Caddyfile) +# from the OSS main branch on every tick. Without this, an image-only +# change is fine, but a change to the Caddyfile or any compose overlay +# (e.g. a new bind mount, a route, an env_file path) only lands on VMs +# that get a fresh `startup.sh` boot — leaving long-uptime VMs running +# the new image against stale config. Confirmed live on 2026-05-05 +# when a Caddyfile change adding a `data:/srv:ro` mount + a new +# `forward_auth` + `file_server` route for parquet downloads landed +# in main but stayed inert on running VMs because auto-upgrade only +# watched image digests. +# +# Hash before/after to detect content drift; treat as "trigger recreate" +# alongside an image digest change. Atomic move-after-fetch guards +# against a partial download corrupting compose at the next docker +# action — `curl --fail` plus the `.new` rename means a 404 / network +# blip leaves the existing file untouched. +RAW_BASE="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main" +CONFIG_FILES=( + docker-compose.yml docker-compose.prod.yml docker-compose.host-mount.yml + docker-compose.tls.yml Caddyfile +) +hash_config_files() { + # Sort to keep hash stable across operator add/remove, missing files + # contribute the empty string (sha256 of "" is well-defined). Run + # from /opt/agnes to keep relative paths terse in the hash input. + ( cd /opt/agnes && for f in "${CONFIG_FILES[@]}"; do + sha256sum "$f" 2>/dev/null || printf 'missing %s\n' "$f" + done ) | sort | sha256sum | awk '{print $1}' +} +CONFIG_BEFORE=$(hash_config_files) +for f in "${CONFIG_FILES[@]}"; do + if curl -fsSL "$RAW_BASE/$f" -o "/opt/agnes/$f.new" 2>/dev/null; then + mv -f "/opt/agnes/$f.new" "/opt/agnes/$f" + else + rm -f "/opt/agnes/$f.new" + logger -t agnes-auto-upgrade "WARN: failed to fetch $f from $RAW_BASE — keeping existing /opt/agnes/$f" + fi +done +CONFIG_AFTER=$(hash_config_files) + +if [ "$BEFORE" != "$AFTER" ] || [ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ]; then + REASON=() + [ "$BEFORE" != "$AFTER" ] && REASON+=("image digest") + [ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ] && REASON+=("config files") + echo "$(date): change detected (${REASON[*]}) — recreating containers" # ${arr[@]+"${arr[@]}"} pattern: expands to nothing when array is # empty (vs. plain "${arr[@]}" which trips `set -u` on bash <4.4). docker compose "${COMPOSE_FILES[@]}" ${PROFILE_ARGS[@]+"${PROFILE_ARGS[@]}"} up -d docker image prune -f >/dev/null 2>&1 fi + +# Self-update: re-fetch *this* script too. Without this, the very fix +# that lets auto-upgrade watch config files would itself never land on +# running VMs — a self-perpetuating "old script" problem. Atomic via +# .new + mv; chmod preserved. The next tick (5 min later) runs the +# new logic. Skipping if curl fails leaves the existing script in place. +if curl -fsSL "$RAW_BASE/scripts/ops/agnes-auto-upgrade.sh" \ + -o /usr/local/bin/agnes-auto-upgrade.sh.new 2>/dev/null; then + if ! cmp -s /usr/local/bin/agnes-auto-upgrade.sh.new \ + /usr/local/bin/agnes-auto-upgrade.sh; then + chmod +x /usr/local/bin/agnes-auto-upgrade.sh.new + mv -f /usr/local/bin/agnes-auto-upgrade.sh.new \ + /usr/local/bin/agnes-auto-upgrade.sh + logger -t agnes-auto-upgrade "self-update: replaced /usr/local/bin/agnes-auto-upgrade.sh" + else + rm -f /usr/local/bin/agnes-auto-upgrade.sh.new + fi +fi