* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip: 1. The marketplaces job called src.marketplace.sync_marketplaces() in-process from the scheduler container, racing the app's long-lived system.duckdb handle. DuckDB rejects cross-process writers — every cron tick 500-ed on "Could not set lock on file ... PID 0". 2. The data-refresh + new marketplaces jobs both 401-ed on the API because SCHEDULER_API_TOKEN was never propagated by the Terraform startup script. The scheduler had no credential to authenticate with. Fix: - New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh through the app process so it inherits the existing DB connection. - Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and the scheduler is reduced to a cron clock. - New app/auth/scheduler_token.py adds a shared-secret auth path. The startup script generates a 256-bit secret on first boot, persists it across reboots, and writes it to /opt/agnes/.env. Both containers source the same .env. The app validates incoming Bearer tokens against the env var (constant-time, length-floored) and resolves matches to a synthetic scheduler@system.local user that's a member of the Admin system group. Audit-log entries from the scheduler are attributed to this user. - app/main.py seeds the synthetic user at startup so the first cron tick has a valid actor; lazy seed in get_scheduler_user covers token rotation before the next app restart. Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short secret rejection, exact-match comparison, idempotent user seeding, and lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests remain green. Existing VMs with .env from before this change need a one-time re-provisioning (re-run startup-script or rotate via openssl rand); documented in CHANGELOG. * fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127 Avoids the literal string 'marketplace:None' in the audit_log resource column when the bulk sync endpoint writes its summary row. * fix(scheduler): unblock event loop + per-job timeouts — Devin review #127 Two findings from Devin re-review on commit 5fbad15: 1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio event loop. sync_marketplaces() does blocking I/O (subprocess git clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes) and would freeze every concurrent request for the duration of a bulk sync. Switched to plain def so FastAPI auto-routes to the thread pool. 2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST. Bulk marketplace sync iterates the registry under a single lock with up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The scheduler then sees a timeout, doesn't update last_run, and re-fires on the next 30s tick, queueing redundant work. Per-job timeout override added to the JOBS tuple; marketplaces gets 900s (15 min), data-refresh keeps 120s, health-check 30s. * fix(auth): require_session_token rejects scheduler shared secret — Devin review #127 require_session_token gates /auth/tokens (PAT minting). Pre-fix it only rejected JWTs with typ=pat — but the scheduler shared secret is an opaque string, so verify_token() returns None, payload becomes {}, and the PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN could mint persistent PATs that survive a secret rotation. Added explicit is_scheduler_token() check before the PAT-claim check; new regression test in tests/test_auth_scheduler_token.py. Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392 also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR and I agree; tracking separately. * release(0.17.0): cut + clean up CHANGELOG duplicates Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint plus the deploy-shape fixes that landed since the last release tag). Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120 (v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had been under-reporting the running release). Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0 above [0.16.0] — the canonical short summaries (with GitHub-release links) already exist below 0.16.0, the long forms were leftover state from before those versions were cut and have been silently shadowed ever since.
194 lines
8.7 KiB
Smarty
194 lines
8.7 KiB
Smarty
#!/bin/bash
|
|
# Agnes VM startup script — templated by Terraform.
|
|
# Idempotent — runs on every boot.
|
|
set -euo pipefail
|
|
exec > /var/log/agnes-startup.log 2>&1
|
|
chmod 640 /var/log/agnes-startup.log # defense in depth — not readable by non-root
|
|
|
|
CUSTOMER_NAME="${customer_name}"
|
|
IMAGE_REPO="${image_repo}"
|
|
IMAGE_TAG="${image_tag}"
|
|
UPGRADE_MODE="${upgrade_mode}"
|
|
TLS_MODE="${tls_mode}"
|
|
DOMAIN="${domain}"
|
|
ACME_EMAIL="${acme_email}"
|
|
DATA_SOURCE="${data_source}"
|
|
KEBOOLA_STACK_URL="${keboola_stack_url}"
|
|
SEED_ADMIN_EMAIL="${seed_admin_email}"
|
|
SEED_ADMIN_PASSWORD="${seed_admin_password}"
|
|
ROLE="${role}"
|
|
COMPOSE_REF="${compose_ref}"
|
|
|
|
echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup at $(date) ==="
|
|
|
|
# --- 1. Docker (install if missing) ---
|
|
if ! command -v docker &>/dev/null; then
|
|
curl -fsSL https://get.docker.com | sh
|
|
fi
|
|
if ! docker compose version &>/dev/null; then
|
|
apt-get update && apt-get install -y docker-compose-plugin
|
|
fi
|
|
|
|
# --- 2. Persistent data disk mount ---
|
|
DATA_DEV="/dev/disk/by-id/google-data"
|
|
DATA_MNT="/data"
|
|
if [ -b "$DATA_DEV" ]; then
|
|
if ! blkid "$DATA_DEV" | grep -q ext4; then
|
|
mkfs.ext4 -F "$DATA_DEV"
|
|
fi
|
|
mkdir -p "$DATA_MNT"
|
|
mountpoint -q "$DATA_MNT" || mount -o discard,defaults "$DATA_DEV" "$DATA_MNT"
|
|
grep -qF "$DATA_DEV" /etc/fstab || echo "$DATA_DEV $DATA_MNT ext4 discard,defaults,nofail 0 2" >> /etc/fstab
|
|
mkdir -p "$DATA_MNT/state" "$DATA_MNT/analytics" "$DATA_MNT/extracts"
|
|
# Match Dockerfile USER agnes (uid:gid 999:999). A freshly-attached PD is
|
|
# root-owned by default; without this chown the non-root container cannot
|
|
# write to /data/state/system.duckdb and every authed request 500s after
|
|
# the first upgrade that flips USER from root to agnes (regression hit
|
|
# agnes-development on 2026-04-29). Idempotent — safe on reboot.
|
|
chown -R 999:999 "$DATA_MNT"
|
|
fi
|
|
|
|
# --- 3. App directory + docker-compose files from public repo ---
|
|
APP_DIR="/opt/agnes"
|
|
mkdir -p "$APP_DIR"
|
|
cd "$APP_DIR"
|
|
|
|
# Fetch docker-compose files pinned to $COMPOSE_REF (defaults to `main`; pin to a
|
|
# stable-YYYY.MM.N tag for reproducibility across VM rebuilds).
|
|
RAW_BASE="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/$${COMPOSE_REF}"
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.yml" -o docker-compose.yml
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.prod.yml" -o docker-compose.prod.yml
|
|
# Overlay which binds `data` volume to host /data (persistent disk mounted above)
|
|
curl -fsSL "$${RAW_BASE}/docker-compose.host-mount.yml" -o docker-compose.host-mount.yml
|
|
|
|
# TLS overlay (Caddy + Let's Encrypt) — fetch only when actually needed; surface failures
|
|
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
|
|
curl -fsSL "$${RAW_BASE}/Caddyfile" -o Caddyfile
|
|
fi
|
|
|
|
# --- 4. Fetch secrets from Secret Manager — fail loudly if missing ---
|
|
KEBOOLA_TOKEN=""
|
|
if [ "$DATA_SOURCE" = "keboola" ]; then
|
|
# No `|| echo ""` fallback — if the token secret is missing, boot should fail
|
|
# loudly rather than silently start an app that will fail sync cryptically later.
|
|
KEBOOLA_TOKEN=$(gcloud secrets versions access latest --secret=keboola-storage-token)
|
|
fi
|
|
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
|
|
|
|
# SCHEDULER_API_TOKEN — shared secret between the app and scheduler containers.
|
|
# Both source the same /opt/agnes/.env via Docker Compose env_file:, so the
|
|
# scheduler's outbound bearer token always matches the app's expected value.
|
|
# See app/auth/scheduler_token.py for the auth path it unlocks.
|
|
#
|
|
# Preserve across reboots: the token is plumbed into a long-lived synthetic
|
|
# user, and rotating it forces a restart of both containers. Read back from
|
|
# an existing .env when present; mint fresh only on the first boot.
|
|
SCHEDULER_API_TOKEN=""
|
|
if [ -f "$APP_DIR/.env" ]; then
|
|
SCHEDULER_API_TOKEN=$(grep -E '^SCHEDULER_API_TOKEN=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
|
|
fi
|
|
if [ -z "$SCHEDULER_API_TOKEN" ]; then
|
|
# 64 hex chars = 256 bits of /dev/urandom entropy. Floor enforced in
|
|
# app/auth/scheduler_token.SCHEDULER_TOKEN_MIN_LENGTH is 32; 64 leaves
|
|
# headroom for a future tightening without re-provisioning every VM.
|
|
SCHEDULER_API_TOKEN=$(openssl rand -hex 32)
|
|
fi
|
|
|
|
# Optional Google OAuth credentials. If the operator has created
|
|
# google-oauth-client-{id,secret} secrets in the project's Secret Manager
|
|
# AND wired them via runtime_secrets in the calling Terraform, the VM SA can
|
|
# read them — write into .env so the Google sign-in flow works. Missing /
|
|
# 403 / empty → silent fallback to "" so password + email auth keep working.
|
|
GOOGLE_CLIENT_ID=$(gcloud secrets versions access latest --secret=google-oauth-client-id 2>/dev/null || echo "")
|
|
GOOGLE_CLIENT_SECRET=$(gcloud secrets versions access latest --secret=google-oauth-client-secret 2>/dev/null || echo "")
|
|
|
|
# AGNES_VERSION, RELEASE_CHANNEL, AGNES_COMMIT_SHA are baked into the image
|
|
# itself as ENV (see Dockerfile ARG/ENV + release.yml build-args). We do NOT
|
|
# set them here — doing so would override the image-level values with the
|
|
# floating tag name ("stable"/"dev"), hiding the real CalVer / git SHA.
|
|
# The app picks them up from the image's runtime environment.
|
|
|
|
# CADDY_TLS controls Caddyfile cert provisioning (see Caddyfile inline docs).
|
|
# - tls_mode=caddy + ACME_EMAIL set → Let's Encrypt auto-issue (public domain)
|
|
# - tls_mode=caddy + no ACME_EMAIL → Caddy-managed self-signed (lab use)
|
|
# - any other tls_mode → leave CADDY_TLS unset, Caddyfile default
|
|
# (cert-file mode for corporate PKI) applies.
|
|
# Operators wanting cert-file mode shouldn't set tls_mode at all on the dev
|
|
# instance — leave it "none" and let the corp-PKI rotate scripts handle certs.
|
|
CADDY_TLS_LINE=""
|
|
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
|
|
# Value MUST be quoted in the .env file: agnes-auto-upgrade.sh sources
|
|
# /opt/agnes/.env via `set -a; . .env; set +a`, and bash interprets an
|
|
# unquoted `KEY=value with spaces` as `KEY=value` followed by trying to
|
|
# exec `with`/`spaces` as commands → boot succeeds but every cron tick
|
|
# logs "<email>: command not found".
|
|
if [ -n "$ACME_EMAIL" ]; then
|
|
CADDY_TLS_LINE="CADDY_TLS=\"tls $ACME_EMAIL\""
|
|
else
|
|
CADDY_TLS_LINE="CADDY_TLS=\"tls internal\""
|
|
fi
|
|
fi
|
|
|
|
cat > "$APP_DIR/.env" <<ENVEOF
|
|
JWT_SECRET_KEY=$JWT_KEY
|
|
DATA_DIR=$DATA_MNT
|
|
DATA_SOURCE=$DATA_SOURCE
|
|
KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
|
|
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
|
|
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
|
|
SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
|
|
SCHEDULER_API_TOKEN=$SCHEDULER_API_TOKEN
|
|
LOG_LEVEL=info
|
|
DOMAIN=$DOMAIN
|
|
AGNES_TAG=$IMAGE_TAG
|
|
ACME_EMAIL=$ACME_EMAIL
|
|
GOOGLE_CLIENT_ID=$GOOGLE_CLIENT_ID
|
|
GOOGLE_CLIENT_SECRET=$GOOGLE_CLIENT_SECRET
|
|
$CADDY_TLS_LINE
|
|
ENVEOF
|
|
chmod 600 "$APP_DIR/.env"
|
|
|
|
# --- 5. Start Agnes ---
|
|
COMPOSE_PROFILES_ARG=""
|
|
if [ "$TLS_MODE" = "caddy" ] && [ -n "$DOMAIN" ]; then
|
|
COMPOSE_PROFILES_ARG="--profile tls"
|
|
fi
|
|
|
|
COMPOSE_FILES="-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml"
|
|
|
|
docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG pull
|
|
docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG up -d
|
|
|
|
# --- 6. Auto-upgrade via cron (pulls new image digest every 5 min) ---
|
|
if [ "$UPGRADE_MODE" = "auto" ]; then
|
|
# Cron script sources /opt/agnes/.env for AGNES_TAG — so if operator edits .env
|
|
# (e.g. to pin a specific stable-YYYY.MM.N), cron picks it up immediately. No
|
|
# drift between what compose up reads and what the digest-check inspects.
|
|
cat > /usr/local/bin/agnes-auto-upgrade.sh <<'SCRIPTEOF'
|
|
#!/bin/bash
|
|
# Runs from cron — pulls new image if one is available, restarts containers.
|
|
set -euo pipefail
|
|
cd /opt/agnes
|
|
# Source .env so AGNES_TAG reflects any operator edits since boot.
|
|
# shellcheck disable=SC1091
|
|
set -a; . /opt/agnes/.env; set +a
|
|
IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:$${AGNES_TAG:-stable}"
|
|
COMPOSE_FILES="-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml"
|
|
BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
|
|
docker compose $COMPOSE_FILES pull >/dev/null 2>&1
|
|
AFTER=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
|
|
if [ "$BEFORE" != "$AFTER" ]; then
|
|
echo "$(date): new image digest for $IMAGE — recreating containers"
|
|
docker compose $COMPOSE_FILES up -d
|
|
docker image prune -f >/dev/null 2>&1
|
|
fi
|
|
SCRIPTEOF
|
|
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
|
|
|
|
# Install cron entry idempotently: remove any prior agnes-auto-upgrade line, then append ours.
|
|
CRON_LINE="*/5 * * * * /usr/local/bin/agnes-auto-upgrade.sh >> /var/log/agnes-auto-upgrade.log 2>&1"
|
|
(crontab -l 2>/dev/null | grep -v agnes-auto-upgrade || true; echo "$CRON_LINE") | crontab -
|
|
fi
|
|
|
|
echo "=== [Agnes $CUSTOMER_NAME $ROLE] Startup complete at $(date) ==="
|
|
docker compose ps
|