diff --git a/CHANGELOG.md b/CHANGELOG.md index 5a5b848..e1a503e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,13 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +## [0.26.0] — 2026-04-30 + +### Changed + +- **BREAKING** **All host-side artifacts (compose files, `Caddyfile`, host bash scripts) now ship in the docker image, not curled from `main` at boot.** The Dockerfile bakes them at `/opt/agnes-host/` and the customer-instance startup template extracts the whole directory via `docker create` + `docker cp` from the same `image_tag` the operator already pinned. Removes 5 `curl`s against `raw.githubusercontent.com` from the customer template (`docker-compose.yml`, `docker-compose.prod.yml`, `docker-compose.host-mount.yml`, `docker-compose.tls.yml`, `Caddyfile`) plus the `agnes-auto-upgrade.sh` curl shipped in 0.25.0. The image also now ships `agnes-tls-rotate.sh` + `tls-fetch.sh` at `/opt/agnes-host/` so consumer-side deploy templates can adopt the same pattern. Replaces the curl-from-main pattern that decoupled host-side artifacts from the pinned image (split-brain — image at `stable-2026.04.516`, host artifacts floating on whatever `main` was when the VM last booted) and gave no rollback knob other than reverting upstream PRs globally. With everything baked in, host artifacts and app code are released together from one commit; `image_tag` controls all; rollback is one tag bump; egress simplifies to "private registry" only (no public-internet dependency on every boot). Drift prevention is preserved by construction — image and host artifacts CANNOT drift because they ship together. **Operator action**: `image_tag` MUST point to a tag from this release or later; older tags lack `/opt/agnes-host/` and the startup `docker cp` will fail-loud at first boot. Existing VMs are unaffected because the module sets `lifecycle { ignore_changes = [metadata_startup_script] }` — only newly-created VMs run the new script. +- `compose_ref` variable on the customer-instance terraform module is **deprecated** — no longer used (compose files come from `image_tag` now). Variable retained for one release cycle to avoid breaking existing `terraform plan`s; will be removed in a future major bump. Pin `image_tag` instead. + ## [0.25.0] — 2026-04-30 ### Fixed diff --git a/Dockerfile b/Dockerfile index 249a7d7..8710690 100644 --- a/Dockerfile +++ b/Dockerfile @@ -17,6 +17,42 @@ WORKDIR /app COPY . . +# Bake every host-side artifact at /opt/agnes-host/ — the contract path +# VM startup uses to extract files via `docker create` + `docker cp` +# instead of curling from raw.githubusercontent.com/main. Pins host +# artifacts to AGNES_TAG the same way the app is already pinned — +# eliminates the split-brain where the immutable image runs against +# arbitrary main-branch compose files / bash scripts. +# +# Includes: +# - agnes-auto-upgrade.sh — host cron driver (5-min digest poll) +# - agnes-tls-rotate.sh — host cron driver (daily corp-PKI cert refetch) +# - tls-fetch.sh — generic URL fetcher (sm:// gs:// https:// file://) +# - docker-compose.{yml,prod.yml,host-mount.yml,tls.yml} — host runtime +# - Caddyfile — TLS reverse proxy config +# +# Why a copy out of /app instead of pointing at /app directly: +# /app is owned by uid 999 (USER agnes below); /opt/agnes-host is +# root-owned, mode 0755 across the board, stable path that won't +# shift if /app structure refactors. Stable contract for `docker cp` +# consumers. +RUN mkdir -p /opt/agnes-host && \ + cp /app/scripts/ops/agnes-auto-upgrade.sh \ + /app/scripts/ops/agnes-tls-rotate.sh \ + /app/scripts/tls-fetch.sh \ + /opt/agnes-host/ && \ + cp /app/docker-compose.yml /app/docker-compose.prod.yml \ + /app/docker-compose.host-mount.yml /app/docker-compose.tls.yml \ + /app/Caddyfile /opt/agnes-host/ && \ + chmod 0755 /opt/agnes-host/agnes-auto-upgrade.sh \ + /opt/agnes-host/agnes-tls-rotate.sh \ + /opt/agnes-host/tls-fetch.sh && \ + chmod 0644 /opt/agnes-host/docker-compose.yml \ + /opt/agnes-host/docker-compose.prod.yml \ + /opt/agnes-host/docker-compose.host-mount.yml \ + /opt/agnes-host/docker-compose.tls.yml \ + /opt/agnes-host/Caddyfile + # Build wheel artifact (served at /cli/download) RUN uv build --wheel --out-dir /app/dist diff --git a/infra/modules/customer-instance/startup-script.sh.tpl b/infra/modules/customer-instance/startup-script.sh.tpl index 949cede..a31a585 100644 --- a/infra/modules/customer-instance/startup-script.sh.tpl +++ b/infra/modules/customer-instance/startup-script.sh.tpl @@ -48,30 +48,36 @@ if [ -b "$DATA_DEV" ]; then chown -R 999:999 "$DATA_MNT" fi -# --- 3. App directory + docker-compose files from public repo --- +# --- 3. App directory + extract host artifacts from the pinned image --- APP_DIR="/opt/agnes" mkdir -p "$APP_DIR" cd "$APP_DIR" -# Fetch docker-compose files pinned to $COMPOSE_REF (defaults to `main`; pin to a -# stable-YYYY.MM.N tag for reproducibility across VM rebuilds). -RAW_BASE="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/$${COMPOSE_REF}" -curl -fsSL "$${RAW_BASE}/docker-compose.yml" -o docker-compose.yml -curl -fsSL "$${RAW_BASE}/docker-compose.prod.yml" -o docker-compose.prod.yml -# Overlay which binds `data` volume to host /data (persistent disk mounted above) -curl -fsSL "$${RAW_BASE}/docker-compose.host-mount.yml" -o docker-compose.host-mount.yml -# TLS overlay + Caddyfile — fetched unconditionally because agnes-auto-upgrade.sh -# (curled from main below) detects TLS at runtime via cert files on disk, -# regardless of TLS_MODE. Certs can appear after boot via agnes-tls-rotate.sh -# or manual provisioning, and: -# - the cron job would fail under `set -euo pipefail` every 5 min if -# docker-compose.tls.yml were missing, and -# - the caddy service in docker-compose.yml bind-mounts ./Caddyfile:ro, -# so without it on disk Docker auto-creates an empty directory there -# and Caddy crash-loops while the overlay has already closed :8000. -# Cheap to keep on disk either way. -curl -fsSL "$${RAW_BASE}/docker-compose.tls.yml" -o docker-compose.tls.yml -curl -fsSL "$${RAW_BASE}/Caddyfile" -o Caddyfile +# Pull the pinned image first so we can extract host-side artifacts from it. +# Everything we need on the host (compose files, Caddyfile, agnes-auto-upgrade.sh) +# ships baked into the image at /opt/agnes-host/, released atomically with +# the app. AGNES_TAG is the single version pin for both — no split-brain +# with main-branch curl. +# +# Why image-extract beats curling raw.githubusercontent.com: +# - Version pin: customer pins AGNES_TAG → extracted artifacts match the +# same tag. main-branch curls would break that pin silently. +# - Egress: image is already pulled from the private registry; the public +# internet is no longer required for boot. +# - Rollback: revert is one tag bump. Curl-from-main has no per-customer +# rollback path. +docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}" +EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}") +trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT +docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/" +docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh +chmod +x /usr/local/bin/agnes-auto-upgrade.sh + +# docker-compose.tls.yml + Caddyfile land regardless of TLS_MODE. agnes-auto-upgrade.sh +# detects TLS at runtime via cert files on disk; certs can appear after boot via +# agnes-tls-rotate.sh or manual provisioning. The caddy service bind-mounts +# ./Caddyfile, so it must exist on disk before any `docker compose up` even when +# the tls overlay is currently inactive. Cheap to keep them on disk either way. # --- 4. Fetch secrets from Secret Manager — fail loudly if missing --- KEBOOLA_TOKEN="" @@ -168,24 +174,10 @@ docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG up -d # --- 6. Auto-upgrade via cron (pulls new image digest every 5 min) --- if [ "$UPGRADE_MODE" = "auto" ]; then - # Single-source the cron script from the OSS repo's main branch instead - # of inlining a copy here. Two reasons: - # 1. Drift prevention — earlier inline copy missed several iterations - # of the canonical script (TLS overlay detection, array-form compose - # files, config-disk fail-fast guard). - # 2. Re-fetched on every VM boot, so script-only fixes propagate - # without an infra recreate. For immediate rollout to running VMs, - # operators can also re-run this fetch by hand. - # - # Coupling note: this URL is pinned to `main` while compose files above - # honor $COMPOSE_REF. If a future canonical script references a NEW - # compose file, the fetch list above MUST be updated to match — pinned- - # ref VMs would otherwise break on the next cron tick. Treat the docker- - # compose.* fetch list as the contract that agnes-auto-upgrade.sh relies - # on; new compose files referenced from main need a corresponding fetch. - SCRIPT_URL="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/ops/agnes-auto-upgrade.sh" - curl -fsSL --retry 3 --retry-delay 2 "$SCRIPT_URL" -o /usr/local/bin/agnes-auto-upgrade.sh - chmod +x /usr/local/bin/agnes-auto-upgrade.sh + # agnes-auto-upgrade.sh was already extracted to /usr/local/bin/ in + # section 3 alongside the compose files — the host artifacts ship + # together from the pinned image. Nothing more to fetch here. + : # Install cron entry idempotently: remove any prior agnes-auto-upgrade line, then append ours. CRON_LINE="*/5 * * * * /usr/local/bin/agnes-auto-upgrade.sh >> /var/log/agnes-auto-upgrade.log 2>&1" diff --git a/infra/modules/customer-instance/variables.tf b/infra/modules/customer-instance/variables.tf index 1fcce5f..2a2852a 100644 --- a/infra/modules/customer-instance/variables.tf +++ b/infra/modules/customer-instance/variables.tf @@ -25,7 +25,17 @@ variable "customer_name" { } variable "prod_instance" { - description = "Production VM configuration." + description = <<-EOT + Production VM configuration. + + `image_tag` MUST point to an image that contains `/opt/agnes-host/` + (this directory was added in v0.26.0). Older tags will fail at first + boot with `docker cp: No such file or directory` because the startup + script extracts host artifacts from the image instead of curling + them. Existing VMs are unaffected by this constraint — the module + sets `lifecycle { ignore_changes = [metadata_startup_script] }` so + the new script only runs on freshly-created VMs. + EOT type = object({ name = string machine_type = optional(string, "e2-small") @@ -45,6 +55,9 @@ variable "dev_instances" { tls_mode + domain are optional and default to plain HTTP on :8000. Set tls_mode = "caddy" + domain to enable Caddy + Let's Encrypt (or whatever CADDY_TLS env var is configured to in the Caddyfile — see Caddyfile docs). + + Same `image_tag >= v0.26.0` constraint as `prod_instance` — older tags + lack `/opt/agnes-host/` and the startup `docker cp` fails-loud. EOT type = list(object({ name = string @@ -93,7 +106,7 @@ variable "image_repo" { } variable "compose_ref" { - description = "Git ref to fetch docker-compose.yml and overlays from (in keboola/agnes-the-ai-analyst). Use `main` for latest, or a tag like `stable-2026.04.47` for reproducibility." + description = "DEPRECATED — no longer used. Compose files now ship inside the docker image at /opt/agnes-host/ and are extracted via `docker cp` from the same `image_tag` the operator pinned. Pin `image_tag` instead. Variable retained for one release cycle to avoid breaking existing terraform plans; will be removed in a future major bump." type = string default = "main" } diff --git a/pyproject.toml b/pyproject.toml index 7b29d22..adfd811 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "agnes-the-ai-analyst" -version = "0.25.0" +version = "0.26.0" description = "Agnes — AI Data Analyst platform for AI analytical systems" requires-python = ">=3.11,<3.14" license = "MIT"