refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149)

* refactor(ops): bake all host artifacts into image, drop every curl-from-main

Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.

## Motivation

The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:

  docker-compose.yml
  docker-compose.prod.yml
  docker-compose.host-mount.yml
  docker-compose.tls.yml
  Caddyfile
  scripts/ops/agnes-auto-upgrade.sh   (added in 0.25.0)

Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:

1. **Split-brain version pinning.** image_tag pins the docker image to an
   immutable digest. The compose files + script bypassed that pinning by
   tracking `main` (or the rarely-set compose_ref). A customer pinned to
   stable-2026.04.516 could wake up tomorrow with their host artifacts
   floating on whatever shipped to main overnight — even though they're
   explicitly pinned for stability.

2. **No rollback knob.** Reverting a bad host artifact meant reverting
   the upstream PR globally — affects every customer that reboots after
   the bad commit. No "rollback for me only" path; tag-pinning gave no
   protection.

3. **Public-internet dependency on every boot.** The image is already
   pulled from a private registry on the same boot. Reusing that channel
   is strictly cheaper than adding a second one. Customers with restricted
   egress (no raw.githubusercontent.com reachability) silently broke on
   every boot.

## Changes

### Dockerfile (+19 -8)

After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:

  agnes-auto-upgrade.sh                  (mode 0755 — host cron driver)
  docker-compose.{yml,prod,host-mount,tls}.yml
  Caddyfile                              (mode 0644)

Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.

### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)

Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:

    docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
    EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
    trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
    docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
    docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
    chmod +x /usr/local/bin/agnes-auto-upgrade.sh

The auto-upgrade section (#6) is now a no-op — script is already in place.

### infra/modules/customer-instance/variables.tf (+1 -1)

`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.

### CHANGELOG.md

`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).

## Out of scope (filed as follow-ups)

1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
   curls the same artifacts from main. Symmetric fix needed there.
   Will file as a separate PR against the infra repo.

2. **Self-update inside agnes-auto-upgrade.sh** after a successful
   `docker compose pull` of a new digest. Otherwise the running cron
   keeps using the OLD baked-in script for one tick after image upgrade.
   ~10 LOC. Deferred to keep this PR scoped.

3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
   bash currently sourced via the infra repo. Should follow the same
   bake-into-image pattern.

## Tested

- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
  artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
  this Dockerfile change. **Recommend reviewer trigger CI build, then
  do a single VM-recreate against a dev VM (e.g. foundryai-development)
  to confirm the extract path works end-to-end before merge.**

## Compatibility

- Existing VMs running 0.25.0 are unaffected — they have host artifacts
  in place from `curl from main` already; this PR doesn't touch them.
  They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
  in the image) would FAIL the docker cp. Current diff fails-loud (no
  fallback). Recommend operators upgrade to a fresh-enough image_tag
  alongside the template upgrade — same coupling as any compose-flag bump.

* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances

The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.

Devin ANALYSIS_0004 on PR #149.

* fix(changelog): mark BREAKING + drop private-repo reference

Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.

Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.

Devin BUG_0001 + WARN_0001 on PR #149.

* chore(release): cut 0.26.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
This commit is contained in:
Vojtech 2026-04-30 23:40:25 +04:00 committed by GitHub
parent ddffdfeafd
commit 2447da7bb1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 89 additions and 41 deletions

View file

@ -10,6 +10,13 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
## [0.26.0] — 2026-04-30
### Changed
- **BREAKING** **All host-side artifacts (compose files, `Caddyfile`, host bash scripts) now ship in the docker image, not curled from `main` at boot.** The Dockerfile bakes them at `/opt/agnes-host/` and the customer-instance startup template extracts the whole directory via `docker create` + `docker cp` from the same `image_tag` the operator already pinned. Removes 5 `curl`s against `raw.githubusercontent.com` from the customer template (`docker-compose.yml`, `docker-compose.prod.yml`, `docker-compose.host-mount.yml`, `docker-compose.tls.yml`, `Caddyfile`) plus the `agnes-auto-upgrade.sh` curl shipped in 0.25.0. The image also now ships `agnes-tls-rotate.sh` + `tls-fetch.sh` at `/opt/agnes-host/` so consumer-side deploy templates can adopt the same pattern. Replaces the curl-from-main pattern that decoupled host-side artifacts from the pinned image (split-brain — image at `stable-2026.04.516`, host artifacts floating on whatever `main` was when the VM last booted) and gave no rollback knob other than reverting upstream PRs globally. With everything baked in, host artifacts and app code are released together from one commit; `image_tag` controls all; rollback is one tag bump; egress simplifies to "private registry" only (no public-internet dependency on every boot). Drift prevention is preserved by construction — image and host artifacts CANNOT drift because they ship together. **Operator action**: `image_tag` MUST point to a tag from this release or later; older tags lack `/opt/agnes-host/` and the startup `docker cp` will fail-loud at first boot. Existing VMs are unaffected because the module sets `lifecycle { ignore_changes = [metadata_startup_script] }` — only newly-created VMs run the new script.
- `compose_ref` variable on the customer-instance terraform module is **deprecated** — no longer used (compose files come from `image_tag` now). Variable retained for one release cycle to avoid breaking existing `terraform plan`s; will be removed in a future major bump. Pin `image_tag` instead.
## [0.25.0] — 2026-04-30
### Fixed

View file

@ -17,6 +17,42 @@ WORKDIR /app
COPY . .
# Bake every host-side artifact at /opt/agnes-host/ — the contract path
# VM startup uses to extract files via `docker create` + `docker cp`
# instead of curling from raw.githubusercontent.com/main. Pins host
# artifacts to AGNES_TAG the same way the app is already pinned —
# eliminates the split-brain where the immutable image runs against
# arbitrary main-branch compose files / bash scripts.
#
# Includes:
# - agnes-auto-upgrade.sh — host cron driver (5-min digest poll)
# - agnes-tls-rotate.sh — host cron driver (daily corp-PKI cert refetch)
# - tls-fetch.sh — generic URL fetcher (sm:// gs:// https:// file://)
# - docker-compose.{yml,prod.yml,host-mount.yml,tls.yml} — host runtime
# - Caddyfile — TLS reverse proxy config
#
# Why a copy out of /app instead of pointing at /app directly:
# /app is owned by uid 999 (USER agnes below); /opt/agnes-host is
# root-owned, mode 0755 across the board, stable path that won't
# shift if /app structure refactors. Stable contract for `docker cp`
# consumers.
RUN mkdir -p /opt/agnes-host && \
cp /app/scripts/ops/agnes-auto-upgrade.sh \
/app/scripts/ops/agnes-tls-rotate.sh \
/app/scripts/tls-fetch.sh \
/opt/agnes-host/ && \
cp /app/docker-compose.yml /app/docker-compose.prod.yml \
/app/docker-compose.host-mount.yml /app/docker-compose.tls.yml \
/app/Caddyfile /opt/agnes-host/ && \
chmod 0755 /opt/agnes-host/agnes-auto-upgrade.sh \
/opt/agnes-host/agnes-tls-rotate.sh \
/opt/agnes-host/tls-fetch.sh && \
chmod 0644 /opt/agnes-host/docker-compose.yml \
/opt/agnes-host/docker-compose.prod.yml \
/opt/agnes-host/docker-compose.host-mount.yml \
/opt/agnes-host/docker-compose.tls.yml \
/opt/agnes-host/Caddyfile
# Build wheel artifact (served at /cli/download)
RUN uv build --wheel --out-dir /app/dist

View file

@ -48,30 +48,36 @@ if [ -b "$DATA_DEV" ]; then
chown -R 999:999 "$DATA_MNT"
fi
# --- 3. App directory + docker-compose files from public repo ---
# --- 3. App directory + extract host artifacts from the pinned image ---
APP_DIR="/opt/agnes"
mkdir -p "$APP_DIR"
cd "$APP_DIR"
# Fetch docker-compose files pinned to $COMPOSE_REF (defaults to `main`; pin to a
# stable-YYYY.MM.N tag for reproducibility across VM rebuilds).
RAW_BASE="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/$${COMPOSE_REF}"
curl -fsSL "$${RAW_BASE}/docker-compose.yml" -o docker-compose.yml
curl -fsSL "$${RAW_BASE}/docker-compose.prod.yml" -o docker-compose.prod.yml
# Overlay which binds `data` volume to host /data (persistent disk mounted above)
curl -fsSL "$${RAW_BASE}/docker-compose.host-mount.yml" -o docker-compose.host-mount.yml
# TLS overlay + Caddyfile — fetched unconditionally because agnes-auto-upgrade.sh
# (curled from main below) detects TLS at runtime via cert files on disk,
# regardless of TLS_MODE. Certs can appear after boot via agnes-tls-rotate.sh
# or manual provisioning, and:
# - the cron job would fail under `set -euo pipefail` every 5 min if
# docker-compose.tls.yml were missing, and
# - the caddy service in docker-compose.yml bind-mounts ./Caddyfile:ro,
# so without it on disk Docker auto-creates an empty directory there
# and Caddy crash-loops while the overlay has already closed :8000.
# Cheap to keep on disk either way.
curl -fsSL "$${RAW_BASE}/docker-compose.tls.yml" -o docker-compose.tls.yml
curl -fsSL "$${RAW_BASE}/Caddyfile" -o Caddyfile
# Pull the pinned image first so we can extract host-side artifacts from it.
# Everything we need on the host (compose files, Caddyfile, agnes-auto-upgrade.sh)
# ships baked into the image at /opt/agnes-host/, released atomically with
# the app. AGNES_TAG is the single version pin for both — no split-brain
# with main-branch curl.
#
# Why image-extract beats curling raw.githubusercontent.com:
# - Version pin: customer pins AGNES_TAG → extracted artifacts match the
# same tag. main-branch curls would break that pin silently.
# - Egress: image is already pulled from the private registry; the public
# internet is no longer required for boot.
# - Rollback: revert is one tag bump. Curl-from-main has no per-customer
# rollback path.
docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
# docker-compose.tls.yml + Caddyfile land regardless of TLS_MODE. agnes-auto-upgrade.sh
# detects TLS at runtime via cert files on disk; certs can appear after boot via
# agnes-tls-rotate.sh or manual provisioning. The caddy service bind-mounts
# ./Caddyfile, so it must exist on disk before any `docker compose up` even when
# the tls overlay is currently inactive. Cheap to keep them on disk either way.
# --- 4. Fetch secrets from Secret Manager — fail loudly if missing ---
KEBOOLA_TOKEN=""
@ -168,24 +174,10 @@ docker compose $COMPOSE_FILES $COMPOSE_PROFILES_ARG up -d
# --- 6. Auto-upgrade via cron (pulls new image digest every 5 min) ---
if [ "$UPGRADE_MODE" = "auto" ]; then
# Single-source the cron script from the OSS repo's main branch instead
# of inlining a copy here. Two reasons:
# 1. Drift prevention — earlier inline copy missed several iterations
# of the canonical script (TLS overlay detection, array-form compose
# files, config-disk fail-fast guard).
# 2. Re-fetched on every VM boot, so script-only fixes propagate
# without an infra recreate. For immediate rollout to running VMs,
# operators can also re-run this fetch by hand.
#
# Coupling note: this URL is pinned to `main` while compose files above
# honor $COMPOSE_REF. If a future canonical script references a NEW
# compose file, the fetch list above MUST be updated to match — pinned-
# ref VMs would otherwise break on the next cron tick. Treat the docker-
# compose.* fetch list as the contract that agnes-auto-upgrade.sh relies
# on; new compose files referenced from main need a corresponding fetch.
SCRIPT_URL="https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/scripts/ops/agnes-auto-upgrade.sh"
curl -fsSL --retry 3 --retry-delay 2 "$SCRIPT_URL" -o /usr/local/bin/agnes-auto-upgrade.sh
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
# agnes-auto-upgrade.sh was already extracted to /usr/local/bin/ in
# section 3 alongside the compose files — the host artifacts ship
# together from the pinned image. Nothing more to fetch here.
:
# Install cron entry idempotently: remove any prior agnes-auto-upgrade line, then append ours.
CRON_LINE="*/5 * * * * /usr/local/bin/agnes-auto-upgrade.sh >> /var/log/agnes-auto-upgrade.log 2>&1"

View file

@ -25,7 +25,17 @@ variable "customer_name" {
}
variable "prod_instance" {
description = "Production VM configuration."
description = <<-EOT
Production VM configuration.
`image_tag` MUST point to an image that contains `/opt/agnes-host/`
(this directory was added in v0.26.0). Older tags will fail at first
boot with `docker cp: No such file or directory` because the startup
script extracts host artifacts from the image instead of curling
them. Existing VMs are unaffected by this constraint the module
sets `lifecycle { ignore_changes = [metadata_startup_script] }` so
the new script only runs on freshly-created VMs.
EOT
type = object({
name = string
machine_type = optional(string, "e2-small")
@ -45,6 +55,9 @@ variable "dev_instances" {
tls_mode + domain are optional and default to plain HTTP on :8000. Set
tls_mode = "caddy" + domain to enable Caddy + Let's Encrypt (or whatever
CADDY_TLS env var is configured to in the Caddyfile see Caddyfile docs).
Same `image_tag >= v0.26.0` constraint as `prod_instance` older tags
lack `/opt/agnes-host/` and the startup `docker cp` fails-loud.
EOT
type = list(object({
name = string
@ -93,7 +106,7 @@ variable "image_repo" {
}
variable "compose_ref" {
description = "Git ref to fetch docker-compose.yml and overlays from (in keboola/agnes-the-ai-analyst). Use `main` for latest, or a tag like `stable-2026.04.47` for reproducibility."
description = "DEPRECATED — no longer used. Compose files now ship inside the docker image at /opt/agnes-host/ and are extracted via `docker cp` from the same `image_tag` the operator pinned. Pin `image_tag` instead. Variable retained for one release cycle to avoid breaking existing terraform plans; will be removed in a future major bump."
type = string
default = "main"
}

View file

@ -1,6 +1,6 @@
[project]
name = "agnes-the-ai-analyst"
version = "0.25.0"
version = "0.26.0"
description = "Agnes — AI Data Analyst platform for AI analytical systems"
requires-python = ">=3.11,<3.14"
license = "MIT"