From 655822b953feb7d3f488eeb81d7d7bf591776bc0 Mon Sep 17 00:00:00 2001 From: Vojtech Rysanek Date: Tue, 5 May 2026 20:37:41 +0400 Subject: [PATCH 1/7] host-mount: replace named-volume driver_opts with direct service binds MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous version of docker-compose.host-mount.yml modified the 'data' named volume's driver_opts to point at /data with 'o: bind,rbind'. Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume. Editing this file and re-running 'docker compose up -d' does NOT propagate the new options to existing volumes — they keep whatever options were in effect at create time. This bit a deployer (Groupon FoundryAI) on 2026-05-05: the volume was created before this overlay had bind,rbind, kept the old bind (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Recovery required docker volume rm + manual data migration on every affected VM. Direct service-level bind mounts ('/host/path:/container/path') don't go through Docker's volume layer at all. They re-evaluate mount options every container start, and modern Docker Engine (20.10+) defaults to recursive bind for these. No options to forget, no immutable state to migrate, no shadow-mount class. Validated via 'docker compose config' merge — overlay correctly replaces 'data:/data' with bind type:none on app, extract, scheduler, telegram-bot, ws-gateway. Compose-spec version note: !override merge tag is part of the Compose Specification supported by Docker Compose v2.20+. Tested against Compose v5.1.3 used by Groupon's deployment. --- docker-compose.host-mount.yml | 84 ++++++++++++++++++++++++++++------- 1 file changed, 67 insertions(+), 17 deletions(-) diff --git a/docker-compose.host-mount.yml b/docker-compose.host-mount.yml index dd06534..af03c33 100644 --- a/docker-compose.host-mount.yml +++ b/docker-compose.host-mount.yml @@ -1,14 +1,41 @@ -# Bind-mount overlay — replaces the `data` named volume with a bind mount -# to /data on the host. +# Bind-mount overlay — replaces the `data` named volume with a direct +# host bind mount per service. # -# Use this when /data is a persistent disk mounted by the VM startup script, -# so Agnes data lives on the PD (not on the boot disk's Docker volume). +# Why direct service-level bind, not driver_opts on the named volume +# ------------------------------------------------------------------ +# The previous version of this file modified the `data` named volume's +# `driver_opts` to point at /data with `o: bind,rbind`. Docker named +# volumes have an immutability footgun: once a volume is created, its +# driver options are fixed for the life of the volume. Editing this +# file and re-running `docker compose up -d` does NOT propagate the +# new options to existing volumes — they keep whatever options were +# in effect at create time. # -# `bind,rbind` (recursive bind) is required when the host nests a second -# disk under /data — e.g. the dual-disk layout where sdb is mounted on /data -# and sdc on /data/state. A plain `bind` captures only the top-level mount -# and silently shadows the sub-mount with an empty subdirectory inside the -# container, causing the app to write to the wrong disk. +# This bit a deployer (Groupon FoundryAI) on 2026-05-05: the volume +# was created before this overlay had `bind,rbind`, kept the old +# `bind` (non-recursive) propagation, and containers wrote to a +# shadowed subdirectory of the parent disk instead of the nested +# child mount. DuckDB went FATAL on a root-owned WAL during a +# routine container recreate; sign-in broke. +# +# Direct service-level bind mounts (`/host/path:/container/path`) +# don't go through Docker's volume layer at all. They re-evaluate +# the mount options every container start, and modern Docker Engine +# (20.10+) defaults to recursive bind for these. No options to +# forget, no immutable state to migrate, no shadow-mount class. +# +# What this overlay does +# ---------------------- +# `volumes: !override` on each service replaces the base +# `data:/data` named-volume mount with a direct `/data:/data` host +# bind. The named volume `data:` declared at the bottom of +# docker-compose.yml is left intact (still useful for local-dev +# `compose up` without this overlay) but is no longer referenced +# by any service when the overlay is active. +# +# When the operator's host has a nested mount under /data (e.g. a +# separate state disk mounted at /data/state), the recursive bind +# carries that nested mount into every container automatically. # # Usage (combined with docker-compose.prod.yml): # docker compose \ @@ -17,11 +44,34 @@ # -f docker-compose.host-mount.yml \ # up -d # -# Do NOT use this overlay in CI — /data does not exist on GitHub runners. -volumes: - data: - driver: local - driver_opts: - type: none - o: bind,rbind - device: /data +# Do NOT use this overlay in CI — /data does not exist on GitHub +# runners. +# +# Compose-spec version requirement: !override merge tag is part of +# the Compose Specification supported by Docker Compose v2.20+ and +# the compose-go library used by Compose v5+. If you need to support +# older clients, fork this overlay into per-service files. + +services: + app: + volumes: !override + - /data:/data + - ./config:/app/config:ro + + extract: + volumes: !override + - /data:/data + - ./config:/app/config:ro + + scheduler: + volumes: !override + - /data:/data + - ./config:/app/config:ro + + telegram-bot: + volumes: !override + - /data:/data + + ws-gateway: + volumes: !override + - /data:/data From a303de0372d00895abad04eea1213d6f4fb1bc76 Mon Sep 17 00:00:00 2001 From: Vojtech Rysanek Date: Tue, 5 May 2026 20:51:17 +0400 Subject: [PATCH 2/7] feat: STATE_DIR env var + flat-mount overlay (parallel disks) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduces STATE_DIR as the single source of truth for the writable state directory path, with backward-compatible default of ${DATA_DIR}/state. Pairs with a new docker-compose.flat-mount.yml overlay that mounts the state disk in PARALLEL to the data disk (rather than nested under it). Why --- The default deployment topology nests state under data: sdb at /data, sdc at /data/state. That layout has known fragility documented in docs/state-dir.md — bind-propagation gotchas, two-writer collisions on the same prefix, mount-order coupling. The 2026-05-05 incident in the Groupon FoundryAI deployment was a manifestation of the propagation gotcha. The flat layout (sdb at /data, sdc at /data-state — parallel, not nested) eliminates the nested-mount class entirely. Each disk is its own bind mount, recursive by default in modern Docker. No volume options to forget. No two-writer collision (host scripts and container app share /data-state at the same path, single namespace). What changes ------------ App code (Python): - src/db.py: new _get_state_dir() helper. get_system_db() and schema migration snapshot use it. - app/secrets.py: new _state_dir() helper. _load_or_generate() uses it for .session_secret and .jwt_secret. - app/main.py: .env_overlay loaded from _state_dir(). Host scripts: - scripts/ops/agnes-auto-upgrade.sh: STATE_DIR drives mount-sanity check and cert detection. Defaults preserve existing behavior. - scripts/ops/agnes-tls-rotate.sh: STATE_DIR drives CERT_DIR. New compose overlay: - docker-compose.flat-mount.yml: parallel /data and /data-state binds per service. Mutually exclusive with docker-compose.host-mount.yml; pick one based on disk topology. Documentation: - docs/state-dir.md: layout choice (A nested vs B flat), pros/cons, migration steps, and which code paths read STATE_DIR. Backward compatibility ---------------------- STATE_DIR defaults to ${DATA_DIR}/state — current behavior. Existing deployers that don't set the var see no behavior change. Migration to flat layout is opt-in per the runbook in docs/state-dir.md. Validation ---------- - bash -n on both host scripts: pass - docker compose config -f docker-compose.flat-mount.yml: resolves cleanly with all 6 services binding /data and /data-state directly - python3 import + helper exercise: STATE_DIR override works, default falls back to ${DATA_DIR}/state Companion to PR #191 (drop named-volume driver_opts in host-mount.yml). That PR fixes the immutability footgun for Layout A; this PR offers Layout B as the architectural alternative. --- app/main.py | 3 +- app/secrets.py | 16 ++++- docker-compose.flat-mount.yml | 88 +++++++++++++++++++++++ docs/state-dir.md | 116 ++++++++++++++++++++++++++++++ scripts/ops/agnes-auto-upgrade.sh | 44 +++++++----- scripts/ops/agnes-tls-rotate.sh | 7 +- src/db.py | 21 +++++- 7 files changed, 271 insertions(+), 24 deletions(-) create mode 100644 docker-compose.flat-mount.yml create mode 100644 docs/state-dir.md diff --git a/app/main.py b/app/main.py index 371235f..42b3186 100644 --- a/app/main.py +++ b/app/main.py @@ -340,7 +340,8 @@ def create_app() -> FastAPI: app.add_middleware(RequestIdMiddleware) # Load .env_overlay (persisted by /api/admin/configure) - _overlay = Path(os.environ.get("DATA_DIR", "./data")) / "state" / ".env_overlay" + from app.secrets import _state_dir + _overlay = _state_dir() / ".env_overlay" if _overlay.exists(): for line in _overlay.read_text().splitlines(): if "=" in line and not line.startswith("#"): diff --git a/app/secrets.py b/app/secrets.py index 41f837d..cb88740 100644 --- a/app/secrets.py +++ b/app/secrets.py @@ -7,13 +7,25 @@ from pathlib import Path logger = logging.getLogger(__name__) +def _state_dir() -> Path: + """Return path to writable state directory. + + STATE_DIR env var takes precedence; otherwise defaults to + ${DATA_DIR}/state for backward compatibility with deployments + that nest state under the data disk. See docs/state-dir.md. + """ + state = os.environ.get("STATE_DIR", "") + if state: + return Path(state) + return Path(os.environ.get("DATA_DIR", "./data")) / "state" + + def _load_or_generate(env_var: str, file_name: str) -> str: """Load secret from env var, or from file, or generate and persist.""" val = os.environ.get(env_var, "") if val: return val - data_dir = Path(os.environ.get("DATA_DIR", "./data")) - secret_path = data_dir / "state" / file_name + secret_path = _state_dir() / file_name if secret_path.exists(): val = secret_path.read_text().strip() if val: diff --git a/docker-compose.flat-mount.yml b/docker-compose.flat-mount.yml new file mode 100644 index 0000000..a95d7aa --- /dev/null +++ b/docker-compose.flat-mount.yml @@ -0,0 +1,88 @@ +# Flat-mount overlay — parallel host binds for /data and /data-state. +# +# Why this overlay +# ---------------- +# The default deployment topology nests state under data: sdb at /data, +# sdc at /data/state (i.e. /data/state is a separate disk mounted INSIDE +# the data disk). That layout works but has known fragility: +# +# - Bind-mount propagation matters. A non-recursive bind hides the +# nested mount, leading to silent shadow writes (the failure mode +# that caused 2026-05-05 in the Groupon FoundryAI deployment). +# +# - Two writers, one tree. Host-side timers (tls-rotate.timer) +# write to /data/state/certs as root, while the container app +# writes to /data/state/system.duckdb as uid 999. Same prefix, +# different mount-namespace views = ownership conflicts. +# +# - sdb resize requires umounting sdc first. Mount-order coupling. +# +# This overlay removes the nesting by mounting the state disk in +# PARALLEL to the data disk: +# +# sdb at /data (analytics, regenerable) +# sdc at /data-state (DuckDB, secrets, certs — irreplaceable) +# +# Both are direct service-level binds, recursive by default in modern +# Docker Engine. No volume options to forget. No nested propagation. +# No two-writer collision (app uses /data-state, host scripts also use +# /data-state — same path, single namespace). +# +# Usage +# ----- +# 1. On the operator's host: mount the config disk at /data-state +# (instead of /data/state). Update fstab. Move existing state +# contents from /data/state to /data-state. +# +# 2. In /opt/agnes/.env, set STATE_DIR=/data-state. The app's secrets +# module + DuckDB code, plus the host-side rotate.sh and +# auto-upgrade.sh scripts, all read this var. +# +# 3. Compose invocation: +# +# docker compose \ +# -f docker-compose.yml \ +# -f docker-compose.prod.yml \ +# -f docker-compose.flat-mount.yml \ +# up -d +# +# Note: this overlay is mutually exclusive with docker-compose.host-mount.yml. +# Pick one based on your disk topology. +# +# Do NOT use this overlay in CI — /data and /data-state do not exist +# on GitHub runners. + +services: + app: + volumes: !override + - /data:/data + - /data-state:/data-state + - ./config:/app/config:ro + + extract: + volumes: !override + - /data:/data + - /data-state:/data-state + - ./config:/app/config:ro + + scheduler: + volumes: !override + - /data:/data + - /data-state:/data-state + - ./config:/app/config:ro + + telegram-bot: + volumes: !override + - /data:/data + - /data-state:/data-state + + ws-gateway: + volumes: !override + - /data:/data + - /data-state:/data-state + + caddy: + volumes: !override + - ./Caddyfile:/etc/caddy/Caddyfile:ro + - /data-state/certs:/certs:ro + - caddy_data:/data diff --git a/docs/state-dir.md b/docs/state-dir.md new file mode 100644 index 0000000..7e40749 --- /dev/null +++ b/docs/state-dir.md @@ -0,0 +1,116 @@ +# State directory layout + +Agnes splits its persistent data into two tiers: + +| Tier | Path | Contents | Backup posture | +|---|---|---|---| +| **data** | `/data` | analytics workspace, extracts, DuckDB caches | regenerable | +| **state** | `${STATE_DIR}` | `system.duckdb`, `.session_secret`, `.jwt_secret`, `certs/*` | irreplaceable | + +`STATE_DIR` is an environment variable that selects the host path the state tier is mounted from. Two layouts are supported: + +## Layout A — nested (legacy default) + +``` +sdb at /data +sdc at /data/state (nested inside the data mount) +``` + +`STATE_DIR=/data/state` (or unset — that's the default). Used by the original deployment topology. + +**Pros**: single bind mount per service (`/data:/data` recursive). Single env var defaults work. + +**Cons**: +- Bind-mount propagation matters. Non-recursive bind silently shadows the nested sdc mount, causing the app to write to an invisible subdirectory of sdb. Recovery requires `docker volume rm` + manual data migration. +- Two writers (host's `tls-rotate.timer` running as root; container app running as uid 999) share `/data/state` with different mount-namespace views → ownership conflicts. +- Resizing sdb requires unmounting sdc first. + +The 2026-05-05 incident in the Groupon FoundryAI deployment was a manifestation of the propagation gotcha. See PRs in this repo and the deployer's infra repo for full root-cause notes. + +## Layout B — flat + +``` +sdb at /data (analytics, regenerable) +sdc at /data-state (state, irreplaceable — parallel to /data, not nested) +``` + +`STATE_DIR=/data-state`. Two parallel host binds per service: `/data:/data` and `/data-state:/data-state`. Use the `docker-compose.flat-mount.yml` overlay. + +**Pros**: +- No nested-mount propagation class. Each disk is its own bind. +- Single writer per disk (host scripts → certs on sdc, container app → DuckDB on sdc; both at the same path). +- sdb resize doesn't touch sdc. +- Direct service binds default to recursive in modern Docker — no `driver_opts` immutability footgun. + +**Cons**: +- One-time per-VM migration: tear down `/data/state` mount, mount sdc at `/data-state` instead, copy state contents. +- Two binds per service (slightly more compose YAML). + +## Choosing + +| Situation | Recommendation | +|---|---| +| Existing deployment, no plans to expand | stay on layout A | +| New deployment | layout B (cleaner, no shadow class) | +| Existing deployment hit by 2026-05-05 shadow class | migrate to layout B | +| CI / local dev | neither (use ephemeral compose volumes) | + +## Migration A → B + +Steps to move an existing VM from nested to flat: + +```bash +# 1. Stop containers +sudo docker compose --env-file /opt/agnes/.env \ + -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml \ + --profile tls down + +# 2. Snapshot the existing state +sudo cp -a /data/state /tmp/state-backup-$(date -u +%Y%m%dT%H%M%SZ) + +# 3. Unmount sdc from /data/state (its current nested location) +sudo umount /data/state +sudo rmdir /data/state # remove the now-empty mount point on sdb + +# 4. Create the new flat mount point and remount sdc there +sudo mkdir /data-state +echo "LABEL=agnes-state /data-state ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab +# (also remove the old /data/state line from fstab) +sudo mount /data-state + +# 5. Restore state from the backup +sudo cp -a /tmp/state-backup-*/. /data-state/ + +# 6. Set STATE_DIR in /opt/agnes/.env +echo "STATE_DIR=/data-state" | sudo tee -a /opt/agnes/.env + +# 7. Bring the stack back up with the flat overlay +cd /opt/agnes +sudo docker compose --env-file /opt/agnes/.env \ + -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.flat-mount.yml \ + --profile tls up -d +``` + +Verify: `sudo docker exec agnes-app-1 ls /data-state` should show `system.duckdb` etc. + +## What reads `STATE_DIR` + +App code: +- `src/db.py::_get_state_dir()` — the canonical helper. Used by `get_system_db()` and the schema migration snapshot. +- `app/secrets.py::_state_dir()` — for `.session_secret`, `.jwt_secret`. Mirrors the helper since `app/` shouldn't import from `src/`. +- `app/main.py` — for the `.env_overlay` startup file (loaded at process start). + +Host scripts: +- `scripts/ops/agnes-auto-upgrade.sh` — mount-sanity check + cert detection. +- `scripts/ops/agnes-tls-rotate.sh` — `CERT_DIR=$STATE_DIR/certs`. + +Both scripts source `/opt/agnes/.env` with `set -a`, so adding `STATE_DIR=/data-state` to that file propagates everywhere. + +## Caddy cert mount + +Caddy mounts the cert directory from the host at `/certs:ro`. The host-side path follows `STATE_DIR/certs`: + +- Layout A: `/data/state/certs` (in `docker-compose.yml` directly). +- Layout B: `/data-state/certs` (overridden in `docker-compose.flat-mount.yml`). + +Compose-time env substitution happens at `compose up`, not at runtime, so the overlay must be selected at deploy time — there's no single compose YAML that switches based on `STATE_DIR`. diff --git a/scripts/ops/agnes-auto-upgrade.sh b/scripts/ops/agnes-auto-upgrade.sh index 54bdf43..7bc4358 100755 --- a/scripts/ops/agnes-auto-upgrade.sh +++ b/scripts/ops/agnes-auto-upgrade.sh @@ -3,50 +3,58 @@ # Cron fires it every 5 min; pulls latest image for the pinned AGNES_TAG # and recreates containers only if the digest moved. # -# Cert-aware: if /data/state/certs/{fullchain,privkey}.pem both exist +# Cert-aware: if ${STATE_DIR}/certs/{fullchain,privkey}.pem both exist # (populated by agnes-tls-rotate.sh), enables the tls overlay so Caddy # fronts :443. Absence → plain HTTP on :8000. +# +# STATE_DIR is the host path that backs the writable state disk. It +# defaults to /data/state for backward compatibility with the legacy +# nested-mount layout (sdb at /data, sdc nested under /data/state). +# Set STATE_DIR=/data-state in /opt/agnes/.env for the flat layout +# (sdb at /data, sdc parallel at /data-state) — see docs/state-dir.md. set -euo pipefail cd /opt/agnes # shellcheck disable=SC1091 set -a; . /opt/agnes/.env; set +a +STATE_DIR="${STATE_DIR:-/data/state}" + # Fail-fast guard: if the VM has a config disk attached, it MUST be -# mounted at /data/state before any container action. Otherwise the -# app would write state onto /data (sdb) and lose it on the next -# container recreate — the regression that motivated this guard. +# mounted at $STATE_DIR before any container action. Otherwise the +# app would write state onto the parent filesystem and lose it on the +# next container recreate — the regression that motivated this guard. # Three retries (mount may race with udev on cold boot) then hard exit. CONFIG_DEVICE=/dev/disk/by-id/google-config-disk if [ -e "$CONFIG_DEVICE" ]; then attempt=0 while [ $attempt -lt 3 ]; do attempt=$((attempt + 1)) - if mountpoint -q /data/state; then + if mountpoint -q "$STATE_DIR"; then expected_dev=$(readlink -f "$CONFIG_DEVICE") - actual_dev=$(findmnt -n -o SOURCE /data/state) + actual_dev=$(findmnt -n -o SOURCE "$STATE_DIR") if [ "$expected_dev" = "$actual_dev" ]; then break fi - logger -t agnes-auto-upgrade "WARN: /data/state on $actual_dev, expected $expected_dev — attempting remount" - umount /data/state 2>/dev/null || true + logger -t agnes-auto-upgrade "WARN: $STATE_DIR on $actual_dev, expected $expected_dev — attempting remount" + umount "$STATE_DIR" 2>/dev/null || true fi - mount "$CONFIG_DEVICE" /data/state 2>/dev/null || true + mount "$CONFIG_DEVICE" "$STATE_DIR" 2>/dev/null || true sleep $((attempt * 2)) done - if ! mountpoint -q /data/state || \ - [ "$(readlink -f "$CONFIG_DEVICE")" != "$(findmnt -n -o SOURCE /data/state)" ]; then - logger -t agnes-auto-upgrade "FATAL: config disk not mounted at /data/state — refusing to start containers" - echo "FATAL: /data/state is not backed by the config disk." >&2 - echo " Refusing to run docker compose — app state must NEVER land on /data (sdb)." >&2 - echo " Inspect: mount | grep /data/state ; ls /dev/disk/by-id/google-config-disk" >&2 + if ! mountpoint -q "$STATE_DIR" || \ + [ "$(readlink -f "$CONFIG_DEVICE")" != "$(findmnt -n -o SOURCE "$STATE_DIR")" ]; then + logger -t agnes-auto-upgrade "FATAL: config disk not mounted at $STATE_DIR — refusing to start containers" + echo "FATAL: $STATE_DIR is not backed by the config disk." >&2 + echo " Refusing to run docker compose — app state must land on the config disk, not the parent filesystem." >&2 + echo " Inspect: mount | grep $STATE_DIR ; ls /dev/disk/by-id/google-config-disk" >&2 exit 1 fi # Re-apply propagation in case a prior container teardown reset it. # Idempotent — safe to call when already private. mount --make-rprivate /data 2>/dev/null || true - mount --make-rprivate /data/state 2>/dev/null || true + mount --make-rprivate "$STATE_DIR" 2>/dev/null || true fi IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:${AGNES_TAG:-stable}" @@ -116,10 +124,10 @@ CONFIG_AFTER=$(hash_config_files) # Evaluated AFTER the config re-fetch above so a freshly-added or # freshly-removed Caddyfile is reflected in this tick's compose set, # not the next one. -if [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ] && [ -s Caddyfile ]; then +if [ -s "$STATE_DIR/certs/fullchain.pem" ] && [ -s "$STATE_DIR/certs/privkey.pem" ] && [ -s Caddyfile ]; then COMPOSE_FILES+=( -f docker-compose.tls.yml ) PROFILE_ARGS=( --profile tls ) -elif [ -s /data/state/certs/fullchain.pem ] && [ -s /data/state/certs/privkey.pem ]; then +elif [ -s "$STATE_DIR/certs/fullchain.pem" ] && [ -s "$STATE_DIR/certs/privkey.pem" ]; then logger -t agnes-auto-upgrade "WARN: certs present but Caddyfile missing/empty — skipping tls overlay" fi diff --git a/scripts/ops/agnes-tls-rotate.sh b/scripts/ops/agnes-tls-rotate.sh index a4f6ac9..39a2166 100755 --- a/scripts/ops/agnes-tls-rotate.sh +++ b/scripts/ops/agnes-tls-rotate.sh @@ -34,7 +34,12 @@ set -a; . /opt/agnes/.env; set +a [ -n "${TLS_FULLCHAIN_URL:-}" ] || { echo "TLS_FULLCHAIN_URL empty — nothing to rotate"; exit 0; } -CERT_DIR=/data/state/certs +# STATE_DIR is the host path that backs the writable state disk. Defaults +# to /data/state for backward compatibility with the legacy nested-mount +# layout; set STATE_DIR=/data-state in /opt/agnes/.env for the flat layout. +# See docs/state-dir.md. +STATE_DIR="${STATE_DIR:-/data/state}" +CERT_DIR="$STATE_DIR/certs" mkdir -p "$CERT_DIR" # Match the agnes UID baked into the app image (Dockerfile: useradd --uid 999). # Without this, whoever happens to win the create race (this script as root diff --git a/src/db.py b/src/db.py index 2591244..f240463 100644 --- a/src/db.py +++ b/src/db.py @@ -453,6 +453,23 @@ def _get_data_dir() -> Path: return Path(os.environ.get("DATA_DIR", "./data")) +def _get_state_dir() -> Path: + """Return path to writable state directory. + + Resolution order: + 1. STATE_DIR env var (explicit override). + 2. ${DATA_DIR}/state (default — current behavior). + + Use the explicit override when the deployer wants state on a + separate disk mounted in parallel with /data rather than nested + inside it. See docs/state-dir.md. + """ + state = os.environ.get("STATE_DIR", "") + if state: + return Path(state) + return _get_data_dir() / "state" + + def get_system_db() -> duckdb.DuckDBPyConnection: """Get a connection to the system state database. @@ -461,7 +478,7 @@ def get_system_db() -> duckdb.DuckDBPyConnection: so callers can safely close() it without closing the underlying connection. """ global _system_db_conn, _system_db_path - db_path = str(_get_data_dir() / "state" / "system.duckdb") + db_path = str(_get_state_dir() / "system.duckdb") with _system_db_lock: if _system_db_conn is None or _system_db_path != db_path: @@ -1812,7 +1829,7 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None: # Snapshot before migration for rollback support if current > 0: try: - db_path = Path(os.environ.get("DATA_DIR", "./data")) / "state" / "system.duckdb" + db_path = _get_state_dir() / "system.duckdb" if db_path.exists(): # Flush WAL to main DB file before copying try: From a9ae5f9c350fbbb566b52a69c42bdb7be2c89909 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 19:29:38 +0200 Subject: [PATCH 3/7] fix(flat-mount): preserve data:/srv:ro and caddy_config:/config in caddy override; CHANGELOG MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The flat-mount overlay's caddy `volumes: !override` block listed only three mounts, but the base docker-compose.yml caddy service has five. `!override` (compose-spec semantics) replaces the entire list, so two mounts were silently dropped under the flat layout: - `data:/srv:ro` — Caddy's read-only view of the agnes data dir, used by the `@download` file_server handler in Caddyfile (added in v0.36.0 as the perf bypass for multi-GB parquet downloads). Without this mount, `try_files /bigquery/data/.parquet …` finds no file and every parquet download falls through to the app's uvicorn worker — defeating the bypass entirely. - `caddy_config:/config` — Caddy's autosave/ACME state. Less critical (we feed certs in via /certs) but loses the autosaved adapter config across container recreates. Restated both mounts with a comment block explaining the !override caveat for any future overlay author. Plus: CHANGELOG entries for the host-mount.yml direct-bind fix and the STATE_DIR + flat-mount overlay under [Unreleased]. --- CHANGELOG.md | 6 ++++++ docker-compose.flat-mount.yml | 15 +++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 888c5ba..68aedc8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,12 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +### Added +- **`STATE_DIR` env var + `docker-compose.flat-mount.yml` overlay** — operators can now place the writable state disk in **parallel** to the data disk (`sdb` at `/data`, `sdc` at `/data-state`) instead of nested (`sdc` at `/data/state` inside `/data`). The flat layout removes three structural fragilities of the legacy nested layout: bind-mount propagation gotchas (the 2026-05-05 shadow-mount class), two-writer collisions on a shared prefix (host's `tls-rotate.timer` as root + container app as uid 999 on the same path), and mount-order coupling on disk resize. `STATE_DIR` defaults to `${DATA_DIR}/state` so existing deployers see no behavior change; opt-in to flat layout via the new overlay + `STATE_DIR=/data-state` per the runbook in `docs/state-dir.md`. Read by `src/db.py:_get_state_dir()`, `app/secrets.py:_state_dir()`, `app/main.py` (`.env_overlay`), `scripts/ops/agnes-auto-upgrade.sh` (mount-sanity + cert detection), `scripts/ops/agnes-tls-rotate.sh` (`CERT_DIR=$STATE_DIR/certs`). + +### Changed +- **`docker-compose.host-mount.yml` switched from "named volume + driver_opts" to direct service-level bind mounts** (`volumes: !override` per service). Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume, and editing this file does NOT propagate the new options to existing volumes. This bit a deployer on 2026-05-05: the volume was created before the overlay had `bind,rbind`, kept the old `bind` (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Direct service binds re-evaluate options every container start and default to recursive in modern Docker (20.10+) — no immutable state to migrate, no shadow-mount class. Operators on this overlay: next `docker compose up -d` starts containers with direct binds; the old `agnes_data` named volume is no longer referenced and can be removed with `docker volume rm agnes_data` (operator's choice — orphaned but harmless if left). + ## [0.36.0] — 2026-05-05 Combined performance + analyst-clarity bundle. Folds three previously-staged work streams into one PR (#188): the long-running `agnes query --remote` timeout (#181), the Caddy parquet-download bypass (#182), and Pavel's #185 Phase 1 trace findings (silent 44-min first-init, opaque CLI tracebacks, no analyst-Claude size signal). Also performs the Tier 1 event-loop unblocking — the five hottest BQ-touching endpoints were `async def` over synchronous DuckDB / BQ-extension calls, so a single heavy `agnes query --remote` froze every other request for the duration of the BQ wait. The image-side fixes ship in this release; for existing VMs, the new auto-upgrade.sh self-fetches the matching Caddyfile + compose overlays from `main` on its next 5-minute tick, so deployment requires no operator action beyond letting the cron run. diff --git a/docker-compose.flat-mount.yml b/docker-compose.flat-mount.yml index a95d7aa..0aa3bfc 100644 --- a/docker-compose.flat-mount.yml +++ b/docker-compose.flat-mount.yml @@ -82,7 +82,22 @@ services: - /data-state:/data-state caddy: + # `!override` replaces the entire base volumes list, so every mount + # the base service depends on must be re-stated here. Two of those + # are easy to miss and silently regress functionality: + # - `data:/srv:ro` — Caddy's read-only view of the agnes data dir + # used by the `@download` `file_server` handler in Caddyfile. + # Without it, `try_files /bigquery/data/.parquet …` finds no + # file and every parquet download falls through to the app's + # uvicorn worker — defeating the perf bypass landed in v0.36.0. + # - `caddy_config:/config` — Caddy's autosave/ACME state. Missing + # it doesn't break HTTPS (we feed certs in via `/certs`) but + # loses the autosaved adapter config across recreates. + # Same caveat applies to any future `volumes: !override` block — + # diff against the base service before merging. volumes: !override - ./Caddyfile:/etc/caddy/Caddyfile:ro - /data-state/certs:/certs:ro - caddy_data:/data + - caddy_config:/config + - /data:/srv:ro From b6543c9c5547fa05433d37f5721d1e4ce764f412 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 19:47:12 +0200 Subject: [PATCH 4/7] =?UTF-8?q?fix:=20Devin=20Review=20on=20#194=20?= =?UTF-8?q?=E2=80=94=202=20BUG-class=20findings?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1. .env_overlay write paths now match read path under STATE_DIR. app/main.py:343 reads via _state_dir() (post-PR #194), but two write sites still hardcoded ${DATA_DIR}/state/.env_overlay: - app/api/admin.py:2687 — configure endpoint secrets persistence - app/api/marketplaces.py:152 — marketplace PAT persistence Under flat-mount layout (STATE_DIR=/data-state) the admin UI wrote secrets to /data/state/.env_overlay while the app read from /data-state/.env_overlay, silently dropping the value on next restart. Both write sites now go through _state_dir(). 2. host-mount.yml: caddy inherits data:/srv:ro from base, but with no service populating the data: named volume (other services switched to direct /data binds), the inherited mount points at an empty Docker volume — try_files finds nothing, every parquet download falls through to uvicorn, defeating the v0.36.0 file_server bypass under the host-mount layout. Added a caddy override that restates all mounts including a direct /data:/srv:ro bind. Mirrors the comment + treatment already in flat-mount.yml. --- app/api/admin.py | 9 +++++++-- app/api/marketplaces.py | 14 +++++++++++--- docker-compose.host-mount.yml | 19 +++++++++++++++++++ 3 files changed, 37 insertions(+), 5 deletions(-) diff --git a/app/api/admin.py b/app/api/admin.py index fe43805..7fd1f82 100644 --- a/app/api/admin.py +++ b/app/api/admin.py @@ -2683,8 +2683,13 @@ async def configure_instance( secrets_to_persist["KEBOOLA_STACK_URL"] = request.keboola_url if secrets_to_persist: - data_dir = Path(os.environ.get("DATA_DIR", "./data")) - overlay_path = data_dir / "state" / ".env_overlay" + # Resolve via _state_dir() so the path matches app/main.py's + # startup-time read of the same overlay. Without this, an operator + # on the flat-mount layout (STATE_DIR=/data-state) would write + # secrets to /data/state/.env_overlay here while the app reads + # from /data-state/.env_overlay — silent loss on next restart. + from app.secrets import _state_dir + overlay_path = _state_dir() / ".env_overlay" overlay_path.parent.mkdir(parents=True, exist_ok=True) # Merge with existing overlay diff --git a/app/api/marketplaces.py b/app/api/marketplaces.py index b802726..f910259 100644 --- a/app/api/marketplaces.py +++ b/app/api/marketplaces.py @@ -147,9 +147,17 @@ def _token_env_name(slug: str) -> str: def _persist_token(env_name: str, value: str) -> None: - """Write (or update) a single key in data/state/.env_overlay and os.environ.""" - data_dir = Path(os.environ.get("DATA_DIR", "./data")) - overlay_path = data_dir / "state" / ".env_overlay" + """Write (or update) a single key in ``${STATE_DIR}/.env_overlay`` and ``os.environ``. + + Path resolution matches ``app/main.py``'s startup-time read; without + this alignment, marketplace PATs persisted under the flat-mount + layout (``STATE_DIR=/data-state``) would land at + ``/data/state/.env_overlay`` while the app reads from + ``/data-state/.env_overlay``, silently dropping the token on the + next restart. + """ + from app.secrets import _state_dir + overlay_path = _state_dir() / ".env_overlay" overlay_path.parent.mkdir(parents=True, exist_ok=True) existing: dict[str, str] = {} diff --git a/docker-compose.host-mount.yml b/docker-compose.host-mount.yml index af03c33..fa1bb27 100644 --- a/docker-compose.host-mount.yml +++ b/docker-compose.host-mount.yml @@ -75,3 +75,22 @@ services: ws-gateway: volumes: !override - /data:/data + + caddy: + # Caddy was originally inheriting `data:/srv:ro` from the base + # service. Once the other services switch to direct binds and + # nothing populates the `data:` named volume, that inherited + # mount points at an empty Docker-managed volume — and the + # @download `try_files /bigquery/data/.parquet …` block + # in Caddyfile finds nothing, so every parquet download falls + # through to the app's uvicorn worker, defeating the v0.36.0 + # file_server bypass. + # + # Restate every mount the base caddy service depends on; mirror + # the same caveat that lives in flat-mount.yml. + volumes: !override + - ./Caddyfile:/etc/caddy/Caddyfile:ro + - /data/state/certs:/certs:ro + - caddy_data:/data + - caddy_config:/config + - /data:/srv:ro From df2c33147c0da28102a75c632dbdc07a7f7ae88b Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 20:02:50 +0200 Subject: [PATCH 5/7] =?UTF-8?q?fix:=20Devin=20Review=20on=20#194=20round?= =?UTF-8?q?=202=20=E2=80=94=203=20BUG-class=20findings?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1. instance.yaml overlay path now matches read site under STATE_DIR. Three sites updated: - app/api/admin.py:1005 (server-config endpoint writer) - app/api/admin.py:2610 (configure endpoint writer) - app/instance_config.py:106 (overlay reader) All three now go through _state_dir() so under flat-mount layout (STATE_DIR=/data-state) the irreplaceable instance.yaml overlay lands on the state disk (sdc) instead of the regenerable data disk (sdb). Without this fix, .env_overlay correctly went to the state disk while instance.yaml went to the data disk — config would be lost if an operator wiped sdb. 2. Strip customer-specific tokens from OSS repo per CLAUDE.md vendor-agnostic rule: - docker-compose.host-mount.yml: 'a deployer (Groupon FoundryAI)' → 'a deployer in production' - docker-compose.flat-mount.yml: 'caused 2026-05-05 in the Groupon FoundryAI deployment' → generic 'production failure mode' - docs/state-dir.md: rewrote the incident reference to describe the failure mode abstractly without naming the deployment; updated the recommendation table to say 'shadow-mount class' instead of dating the specific incident. 3. Updated docs/state-dir.md 'What reads STATE_DIR' to list all read/write sites including the three migrated in this round (admin.py, instance_config.py, marketplaces.py). ANALYSIS finding (tls-rotate.sh hardcoded host-mount.yml) deferred — same operator-side class as auto-upgrade.sh hardcoded host-mount, documented limitation per the PR body. --- app/api/admin.py | 8 ++++---- app/instance_config.py | 9 +++++++-- docker-compose.flat-mount.yml | 4 ++-- docker-compose.host-mount.yml | 2 +- docs/state-dir.md | 7 +++++-- 5 files changed, 19 insertions(+), 11 deletions(-) diff --git a/app/api/admin.py b/app/api/admin.py index 7fd1f82..4469299 100644 --- a/app/api/admin.py +++ b/app/api/admin.py @@ -1001,8 +1001,8 @@ async def update_server_config( # atomic-write sequence; the audit log sits outside since it operates on # local snapshots. from app.instance_config import reset_cache - data_dir = Path(os.environ.get("DATA_DIR", "./data")) - config_path = data_dir / "state" / "instance.yaml" + from app.secrets import _state_dir + config_path = _state_dir() / "instance.yaml" config_path.parent.mkdir(parents=True, exist_ok=True) with _overlay_write_lock: @@ -2606,8 +2606,8 @@ async def configure_instance( # — they don't belong in the overlay at all. # 2. Patch only the sections this endpoint touches. # 3. Write the narrow overlay back atomically (tmp + os.replace). - data_dir = Path(os.environ.get("DATA_DIR", "./data")) - config_path = data_dir / "state" / "instance.yaml" + from app.secrets import _state_dir + config_path = _state_dir() / "instance.yaml" # Same serialization + corrupt-overlay handling as POST /server-config. with _overlay_write_lock: diff --git a/app/instance_config.py b/app/instance_config.py index 6155608..6901993 100644 --- a/app/instance_config.py +++ b/app/instance_config.py @@ -102,8 +102,13 @@ def load_instance_config() -> dict: # mirror the resolver here before the deep-merge — without it, the # LLM factory receives the literal placeholder and rejects it as an # invalid api key (#179 review fix). - data_dir = Path(os.environ.get("DATA_DIR", "./data")) - overlay_path = data_dir / "state" / "instance.yaml" + # Resolve via _state_dir() so the path matches the writer in + # app/api/admin.py — under the flat-mount layout (STATE_DIR=/data-state) + # both the configure-endpoint and the server-config-endpoint write + # ``/data-state/instance.yaml``; reading from ``/data/state/...`` here + # would silently load stale config from the regenerable data disk. + from app.secrets import _state_dir + overlay_path = _state_dir() / "instance.yaml" if overlay_path.exists(): try: overlay = yaml.safe_load(overlay_path.read_text()) or {} diff --git a/docker-compose.flat-mount.yml b/docker-compose.flat-mount.yml index 0aa3bfc..63125b0 100644 --- a/docker-compose.flat-mount.yml +++ b/docker-compose.flat-mount.yml @@ -7,8 +7,8 @@ # the data disk). That layout works but has known fragility: # # - Bind-mount propagation matters. A non-recursive bind hides the -# nested mount, leading to silent shadow writes (the failure mode -# that caused 2026-05-05 in the Groupon FoundryAI deployment). +# nested mount, leading to silent shadow writes — the production +# failure mode that motivated this overlay. # # - Two writers, one tree. Host-side timers (tls-rotate.timer) # write to /data/state/certs as root, while the container app diff --git a/docker-compose.host-mount.yml b/docker-compose.host-mount.yml index fa1bb27..3758a82 100644 --- a/docker-compose.host-mount.yml +++ b/docker-compose.host-mount.yml @@ -11,7 +11,7 @@ # new options to existing volumes — they keep whatever options were # in effect at create time. # -# This bit a deployer (Groupon FoundryAI) on 2026-05-05: the volume +# This bit a deployer in production: the volume # was created before this overlay had `bind,rbind`, kept the old # `bind` (non-recursive) propagation, and containers wrote to a # shadowed subdirectory of the parent disk instead of the nested diff --git a/docs/state-dir.md b/docs/state-dir.md index 7e40749..f89ef4d 100644 --- a/docs/state-dir.md +++ b/docs/state-dir.md @@ -25,7 +25,7 @@ sdc at /data/state (nested inside the data mount) - Two writers (host's `tls-rotate.timer` running as root; container app running as uid 999) share `/data/state` with different mount-namespace views → ownership conflicts. - Resizing sdb requires unmounting sdc first. -The 2026-05-05 incident in the Groupon FoundryAI deployment was a manifestation of the propagation gotcha. See PRs in this repo and the deployer's infra repo for full root-cause notes. +A production deployment hit this propagation gotcha: a volume was created with non-recursive `bind`, the file was later edited to `bind,rbind`, but Docker named-volume options are immutable after creation, so containers kept writing to a shadowed subdirectory of the parent disk. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Recovery required `docker volume rm` + per-VM data migration on every affected host. ## Layout B — flat @@ -52,7 +52,7 @@ sdc at /data-state (state, irreplaceable — parallel to /data, not nested) |---|---| | Existing deployment, no plans to expand | stay on layout A | | New deployment | layout B (cleaner, no shadow class) | -| Existing deployment hit by 2026-05-05 shadow class | migrate to layout B | +| Existing deployment hit by the shadow-mount class above | migrate to layout B | | CI / local dev | neither (use ephemeral compose volumes) | ## Migration A → B @@ -99,6 +99,9 @@ App code: - `src/db.py::_get_state_dir()` — the canonical helper. Used by `get_system_db()` and the schema migration snapshot. - `app/secrets.py::_state_dir()` — for `.session_secret`, `.jwt_secret`. Mirrors the helper since `app/` shouldn't import from `src/`. - `app/main.py` — for the `.env_overlay` startup file (loaded at process start). +- `app/instance_config.py` — for the writable `instance.yaml` overlay (read at every config-load). +- `app/api/admin.py` — for the writable `instance.yaml` overlay (write site of `POST /api/admin/server-config` and `POST /api/admin/configure`) and for `.env_overlay` (write site of `POST /api/admin/configure`). +- `app/api/marketplaces.py` — for `.env_overlay` (write site of marketplace PAT persistence). Host scripts: - `scripts/ops/agnes-auto-upgrade.sh` — mount-sanity check + cert detection. From 4a1916a4b0526075dd00a1e87d25e19b2ee9ed57 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 20:13:08 +0200 Subject: [PATCH 6/7] fix: v24 migration error message points to actual snapshot path The pre-migration snapshot was correctly migrated to STATE_DIR-aware path in src/db.py:1832 (`_get_state_dir() / 'system.duckdb.pre-migrate'`), but the error message in _migrate_v24_bq_source_queries still hardcoded the old `{DATA_DIR}/state/...` shape. Under flat-mount layout (STATE_DIR=/data-state), an operator hitting the v24 migration error would look in /data/state/ for a rollback snapshot that lives in /data-state/. Devin Review on PR #194 round 3. --- src/db.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/db.py b/src/db.py index f240463..681c906 100644 --- a/src/db.py +++ b/src/db.py @@ -1771,7 +1771,7 @@ def _v23_to_v24_finalize(conn: duckdb.DuckDBPyConnection) -> None: f"`instance.yaml: data_source.bigquery.project`) and restart " f"the app to retry the migration. The schema version is NOT " f"bumped to 24 until this completes; pre-migration DB " - f"snapshot is at `{{DATA_DIR}}/state/system.duckdb.pre-migrate`." + f"snapshot is at `{_get_state_dir()}/system.duckdb.pre-migrate`." ) conn.execute("BEGIN TRANSACTION") From fdc6cd7fb4aa0a1d371090ecf017a213c7e99257 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Wed, 6 May 2026 06:53:48 +0200 Subject: [PATCH 7/7] =?UTF-8?q?release:=200.37.0=20=E2=80=94=20STATE=5FDIR?= =?UTF-8?q?=20+=20flat-mount=20overlay;=20host-mount=20direct-bind=20fix?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- CHANGELOG.md | 8 ++++++-- pyproject.toml | 2 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 68aedc8..d311412 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,11 +10,15 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +## [0.37.0] — 2026-05-06 + +Operator-side disk-layout release. Closes the 2026-05-05 shadow-mount class identified in v0.36.0's deploy notes via two independent fixes that operators can adopt separately: (#194 folds in @cvrysanek's #191 + #192). The image-side change is invisible — `STATE_DIR` defaults to the legacy nested path, so existing deployments see no behavior change unless they opt into the new flat layout. Folds in three rounds of Devin Review (3 BUGs + 1 ANALYSIS class, ANALYSIS deferred per the operator-side limitation it describes). + ### Added -- **`STATE_DIR` env var + `docker-compose.flat-mount.yml` overlay** — operators can now place the writable state disk in **parallel** to the data disk (`sdb` at `/data`, `sdc` at `/data-state`) instead of nested (`sdc` at `/data/state` inside `/data`). The flat layout removes three structural fragilities of the legacy nested layout: bind-mount propagation gotchas (the 2026-05-05 shadow-mount class), two-writer collisions on a shared prefix (host's `tls-rotate.timer` as root + container app as uid 999 on the same path), and mount-order coupling on disk resize. `STATE_DIR` defaults to `${DATA_DIR}/state` so existing deployers see no behavior change; opt-in to flat layout via the new overlay + `STATE_DIR=/data-state` per the runbook in `docs/state-dir.md`. Read by `src/db.py:_get_state_dir()`, `app/secrets.py:_state_dir()`, `app/main.py` (`.env_overlay`), `scripts/ops/agnes-auto-upgrade.sh` (mount-sanity + cert detection), `scripts/ops/agnes-tls-rotate.sh` (`CERT_DIR=$STATE_DIR/certs`). +- **`STATE_DIR` env var + `docker-compose.flat-mount.yml` overlay** — operators can now place the writable state disk in **parallel** to the data disk (`sdb` at `/data`, `sdc` at `/data-state`) instead of nested (`sdc` at `/data/state` inside `/data`). The flat layout removes three structural fragilities of the legacy nested layout: bind-mount propagation gotchas (the 2026-05-05 shadow-mount class), two-writer collisions on a shared prefix (host's `tls-rotate.timer` as root + container app as uid 999 on the same path), and mount-order coupling on disk resize. `STATE_DIR` defaults to `${DATA_DIR}/state` so existing deployers see no behavior change; opt-in to flat layout via the new overlay + `STATE_DIR=/data-state` per the runbook in `docs/state-dir.md`. Read by `src/db.py:_get_state_dir()`, `app/secrets.py:_state_dir()`, `app/main.py` (`.env_overlay`), `app/instance_config.py` (`instance.yaml` overlay reader), `app/api/admin.py` (writers for both `/api/admin/configure` and `/api/admin/server-config` against the same overlay), `app/api/marketplaces.py` (marketplace PAT persistence into `.env_overlay`), `scripts/ops/agnes-auto-upgrade.sh` (mount-sanity + cert detection), `scripts/ops/agnes-tls-rotate.sh` (`CERT_DIR=$STATE_DIR/certs`). All read/write sites resolve via the same helper so under `STATE_DIR=/data-state` the irreplaceable tier (`system.duckdb`, secrets, `instance.yaml`, `.env_overlay`, certs) lands on sdc consistently — partial migration would silently lose secrets on container restart. ### Changed -- **`docker-compose.host-mount.yml` switched from "named volume + driver_opts" to direct service-level bind mounts** (`volumes: !override` per service). Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume, and editing this file does NOT propagate the new options to existing volumes. This bit a deployer on 2026-05-05: the volume was created before the overlay had `bind,rbind`, kept the old `bind` (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Direct service binds re-evaluate options every container start and default to recursive in modern Docker (20.10+) — no immutable state to migrate, no shadow-mount class. Operators on this overlay: next `docker compose up -d` starts containers with direct binds; the old `agnes_data` named volume is no longer referenced and can be removed with `docker volume rm agnes_data` (operator's choice — orphaned but harmless if left). +- **`docker-compose.host-mount.yml` switched from "named volume + driver_opts" to direct service-level bind mounts** (`volumes: !override` per service). Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume, and editing this file does NOT propagate the new options to existing volumes. This bit a deployer in production: the volume was created before the overlay had `bind,rbind`, kept the old `bind` (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Direct service binds re-evaluate options every container start and default to recursive in modern Docker (20.10+) — no immutable state to migrate, no shadow-mount class. Operators on this overlay: next `docker compose up -d` starts containers with direct binds; the old `agnes_data` named volume is no longer referenced and can be removed with `docker volume rm agnes_data` (operator's choice — orphaned but harmless if left). Both `host-mount.yml` and `flat-mount.yml` `volumes: !override` blocks for `caddy` now restate every mount the base service depends on (notably `data:/srv:ro` for the v0.36.0 file_server bypass and `caddy_config:/config` for ACME state) — a Devin-caught regression where `!override` silently dropped these mounts under the new layout, defeating the parquet-download perf bypass. ## [0.36.0] — 2026-05-05 diff --git a/pyproject.toml b/pyproject.toml index 8fddab2..f925413 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "agnes-the-ai-analyst" -version = "0.36.0" +version = "0.37.0" description = "Agnes — AI Data Analyst platform for AI analytical systems" requires-python = ">=3.11,<3.14" license = "MIT"