From a9ae5f9c350fbbb566b52a69c42bdb7be2c89909 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 19:29:38 +0200 Subject: [PATCH] fix(flat-mount): preserve data:/srv:ro and caddy_config:/config in caddy override; CHANGELOG MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The flat-mount overlay's caddy `volumes: !override` block listed only three mounts, but the base docker-compose.yml caddy service has five. `!override` (compose-spec semantics) replaces the entire list, so two mounts were silently dropped under the flat layout: - `data:/srv:ro` — Caddy's read-only view of the agnes data dir, used by the `@download` file_server handler in Caddyfile (added in v0.36.0 as the perf bypass for multi-GB parquet downloads). Without this mount, `try_files /bigquery/data/.parquet …` finds no file and every parquet download falls through to the app's uvicorn worker — defeating the bypass entirely. - `caddy_config:/config` — Caddy's autosave/ACME state. Less critical (we feed certs in via /certs) but loses the autosaved adapter config across container recreates. Restated both mounts with a comment block explaining the !override caveat for any future overlay author. Plus: CHANGELOG entries for the host-mount.yml direct-bind fix and the STATE_DIR + flat-mount overlay under [Unreleased]. --- CHANGELOG.md | 6 ++++++ docker-compose.flat-mount.yml | 15 +++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 888c5ba..68aedc8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,12 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +### Added +- **`STATE_DIR` env var + `docker-compose.flat-mount.yml` overlay** — operators can now place the writable state disk in **parallel** to the data disk (`sdb` at `/data`, `sdc` at `/data-state`) instead of nested (`sdc` at `/data/state` inside `/data`). The flat layout removes three structural fragilities of the legacy nested layout: bind-mount propagation gotchas (the 2026-05-05 shadow-mount class), two-writer collisions on a shared prefix (host's `tls-rotate.timer` as root + container app as uid 999 on the same path), and mount-order coupling on disk resize. `STATE_DIR` defaults to `${DATA_DIR}/state` so existing deployers see no behavior change; opt-in to flat layout via the new overlay + `STATE_DIR=/data-state` per the runbook in `docs/state-dir.md`. Read by `src/db.py:_get_state_dir()`, `app/secrets.py:_state_dir()`, `app/main.py` (`.env_overlay`), `scripts/ops/agnes-auto-upgrade.sh` (mount-sanity + cert detection), `scripts/ops/agnes-tls-rotate.sh` (`CERT_DIR=$STATE_DIR/certs`). + +### Changed +- **`docker-compose.host-mount.yml` switched from "named volume + driver_opts" to direct service-level bind mounts** (`volumes: !override` per service). Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume, and editing this file does NOT propagate the new options to existing volumes. This bit a deployer on 2026-05-05: the volume was created before the overlay had `bind,rbind`, kept the old `bind` (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Direct service binds re-evaluate options every container start and default to recursive in modern Docker (20.10+) — no immutable state to migrate, no shadow-mount class. Operators on this overlay: next `docker compose up -d` starts containers with direct binds; the old `agnes_data` named volume is no longer referenced and can be removed with `docker volume rm agnes_data` (operator's choice — orphaned but harmless if left). + ## [0.36.0] — 2026-05-05 Combined performance + analyst-clarity bundle. Folds three previously-staged work streams into one PR (#188): the long-running `agnes query --remote` timeout (#181), the Caddy parquet-download bypass (#182), and Pavel's #185 Phase 1 trace findings (silent 44-min first-init, opaque CLI tracebacks, no analyst-Claude size signal). Also performs the Tier 1 event-loop unblocking — the five hottest BQ-touching endpoints were `async def` over synchronous DuckDB / BQ-extension calls, so a single heavy `agnes query --remote` froze every other request for the duration of the BQ wait. The image-side fixes ship in this release; for existing VMs, the new auto-upgrade.sh self-fetches the matching Caddyfile + compose overlays from `main` on its next 5-minute tick, so deployment requires no operator action beyond letting the cron run. diff --git a/docker-compose.flat-mount.yml b/docker-compose.flat-mount.yml index a95d7aa..0aa3bfc 100644 --- a/docker-compose.flat-mount.yml +++ b/docker-compose.flat-mount.yml @@ -82,7 +82,22 @@ services: - /data-state:/data-state caddy: + # `!override` replaces the entire base volumes list, so every mount + # the base service depends on must be re-stated here. Two of those + # are easy to miss and silently regress functionality: + # - `data:/srv:ro` — Caddy's read-only view of the agnes data dir + # used by the `@download` `file_server` handler in Caddyfile. + # Without it, `try_files /bigquery/data/.parquet …` finds no + # file and every parquet download falls through to the app's + # uvicorn worker — defeating the perf bypass landed in v0.36.0. + # - `caddy_config:/config` — Caddy's autosave/ACME state. Missing + # it doesn't break HTTPS (we feed certs in via `/certs`) but + # loses the autosaved adapter config across recreates. + # Same caveat applies to any future `volumes: !override` block — + # diff against the base service before merging. volumes: !override - ./Caddyfile:/etc/caddy/Caddyfile:ro - /data-state/certs:/certs:ro - caddy_data:/data + - caddy_config:/config + - /data:/srv:ro