fix(flat-mount): preserve data:/srv:ro and caddy_config:/config in caddy override; CHANGELOG

The flat-mount overlay's caddy `volumes: !override` block listed only
three mounts, but the base docker-compose.yml caddy service has five.
`!override` (compose-spec semantics) replaces the entire list, so two
mounts were silently dropped under the flat layout:

- `data:/srv:ro` — Caddy's read-only view of the agnes data dir, used
  by the `@download` file_server handler in Caddyfile (added in v0.36.0
  as the perf bypass for multi-GB parquet downloads). Without this
  mount, `try_files /bigquery/data/<id>.parquet …` finds no file and
  every parquet download falls through to the app's uvicorn worker —
  defeating the bypass entirely.
- `caddy_config:/config` — Caddy's autosave/ACME state. Less critical
  (we feed certs in via /certs) but loses the autosaved adapter config
  across container recreates.

Restated both mounts with a comment block explaining the !override
caveat for any future overlay author.

Plus: CHANGELOG entries for the host-mount.yml direct-bind fix and
the STATE_DIR + flat-mount overlay under [Unreleased].
This commit is contained in:
ZdenekSrotyr 2026-05-05 19:29:38 +02:00
parent a303de0372
commit a9ae5f9c35
2 changed files with 21 additions and 0 deletions

View file

@ -10,6 +10,12 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
### Added
- **`STATE_DIR` env var + `docker-compose.flat-mount.yml` overlay** — operators can now place the writable state disk in **parallel** to the data disk (`sdb` at `/data`, `sdc` at `/data-state`) instead of nested (`sdc` at `/data/state` inside `/data`). The flat layout removes three structural fragilities of the legacy nested layout: bind-mount propagation gotchas (the 2026-05-05 shadow-mount class), two-writer collisions on a shared prefix (host's `tls-rotate.timer` as root + container app as uid 999 on the same path), and mount-order coupling on disk resize. `STATE_DIR` defaults to `${DATA_DIR}/state` so existing deployers see no behavior change; opt-in to flat layout via the new overlay + `STATE_DIR=/data-state` per the runbook in `docs/state-dir.md`. Read by `src/db.py:_get_state_dir()`, `app/secrets.py:_state_dir()`, `app/main.py` (`.env_overlay`), `scripts/ops/agnes-auto-upgrade.sh` (mount-sanity + cert detection), `scripts/ops/agnes-tls-rotate.sh` (`CERT_DIR=$STATE_DIR/certs`).
### Changed
- **`docker-compose.host-mount.yml` switched from "named volume + driver_opts" to direct service-level bind mounts** (`volumes: !override` per service). Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume, and editing this file does NOT propagate the new options to existing volumes. This bit a deployer on 2026-05-05: the volume was created before the overlay had `bind,rbind`, kept the old `bind` (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Direct service binds re-evaluate options every container start and default to recursive in modern Docker (20.10+) — no immutable state to migrate, no shadow-mount class. Operators on this overlay: next `docker compose up -d` starts containers with direct binds; the old `agnes_data` named volume is no longer referenced and can be removed with `docker volume rm agnes_data` (operator's choice — orphaned but harmless if left).
## [0.36.0] — 2026-05-05
Combined performance + analyst-clarity bundle. Folds three previously-staged work streams into one PR (#188): the long-running `agnes query --remote` timeout (#181), the Caddy parquet-download bypass (#182), and Pavel's #185 Phase 1 trace findings (silent 44-min first-init, opaque CLI tracebacks, no analyst-Claude size signal). Also performs the Tier 1 event-loop unblocking — the five hottest BQ-touching endpoints were `async def` over synchronous DuckDB / BQ-extension calls, so a single heavy `agnes query --remote` froze every other request for the duration of the BQ wait. The image-side fixes ship in this release; for existing VMs, the new auto-upgrade.sh self-fetches the matching Caddyfile + compose overlays from `main` on its next 5-minute tick, so deployment requires no operator action beyond letting the cron run.

View file

@ -82,7 +82,22 @@ services:
- /data-state:/data-state
caddy:
# `!override` replaces the entire base volumes list, so every mount
# the base service depends on must be re-stated here. Two of those
# are easy to miss and silently regress functionality:
# - `data:/srv:ro` — Caddy's read-only view of the agnes data dir
# used by the `@download` `file_server` handler in Caddyfile.
# Without it, `try_files /bigquery/data/<id>.parquet …` finds no
# file and every parquet download falls through to the app's
# uvicorn worker — defeating the perf bypass landed in v0.36.0.
# - `caddy_config:/config` — Caddy's autosave/ACME state. Missing
# it doesn't break HTTPS (we feed certs in via `/certs`) but
# loses the autosaved adapter config across recreates.
# Same caveat applies to any future `volumes: !override` block —
# diff against the base service before merging.
volumes: !override
- ./Caddyfile:/etc/caddy/Caddyfile:ro
- /data-state/certs:/certs:ro
- caddy_data:/data
- caddy_config:/config
- /data:/srv:ro