1. instance.yaml overlay path now matches read site under STATE_DIR.
Three sites updated:
- app/api/admin.py:1005 (server-config endpoint writer)
- app/api/admin.py:2610 (configure endpoint writer)
- app/instance_config.py:106 (overlay reader)
All three now go through _state_dir() so under flat-mount layout
(STATE_DIR=/data-state) the irreplaceable instance.yaml overlay
lands on the state disk (sdc) instead of the regenerable data
disk (sdb). Without this fix, .env_overlay correctly went to the
state disk while instance.yaml went to the data disk — config
would be lost if an operator wiped sdb.
2. Strip customer-specific tokens from OSS repo per CLAUDE.md
vendor-agnostic rule:
- docker-compose.host-mount.yml: 'a deployer (Groupon FoundryAI)'
→ 'a deployer in production'
- docker-compose.flat-mount.yml: 'caused 2026-05-05 in the
Groupon FoundryAI deployment' → generic 'production failure
mode'
- docs/state-dir.md: rewrote the incident reference to describe
the failure mode abstractly without naming the deployment;
updated the recommendation table to say 'shadow-mount class'
instead of dating the specific incident.
3. Updated docs/state-dir.md 'What reads STATE_DIR' to list all
read/write sites including the three migrated in this round
(admin.py, instance_config.py, marketplaces.py).
ANALYSIS finding (tls-rotate.sh hardcoded host-mount.yml) deferred
— same operator-side class as auto-upgrade.sh hardcoded host-mount,
documented limitation per the PR body.
103 lines
3.6 KiB
YAML
103 lines
3.6 KiB
YAML
# Flat-mount overlay — parallel host binds for /data and /data-state.
|
|
#
|
|
# Why this overlay
|
|
# ----------------
|
|
# The default deployment topology nests state under data: sdb at /data,
|
|
# sdc at /data/state (i.e. /data/state is a separate disk mounted INSIDE
|
|
# the data disk). That layout works but has known fragility:
|
|
#
|
|
# - Bind-mount propagation matters. A non-recursive bind hides the
|
|
# nested mount, leading to silent shadow writes — the production
|
|
# failure mode that motivated this overlay.
|
|
#
|
|
# - Two writers, one tree. Host-side timers (tls-rotate.timer)
|
|
# write to /data/state/certs as root, while the container app
|
|
# writes to /data/state/system.duckdb as uid 999. Same prefix,
|
|
# different mount-namespace views = ownership conflicts.
|
|
#
|
|
# - sdb resize requires umounting sdc first. Mount-order coupling.
|
|
#
|
|
# This overlay removes the nesting by mounting the state disk in
|
|
# PARALLEL to the data disk:
|
|
#
|
|
# sdb at /data (analytics, regenerable)
|
|
# sdc at /data-state (DuckDB, secrets, certs — irreplaceable)
|
|
#
|
|
# Both are direct service-level binds, recursive by default in modern
|
|
# Docker Engine. No volume options to forget. No nested propagation.
|
|
# No two-writer collision (app uses /data-state, host scripts also use
|
|
# /data-state — same path, single namespace).
|
|
#
|
|
# Usage
|
|
# -----
|
|
# 1. On the operator's host: mount the config disk at /data-state
|
|
# (instead of /data/state). Update fstab. Move existing state
|
|
# contents from /data/state to /data-state.
|
|
#
|
|
# 2. In /opt/agnes/.env, set STATE_DIR=/data-state. The app's secrets
|
|
# module + DuckDB code, plus the host-side rotate.sh and
|
|
# auto-upgrade.sh scripts, all read this var.
|
|
#
|
|
# 3. Compose invocation:
|
|
#
|
|
# docker compose \
|
|
# -f docker-compose.yml \
|
|
# -f docker-compose.prod.yml \
|
|
# -f docker-compose.flat-mount.yml \
|
|
# up -d
|
|
#
|
|
# Note: this overlay is mutually exclusive with docker-compose.host-mount.yml.
|
|
# Pick one based on your disk topology.
|
|
#
|
|
# Do NOT use this overlay in CI — /data and /data-state do not exist
|
|
# on GitHub runners.
|
|
|
|
services:
|
|
app:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
- ./config:/app/config:ro
|
|
|
|
extract:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
- ./config:/app/config:ro
|
|
|
|
scheduler:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
- ./config:/app/config:ro
|
|
|
|
telegram-bot:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
|
|
ws-gateway:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
|
|
caddy:
|
|
# `!override` replaces the entire base volumes list, so every mount
|
|
# the base service depends on must be re-stated here. Two of those
|
|
# are easy to miss and silently regress functionality:
|
|
# - `data:/srv:ro` — Caddy's read-only view of the agnes data dir
|
|
# used by the `@download` `file_server` handler in Caddyfile.
|
|
# Without it, `try_files /bigquery/data/<id>.parquet …` finds no
|
|
# file and every parquet download falls through to the app's
|
|
# uvicorn worker — defeating the perf bypass landed in v0.36.0.
|
|
# - `caddy_config:/config` — Caddy's autosave/ACME state. Missing
|
|
# it doesn't break HTTPS (we feed certs in via `/certs`) but
|
|
# loses the autosaved adapter config across recreates.
|
|
# Same caveat applies to any future `volumes: !override` block —
|
|
# diff against the base service before merging.
|
|
volumes: !override
|
|
- ./Caddyfile:/etc/caddy/Caddyfile:ro
|
|
- /data-state/certs:/certs:ro
|
|
- caddy_data:/data
|
|
- caddy_config:/config
|
|
- /data:/srv:ro
|