Introduces STATE_DIR as the single source of truth for the writable
state directory path, with backward-compatible default of
${DATA_DIR}/state. Pairs with a new docker-compose.flat-mount.yml
overlay that mounts the state disk in PARALLEL to the data disk
(rather than nested under it).
Why
---
The default deployment topology nests state under data: sdb at /data,
sdc at /data/state. That layout has known fragility documented in
docs/state-dir.md — bind-propagation gotchas, two-writer collisions
on the same prefix, mount-order coupling. The 2026-05-05 incident in
the Groupon FoundryAI deployment was a manifestation of the
propagation gotcha.
The flat layout (sdb at /data, sdc at /data-state — parallel, not
nested) eliminates the nested-mount class entirely. Each disk is its
own bind mount, recursive by default in modern Docker. No volume
options to forget. No two-writer collision (host scripts and
container app share /data-state at the same path, single namespace).
What changes
------------
App code (Python):
- src/db.py: new _get_state_dir() helper. get_system_db() and
schema migration snapshot use it.
- app/secrets.py: new _state_dir() helper. _load_or_generate() uses
it for .session_secret and .jwt_secret.
- app/main.py: .env_overlay loaded from _state_dir().
Host scripts:
- scripts/ops/agnes-auto-upgrade.sh: STATE_DIR drives mount-sanity
check and cert detection. Defaults preserve existing behavior.
- scripts/ops/agnes-tls-rotate.sh: STATE_DIR drives CERT_DIR.
New compose overlay:
- docker-compose.flat-mount.yml: parallel /data and /data-state binds
per service. Mutually exclusive with docker-compose.host-mount.yml;
pick one based on disk topology.
Documentation:
- docs/state-dir.md: layout choice (A nested vs B flat), pros/cons,
migration steps, and which code paths read STATE_DIR.
Backward compatibility
----------------------
STATE_DIR defaults to ${DATA_DIR}/state — current behavior. Existing
deployers that don't set the var see no behavior change. Migration
to flat layout is opt-in per the runbook in docs/state-dir.md.
Validation
----------
- bash -n on both host scripts: pass
- docker compose config -f docker-compose.flat-mount.yml: resolves
cleanly with all 6 services binding /data and /data-state directly
- python3 import + helper exercise: STATE_DIR override works,
default falls back to ${DATA_DIR}/state
Companion to PR #191 (drop named-volume driver_opts in host-mount.yml).
That PR fixes the immutability footgun for Layout A; this PR offers
Layout B as the architectural alternative.
88 lines
2.7 KiB
YAML
88 lines
2.7 KiB
YAML
# Flat-mount overlay — parallel host binds for /data and /data-state.
|
|
#
|
|
# Why this overlay
|
|
# ----------------
|
|
# The default deployment topology nests state under data: sdb at /data,
|
|
# sdc at /data/state (i.e. /data/state is a separate disk mounted INSIDE
|
|
# the data disk). That layout works but has known fragility:
|
|
#
|
|
# - Bind-mount propagation matters. A non-recursive bind hides the
|
|
# nested mount, leading to silent shadow writes (the failure mode
|
|
# that caused 2026-05-05 in the Groupon FoundryAI deployment).
|
|
#
|
|
# - Two writers, one tree. Host-side timers (tls-rotate.timer)
|
|
# write to /data/state/certs as root, while the container app
|
|
# writes to /data/state/system.duckdb as uid 999. Same prefix,
|
|
# different mount-namespace views = ownership conflicts.
|
|
#
|
|
# - sdb resize requires umounting sdc first. Mount-order coupling.
|
|
#
|
|
# This overlay removes the nesting by mounting the state disk in
|
|
# PARALLEL to the data disk:
|
|
#
|
|
# sdb at /data (analytics, regenerable)
|
|
# sdc at /data-state (DuckDB, secrets, certs — irreplaceable)
|
|
#
|
|
# Both are direct service-level binds, recursive by default in modern
|
|
# Docker Engine. No volume options to forget. No nested propagation.
|
|
# No two-writer collision (app uses /data-state, host scripts also use
|
|
# /data-state — same path, single namespace).
|
|
#
|
|
# Usage
|
|
# -----
|
|
# 1. On the operator's host: mount the config disk at /data-state
|
|
# (instead of /data/state). Update fstab. Move existing state
|
|
# contents from /data/state to /data-state.
|
|
#
|
|
# 2. In /opt/agnes/.env, set STATE_DIR=/data-state. The app's secrets
|
|
# module + DuckDB code, plus the host-side rotate.sh and
|
|
# auto-upgrade.sh scripts, all read this var.
|
|
#
|
|
# 3. Compose invocation:
|
|
#
|
|
# docker compose \
|
|
# -f docker-compose.yml \
|
|
# -f docker-compose.prod.yml \
|
|
# -f docker-compose.flat-mount.yml \
|
|
# up -d
|
|
#
|
|
# Note: this overlay is mutually exclusive with docker-compose.host-mount.yml.
|
|
# Pick one based on your disk topology.
|
|
#
|
|
# Do NOT use this overlay in CI — /data and /data-state do not exist
|
|
# on GitHub runners.
|
|
|
|
services:
|
|
app:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
- ./config:/app/config:ro
|
|
|
|
extract:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
- ./config:/app/config:ro
|
|
|
|
scheduler:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
- ./config:/app/config:ro
|
|
|
|
telegram-bot:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
|
|
ws-gateway:
|
|
volumes: !override
|
|
- /data:/data
|
|
- /data-state:/data-state
|
|
|
|
caddy:
|
|
volumes: !override
|
|
- ./Caddyfile:/etc/caddy/Caddyfile:ro
|
|
- /data-state/certs:/certs:ro
|
|
- caddy_data:/data
|