fix: Devin Review on #194 round 2 — 3 BUG-class findings
1. instance.yaml overlay path now matches read site under STATE_DIR.
Three sites updated:
- app/api/admin.py:1005 (server-config endpoint writer)
- app/api/admin.py:2610 (configure endpoint writer)
- app/instance_config.py:106 (overlay reader)
All three now go through _state_dir() so under flat-mount layout
(STATE_DIR=/data-state) the irreplaceable instance.yaml overlay
lands on the state disk (sdc) instead of the regenerable data
disk (sdb). Without this fix, .env_overlay correctly went to the
state disk while instance.yaml went to the data disk — config
would be lost if an operator wiped sdb.
2. Strip customer-specific tokens from OSS repo per CLAUDE.md
vendor-agnostic rule:
- docker-compose.host-mount.yml: 'a deployer (Groupon FoundryAI)'
→ 'a deployer in production'
- docker-compose.flat-mount.yml: 'caused 2026-05-05 in the
Groupon FoundryAI deployment' → generic 'production failure
mode'
- docs/state-dir.md: rewrote the incident reference to describe
the failure mode abstractly without naming the deployment;
updated the recommendation table to say 'shadow-mount class'
instead of dating the specific incident.
3. Updated docs/state-dir.md 'What reads STATE_DIR' to list all
read/write sites including the three migrated in this round
(admin.py, instance_config.py, marketplaces.py).
ANALYSIS finding (tls-rotate.sh hardcoded host-mount.yml) deferred
— same operator-side class as auto-upgrade.sh hardcoded host-mount,
documented limitation per the PR body.
This commit is contained in:
parent
b6543c9c55
commit
df2c33147c
5 changed files with 19 additions and 11 deletions
|
|
@ -1001,8 +1001,8 @@ async def update_server_config(
|
||||||
# atomic-write sequence; the audit log sits outside since it operates on
|
# atomic-write sequence; the audit log sits outside since it operates on
|
||||||
# local snapshots.
|
# local snapshots.
|
||||||
from app.instance_config import reset_cache
|
from app.instance_config import reset_cache
|
||||||
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
from app.secrets import _state_dir
|
||||||
config_path = data_dir / "state" / "instance.yaml"
|
config_path = _state_dir() / "instance.yaml"
|
||||||
config_path.parent.mkdir(parents=True, exist_ok=True)
|
config_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
with _overlay_write_lock:
|
with _overlay_write_lock:
|
||||||
|
|
@ -2606,8 +2606,8 @@ async def configure_instance(
|
||||||
# — they don't belong in the overlay at all.
|
# — they don't belong in the overlay at all.
|
||||||
# 2. Patch only the sections this endpoint touches.
|
# 2. Patch only the sections this endpoint touches.
|
||||||
# 3. Write the narrow overlay back atomically (tmp + os.replace).
|
# 3. Write the narrow overlay back atomically (tmp + os.replace).
|
||||||
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
from app.secrets import _state_dir
|
||||||
config_path = data_dir / "state" / "instance.yaml"
|
config_path = _state_dir() / "instance.yaml"
|
||||||
|
|
||||||
# Same serialization + corrupt-overlay handling as POST /server-config.
|
# Same serialization + corrupt-overlay handling as POST /server-config.
|
||||||
with _overlay_write_lock:
|
with _overlay_write_lock:
|
||||||
|
|
|
||||||
|
|
@ -102,8 +102,13 @@ def load_instance_config() -> dict:
|
||||||
# mirror the resolver here before the deep-merge — without it, the
|
# mirror the resolver here before the deep-merge — without it, the
|
||||||
# LLM factory receives the literal placeholder and rejects it as an
|
# LLM factory receives the literal placeholder and rejects it as an
|
||||||
# invalid api key (#179 review fix).
|
# invalid api key (#179 review fix).
|
||||||
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
# Resolve via _state_dir() so the path matches the writer in
|
||||||
overlay_path = data_dir / "state" / "instance.yaml"
|
# app/api/admin.py — under the flat-mount layout (STATE_DIR=/data-state)
|
||||||
|
# both the configure-endpoint and the server-config-endpoint write
|
||||||
|
# ``/data-state/instance.yaml``; reading from ``/data/state/...`` here
|
||||||
|
# would silently load stale config from the regenerable data disk.
|
||||||
|
from app.secrets import _state_dir
|
||||||
|
overlay_path = _state_dir() / "instance.yaml"
|
||||||
if overlay_path.exists():
|
if overlay_path.exists():
|
||||||
try:
|
try:
|
||||||
overlay = yaml.safe_load(overlay_path.read_text()) or {}
|
overlay = yaml.safe_load(overlay_path.read_text()) or {}
|
||||||
|
|
|
||||||
|
|
@ -7,8 +7,8 @@
|
||||||
# the data disk). That layout works but has known fragility:
|
# the data disk). That layout works but has known fragility:
|
||||||
#
|
#
|
||||||
# - Bind-mount propagation matters. A non-recursive bind hides the
|
# - Bind-mount propagation matters. A non-recursive bind hides the
|
||||||
# nested mount, leading to silent shadow writes (the failure mode
|
# nested mount, leading to silent shadow writes — the production
|
||||||
# that caused 2026-05-05 in the Groupon FoundryAI deployment).
|
# failure mode that motivated this overlay.
|
||||||
#
|
#
|
||||||
# - Two writers, one tree. Host-side timers (tls-rotate.timer)
|
# - Two writers, one tree. Host-side timers (tls-rotate.timer)
|
||||||
# write to /data/state/certs as root, while the container app
|
# write to /data/state/certs as root, while the container app
|
||||||
|
|
|
||||||
|
|
@ -11,7 +11,7 @@
|
||||||
# new options to existing volumes — they keep whatever options were
|
# new options to existing volumes — they keep whatever options were
|
||||||
# in effect at create time.
|
# in effect at create time.
|
||||||
#
|
#
|
||||||
# This bit a deployer (Groupon FoundryAI) on 2026-05-05: the volume
|
# This bit a deployer in production: the volume
|
||||||
# was created before this overlay had `bind,rbind`, kept the old
|
# was created before this overlay had `bind,rbind`, kept the old
|
||||||
# `bind` (non-recursive) propagation, and containers wrote to a
|
# `bind` (non-recursive) propagation, and containers wrote to a
|
||||||
# shadowed subdirectory of the parent disk instead of the nested
|
# shadowed subdirectory of the parent disk instead of the nested
|
||||||
|
|
|
||||||
|
|
@ -25,7 +25,7 @@ sdc at /data/state (nested inside the data mount)
|
||||||
- Two writers (host's `tls-rotate.timer` running as root; container app running as uid 999) share `/data/state` with different mount-namespace views → ownership conflicts.
|
- Two writers (host's `tls-rotate.timer` running as root; container app running as uid 999) share `/data/state` with different mount-namespace views → ownership conflicts.
|
||||||
- Resizing sdb requires unmounting sdc first.
|
- Resizing sdb requires unmounting sdc first.
|
||||||
|
|
||||||
The 2026-05-05 incident in the Groupon FoundryAI deployment was a manifestation of the propagation gotcha. See PRs in this repo and the deployer's infra repo for full root-cause notes.
|
A production deployment hit this propagation gotcha: a volume was created with non-recursive `bind`, the file was later edited to `bind,rbind`, but Docker named-volume options are immutable after creation, so containers kept writing to a shadowed subdirectory of the parent disk. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Recovery required `docker volume rm` + per-VM data migration on every affected host.
|
||||||
|
|
||||||
## Layout B — flat
|
## Layout B — flat
|
||||||
|
|
||||||
|
|
@ -52,7 +52,7 @@ sdc at /data-state (state, irreplaceable — parallel to /data, not nested)
|
||||||
|---|---|
|
|---|---|
|
||||||
| Existing deployment, no plans to expand | stay on layout A |
|
| Existing deployment, no plans to expand | stay on layout A |
|
||||||
| New deployment | layout B (cleaner, no shadow class) |
|
| New deployment | layout B (cleaner, no shadow class) |
|
||||||
| Existing deployment hit by 2026-05-05 shadow class | migrate to layout B |
|
| Existing deployment hit by the shadow-mount class above | migrate to layout B |
|
||||||
| CI / local dev | neither (use ephemeral compose volumes) |
|
| CI / local dev | neither (use ephemeral compose volumes) |
|
||||||
|
|
||||||
## Migration A → B
|
## Migration A → B
|
||||||
|
|
@ -99,6 +99,9 @@ App code:
|
||||||
- `src/db.py::_get_state_dir()` — the canonical helper. Used by `get_system_db()` and the schema migration snapshot.
|
- `src/db.py::_get_state_dir()` — the canonical helper. Used by `get_system_db()` and the schema migration snapshot.
|
||||||
- `app/secrets.py::_state_dir()` — for `.session_secret`, `.jwt_secret`. Mirrors the helper since `app/` shouldn't import from `src/`.
|
- `app/secrets.py::_state_dir()` — for `.session_secret`, `.jwt_secret`. Mirrors the helper since `app/` shouldn't import from `src/`.
|
||||||
- `app/main.py` — for the `.env_overlay` startup file (loaded at process start).
|
- `app/main.py` — for the `.env_overlay` startup file (loaded at process start).
|
||||||
|
- `app/instance_config.py` — for the writable `instance.yaml` overlay (read at every config-load).
|
||||||
|
- `app/api/admin.py` — for the writable `instance.yaml` overlay (write site of `POST /api/admin/server-config` and `POST /api/admin/configure`) and for `.env_overlay` (write site of `POST /api/admin/configure`).
|
||||||
|
- `app/api/marketplaces.py` — for `.env_overlay` (write site of marketplace PAT persistence).
|
||||||
|
|
||||||
Host scripts:
|
Host scripts:
|
||||||
- `scripts/ops/agnes-auto-upgrade.sh` — mount-sanity check + cert detection.
|
- `scripts/ops/agnes-auto-upgrade.sh` — mount-sanity check + cert detection.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue