fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption (#235)
* fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption Default 10s stop_grace_period + missing CHECKPOINT on shutdown produced a class of WAL-replay failures during agnes-auto-upgrade recreates. Sequence: 1. New image digest detected → docker compose up -d → SIGTERM to app 2. App's lifespan close_system_db() called .close() but never CHECKPOINT, so any uncheckpointed ops stayed in system.duckdb.wal 3. Container didn't exit within 10s → dockerd SIGKILL (verified in journal: "Container failed to exit within 10s of signal 15 - using the force") 4. New container started with possibly-different DuckDB version, replay hit "Failure while replaying WAL ... GetDefaultDatabase with no default database set" assertion → 500 on every authed request Observed on foundryai-dev-vrysanek 2026-05-05; recovered by removing the WAL manually. _try_open_system_db already exists as a recovery net but requires a system.duckdb.pre-migrate snapshot, which doesn't exist outside migration windows. Two-part prevention: - src/db.py::close_system_db: execute CHECKPOINT before .close() so the WAL is empty when the file is released. Best-effort (try/except), so a locked or full-disk CHECKPOINT does not block close. - docker-compose.yml: stop_grace_period: 60s on app + scheduler, gives uvicorn + lifespan room to run shutdown handlers under load before Docker's SIGKILL fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * log CHECKPOINT outcome on system DB close Silent except: pass on both CHECKPOINT and close() left operators without any signal when the WAL-flush safety net actually saved them (or didn't). Add logger.warning on CHECKPOINT failure (operator-actionable - recovery via _try_open_system_db kicks in next start) and debug-level trace on success / close exception. * drop customer-specific token + add CHANGELOG entries Per CLAUDE.md vendor-agnostic OSS rule: nothing customer-specific in shipped code/comments. Replaced "foundryai-dev-vrysanek 2026-05-05" references in docker-compose.yml and src/db.py docstring with generic "Docker image upgrade window where DuckDB versions differ" framing. The original incident date + host live in the commit history / PR body, not in the tree. Adds CHANGELOG entries under Unreleased: - Fixed: close_system_db CHECKPOINT-on-shutdown semantics + WAL-replay failure mode the fix protects against - Changed: docker-compose stop_grace_period 60s on app + scheduler --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
This commit is contained in:
parent
41829e8a45
commit
8b2f6620a8
3 changed files with 48 additions and 3 deletions
|
|
@ -542,6 +542,14 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
- Schema bumped v28 → v29 → v30. New tests: news repository (14), sanitizer (20), API (8), web (5), CLI (14) — 61 total — plus updated home/auth/template tests for the shared-shell architecture. CLAUDE.md "Run tests before every push" section codifies `pytest tests/ -n auto -q` as non-negotiable before each push.
|
||||
|
||||
### Fixed (system DB shutdown)
|
||||
|
||||
- **`close_system_db()` now CHECKPOINTs before closing the system DB connection**, so the WAL flushes into `system.duckdb` and the file is left in a clean state across `docker compose up -d` recreate windows. Previously, a SIGKILL after the default 10s `stop_grace_period` could leave a populated `.wal` that the next process must replay on open; if the next image carried a different DuckDB version, replay could trip an internal assertion (`Failure while replaying WAL ... GetDefaultDatabase with no default database set`) and 500 every authed request until the WAL file was manually removed. CHECKPOINT is best-effort with operator-visible logging — `WARNING` on failure, `DEBUG` on success.
|
||||
|
||||
### Changed (compose grace)
|
||||
|
||||
- **`docker-compose.yml` `stop_grace_period: 60s`** on the `app` and `scheduler` services (was Docker's 10s default). Gives uvicorn time to drain in-flight requests + run the new shutdown CHECKPOINT before SIGKILL. Healthy `docker compose down` is unaffected (services still stop as soon as their lifespan exits).
|
||||
|
||||
## [0.47.4] — 2026-05-08
|
||||
|
||||
### Fixed
|
||||
|
|
|
|||
|
|
@ -38,6 +38,16 @@ services:
|
|||
mem_limit: 4g
|
||||
mem_reservation: 1g
|
||||
cpus: 2.0
|
||||
# Default 10s is too short for graceful uvicorn shutdown — under load,
|
||||
# in-flight requests + DuckDB CHECKPOINT (see lifespan in app/main.py)
|
||||
# need more headroom. SIGKILL mid-WAL-write produces a corrupt
|
||||
# system.duckdb.wal that the next image's DuckDB version cannot replay
|
||||
# ("Failure while replaying WAL ... GetDefaultDatabase with no default
|
||||
# database set"), 500-ing every authed request until WAL is removed.
|
||||
# Hits hardest during a Docker image upgrade window where the new
|
||||
# image's DuckDB version differs from the old container's, since
|
||||
# WAL replay across versions trips on internal assertions.
|
||||
stop_grace_period: 60s
|
||||
|
||||
# One-shot: run extractor then rebuild orchestrator views
|
||||
extract:
|
||||
|
|
@ -78,6 +88,9 @@ services:
|
|||
restart: unless-stopped
|
||||
mem_limit: 2g
|
||||
cpus: 1.0
|
||||
# Match app service — scheduler holds DuckDB connections too; same
|
||||
# WAL-corruption risk on SIGKILL during recreate.
|
||||
stop_grace_period: 60s
|
||||
|
||||
telegram-bot:
|
||||
build: .
|
||||
|
|
|
|||
30
src/db.py
30
src/db.py
|
|
@ -2914,12 +2914,36 @@ def get_schema_version(conn: duckdb.DuckDBPyConnection) -> int:
|
|||
|
||||
|
||||
def close_system_db() -> None:
|
||||
"""Close the shared system DB connection. Called on app shutdown."""
|
||||
"""Close the shared system DB connection. Called on app shutdown.
|
||||
|
||||
CHECKPOINT before close so the WAL flushes into ``system.duckdb`` and
|
||||
the file is left in a clean state. If we skip this and the process
|
||||
later gets SIGKILL'd (e.g. Docker's default 10s stop_grace_period
|
||||
expires during ``docker compose up -d`` recreate), DuckDB leaves a
|
||||
populated ``.wal`` that the next process must replay on open. When
|
||||
the next process is a different DuckDB version (image upgrade
|
||||
window), replay can hit internal assertions like
|
||||
``Failure while replaying WAL ... GetDefaultDatabase with no default
|
||||
database set`` and the app 500s on every authed request.
|
||||
|
||||
CHECKPOINT is best-effort: if it raises (locked, disk full, etc.)
|
||||
we still proceed to close — the recovery path in ``_try_open_system_db``
|
||||
plus the longer ``stop_grace_period`` in compose are the safety nets.
|
||||
"""
|
||||
global _system_db_conn, _system_db_path
|
||||
if _system_db_conn:
|
||||
try:
|
||||
_system_db_conn.execute("CHECKPOINT")
|
||||
logger.debug("close_system_db: CHECKPOINT ok")
|
||||
except Exception as exc:
|
||||
# Log + proceed — CHECKPOINT failure is not fatal (recovery path
|
||||
# in _try_open_system_db handles a dirty WAL on next open), but
|
||||
# we want operators to see WHY the safety net was needed if a
|
||||
# WAL-replay failure does surface later.
|
||||
logger.warning("close_system_db: CHECKPOINT failed (%s); proceeding to close", exc)
|
||||
try:
|
||||
_system_db_conn.close()
|
||||
except Exception:
|
||||
pass
|
||||
except Exception as exc:
|
||||
logger.debug("close_system_db: close raised (%s); ignoring", exc)
|
||||
_system_db_conn = None
|
||||
_system_db_path = None
|
||||
|
|
|
|||
Loading…
Reference in a new issue