From 8b2f6620a8cd19f8c60e4633169759cad639929c Mon Sep 17 00:00:00 2001 From: Vojtech <119944107+cvrysanek@users.noreply.github.com> Date: Sun, 10 May 2026 23:02:30 +0400 Subject: [PATCH] fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption (#235) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption Default 10s stop_grace_period + missing CHECKPOINT on shutdown produced a class of WAL-replay failures during agnes-auto-upgrade recreates. Sequence: 1. New image digest detected → docker compose up -d → SIGTERM to app 2. App's lifespan close_system_db() called .close() but never CHECKPOINT, so any uncheckpointed ops stayed in system.duckdb.wal 3. Container didn't exit within 10s → dockerd SIGKILL (verified in journal: "Container failed to exit within 10s of signal 15 - using the force") 4. New container started with possibly-different DuckDB version, replay hit "Failure while replaying WAL ... GetDefaultDatabase with no default database set" assertion → 500 on every authed request Observed on foundryai-dev-vrysanek 2026-05-05; recovered by removing the WAL manually. _try_open_system_db already exists as a recovery net but requires a system.duckdb.pre-migrate snapshot, which doesn't exist outside migration windows. Two-part prevention: - src/db.py::close_system_db: execute CHECKPOINT before .close() so the WAL is empty when the file is released. Best-effort (try/except), so a locked or full-disk CHECKPOINT does not block close. - docker-compose.yml: stop_grace_period: 60s on app + scheduler, gives uvicorn + lifespan room to run shutdown handlers under load before Docker's SIGKILL fires. Co-Authored-By: Claude Opus 4.7 (1M context) * log CHECKPOINT outcome on system DB close Silent except: pass on both CHECKPOINT and close() left operators without any signal when the WAL-flush safety net actually saved them (or didn't). Add logger.warning on CHECKPOINT failure (operator-actionable - recovery via _try_open_system_db kicks in next start) and debug-level trace on success / close exception. * drop customer-specific token + add CHANGELOG entries Per CLAUDE.md vendor-agnostic OSS rule: nothing customer-specific in shipped code/comments. Replaced "foundryai-dev-vrysanek 2026-05-05" references in docker-compose.yml and src/db.py docstring with generic "Docker image upgrade window where DuckDB versions differ" framing. The original incident date + host live in the commit history / PR body, not in the tree. Adds CHANGELOG entries under Unreleased: - Fixed: close_system_db CHECKPOINT-on-shutdown semantics + WAL-replay failure mode the fix protects against - Changed: docker-compose stop_grace_period 60s on app + scheduler --------- Co-authored-by: Claude Opus 4.7 (1M context) Co-authored-by: ZdenekSrotyr --- CHANGELOG.md | 8 ++++++++ docker-compose.yml | 13 +++++++++++++ src/db.py | 30 +++++++++++++++++++++++++++--- 3 files changed, 48 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3ed9580..236d732 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -542,6 +542,14 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C - Schema bumped v28 → v29 → v30. New tests: news repository (14), sanitizer (20), API (8), web (5), CLI (14) — 61 total — plus updated home/auth/template tests for the shared-shell architecture. CLAUDE.md "Run tests before every push" section codifies `pytest tests/ -n auto -q` as non-negotiable before each push. +### Fixed (system DB shutdown) + +- **`close_system_db()` now CHECKPOINTs before closing the system DB connection**, so the WAL flushes into `system.duckdb` and the file is left in a clean state across `docker compose up -d` recreate windows. Previously, a SIGKILL after the default 10s `stop_grace_period` could leave a populated `.wal` that the next process must replay on open; if the next image carried a different DuckDB version, replay could trip an internal assertion (`Failure while replaying WAL ... GetDefaultDatabase with no default database set`) and 500 every authed request until the WAL file was manually removed. CHECKPOINT is best-effort with operator-visible logging — `WARNING` on failure, `DEBUG` on success. + +### Changed (compose grace) + +- **`docker-compose.yml` `stop_grace_period: 60s`** on the `app` and `scheduler` services (was Docker's 10s default). Gives uvicorn time to drain in-flight requests + run the new shutdown CHECKPOINT before SIGKILL. Healthy `docker compose down` is unaffected (services still stop as soon as their lifespan exits). + ## [0.47.4] — 2026-05-08 ### Fixed diff --git a/docker-compose.yml b/docker-compose.yml index 4ea3a49..8e99783 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -38,6 +38,16 @@ services: mem_limit: 4g mem_reservation: 1g cpus: 2.0 + # Default 10s is too short for graceful uvicorn shutdown — under load, + # in-flight requests + DuckDB CHECKPOINT (see lifespan in app/main.py) + # need more headroom. SIGKILL mid-WAL-write produces a corrupt + # system.duckdb.wal that the next image's DuckDB version cannot replay + # ("Failure while replaying WAL ... GetDefaultDatabase with no default + # database set"), 500-ing every authed request until WAL is removed. + # Hits hardest during a Docker image upgrade window where the new + # image's DuckDB version differs from the old container's, since + # WAL replay across versions trips on internal assertions. + stop_grace_period: 60s # One-shot: run extractor then rebuild orchestrator views extract: @@ -78,6 +88,9 @@ services: restart: unless-stopped mem_limit: 2g cpus: 1.0 + # Match app service — scheduler holds DuckDB connections too; same + # WAL-corruption risk on SIGKILL during recreate. + stop_grace_period: 60s telegram-bot: build: . diff --git a/src/db.py b/src/db.py index 54ff4a3..054684b 100644 --- a/src/db.py +++ b/src/db.py @@ -2914,12 +2914,36 @@ def get_schema_version(conn: duckdb.DuckDBPyConnection) -> int: def close_system_db() -> None: - """Close the shared system DB connection. Called on app shutdown.""" + """Close the shared system DB connection. Called on app shutdown. + + CHECKPOINT before close so the WAL flushes into ``system.duckdb`` and + the file is left in a clean state. If we skip this and the process + later gets SIGKILL'd (e.g. Docker's default 10s stop_grace_period + expires during ``docker compose up -d`` recreate), DuckDB leaves a + populated ``.wal`` that the next process must replay on open. When + the next process is a different DuckDB version (image upgrade + window), replay can hit internal assertions like + ``Failure while replaying WAL ... GetDefaultDatabase with no default + database set`` and the app 500s on every authed request. + + CHECKPOINT is best-effort: if it raises (locked, disk full, etc.) + we still proceed to close — the recovery path in ``_try_open_system_db`` + plus the longer ``stop_grace_period`` in compose are the safety nets. + """ global _system_db_conn, _system_db_path if _system_db_conn: + try: + _system_db_conn.execute("CHECKPOINT") + logger.debug("close_system_db: CHECKPOINT ok") + except Exception as exc: + # Log + proceed — CHECKPOINT failure is not fatal (recovery path + # in _try_open_system_db handles a dirty WAL on next open), but + # we want operators to see WHY the safety net was needed if a + # WAL-replay failure does surface later. + logger.warning("close_system_db: CHECKPOINT failed (%s); proceeding to close", exc) try: _system_db_conn.close() - except Exception: - pass + except Exception as exc: + logger.debug("close_system_db: close raised (%s); ignoring", exc) _system_db_conn = None _system_db_path = None