fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption (#235)

* fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption Default 10s stop_grace_period + missing CHECKPOINT on shutdown produced a class of WAL-replay failures during agnes-auto-upgrade recreates. Sequence: 1. New image digest detected → docker compose up -d → SIGTERM to app 2. App's lifespan close_system_db() called .close() but never CHECKPOINT, so any uncheckpointed ops stayed in system.duckdb.wal 3. Container didn't exit within 10s → dockerd SIGKILL (verified in journal: "Container failed to exit within 10s of signal 15 - using the force") 4. New container started with possibly-different DuckDB version, replay hit "Failure while replaying WAL ... GetDefaultDatabase with no default database set" assertion → 500 on every authed request Observed on foundryai-dev-vrysanek 2026-05-05; recovered by removing the WAL manually. _try_open_system_db already exists as a recovery net but requires a system.duckdb.pre-migrate snapshot, which doesn't exist outside migration windows. Two-part prevention: - src/db.py::close_system_db: execute CHECKPOINT before .close() so the WAL is empty when the file is released. Best-effort (try/except), so a locked or full-disk CHECKPOINT does not block close. - docker-compose.yml: stop_grace_period: 60s on app + scheduler, gives uvicorn + lifespan room to run shutdown handlers under load before Docker's SIGKILL fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * log CHECKPOINT outcome on system DB close Silent except: pass on both CHECKPOINT and close() left operators without any signal when the WAL-flush safety net actually saved them (or didn't). Add logger.warning on CHECKPOINT failure (operator-actionable - recovery via _try_open_system_db kicks in next start) and debug-level trace on success / close exception. * drop customer-specific token + add CHANGELOG entries Per CLAUDE.md vendor-agnostic OSS rule: nothing customer-specific in shipped code/comments. Replaced "foundryai-dev-vrysanek 2026-05-05" references in docker-compose.yml and src/db.py docstring with generic "Docker image upgrade window where DuckDB versions differ" framing. The original incident date + host live in the commit history / PR body, not in the tree. Adds CHANGELOG entries under Unreleased: - Fixed: close_system_db CHECKPOINT-on-shutdown semantics + WAL-replay failure mode the fix protects against - Changed: docker-compose stop_grace_period 60s on app + scheduler --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-10 23:02:30 +04:00 · 2026-05-10 23:02:30 +04:00 · 8b2f6620a8
commit 8b2f6620a8
parent 41829e8a45
3 changed files with 48 additions and 3 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -542,6 +542,14 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C

 - Schema bumped v28 → v29 → v30. New tests: news repository (14), sanitizer (20), API (8), web (5), CLI (14) — 61 total — plus updated home/auth/template tests for the shared-shell architecture. CLAUDE.md "Run tests before every push" section codifies `pytest tests/ -n auto -q` as non-negotiable before each push.

+### Fixed (system DB shutdown)
+
+- **`close_system_db()` now CHECKPOINTs before closing the system DB connection**, so the WAL flushes into `system.duckdb` and the file is left in a clean state across `docker compose up -d` recreate windows. Previously, a SIGKILL after the default 10s `stop_grace_period` could leave a populated `.wal` that the next process must replay on open; if the next image carried a different DuckDB version, replay could trip an internal assertion (`Failure while replaying WAL ... GetDefaultDatabase with no default database set`) and 500 every authed request until the WAL file was manually removed. CHECKPOINT is best-effort with operator-visible logging — `WARNING` on failure, `DEBUG` on success.
+
+### Changed (compose grace)
+
+- **`docker-compose.yml` `stop_grace_period: 60s`** on the `app` and `scheduler` services (was Docker's 10s default). Gives uvicorn time to drain in-flight requests + run the new shutdown CHECKPOINT before SIGKILL. Healthy `docker compose down` is unaffected (services still stop as soon as their lifespan exits).
+
 ## [0.47.4] — 2026-05-08

 ### Fixed
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -38,6 +38,16 @@ services:
    mem_limit: 4g
    mem_reservation: 1g
    cpus: 2.0
+    # Default 10s is too short for graceful uvicorn shutdown — under load,
+    # in-flight requests + DuckDB CHECKPOINT (see lifespan in app/main.py)
+    # need more headroom. SIGKILL mid-WAL-write produces a corrupt
+    # system.duckdb.wal that the next image's DuckDB version cannot replay
+    # ("Failure while replaying WAL ... GetDefaultDatabase with no default
+    # database set"), 500-ing every authed request until WAL is removed.
+    # Hits hardest during a Docker image upgrade window where the new
+    # image's DuckDB version differs from the old container's, since
+    # WAL replay across versions trips on internal assertions.
+    stop_grace_period: 60s

  # One-shot: run extractor then rebuild orchestrator views
  extract:
@ -78,6 +88,9 @@ services:
    restart: unless-stopped
    mem_limit: 2g
    cpus: 1.0
+    # Match app service — scheduler holds DuckDB connections too; same
+    # WAL-corruption risk on SIGKILL during recreate.
+    stop_grace_period: 60s

  telegram-bot:
    build: .
--- a/src/db.py
+++ b/src/db.py
@ -2914,12 +2914,36 @@ def get_schema_version(conn: duckdb.DuckDBPyConnection) -> int:


 def close_system_db() -> None:
-    """Close the shared system DB connection. Called on app shutdown."""
+    """Close the shared system DB connection. Called on app shutdown.
+
+    CHECKPOINT before close so the WAL flushes into ``system.duckdb`` and
+    the file is left in a clean state. If we skip this and the process
+    later gets SIGKILL'd (e.g. Docker's default 10s stop_grace_period
+    expires during ``docker compose up -d`` recreate), DuckDB leaves a
+    populated ``.wal`` that the next process must replay on open. When
+    the next process is a different DuckDB version (image upgrade
+    window), replay can hit internal assertions like
+    ``Failure while replaying WAL ... GetDefaultDatabase with no default
+    database set`` and the app 500s on every authed request.
+
+    CHECKPOINT is best-effort: if it raises (locked, disk full, etc.)
+    we still proceed to close — the recovery path in ``_try_open_system_db``
+    plus the longer ``stop_grace_period`` in compose are the safety nets.
+    """
    global _system_db_conn, _system_db_path
    if _system_db_conn:
+        try:
+            _system_db_conn.execute("CHECKPOINT")
+            logger.debug("close_system_db: CHECKPOINT ok")
+        except Exception as exc:
+            # Log + proceed — CHECKPOINT failure is not fatal (recovery path
+            # in _try_open_system_db handles a dirty WAL on next open), but
+            # we want operators to see WHY the safety net was needed if a
+            # WAL-replay failure does surface later.
+            logger.warning("close_system_db: CHECKPOINT failed (%s); proceeding to close", exc)
        try:
            _system_db_conn.close()
-        except Exception:
-            pass
+        except Exception as exc:
+            logger.debug("close_system_db: close raised (%s); ignoring", exc)
        _system_db_conn = None
        _system_db_path = None