From 8b2f6620a8cd19f8c60e4633169759cad639929c Mon Sep 17 00:00:00 2001
From: Vojtech <119944107+cvrysanek@users.noreply.github.com>
Date: Sun, 10 May 2026 23:02:30 +0400
Subject: [PATCH] fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to
 prevent WAL corruption (#235)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption

Default 10s stop_grace_period + missing CHECKPOINT on shutdown produced a
class of WAL-replay failures during agnes-auto-upgrade recreates. Sequence:

1. New image digest detected → docker compose up -d → SIGTERM to app
2. App's lifespan close_system_db() called .close() but never CHECKPOINT,
   so any uncheckpointed ops stayed in system.duckdb.wal
3. Container didn't exit within 10s → dockerd SIGKILL (verified in journal:
   "Container failed to exit within 10s of signal 15 - using the force")
4. New container started with possibly-different DuckDB version, replay
   hit "Failure while replaying WAL ... GetDefaultDatabase with no
   default database set" assertion → 500 on every authed request

Observed on foundryai-dev-vrysanek 2026-05-05; recovered by removing the
WAL manually. _try_open_system_db already exists as a recovery net but
requires a system.duckdb.pre-migrate snapshot, which doesn't exist
outside migration windows.

Two-part prevention:

- src/db.py::close_system_db: execute CHECKPOINT before .close() so the
  WAL is empty when the file is released. Best-effort (try/except), so a
  locked or full-disk CHECKPOINT does not block close.

- docker-compose.yml: stop_grace_period: 60s on app + scheduler, gives
  uvicorn + lifespan room to run shutdown handlers under load before
  Docker's SIGKILL fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* log CHECKPOINT outcome on system DB close

Silent except: pass on both CHECKPOINT and close() left operators
without any signal when the WAL-flush safety net actually saved them
(or didn't). Add logger.warning on CHECKPOINT failure (operator-actionable
- recovery via _try_open_system_db kicks in next start) and debug-level
trace on success / close exception.

* drop customer-specific token + add CHANGELOG entries

Per CLAUDE.md vendor-agnostic OSS rule: nothing customer-specific in
shipped code/comments. Replaced "foundryai-dev-vrysanek 2026-05-05"
references in docker-compose.yml and src/db.py docstring with generic
"Docker image upgrade window where DuckDB versions differ" framing.
The original incident date + host live in the commit history / PR body,
not in the tree.

Adds CHANGELOG entries under Unreleased:
- Fixed: close_system_db CHECKPOINT-on-shutdown semantics + WAL-replay
  failure mode the fix protects against
- Changed: docker-compose stop_grace_period 60s on app + scheduler

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
---
 CHANGELOG.md       |  8 ++++++++
 docker-compose.yml | 13 +++++++++++++
 src/db.py          | 30 +++++++++++++++++++++++++++---
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 3ed9580..236d732 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -542,6 +542,14 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
 
 - Schema bumped v28 → v29 → v30. New tests: news repository (14), sanitizer (20), API (8), web (5), CLI (14) — 61 total — plus updated home/auth/template tests for the shared-shell architecture. CLAUDE.md "Run tests before every push" section codifies `pytest tests/ -n auto -q` as non-negotiable before each push.
 
+### Fixed (system DB shutdown)
+
+- **`close_system_db()` now CHECKPOINTs before closing the system DB connection**, so the WAL flushes into `system.duckdb` and the file is left in a clean state across `docker compose up -d` recreate windows. Previously, a SIGKILL after the default 10s `stop_grace_period` could leave a populated `.wal` that the next process must replay on open; if the next image carried a different DuckDB version, replay could trip an internal assertion (`Failure while replaying WAL ... GetDefaultDatabase with no default database set`) and 500 every authed request until the WAL file was manually removed. CHECKPOINT is best-effort with operator-visible logging — `WARNING` on failure, `DEBUG` on success.
+
+### Changed (compose grace)
+
+- **`docker-compose.yml` `stop_grace_period: 60s`** on the `app` and `scheduler` services (was Docker's 10s default). Gives uvicorn time to drain in-flight requests + run the new shutdown CHECKPOINT before SIGKILL. Healthy `docker compose down` is unaffected (services still stop as soon as their lifespan exits).
+
 ## [0.47.4] — 2026-05-08
 
 ### Fixed
diff --git a/docker-compose.yml b/docker-compose.yml
index 4ea3a49..8e99783 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -38,6 +38,16 @@ services:
     mem_limit: 4g
     mem_reservation: 1g
     cpus: 2.0
+    # Default 10s is too short for graceful uvicorn shutdown — under load,
+    # in-flight requests + DuckDB CHECKPOINT (see lifespan in app/main.py)
+    # need more headroom. SIGKILL mid-WAL-write produces a corrupt
+    # system.duckdb.wal that the next image's DuckDB version cannot replay
+    # ("Failure while replaying WAL ... GetDefaultDatabase with no default
+    # database set"), 500-ing every authed request until WAL is removed.
+    # Hits hardest during a Docker image upgrade window where the new
+    # image's DuckDB version differs from the old container's, since
+    # WAL replay across versions trips on internal assertions.
+    stop_grace_period: 60s
 
   # One-shot: run extractor then rebuild orchestrator views
   extract:
@@ -78,6 +88,9 @@ services:
     restart: unless-stopped
     mem_limit: 2g
     cpus: 1.0
+    # Match app service — scheduler holds DuckDB connections too; same
+    # WAL-corruption risk on SIGKILL during recreate.
+    stop_grace_period: 60s
 
   telegram-bot:
     build: .
diff --git a/src/db.py b/src/db.py
index 54ff4a3..054684b 100644
--- a/src/db.py
+++ b/src/db.py
@@ -2914,12 +2914,36 @@ def get_schema_version(conn: duckdb.DuckDBPyConnection) -> int:
 
 
 def close_system_db() -> None:
-    """Close the shared system DB connection. Called on app shutdown."""
+    """Close the shared system DB connection. Called on app shutdown.
+
+    CHECKPOINT before close so the WAL flushes into ``system.duckdb`` and
+    the file is left in a clean state. If we skip this and the process
+    later gets SIGKILL'd (e.g. Docker's default 10s stop_grace_period
+    expires during ``docker compose up -d`` recreate), DuckDB leaves a
+    populated ``.wal`` that the next process must replay on open. When
+    the next process is a different DuckDB version (image upgrade
+    window), replay can hit internal assertions like
+    ``Failure while replaying WAL ... GetDefaultDatabase with no default
+    database set`` and the app 500s on every authed request.
+
+    CHECKPOINT is best-effort: if it raises (locked, disk full, etc.)
+    we still proceed to close — the recovery path in ``_try_open_system_db``
+    plus the longer ``stop_grace_period`` in compose are the safety nets.
+    """
     global _system_db_conn, _system_db_path
     if _system_db_conn:
+        try:
+            _system_db_conn.execute("CHECKPOINT")
+            logger.debug("close_system_db: CHECKPOINT ok")
+        except Exception as exc:
+            # Log + proceed — CHECKPOINT failure is not fatal (recovery path
+            # in _try_open_system_db handles a dirty WAL on next open), but
+            # we want operators to see WHY the safety net was needed if a
+            # WAL-replay failure does surface later.
+            logger.warning("close_system_db: CHECKPOINT failed (%s); proceeding to close", exc)
         try:
             _system_db_conn.close()
-        except Exception:
-            pass
+        except Exception as exc:
+            logger.debug("close_system_db: close raised (%s); ignoring", exc)
         _system_db_conn = None
         _system_db_path = None