fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption (#235)

* fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption

Default 10s stop_grace_period + missing CHECKPOINT on shutdown produced a
class of WAL-replay failures during agnes-auto-upgrade recreates. Sequence:

1. New image digest detected → docker compose up -d → SIGTERM to app
2. App's lifespan close_system_db() called .close() but never CHECKPOINT,
   so any uncheckpointed ops stayed in system.duckdb.wal
3. Container didn't exit within 10s → dockerd SIGKILL (verified in journal:
   "Container failed to exit within 10s of signal 15 - using the force")
4. New container started with possibly-different DuckDB version, replay
   hit "Failure while replaying WAL ... GetDefaultDatabase with no
   default database set" assertion → 500 on every authed request

Observed on foundryai-dev-vrysanek 2026-05-05; recovered by removing the
WAL manually. _try_open_system_db already exists as a recovery net but
requires a system.duckdb.pre-migrate snapshot, which doesn't exist
outside migration windows.

Two-part prevention:

- src/db.py::close_system_db: execute CHECKPOINT before .close() so the
  WAL is empty when the file is released. Best-effort (try/except), so a
  locked or full-disk CHECKPOINT does not block close.

- docker-compose.yml: stop_grace_period: 60s on app + scheduler, gives
  uvicorn + lifespan room to run shutdown handlers under load before
  Docker's SIGKILL fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* log CHECKPOINT outcome on system DB close

Silent except: pass on both CHECKPOINT and close() left operators
without any signal when the WAL-flush safety net actually saved them
(or didn't). Add logger.warning on CHECKPOINT failure (operator-actionable
- recovery via _try_open_system_db kicks in next start) and debug-level
trace on success / close exception.

* drop customer-specific token + add CHANGELOG entries

Per CLAUDE.md vendor-agnostic OSS rule: nothing customer-specific in
shipped code/comments. Replaced "foundryai-dev-vrysanek 2026-05-05"
references in docker-compose.yml and src/db.py docstring with generic
"Docker image upgrade window where DuckDB versions differ" framing.
The original incident date + host live in the commit history / PR body,
not in the tree.

Adds CHANGELOG entries under Unreleased:
- Fixed: close_system_db CHECKPOINT-on-shutdown semantics + WAL-replay
  failure mode the fix protects against
- Changed: docker-compose stop_grace_period 60s on app + scheduler

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
This commit is contained in:
Vojtech 2026-05-10 23:02:30 +04:00 committed by GitHub
parent 41829e8a45
commit 8b2f6620a8
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 48 additions and 3 deletions

View file

@ -542,6 +542,14 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
- Schema bumped v28 → v29 → v30. New tests: news repository (14), sanitizer (20), API (8), web (5), CLI (14) — 61 total — plus updated home/auth/template tests for the shared-shell architecture. CLAUDE.md "Run tests before every push" section codifies `pytest tests/ -n auto -q` as non-negotiable before each push.
### Fixed (system DB shutdown)
- **`close_system_db()` now CHECKPOINTs before closing the system DB connection**, so the WAL flushes into `system.duckdb` and the file is left in a clean state across `docker compose up -d` recreate windows. Previously, a SIGKILL after the default 10s `stop_grace_period` could leave a populated `.wal` that the next process must replay on open; if the next image carried a different DuckDB version, replay could trip an internal assertion (`Failure while replaying WAL ... GetDefaultDatabase with no default database set`) and 500 every authed request until the WAL file was manually removed. CHECKPOINT is best-effort with operator-visible logging — `WARNING` on failure, `DEBUG` on success.
### Changed (compose grace)
- **`docker-compose.yml` `stop_grace_period: 60s`** on the `app` and `scheduler` services (was Docker's 10s default). Gives uvicorn time to drain in-flight requests + run the new shutdown CHECKPOINT before SIGKILL. Healthy `docker compose down` is unaffected (services still stop as soon as their lifespan exits).
## [0.47.4] — 2026-05-08
### Fixed

View file

@ -38,6 +38,16 @@ services:
mem_limit: 4g
mem_reservation: 1g
cpus: 2.0
# Default 10s is too short for graceful uvicorn shutdown — under load,
# in-flight requests + DuckDB CHECKPOINT (see lifespan in app/main.py)
# need more headroom. SIGKILL mid-WAL-write produces a corrupt
# system.duckdb.wal that the next image's DuckDB version cannot replay
# ("Failure while replaying WAL ... GetDefaultDatabase with no default
# database set"), 500-ing every authed request until WAL is removed.
# Hits hardest during a Docker image upgrade window where the new
# image's DuckDB version differs from the old container's, since
# WAL replay across versions trips on internal assertions.
stop_grace_period: 60s
# One-shot: run extractor then rebuild orchestrator views
extract:
@ -78,6 +88,9 @@ services:
restart: unless-stopped
mem_limit: 2g
cpus: 1.0
# Match app service — scheduler holds DuckDB connections too; same
# WAL-corruption risk on SIGKILL during recreate.
stop_grace_period: 60s
telegram-bot:
build: .

View file

@ -2914,12 +2914,36 @@ def get_schema_version(conn: duckdb.DuckDBPyConnection) -> int:
def close_system_db() -> None:
"""Close the shared system DB connection. Called on app shutdown."""
"""Close the shared system DB connection. Called on app shutdown.
CHECKPOINT before close so the WAL flushes into ``system.duckdb`` and
the file is left in a clean state. If we skip this and the process
later gets SIGKILL'd (e.g. Docker's default 10s stop_grace_period
expires during ``docker compose up -d`` recreate), DuckDB leaves a
populated ``.wal`` that the next process must replay on open. When
the next process is a different DuckDB version (image upgrade
window), replay can hit internal assertions like
``Failure while replaying WAL ... GetDefaultDatabase with no default
database set`` and the app 500s on every authed request.
CHECKPOINT is best-effort: if it raises (locked, disk full, etc.)
we still proceed to close the recovery path in ``_try_open_system_db``
plus the longer ``stop_grace_period`` in compose are the safety nets.
"""
global _system_db_conn, _system_db_path
if _system_db_conn:
try:
_system_db_conn.execute("CHECKPOINT")
logger.debug("close_system_db: CHECKPOINT ok")
except Exception as exc:
# Log + proceed — CHECKPOINT failure is not fatal (recovery path
# in _try_open_system_db handles a dirty WAL on next open), but
# we want operators to see WHY the safety net was needed if a
# WAL-replay failure does surface later.
logger.warning("close_system_db: CHECKPOINT failed (%s); proceeding to close", exc)
try:
_system_db_conn.close()
except Exception:
pass
except Exception as exc:
logger.debug("close_system_db: close raised (%s); ignoring", exc)
_system_db_conn = None
_system_db_path = None