From 3047f310b9c324320f022254df57ece875adcd33 Mon Sep 17 00:00:00 2001 From: Petr Simecek Date: Tue, 28 Apr 2026 15:51:33 +0200 Subject: [PATCH] fix(db): self-heal missing tables on future-version DBs (agnes-dev incident) (#75) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Discovered when 0.11.5 deployed onto agnes-dev whose system DB had been bumped to schema_version=10 during local experimentation with a parallel WIP branch (PR #72-style Context Engineering work). The lab v10 migration laid down its own table set without including v9's role tables — so the v9 binary saw `current=10 > SCHEMA_VERSION=9`, correctly treated it as a future-version-rollback and skipped its migration ladder, but ALSO skipped the table-creation step. Every query against user_role_grants (`_hydrate_legacy_role`, /profile, require_internal_role's DB fallback, every admin-gated request) then crashed with `_duckdb.CatalogException: Table with name user_role_grants does not exist`. Symptom on agnes-dev: HTTP 500 on /profile, admin nav vanished, /admin/* returned 403. Fix: hoist `conn.execute(_SYSTEM_SCHEMA)` to the TOP of _ensure_schema, unconditional. _SYSTEM_SCHEMA is all `CREATE TABLE IF NOT EXISTS`, so existing tables stay untouched (columns + data preserved); missing tables get created. Idempotent, near-zero cost (a few dozen no-op DDLs per process start). The migration block below still calls _SYSTEM_SCHEMA when migrating; that's now the redundant-but-cheap follow-up — left in place so the migration ladder reads chronologically. Concrete coverage of the rebase scenario the user asked about — a contributor switching FROM a lab future-schema branch BACK to a released binary now boots cleanly: - Forward rebase (older → current): unchanged, ladder runs as before. - Same-version rebase: unchanged, _seed_core_roles tail call still drives doc-tweak refresh. - Backward "lab" rebase (this fix): tables get re-materialized; if the DB is still on a future schema_version, _seed_core_roles tail call remains gated so we don't accidentally write data into a schema shape this binary doesn't understand. Operator can drop the v9 schema_version manually to trigger a clean ladder re-run if they want the full v8→v9 backfill (what we did to recover agnes-dev). Test: new test_split_brain_future_version_with_missing_tables_self_heals in tests/test_db.py::TestMigrationSafety. Synthesizes a v99 DB whose only existing table is schema_version, runs _ensure_schema, asserts both user_role_grants AND internal_roles AND group_mappings AND users exist after the call, and that the schema_version row stays at 99 (future-version contract). test_future_version_is_noop docstring updated to reflect the new self-heal pass — its only assertion (the version-row contract) still holds unchanged. pyproject.toml: 0.11.5 → 0.11.6. CHANGELOG.md: new [0.11.6] section under [Unreleased] skeleton. --- CHANGELOG.md | 48 +++++++++++++ src/db.py | 36 ++++++---- tests/test_db.py | 178 ++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 249 insertions(+), 13 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index da829c6..1ea886d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,54 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +### Fixed + +- **Pre-migration snapshot integrity** — the snapshot file written + before a v(N-k)→vN migration now captures the true on-disk state + *before* any DDL runs, instead of the post-self-heal state the + 0.12.0 hoist (#106) introduced. With the unconditional + `conn.execute(_SYSTEM_SCHEMA)` at the top of `_ensure_schema`, the + full set of modern-binary tables (`view_ownership`, + `marketplace_registry`, `user_groups`, `resource_grants`, etc.) was + materialized first, then `CHECKPOINT` flushed them to disk, and + `shutil.copy2` copied the already-modified DB as the + "pre-migration" snapshot — so an operator inspecting the snapshot + for rollback debugging saw the binary's full table set instead of + the old schema. Functionally rollback still worked (extra empty + tables are harmless and re-running migration is idempotent), but + the snapshot was misleading. Fix: gate the self-heal call on + `current >= SCHEMA_VERSION`. The split-brain (`current > + SCHEMA_VERSION`) and same-version safety-net (`current == + SCHEMA_VERSION`) paths still self-heal as before; the migration + path (`current < SCHEMA_VERSION`) takes its snapshot first and + then runs `_SYSTEM_SCHEMA` from inside the existing migration + block. +- **Split-brain self-heal regression test for a shared dev-VM + split-brain incident** (2026-04-27). Pins the contract that the + gated `_SYSTEM_SCHEMA` + self-heal pass keeps working when a binary lands on a + future-version DB that's missing tables it expects: every query + against the missing table would otherwise crash at runtime + (`_duckdb.CatalogException`). New + `test_split_brain_future_version_with_missing_tables_self_heals` + in `tests/test_db.py::TestMigrationSafety` synthesizes a v99 DB + whose only table is `schema_version`, runs `_ensure_schema`, and + asserts that the v13-era core tables (`users`, `user_groups`, + `user_group_members`, `resource_grants`) now exist *and* that + `schema_version` stays at 99 (self-heal without falsely + advertising a downgrade). Plus + `test_pre_migration_snapshot_excludes_post_self_heal_tables` + pins the snapshot-integrity contract: a v2→vN migration's + snapshot must not contain any post-v2 table from the modern + binary. + +### Internal + +- `test_future_version_is_noop` docstring updated to reflect that + the self-heal pass *does* run on a future-version DB, just + doesn't touch the version row. The test still passes unchanged — + its only assertion was the version-row contract, which holds. + ## [0.12.0] — 2026-04-28 ### Changed diff --git a/src/db.py b/src/db.py index 501ba93..347c086 100644 --- a/src/db.py +++ b/src/db.py @@ -1222,22 +1222,34 @@ _V3_TO_V4_MIGRATIONS = [ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None: """Create tables if they don't exist. Apply migrations if schema version changed. - The first action — running ``_SYSTEM_SCHEMA`` unconditionally — is the - self-heal pass for split-brain DBs. Scenario: a contributor's DB landed - at schema_version=N from a partial migration (crash mid-DDL, parallel - WIP branch with a different table set, etc.), but the on-disk file is - missing tables this binary expects. Without this pass, the migration - block below skips because ``current >= SCHEMA_VERSION`` and every - runtime query against the missing table crashes. + Self-heal pass for split-brain DBs runs only when ``current >= + SCHEMA_VERSION``. Scenario: a contributor's DB landed at + ``schema_version=N`` from a partial migration (crash mid-DDL, + parallel WIP branch with a different table set, etc.), but the + on-disk file is missing tables this binary expects. Without this + pass, the migration block below skips because we don't downgrade, + and every runtime query against the missing table crashes. Because ``_SYSTEM_SCHEMA`` is all ``CREATE TABLE IF NOT EXISTS``, - running it unconditionally is idempotent: existing tables stay - untouched (columns + data preserved), missing tables get created. - Cost: dozens of no-op DDLs per process start. - """ - conn.execute(_SYSTEM_SCHEMA) + running it is idempotent: existing tables stay untouched (columns + + data preserved), missing tables get created. Cost: dozens of no-op + DDLs per process start. + The self-heal explicitly does NOT run on the ``current < + SCHEMA_VERSION`` path so the pre-migration snapshot taken inside + that branch captures a true point-in-time state of the on-disk DB + *before* any DDL runs — operators reading the snapshot for rollback + debugging see exactly the tables the old schema had, not the + binary's full table set with extras tacked on. + """ current = get_schema_version(conn) + if current >= SCHEMA_VERSION: + # Split-brain or same-version safety net: heal any tables this + # binary expects that aren't on disk. Migration block skipped + # because we don't downgrade — the version row is left at + # ``current`` so a later binary that understands ``current`` + # picks up where the split-brain left off. + conn.execute(_SYSTEM_SCHEMA) if current < SCHEMA_VERSION: # Snapshot before migration for rollback support if current > 0: diff --git a/tests/test_db.py b/tests/test_db.py index 9addc92..5cd3f45 100644 --- a/tests/test_db.py +++ b/tests/test_db.py @@ -326,7 +326,11 @@ class TestMigrationSafety: conn.close() def test_future_version_is_noop(self, tmp_path, monkeypatch): - """_ensure_schema does nothing when schema_version > SCHEMA_VERSION.""" + """``_ensure_schema`` does not modify ``schema_version`` when it's + already past ``SCHEMA_VERSION``. The unconditional ``_SYSTEM_SCHEMA`` + self-heal pass *does* run on the future-version DB — it's all + ``CREATE TABLE IF NOT EXISTS``, so tables this binary expects get + materialized — but the version row stays put.""" monkeypatch.setenv("DATA_DIR", str(tmp_path)) import duckdb as _duckdb from src.db import _ensure_schema, get_schema_version @@ -344,6 +348,178 @@ class TestMigrationSafety: finally: conn.close() + def test_split_brain_future_version_with_missing_tables_self_heals( + self, tmp_path, monkeypatch, + ): + """Regression for a shared dev-VM split-brain incident. + + Shape: a contributor experiments with a future-schema branch that + bumps the DB to ``schema_version=N`` (N > current binary's + ``SCHEMA_VERSION``) with its own table layout, then switches or + rebases back to the released binary. The on-disk DB is on a + version this binary doesn't understand and is missing tables this + binary's code expects. Without self-heal, every query against the + missing table crashes at runtime — the migration block correctly + skips (we don't downgrade), but nothing creates the missing + tables either. + + The contract this test pins: the gated + ``conn.execute(_SYSTEM_SCHEMA)`` call (run when ``current >= + SCHEMA_VERSION``) materializes any missing tables *and* leaves + the future-version ``schema_version`` row untouched. We + synthesize a v99 DB whose only table is ``schema_version``, + then assert that running ``_ensure_schema`` creates the v13-era + core tables that the binary needs (``user_groups``, + ``user_group_members``, ``resource_grants``, ``users``) while + keeping the version at 99. + """ + monkeypatch.setenv("DATA_DIR", str(tmp_path)) + import duckdb as _duckdb + from src.db import _ensure_schema, get_schema_version + + db_path = tmp_path / "state" / "system.duckdb" + db_path.parent.mkdir(parents=True, exist_ok=True) + conn = _duckdb.connect(str(db_path)) + try: + # Synthesize an "old binary on a future-schema DB" state: only + # the schema_version table exists (no current-schema tables, + # no lab tables either — matches the exact shape seen after + # a lab migration ran but the binary then rolled back to one + # that doesn't know the lab schema). + conn.execute( + "CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp);" + "INSERT INTO schema_version (version) VALUES (99);" + ) + + # Sanity: the v13-era tables we expect the self-heal pass to + # create are NOT there before the call. Picked from the + # post-RBAC-v13 / post-marketplace surface so a future + # rename/drop in src/db.py fails this test loudly. + expected_tables = { + "users", + "user_groups", + "user_group_members", + "resource_grants", + } + tables_before = { + r[0] + for r in conn.execute( + "SELECT table_name FROM information_schema.tables " + "WHERE table_schema = ?", + ["main"], + ).fetchall() + } + assert not (expected_tables & tables_before), ( + "fixture started with a non-empty schema; expected only " + "schema_version to be present" + ) + + _ensure_schema(conn) + + # After: every expected table exists (self-heal worked) AND + # the version row stays at the future value. + tables_after = { + r[0] + for r in conn.execute( + "SELECT table_name FROM information_schema.tables " + "WHERE table_schema = ?", + ["main"], + ).fetchall() + } + missing = expected_tables - tables_after + assert not missing, ( + f"self-heal must create v13-era tables on a future-version DB, " + f"missing: {sorted(missing)}" + ) + + # The future-version contract still holds: version row untouched. + assert get_schema_version(conn) == 99 + finally: + conn.close() + + def test_pre_migration_snapshot_excludes_post_self_heal_tables( + self, tmp_path, monkeypatch, + ): + """The pre-migration snapshot must capture the on-disk DB state + *before* any DDL runs, so operators reading the snapshot for + rollback debugging see the old schema as it actually was — not + the binary's full table set with extras tacked on. + + Regression for the original hoist in 0.12.0: ``_SYSTEM_SCHEMA`` + was unconditionally executed at the top of ``_ensure_schema``, + ahead of the snapshot copy in the migration block. On a v2→vN + migration, ``view_ownership`` / ``user_groups`` / + ``resource_grants`` (and every other table the modern binary + adds) were created first, then ``CHECKPOINT`` flushed them to + disk, and ``shutil.copy2`` copied the already-modified file as + the "pre-migration" snapshot. Functionally rollback still + worked (extra empty tables are harmless), but the snapshot was + misleading. Fix: gate the self-heal call on ``current >= + SCHEMA_VERSION`` so the migration path takes its snapshot + before any DDL touches the DB. + """ + from src.db import ( + SCHEMA_VERSION, + _ensure_schema, + get_schema_version, + get_system_db, + ) + + # Bootstrap a v2 DB on disk, then trigger the migration ladder. + db_path = tmp_path / "state" / "system.duckdb" + self._create_v2_db(db_path) + monkeypatch.setenv("DATA_DIR", str(tmp_path)) + + conn = get_system_db() + try: + assert get_schema_version(conn) == SCHEMA_VERSION + finally: + conn.close() + # Drop the cached connection so the snapshot file isn't + # locked when we re-open it. + from src import db as _db + _db._system_db_conn = None + _db._system_db_path = None + + snapshot = tmp_path / "state" / "system.duckdb.pre-migrate" + assert snapshot.exists(), ( + "fixture precondition: snapshot must be written for a v2→vN " + "migration" + ) + + import duckdb as _duckdb + snap = _duckdb.connect(str(snapshot), read_only=True) + try: + tables_in_snapshot = { + r[0] + for r in snap.execute( + "SELECT table_name FROM information_schema.tables " + "WHERE table_schema = 'main'" + ).fetchall() + } + finally: + snap.close() + + # Tables NOT present in the v2 fixture but added by later + # migrations (and therefore created by _SYSTEM_SCHEMA on the + # modern binary). If any of these leaked into the snapshot, the + # snapshot was contaminated by a self-heal pass running before + # the snapshot copy. + post_v2_tables = { + "view_ownership", # v10 (#100) + "marketplace_registry", # v11 + "marketplace_plugins", # v11 + "user_groups", # v11+ / v13 + "user_group_members", # v13 (#106) + "resource_grants", # v13 (#106) + } + leaked = post_v2_tables & tables_in_snapshot + assert not leaked, ( + f"pre-migration snapshot was contaminated with post-v2 " + f"tables — self-heal pass ran before the snapshot copy. " + f"Leaked: {sorted(leaked)}" + ) + class TestSchemaV4: """Tests for v4 schema additions: metric_definitions and column_metadata tables."""