fix(db): self-heal missing tables on future-version DBs (agnes-dev incident) (#75)

Discovered when 0.11.5 deployed onto agnes-dev whose system DB had been
bumped to schema_version=10 during local experimentation with a parallel
WIP branch (PR #72-style Context Engineering work). The lab v10 migration
laid down its own table set without including v9's role tables — so the
v9 binary saw `current=10 > SCHEMA_VERSION=9`, correctly treated it as a
future-version-rollback and skipped its migration ladder, but ALSO
skipped the table-creation step. Every query against user_role_grants
(`_hydrate_legacy_role`, /profile, require_internal_role's DB fallback,
every admin-gated request) then crashed with `_duckdb.CatalogException:
Table with name user_role_grants does not exist`. Symptom on agnes-dev:
HTTP 500 on /profile, admin nav vanished, /admin/* returned 403.

Fix: hoist `conn.execute(_SYSTEM_SCHEMA)` to the TOP of _ensure_schema,
unconditional. _SYSTEM_SCHEMA is all `CREATE TABLE IF NOT EXISTS`, so
existing tables stay untouched (columns + data preserved); missing
tables get created. Idempotent, near-zero cost (a few dozen no-op DDLs
per process start). The migration block below still calls
_SYSTEM_SCHEMA when migrating; that's now the redundant-but-cheap
follow-up — left in place so the migration ladder reads chronologically.

Concrete coverage of the rebase scenario the user asked about — a
contributor switching FROM a lab future-schema branch BACK to a
released binary now boots cleanly:

- Forward rebase (older → current): unchanged, ladder runs as before.
- Same-version rebase: unchanged, _seed_core_roles tail call still
  drives doc-tweak refresh.
- Backward "lab" rebase (this fix): tables get re-materialized; if the
  DB is still on a future schema_version, _seed_core_roles tail call
  remains gated so we don't accidentally write data into a schema
  shape this binary doesn't understand. Operator can drop the v9
  schema_version manually to trigger a clean ladder re-run if they
  want the full v8→v9 backfill (what we did to recover agnes-dev).

Test: new test_split_brain_future_version_with_missing_tables_self_heals
in tests/test_db.py::TestMigrationSafety. Synthesizes a v99 DB whose
only existing table is schema_version, runs _ensure_schema, asserts
both user_role_grants AND internal_roles AND group_mappings AND users
exist after the call, and that the schema_version row stays at 99
(future-version contract). test_future_version_is_noop docstring
updated to reflect the new self-heal pass — its only assertion (the
version-row contract) still holds unchanged.

pyproject.toml: 0.11.5 → 0.11.6.
CHANGELOG.md: new [0.11.6] section under [Unreleased] skeleton.
This commit is contained in:
Petr Simecek 2026-04-28 15:51:33 +02:00 committed by GitHub
parent d9c4a95d9e
commit 3047f310b9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 249 additions and 13 deletions

View file

@ -10,6 +10,54 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
### Fixed
- **Pre-migration snapshot integrity** — the snapshot file written
before a v(N-k)→vN migration now captures the true on-disk state
*before* any DDL runs, instead of the post-self-heal state the
0.12.0 hoist (#106) introduced. With the unconditional
`conn.execute(_SYSTEM_SCHEMA)` at the top of `_ensure_schema`, the
full set of modern-binary tables (`view_ownership`,
`marketplace_registry`, `user_groups`, `resource_grants`, etc.) was
materialized first, then `CHECKPOINT` flushed them to disk, and
`shutil.copy2` copied the already-modified DB as the
"pre-migration" snapshot — so an operator inspecting the snapshot
for rollback debugging saw the binary's full table set instead of
the old schema. Functionally rollback still worked (extra empty
tables are harmless and re-running migration is idempotent), but
the snapshot was misleading. Fix: gate the self-heal call on
`current >= SCHEMA_VERSION`. The split-brain (`current >
SCHEMA_VERSION`) and same-version safety-net (`current ==
SCHEMA_VERSION`) paths still self-heal as before; the migration
path (`current < SCHEMA_VERSION`) takes its snapshot first and
then runs `_SYSTEM_SCHEMA` from inside the existing migration
block.
- **Split-brain self-heal regression test for a shared dev-VM
split-brain incident** (2026-04-27). Pins the contract that the
gated `_SYSTEM_SCHEMA`
self-heal pass keeps working when a binary lands on a
future-version DB that's missing tables it expects: every query
against the missing table would otherwise crash at runtime
(`_duckdb.CatalogException`). New
`test_split_brain_future_version_with_missing_tables_self_heals`
in `tests/test_db.py::TestMigrationSafety` synthesizes a v99 DB
whose only table is `schema_version`, runs `_ensure_schema`, and
asserts that the v13-era core tables (`users`, `user_groups`,
`user_group_members`, `resource_grants`) now exist *and* that
`schema_version` stays at 99 (self-heal without falsely
advertising a downgrade). Plus
`test_pre_migration_snapshot_excludes_post_self_heal_tables`
pins the snapshot-integrity contract: a v2→vN migration's
snapshot must not contain any post-v2 table from the modern
binary.
### Internal
- `test_future_version_is_noop` docstring updated to reflect that
the self-heal pass *does* run on a future-version DB, just
doesn't touch the version row. The test still passes unchanged —
its only assertion was the version-row contract, which holds.
## [0.12.0] — 2026-04-28
### Changed

View file

@ -1222,22 +1222,34 @@ _V3_TO_V4_MIGRATIONS = [
def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
"""Create tables if they don't exist. Apply migrations if schema version changed.
The first action running ``_SYSTEM_SCHEMA`` unconditionally is the
self-heal pass for split-brain DBs. Scenario: a contributor's DB landed
at schema_version=N from a partial migration (crash mid-DDL, parallel
WIP branch with a different table set, etc.), but the on-disk file is
missing tables this binary expects. Without this pass, the migration
block below skips because ``current >= SCHEMA_VERSION`` and every
runtime query against the missing table crashes.
Self-heal pass for split-brain DBs runs only when ``current >=
SCHEMA_VERSION``. Scenario: a contributor's DB landed at
``schema_version=N`` from a partial migration (crash mid-DDL,
parallel WIP branch with a different table set, etc.), but the
on-disk file is missing tables this binary expects. Without this
pass, the migration block below skips because we don't downgrade,
and every runtime query against the missing table crashes.
Because ``_SYSTEM_SCHEMA`` is all ``CREATE TABLE IF NOT EXISTS``,
running it unconditionally is idempotent: existing tables stay
untouched (columns + data preserved), missing tables get created.
Cost: dozens of no-op DDLs per process start.
"""
conn.execute(_SYSTEM_SCHEMA)
running it is idempotent: existing tables stay untouched (columns +
data preserved), missing tables get created. Cost: dozens of no-op
DDLs per process start.
The self-heal explicitly does NOT run on the ``current <
SCHEMA_VERSION`` path so the pre-migration snapshot taken inside
that branch captures a true point-in-time state of the on-disk DB
*before* any DDL runs operators reading the snapshot for rollback
debugging see exactly the tables the old schema had, not the
binary's full table set with extras tacked on.
"""
current = get_schema_version(conn)
if current >= SCHEMA_VERSION:
# Split-brain or same-version safety net: heal any tables this
# binary expects that aren't on disk. Migration block skipped
# because we don't downgrade — the version row is left at
# ``current`` so a later binary that understands ``current``
# picks up where the split-brain left off.
conn.execute(_SYSTEM_SCHEMA)
if current < SCHEMA_VERSION:
# Snapshot before migration for rollback support
if current > 0:

View file

@ -326,7 +326,11 @@ class TestMigrationSafety:
conn.close()
def test_future_version_is_noop(self, tmp_path, monkeypatch):
"""_ensure_schema does nothing when schema_version > SCHEMA_VERSION."""
"""``_ensure_schema`` does not modify ``schema_version`` when it's
already past ``SCHEMA_VERSION``. The unconditional ``_SYSTEM_SCHEMA``
self-heal pass *does* run on the future-version DB it's all
``CREATE TABLE IF NOT EXISTS``, so tables this binary expects get
materialized but the version row stays put."""
monkeypatch.setenv("DATA_DIR", str(tmp_path))
import duckdb as _duckdb
from src.db import _ensure_schema, get_schema_version
@ -344,6 +348,178 @@ class TestMigrationSafety:
finally:
conn.close()
def test_split_brain_future_version_with_missing_tables_self_heals(
self, tmp_path, monkeypatch,
):
"""Regression for a shared dev-VM split-brain incident.
Shape: a contributor experiments with a future-schema branch that
bumps the DB to ``schema_version=N`` (N > current binary's
``SCHEMA_VERSION``) with its own table layout, then switches or
rebases back to the released binary. The on-disk DB is on a
version this binary doesn't understand and is missing tables this
binary's code expects. Without self-heal, every query against the
missing table crashes at runtime the migration block correctly
skips (we don't downgrade), but nothing creates the missing
tables either.
The contract this test pins: the gated
``conn.execute(_SYSTEM_SCHEMA)`` call (run when ``current >=
SCHEMA_VERSION``) materializes any missing tables *and* leaves
the future-version ``schema_version`` row untouched. We
synthesize a v99 DB whose only table is ``schema_version``,
then assert that running ``_ensure_schema`` creates the v13-era
core tables that the binary needs (``user_groups``,
``user_group_members``, ``resource_grants``, ``users``) while
keeping the version at 99.
"""
monkeypatch.setenv("DATA_DIR", str(tmp_path))
import duckdb as _duckdb
from src.db import _ensure_schema, get_schema_version
db_path = tmp_path / "state" / "system.duckdb"
db_path.parent.mkdir(parents=True, exist_ok=True)
conn = _duckdb.connect(str(db_path))
try:
# Synthesize an "old binary on a future-schema DB" state: only
# the schema_version table exists (no current-schema tables,
# no lab tables either — matches the exact shape seen after
# a lab migration ran but the binary then rolled back to one
# that doesn't know the lab schema).
conn.execute(
"CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp);"
"INSERT INTO schema_version (version) VALUES (99);"
)
# Sanity: the v13-era tables we expect the self-heal pass to
# create are NOT there before the call. Picked from the
# post-RBAC-v13 / post-marketplace surface so a future
# rename/drop in src/db.py fails this test loudly.
expected_tables = {
"users",
"user_groups",
"user_group_members",
"resource_grants",
}
tables_before = {
r[0]
for r in conn.execute(
"SELECT table_name FROM information_schema.tables "
"WHERE table_schema = ?",
["main"],
).fetchall()
}
assert not (expected_tables & tables_before), (
"fixture started with a non-empty schema; expected only "
"schema_version to be present"
)
_ensure_schema(conn)
# After: every expected table exists (self-heal worked) AND
# the version row stays at the future value.
tables_after = {
r[0]
for r in conn.execute(
"SELECT table_name FROM information_schema.tables "
"WHERE table_schema = ?",
["main"],
).fetchall()
}
missing = expected_tables - tables_after
assert not missing, (
f"self-heal must create v13-era tables on a future-version DB, "
f"missing: {sorted(missing)}"
)
# The future-version contract still holds: version row untouched.
assert get_schema_version(conn) == 99
finally:
conn.close()
def test_pre_migration_snapshot_excludes_post_self_heal_tables(
self, tmp_path, monkeypatch,
):
"""The pre-migration snapshot must capture the on-disk DB state
*before* any DDL runs, so operators reading the snapshot for
rollback debugging see the old schema as it actually was not
the binary's full table set with extras tacked on.
Regression for the original hoist in 0.12.0: ``_SYSTEM_SCHEMA``
was unconditionally executed at the top of ``_ensure_schema``,
ahead of the snapshot copy in the migration block. On a v2vN
migration, ``view_ownership`` / ``user_groups`` /
``resource_grants`` (and every other table the modern binary
adds) were created first, then ``CHECKPOINT`` flushed them to
disk, and ``shutil.copy2`` copied the already-modified file as
the "pre-migration" snapshot. Functionally rollback still
worked (extra empty tables are harmless), but the snapshot was
misleading. Fix: gate the self-heal call on ``current >=
SCHEMA_VERSION`` so the migration path takes its snapshot
before any DDL touches the DB.
"""
from src.db import (
SCHEMA_VERSION,
_ensure_schema,
get_schema_version,
get_system_db,
)
# Bootstrap a v2 DB on disk, then trigger the migration ladder.
db_path = tmp_path / "state" / "system.duckdb"
self._create_v2_db(db_path)
monkeypatch.setenv("DATA_DIR", str(tmp_path))
conn = get_system_db()
try:
assert get_schema_version(conn) == SCHEMA_VERSION
finally:
conn.close()
# Drop the cached connection so the snapshot file isn't
# locked when we re-open it.
from src import db as _db
_db._system_db_conn = None
_db._system_db_path = None
snapshot = tmp_path / "state" / "system.duckdb.pre-migrate"
assert snapshot.exists(), (
"fixture precondition: snapshot must be written for a v2→vN "
"migration"
)
import duckdb as _duckdb
snap = _duckdb.connect(str(snapshot), read_only=True)
try:
tables_in_snapshot = {
r[0]
for r in snap.execute(
"SELECT table_name FROM information_schema.tables "
"WHERE table_schema = 'main'"
).fetchall()
}
finally:
snap.close()
# Tables NOT present in the v2 fixture but added by later
# migrations (and therefore created by _SYSTEM_SCHEMA on the
# modern binary). If any of these leaked into the snapshot, the
# snapshot was contaminated by a self-heal pass running before
# the snapshot copy.
post_v2_tables = {
"view_ownership", # v10 (#100)
"marketplace_registry", # v11
"marketplace_plugins", # v11
"user_groups", # v11+ / v13
"user_group_members", # v13 (#106)
"resource_grants", # v13 (#106)
}
leaked = post_v2_tables & tables_in_snapshot
assert not leaked, (
f"pre-migration snapshot was contaminated with post-v2 "
f"tables — self-heal pass ran before the snapshot copy. "
f"Leaked: {sorted(leaked)}"
)
class TestSchemaV4:
"""Tests for v4 schema additions: metric_definitions and column_metadata tables."""