fix(db): self-heal missing tables on future-version DBs (agnes-dev incident) (#75)
Discovered when 0.11.5 deployed onto agnes-dev whose system DB had been bumped to schema_version=10 during local experimentation with a parallel WIP branch (PR #72-style Context Engineering work). The lab v10 migration laid down its own table set without including v9's role tables — so the v9 binary saw `current=10 > SCHEMA_VERSION=9`, correctly treated it as a future-version-rollback and skipped its migration ladder, but ALSO skipped the table-creation step. Every query against user_role_grants (`_hydrate_legacy_role`, /profile, require_internal_role's DB fallback, every admin-gated request) then crashed with `_duckdb.CatalogException: Table with name user_role_grants does not exist`. Symptom on agnes-dev: HTTP 500 on /profile, admin nav vanished, /admin/* returned 403. Fix: hoist `conn.execute(_SYSTEM_SCHEMA)` to the TOP of _ensure_schema, unconditional. _SYSTEM_SCHEMA is all `CREATE TABLE IF NOT EXISTS`, so existing tables stay untouched (columns + data preserved); missing tables get created. Idempotent, near-zero cost (a few dozen no-op DDLs per process start). The migration block below still calls _SYSTEM_SCHEMA when migrating; that's now the redundant-but-cheap follow-up — left in place so the migration ladder reads chronologically. Concrete coverage of the rebase scenario the user asked about — a contributor switching FROM a lab future-schema branch BACK to a released binary now boots cleanly: - Forward rebase (older → current): unchanged, ladder runs as before. - Same-version rebase: unchanged, _seed_core_roles tail call still drives doc-tweak refresh. - Backward "lab" rebase (this fix): tables get re-materialized; if the DB is still on a future schema_version, _seed_core_roles tail call remains gated so we don't accidentally write data into a schema shape this binary doesn't understand. Operator can drop the v9 schema_version manually to trigger a clean ladder re-run if they want the full v8→v9 backfill (what we did to recover agnes-dev). Test: new test_split_brain_future_version_with_missing_tables_self_heals in tests/test_db.py::TestMigrationSafety. Synthesizes a v99 DB whose only existing table is schema_version, runs _ensure_schema, asserts both user_role_grants AND internal_roles AND group_mappings AND users exist after the call, and that the schema_version row stays at 99 (future-version contract). test_future_version_is_noop docstring updated to reflect the new self-heal pass — its only assertion (the version-row contract) still holds unchanged. pyproject.toml: 0.11.5 → 0.11.6. CHANGELOG.md: new [0.11.6] section under [Unreleased] skeleton.
This commit is contained in:
parent
d9c4a95d9e
commit
3047f310b9
3 changed files with 249 additions and 13 deletions
48
CHANGELOG.md
48
CHANGELOG.md
|
|
@ -10,6 +10,54 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- **Pre-migration snapshot integrity** — the snapshot file written
|
||||||
|
before a v(N-k)→vN migration now captures the true on-disk state
|
||||||
|
*before* any DDL runs, instead of the post-self-heal state the
|
||||||
|
0.12.0 hoist (#106) introduced. With the unconditional
|
||||||
|
`conn.execute(_SYSTEM_SCHEMA)` at the top of `_ensure_schema`, the
|
||||||
|
full set of modern-binary tables (`view_ownership`,
|
||||||
|
`marketplace_registry`, `user_groups`, `resource_grants`, etc.) was
|
||||||
|
materialized first, then `CHECKPOINT` flushed them to disk, and
|
||||||
|
`shutil.copy2` copied the already-modified DB as the
|
||||||
|
"pre-migration" snapshot — so an operator inspecting the snapshot
|
||||||
|
for rollback debugging saw the binary's full table set instead of
|
||||||
|
the old schema. Functionally rollback still worked (extra empty
|
||||||
|
tables are harmless and re-running migration is idempotent), but
|
||||||
|
the snapshot was misleading. Fix: gate the self-heal call on
|
||||||
|
`current >= SCHEMA_VERSION`. The split-brain (`current >
|
||||||
|
SCHEMA_VERSION`) and same-version safety-net (`current ==
|
||||||
|
SCHEMA_VERSION`) paths still self-heal as before; the migration
|
||||||
|
path (`current < SCHEMA_VERSION`) takes its snapshot first and
|
||||||
|
then runs `_SYSTEM_SCHEMA` from inside the existing migration
|
||||||
|
block.
|
||||||
|
- **Split-brain self-heal regression test for a shared dev-VM
|
||||||
|
split-brain incident** (2026-04-27). Pins the contract that the
|
||||||
|
gated `_SYSTEM_SCHEMA`
|
||||||
|
self-heal pass keeps working when a binary lands on a
|
||||||
|
future-version DB that's missing tables it expects: every query
|
||||||
|
against the missing table would otherwise crash at runtime
|
||||||
|
(`_duckdb.CatalogException`). New
|
||||||
|
`test_split_brain_future_version_with_missing_tables_self_heals`
|
||||||
|
in `tests/test_db.py::TestMigrationSafety` synthesizes a v99 DB
|
||||||
|
whose only table is `schema_version`, runs `_ensure_schema`, and
|
||||||
|
asserts that the v13-era core tables (`users`, `user_groups`,
|
||||||
|
`user_group_members`, `resource_grants`) now exist *and* that
|
||||||
|
`schema_version` stays at 99 (self-heal without falsely
|
||||||
|
advertising a downgrade). Plus
|
||||||
|
`test_pre_migration_snapshot_excludes_post_self_heal_tables`
|
||||||
|
pins the snapshot-integrity contract: a v2→vN migration's
|
||||||
|
snapshot must not contain any post-v2 table from the modern
|
||||||
|
binary.
|
||||||
|
|
||||||
|
### Internal
|
||||||
|
|
||||||
|
- `test_future_version_is_noop` docstring updated to reflect that
|
||||||
|
the self-heal pass *does* run on a future-version DB, just
|
||||||
|
doesn't touch the version row. The test still passes unchanged —
|
||||||
|
its only assertion was the version-row contract, which holds.
|
||||||
|
|
||||||
## [0.12.0] — 2026-04-28
|
## [0.12.0] — 2026-04-28
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
|
||||||
36
src/db.py
36
src/db.py
|
|
@ -1222,22 +1222,34 @@ _V3_TO_V4_MIGRATIONS = [
|
||||||
def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
||||||
"""Create tables if they don't exist. Apply migrations if schema version changed.
|
"""Create tables if they don't exist. Apply migrations if schema version changed.
|
||||||
|
|
||||||
The first action — running ``_SYSTEM_SCHEMA`` unconditionally — is the
|
Self-heal pass for split-brain DBs runs only when ``current >=
|
||||||
self-heal pass for split-brain DBs. Scenario: a contributor's DB landed
|
SCHEMA_VERSION``. Scenario: a contributor's DB landed at
|
||||||
at schema_version=N from a partial migration (crash mid-DDL, parallel
|
``schema_version=N`` from a partial migration (crash mid-DDL,
|
||||||
WIP branch with a different table set, etc.), but the on-disk file is
|
parallel WIP branch with a different table set, etc.), but the
|
||||||
missing tables this binary expects. Without this pass, the migration
|
on-disk file is missing tables this binary expects. Without this
|
||||||
block below skips because ``current >= SCHEMA_VERSION`` and every
|
pass, the migration block below skips because we don't downgrade,
|
||||||
runtime query against the missing table crashes.
|
and every runtime query against the missing table crashes.
|
||||||
|
|
||||||
Because ``_SYSTEM_SCHEMA`` is all ``CREATE TABLE IF NOT EXISTS``,
|
Because ``_SYSTEM_SCHEMA`` is all ``CREATE TABLE IF NOT EXISTS``,
|
||||||
running it unconditionally is idempotent: existing tables stay
|
running it is idempotent: existing tables stay untouched (columns +
|
||||||
untouched (columns + data preserved), missing tables get created.
|
data preserved), missing tables get created. Cost: dozens of no-op
|
||||||
Cost: dozens of no-op DDLs per process start.
|
DDLs per process start.
|
||||||
"""
|
|
||||||
conn.execute(_SYSTEM_SCHEMA)
|
|
||||||
|
|
||||||
|
The self-heal explicitly does NOT run on the ``current <
|
||||||
|
SCHEMA_VERSION`` path so the pre-migration snapshot taken inside
|
||||||
|
that branch captures a true point-in-time state of the on-disk DB
|
||||||
|
*before* any DDL runs — operators reading the snapshot for rollback
|
||||||
|
debugging see exactly the tables the old schema had, not the
|
||||||
|
binary's full table set with extras tacked on.
|
||||||
|
"""
|
||||||
current = get_schema_version(conn)
|
current = get_schema_version(conn)
|
||||||
|
if current >= SCHEMA_VERSION:
|
||||||
|
# Split-brain or same-version safety net: heal any tables this
|
||||||
|
# binary expects that aren't on disk. Migration block skipped
|
||||||
|
# because we don't downgrade — the version row is left at
|
||||||
|
# ``current`` so a later binary that understands ``current``
|
||||||
|
# picks up where the split-brain left off.
|
||||||
|
conn.execute(_SYSTEM_SCHEMA)
|
||||||
if current < SCHEMA_VERSION:
|
if current < SCHEMA_VERSION:
|
||||||
# Snapshot before migration for rollback support
|
# Snapshot before migration for rollback support
|
||||||
if current > 0:
|
if current > 0:
|
||||||
|
|
|
||||||
178
tests/test_db.py
178
tests/test_db.py
|
|
@ -326,7 +326,11 @@ class TestMigrationSafety:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
def test_future_version_is_noop(self, tmp_path, monkeypatch):
|
def test_future_version_is_noop(self, tmp_path, monkeypatch):
|
||||||
"""_ensure_schema does nothing when schema_version > SCHEMA_VERSION."""
|
"""``_ensure_schema`` does not modify ``schema_version`` when it's
|
||||||
|
already past ``SCHEMA_VERSION``. The unconditional ``_SYSTEM_SCHEMA``
|
||||||
|
self-heal pass *does* run on the future-version DB — it's all
|
||||||
|
``CREATE TABLE IF NOT EXISTS``, so tables this binary expects get
|
||||||
|
materialized — but the version row stays put."""
|
||||||
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
import duckdb as _duckdb
|
import duckdb as _duckdb
|
||||||
from src.db import _ensure_schema, get_schema_version
|
from src.db import _ensure_schema, get_schema_version
|
||||||
|
|
@ -344,6 +348,178 @@ class TestMigrationSafety:
|
||||||
finally:
|
finally:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
|
def test_split_brain_future_version_with_missing_tables_self_heals(
|
||||||
|
self, tmp_path, monkeypatch,
|
||||||
|
):
|
||||||
|
"""Regression for a shared dev-VM split-brain incident.
|
||||||
|
|
||||||
|
Shape: a contributor experiments with a future-schema branch that
|
||||||
|
bumps the DB to ``schema_version=N`` (N > current binary's
|
||||||
|
``SCHEMA_VERSION``) with its own table layout, then switches or
|
||||||
|
rebases back to the released binary. The on-disk DB is on a
|
||||||
|
version this binary doesn't understand and is missing tables this
|
||||||
|
binary's code expects. Without self-heal, every query against the
|
||||||
|
missing table crashes at runtime — the migration block correctly
|
||||||
|
skips (we don't downgrade), but nothing creates the missing
|
||||||
|
tables either.
|
||||||
|
|
||||||
|
The contract this test pins: the gated
|
||||||
|
``conn.execute(_SYSTEM_SCHEMA)`` call (run when ``current >=
|
||||||
|
SCHEMA_VERSION``) materializes any missing tables *and* leaves
|
||||||
|
the future-version ``schema_version`` row untouched. We
|
||||||
|
synthesize a v99 DB whose only table is ``schema_version``,
|
||||||
|
then assert that running ``_ensure_schema`` creates the v13-era
|
||||||
|
core tables that the binary needs (``user_groups``,
|
||||||
|
``user_group_members``, ``resource_grants``, ``users``) while
|
||||||
|
keeping the version at 99.
|
||||||
|
"""
|
||||||
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
|
import duckdb as _duckdb
|
||||||
|
from src.db import _ensure_schema, get_schema_version
|
||||||
|
|
||||||
|
db_path = tmp_path / "state" / "system.duckdb"
|
||||||
|
db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
conn = _duckdb.connect(str(db_path))
|
||||||
|
try:
|
||||||
|
# Synthesize an "old binary on a future-schema DB" state: only
|
||||||
|
# the schema_version table exists (no current-schema tables,
|
||||||
|
# no lab tables either — matches the exact shape seen after
|
||||||
|
# a lab migration ran but the binary then rolled back to one
|
||||||
|
# that doesn't know the lab schema).
|
||||||
|
conn.execute(
|
||||||
|
"CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp);"
|
||||||
|
"INSERT INTO schema_version (version) VALUES (99);"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Sanity: the v13-era tables we expect the self-heal pass to
|
||||||
|
# create are NOT there before the call. Picked from the
|
||||||
|
# post-RBAC-v13 / post-marketplace surface so a future
|
||||||
|
# rename/drop in src/db.py fails this test loudly.
|
||||||
|
expected_tables = {
|
||||||
|
"users",
|
||||||
|
"user_groups",
|
||||||
|
"user_group_members",
|
||||||
|
"resource_grants",
|
||||||
|
}
|
||||||
|
tables_before = {
|
||||||
|
r[0]
|
||||||
|
for r in conn.execute(
|
||||||
|
"SELECT table_name FROM information_schema.tables "
|
||||||
|
"WHERE table_schema = ?",
|
||||||
|
["main"],
|
||||||
|
).fetchall()
|
||||||
|
}
|
||||||
|
assert not (expected_tables & tables_before), (
|
||||||
|
"fixture started with a non-empty schema; expected only "
|
||||||
|
"schema_version to be present"
|
||||||
|
)
|
||||||
|
|
||||||
|
_ensure_schema(conn)
|
||||||
|
|
||||||
|
# After: every expected table exists (self-heal worked) AND
|
||||||
|
# the version row stays at the future value.
|
||||||
|
tables_after = {
|
||||||
|
r[0]
|
||||||
|
for r in conn.execute(
|
||||||
|
"SELECT table_name FROM information_schema.tables "
|
||||||
|
"WHERE table_schema = ?",
|
||||||
|
["main"],
|
||||||
|
).fetchall()
|
||||||
|
}
|
||||||
|
missing = expected_tables - tables_after
|
||||||
|
assert not missing, (
|
||||||
|
f"self-heal must create v13-era tables on a future-version DB, "
|
||||||
|
f"missing: {sorted(missing)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# The future-version contract still holds: version row untouched.
|
||||||
|
assert get_schema_version(conn) == 99
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
def test_pre_migration_snapshot_excludes_post_self_heal_tables(
|
||||||
|
self, tmp_path, monkeypatch,
|
||||||
|
):
|
||||||
|
"""The pre-migration snapshot must capture the on-disk DB state
|
||||||
|
*before* any DDL runs, so operators reading the snapshot for
|
||||||
|
rollback debugging see the old schema as it actually was — not
|
||||||
|
the binary's full table set with extras tacked on.
|
||||||
|
|
||||||
|
Regression for the original hoist in 0.12.0: ``_SYSTEM_SCHEMA``
|
||||||
|
was unconditionally executed at the top of ``_ensure_schema``,
|
||||||
|
ahead of the snapshot copy in the migration block. On a v2→vN
|
||||||
|
migration, ``view_ownership`` / ``user_groups`` /
|
||||||
|
``resource_grants`` (and every other table the modern binary
|
||||||
|
adds) were created first, then ``CHECKPOINT`` flushed them to
|
||||||
|
disk, and ``shutil.copy2`` copied the already-modified file as
|
||||||
|
the "pre-migration" snapshot. Functionally rollback still
|
||||||
|
worked (extra empty tables are harmless), but the snapshot was
|
||||||
|
misleading. Fix: gate the self-heal call on ``current >=
|
||||||
|
SCHEMA_VERSION`` so the migration path takes its snapshot
|
||||||
|
before any DDL touches the DB.
|
||||||
|
"""
|
||||||
|
from src.db import (
|
||||||
|
SCHEMA_VERSION,
|
||||||
|
_ensure_schema,
|
||||||
|
get_schema_version,
|
||||||
|
get_system_db,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Bootstrap a v2 DB on disk, then trigger the migration ladder.
|
||||||
|
db_path = tmp_path / "state" / "system.duckdb"
|
||||||
|
self._create_v2_db(db_path)
|
||||||
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
assert get_schema_version(conn) == SCHEMA_VERSION
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
# Drop the cached connection so the snapshot file isn't
|
||||||
|
# locked when we re-open it.
|
||||||
|
from src import db as _db
|
||||||
|
_db._system_db_conn = None
|
||||||
|
_db._system_db_path = None
|
||||||
|
|
||||||
|
snapshot = tmp_path / "state" / "system.duckdb.pre-migrate"
|
||||||
|
assert snapshot.exists(), (
|
||||||
|
"fixture precondition: snapshot must be written for a v2→vN "
|
||||||
|
"migration"
|
||||||
|
)
|
||||||
|
|
||||||
|
import duckdb as _duckdb
|
||||||
|
snap = _duckdb.connect(str(snapshot), read_only=True)
|
||||||
|
try:
|
||||||
|
tables_in_snapshot = {
|
||||||
|
r[0]
|
||||||
|
for r in snap.execute(
|
||||||
|
"SELECT table_name FROM information_schema.tables "
|
||||||
|
"WHERE table_schema = 'main'"
|
||||||
|
).fetchall()
|
||||||
|
}
|
||||||
|
finally:
|
||||||
|
snap.close()
|
||||||
|
|
||||||
|
# Tables NOT present in the v2 fixture but added by later
|
||||||
|
# migrations (and therefore created by _SYSTEM_SCHEMA on the
|
||||||
|
# modern binary). If any of these leaked into the snapshot, the
|
||||||
|
# snapshot was contaminated by a self-heal pass running before
|
||||||
|
# the snapshot copy.
|
||||||
|
post_v2_tables = {
|
||||||
|
"view_ownership", # v10 (#100)
|
||||||
|
"marketplace_registry", # v11
|
||||||
|
"marketplace_plugins", # v11
|
||||||
|
"user_groups", # v11+ / v13
|
||||||
|
"user_group_members", # v13 (#106)
|
||||||
|
"resource_grants", # v13 (#106)
|
||||||
|
}
|
||||||
|
leaked = post_v2_tables & tables_in_snapshot
|
||||||
|
assert not leaked, (
|
||||||
|
f"pre-migration snapshot was contaminated with post-v2 "
|
||||||
|
f"tables — self-heal pass ran before the snapshot copy. "
|
||||||
|
f"Leaked: {sorted(leaked)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class TestSchemaV4:
|
class TestSchemaV4:
|
||||||
"""Tests for v4 schema additions: metric_definitions and column_metadata tables."""
|
"""Tests for v4 schema additions: metric_definitions and column_metadata tables."""
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue