agnes-the-ai-analyst/tests/test_db_wal_recovery.py
ZdenekSrotyr 506a378c3a
release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217)
## Summary

Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):

- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).

## Architecture

- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.

## Test plan

- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
  - Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
  - Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)

## Bugs caught + fixed during E2E

The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:

1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.

## Operator notes

- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).

## Spec source

The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 19:01:27 +02:00

197 lines
7.6 KiB
Python

"""WAL-replay auto-recovery for ``system.duckdb``.
Reproduces the production failure observed during PR #217's v27
rollout: a container kill mid-migration leaves an unflushed
``ALTER TABLE … ADD COLUMN`` op in ``system.duckdb.wal``. On the next
start, DuckDB's ``ReplayAlter`` path raises
``INTERNAL Error: Calling DatabaseManager::GetDefaultDatabase with no
default database set`` and the system database becomes unrecoverable
from the running binary — the operator has to restore from the
pre-migrate snapshot by hand.
The fix is two-pronged:
1. ``_ensure_schema`` runs ``CHECKPOINT`` immediately after the
migration ladder so a fresh ALTER doesn't sit in the WAL beyond
the migration window. Tested implicitly by every migration test
that survives a process restart between fixture runs (covered by
the existing v25→v26→v27 tests).
2. ``_try_open_system_db`` catches the WAL-replay error class and
falls back to ``system.duckdb.pre-migrate``. That's the path
this file exercises.
"""
from __future__ import annotations
import shutil
from pathlib import Path
from unittest.mock import patch
import duckdb
import pytest
@pytest.fixture
def state_dir(monkeypatch, tmp_path):
"""Point DATA_DIR at a fresh tmp dir so each test gets its own
state/system.duckdb without bleed."""
data = tmp_path / "data"
(data / "state").mkdir(parents=True)
monkeypatch.setenv("DATA_DIR", str(data))
# Clear any cached connection from earlier tests in the same process.
import src.db as db_mod
db_mod._system_db_conn = None
db_mod._system_db_path = None
yield data / "state"
db_mod._system_db_conn = None
db_mod._system_db_path = None
def test_recovery_restores_pre_migrate_snapshot_on_wal_replay_error(
state_dir,
):
"""Simulate a corrupted DB + valid pre-migrate snapshot. Open path
must restore from the snapshot, leave the broken DB aside (with
`.broken.<ts>` suffix for post-mortem), drop the broken WAL, and
return a working connection. The migration ladder runs once on the
restored DB to land back at the current SCHEMA_VERSION."""
from src.db import _try_open_system_db, _ensure_schema, SCHEMA_VERSION
db_path = state_dir / "system.duckdb"
snapshot_path = state_dir / "system.duckdb.pre-migrate"
# 1. Create a clean valid DB at snapshot_path. This stands in for
# the snapshot taken at the start of the most recent migration.
seed_conn = duckdb.connect(str(snapshot_path))
_ensure_schema(seed_conn)
seed_conn.close()
assert snapshot_path.exists(), "fixture setup: snapshot must exist"
# 2. Plant a fake "broken" DB at db_path: simply copy the snapshot
# and add a .wal sentinel so the restore path moves both aside.
shutil.copy2(str(snapshot_path), str(db_path))
wal_path = Path(str(db_path) + ".wal")
wal_path.write_bytes(b"FAKE_WAL_CONTENT")
# 3. Make `duckdb.connect(db_path)` raise the WAL-replay error on
# the FIRST call only. The auto-recovery's second call (after
# snapshot restore) must succeed.
real_connect = duckdb.connect
call_count = {"n": 0}
fake_error = duckdb.Error(
"INTERNAL Error: Failure while replaying WAL file: "
"Calling DatabaseManager::GetDefaultDatabase with no default "
"database set"
)
def flaky_connect(path, *args, **kwargs):
if str(path) == str(db_path) and call_count["n"] == 0:
call_count["n"] += 1
raise fake_error
return real_connect(path, *args, **kwargs)
with patch("src.db.duckdb.connect", side_effect=flaky_connect):
conn = _try_open_system_db(str(db_path))
# 4. The returned connection is usable.
ver = conn.execute(
"SELECT version FROM schema_version ORDER BY applied_at DESC LIMIT 1"
).fetchone()
assert ver is not None
assert ver[0] == SCHEMA_VERSION, (
f"recovered DB should be at the current schema version "
f"({SCHEMA_VERSION}); got {ver[0]}"
)
# 5. Broken DB and broken WAL were moved aside (kept for forensics).
all_broken = sorted(state_dir.glob("system.duckdb.broken.*"))
broken_dbs = [p for p in all_broken if not p.name.endswith(".wal")]
broken_wals = [p for p in all_broken if p.name.endswith(".wal")]
assert len(broken_dbs) == 1, all_broken
assert len(broken_wals) == 1, all_broken
# 6. The current main DB exists and the (broken) WAL beside it does
# NOT — the recovery path must drop the unflushed WAL or the
# next start would replay the same broken op.
assert db_path.exists()
assert not (state_dir / "system.duckdb.wal").exists()
def test_recovery_does_not_fire_on_unrelated_error(state_dir):
"""Recovery must be narrow — only the WAL-replay error class. A
generic ``IO Error: file is locked`` (a real corruption / permission
case) must propagate so an operator notices instead of silently
losing whatever's in the WAL by overwriting from snapshot."""
from src.db import _try_open_system_db
db_path = state_dir / "system.duckdb"
db_path.write_bytes(b"corrupted")
real_connect = duckdb.connect
unrelated_error = duckdb.Error(
"IO Error: file is locked by another process"
)
def always_unrelated(path, *args, **kwargs):
if str(path) == str(db_path):
raise unrelated_error
return real_connect(path, *args, **kwargs)
with patch("src.db.duckdb.connect", side_effect=always_unrelated):
with pytest.raises(duckdb.Error, match="file is locked"):
_try_open_system_db(str(db_path))
def test_recovery_propagates_when_no_snapshot_exists(state_dir):
"""If the WAL-replay error fires but ``system.duckdb.pre-migrate``
is missing, recovery has nowhere to fall back. Re-raise the
original error so the operator sees what's actually wrong."""
from src.db import _try_open_system_db
db_path = state_dir / "system.duckdb"
db_path.write_bytes(b"corrupted")
# No snapshot at state_dir / "system.duckdb.pre-migrate"
real_connect = duckdb.connect
wal_error = duckdb.Error(
"INTERNAL Error: Failure while replaying WAL file: "
"Calling DatabaseManager::GetDefaultDatabase"
)
def fake(path, *args, **kwargs):
if str(path) == str(db_path):
raise wal_error
return real_connect(path, *args, **kwargs)
with patch("src.db.duckdb.connect", side_effect=fake):
with pytest.raises(duckdb.Error, match="GetDefaultDatabase"):
_try_open_system_db(str(db_path))
def test_recovery_re_raises_if_snapshot_also_broken(state_dir):
"""Edge case: snapshot exists but is itself corrupted (operator
edited it / disk error). The first recovery ``duckdb.connect``
succeeds (via the mock) so the function returns, but the second
call would still fail. We assert the re-attempted connect's error
propagates rather than being swallowed."""
from src.db import _try_open_system_db
db_path = state_dir / "system.duckdb"
snapshot_path = state_dir / "system.duckdb.pre-migrate"
db_path.write_bytes(b"corrupted")
snapshot_path.write_bytes(b"also-corrupted")
wal_error = duckdb.Error(
"INTERNAL Error: ReplayAlter failed"
)
snapshot_error = duckdb.Error("IO Error: malformed database file")
call_count = {"n": 0}
def fake(path, *args, **kwargs):
call_count["n"] += 1
if call_count["n"] == 1:
raise wal_error
raise snapshot_error
with patch("src.db.duckdb.connect", side_effect=fake):
with pytest.raises(duckdb.Error, match="malformed"):
_try_open_system_db(str(db_path))