## Summary
Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):
- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).
## Architecture
- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.
## Test plan
- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
- Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
- Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)
## Bugs caught + fixed during E2E
The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:
1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.
## Operator notes
- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).
## Spec source
The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
197 lines
7.6 KiB
Python
197 lines
7.6 KiB
Python
"""WAL-replay auto-recovery for ``system.duckdb``.
|
|
|
|
Reproduces the production failure observed during PR #217's v27
|
|
rollout: a container kill mid-migration leaves an unflushed
|
|
``ALTER TABLE … ADD COLUMN`` op in ``system.duckdb.wal``. On the next
|
|
start, DuckDB's ``ReplayAlter`` path raises
|
|
``INTERNAL Error: Calling DatabaseManager::GetDefaultDatabase with no
|
|
default database set`` and the system database becomes unrecoverable
|
|
from the running binary — the operator has to restore from the
|
|
pre-migrate snapshot by hand.
|
|
|
|
The fix is two-pronged:
|
|
1. ``_ensure_schema`` runs ``CHECKPOINT`` immediately after the
|
|
migration ladder so a fresh ALTER doesn't sit in the WAL beyond
|
|
the migration window. Tested implicitly by every migration test
|
|
that survives a process restart between fixture runs (covered by
|
|
the existing v25→v26→v27 tests).
|
|
2. ``_try_open_system_db`` catches the WAL-replay error class and
|
|
falls back to ``system.duckdb.pre-migrate``. That's the path
|
|
this file exercises.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import shutil
|
|
from pathlib import Path
|
|
from unittest.mock import patch
|
|
|
|
import duckdb
|
|
import pytest
|
|
|
|
|
|
@pytest.fixture
|
|
def state_dir(monkeypatch, tmp_path):
|
|
"""Point DATA_DIR at a fresh tmp dir so each test gets its own
|
|
state/system.duckdb without bleed."""
|
|
data = tmp_path / "data"
|
|
(data / "state").mkdir(parents=True)
|
|
monkeypatch.setenv("DATA_DIR", str(data))
|
|
# Clear any cached connection from earlier tests in the same process.
|
|
import src.db as db_mod
|
|
db_mod._system_db_conn = None
|
|
db_mod._system_db_path = None
|
|
yield data / "state"
|
|
db_mod._system_db_conn = None
|
|
db_mod._system_db_path = None
|
|
|
|
|
|
def test_recovery_restores_pre_migrate_snapshot_on_wal_replay_error(
|
|
state_dir,
|
|
):
|
|
"""Simulate a corrupted DB + valid pre-migrate snapshot. Open path
|
|
must restore from the snapshot, leave the broken DB aside (with
|
|
`.broken.<ts>` suffix for post-mortem), drop the broken WAL, and
|
|
return a working connection. The migration ladder runs once on the
|
|
restored DB to land back at the current SCHEMA_VERSION."""
|
|
from src.db import _try_open_system_db, _ensure_schema, SCHEMA_VERSION
|
|
|
|
db_path = state_dir / "system.duckdb"
|
|
snapshot_path = state_dir / "system.duckdb.pre-migrate"
|
|
|
|
# 1. Create a clean valid DB at snapshot_path. This stands in for
|
|
# the snapshot taken at the start of the most recent migration.
|
|
seed_conn = duckdb.connect(str(snapshot_path))
|
|
_ensure_schema(seed_conn)
|
|
seed_conn.close()
|
|
assert snapshot_path.exists(), "fixture setup: snapshot must exist"
|
|
|
|
# 2. Plant a fake "broken" DB at db_path: simply copy the snapshot
|
|
# and add a .wal sentinel so the restore path moves both aside.
|
|
shutil.copy2(str(snapshot_path), str(db_path))
|
|
wal_path = Path(str(db_path) + ".wal")
|
|
wal_path.write_bytes(b"FAKE_WAL_CONTENT")
|
|
|
|
# 3. Make `duckdb.connect(db_path)` raise the WAL-replay error on
|
|
# the FIRST call only. The auto-recovery's second call (after
|
|
# snapshot restore) must succeed.
|
|
real_connect = duckdb.connect
|
|
call_count = {"n": 0}
|
|
fake_error = duckdb.Error(
|
|
"INTERNAL Error: Failure while replaying WAL file: "
|
|
"Calling DatabaseManager::GetDefaultDatabase with no default "
|
|
"database set"
|
|
)
|
|
|
|
def flaky_connect(path, *args, **kwargs):
|
|
if str(path) == str(db_path) and call_count["n"] == 0:
|
|
call_count["n"] += 1
|
|
raise fake_error
|
|
return real_connect(path, *args, **kwargs)
|
|
|
|
with patch("src.db.duckdb.connect", side_effect=flaky_connect):
|
|
conn = _try_open_system_db(str(db_path))
|
|
|
|
# 4. The returned connection is usable.
|
|
ver = conn.execute(
|
|
"SELECT version FROM schema_version ORDER BY applied_at DESC LIMIT 1"
|
|
).fetchone()
|
|
assert ver is not None
|
|
assert ver[0] == SCHEMA_VERSION, (
|
|
f"recovered DB should be at the current schema version "
|
|
f"({SCHEMA_VERSION}); got {ver[0]}"
|
|
)
|
|
|
|
# 5. Broken DB and broken WAL were moved aside (kept for forensics).
|
|
all_broken = sorted(state_dir.glob("system.duckdb.broken.*"))
|
|
broken_dbs = [p for p in all_broken if not p.name.endswith(".wal")]
|
|
broken_wals = [p for p in all_broken if p.name.endswith(".wal")]
|
|
assert len(broken_dbs) == 1, all_broken
|
|
assert len(broken_wals) == 1, all_broken
|
|
|
|
# 6. The current main DB exists and the (broken) WAL beside it does
|
|
# NOT — the recovery path must drop the unflushed WAL or the
|
|
# next start would replay the same broken op.
|
|
assert db_path.exists()
|
|
assert not (state_dir / "system.duckdb.wal").exists()
|
|
|
|
|
|
def test_recovery_does_not_fire_on_unrelated_error(state_dir):
|
|
"""Recovery must be narrow — only the WAL-replay error class. A
|
|
generic ``IO Error: file is locked`` (a real corruption / permission
|
|
case) must propagate so an operator notices instead of silently
|
|
losing whatever's in the WAL by overwriting from snapshot."""
|
|
from src.db import _try_open_system_db
|
|
|
|
db_path = state_dir / "system.duckdb"
|
|
db_path.write_bytes(b"corrupted")
|
|
|
|
real_connect = duckdb.connect
|
|
unrelated_error = duckdb.Error(
|
|
"IO Error: file is locked by another process"
|
|
)
|
|
|
|
def always_unrelated(path, *args, **kwargs):
|
|
if str(path) == str(db_path):
|
|
raise unrelated_error
|
|
return real_connect(path, *args, **kwargs)
|
|
|
|
with patch("src.db.duckdb.connect", side_effect=always_unrelated):
|
|
with pytest.raises(duckdb.Error, match="file is locked"):
|
|
_try_open_system_db(str(db_path))
|
|
|
|
|
|
def test_recovery_propagates_when_no_snapshot_exists(state_dir):
|
|
"""If the WAL-replay error fires but ``system.duckdb.pre-migrate``
|
|
is missing, recovery has nowhere to fall back. Re-raise the
|
|
original error so the operator sees what's actually wrong."""
|
|
from src.db import _try_open_system_db
|
|
|
|
db_path = state_dir / "system.duckdb"
|
|
db_path.write_bytes(b"corrupted")
|
|
# No snapshot at state_dir / "system.duckdb.pre-migrate"
|
|
|
|
real_connect = duckdb.connect
|
|
wal_error = duckdb.Error(
|
|
"INTERNAL Error: Failure while replaying WAL file: "
|
|
"Calling DatabaseManager::GetDefaultDatabase"
|
|
)
|
|
|
|
def fake(path, *args, **kwargs):
|
|
if str(path) == str(db_path):
|
|
raise wal_error
|
|
return real_connect(path, *args, **kwargs)
|
|
|
|
with patch("src.db.duckdb.connect", side_effect=fake):
|
|
with pytest.raises(duckdb.Error, match="GetDefaultDatabase"):
|
|
_try_open_system_db(str(db_path))
|
|
|
|
|
|
def test_recovery_re_raises_if_snapshot_also_broken(state_dir):
|
|
"""Edge case: snapshot exists but is itself corrupted (operator
|
|
edited it / disk error). The first recovery ``duckdb.connect``
|
|
succeeds (via the mock) so the function returns, but the second
|
|
call would still fail. We assert the re-attempted connect's error
|
|
propagates rather than being swallowed."""
|
|
from src.db import _try_open_system_db
|
|
|
|
db_path = state_dir / "system.duckdb"
|
|
snapshot_path = state_dir / "system.duckdb.pre-migrate"
|
|
db_path.write_bytes(b"corrupted")
|
|
snapshot_path.write_bytes(b"also-corrupted")
|
|
|
|
wal_error = duckdb.Error(
|
|
"INTERNAL Error: ReplayAlter failed"
|
|
)
|
|
snapshot_error = duckdb.Error("IO Error: malformed database file")
|
|
|
|
call_count = {"n": 0}
|
|
|
|
def fake(path, *args, **kwargs):
|
|
call_count["n"] += 1
|
|
if call_count["n"] == 1:
|
|
raise wal_error
|
|
raise snapshot_error
|
|
|
|
with patch("src.db.duckdb.connect", side_effect=fake):
|
|
with pytest.raises(duckdb.Error, match="malformed"):
|
|
_try_open_system_db(str(db_path))
|