## Summary
Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):
- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).
## Architecture
- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.
## Test plan
- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
- Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
- Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)
## Bugs caught + fixed during E2E
The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:
1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.
## Operator notes
- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).
## Spec source
The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
279 lines
9.9 KiB
Python
279 lines
9.9 KiB
Python
"""Pure-function tests for incremental sync helpers."""
|
|
import csv as _csv
|
|
from datetime import datetime, timezone, timedelta
|
|
from pathlib import Path
|
|
|
|
import pandas as pd
|
|
import pyarrow as pa
|
|
import pyarrow.parquet as pq
|
|
import pytest
|
|
|
|
|
|
# ───────────────────────────── compute_changed_since ──────────────────────────
|
|
|
|
|
|
def test_subsequent_sync_uses_last_sync_minus_window():
|
|
from connectors.keboola.incremental import compute_changed_since
|
|
|
|
last_sync = datetime(2026, 5, 1, 12, 0, 0, tzinfo=timezone.utc)
|
|
result = compute_changed_since(
|
|
last_sync=last_sync, window_days=7, max_history_days=None,
|
|
now=last_sync + timedelta(days=1),
|
|
)
|
|
assert result == "2026-04-24T12:00:00+00:00"
|
|
|
|
|
|
def test_subsequent_sync_default_window_is_seven():
|
|
from connectors.keboola.incremental import compute_changed_since
|
|
|
|
last_sync = datetime(2026, 5, 1, tzinfo=timezone.utc)
|
|
result = compute_changed_since(
|
|
last_sync=last_sync, window_days=None, max_history_days=None, now=last_sync,
|
|
)
|
|
assert result == "2026-04-24T00:00:00+00:00"
|
|
|
|
|
|
def test_first_sync_no_max_history_returns_none():
|
|
from connectors.keboola.incremental import compute_changed_since
|
|
|
|
now = datetime(2026, 5, 1, tzinfo=timezone.utc)
|
|
result = compute_changed_since(
|
|
last_sync=None, window_days=7, max_history_days=None, now=now,
|
|
)
|
|
assert result is None
|
|
|
|
|
|
def test_first_sync_with_max_history_caps_to_now_minus_max():
|
|
from connectors.keboola.incremental import compute_changed_since
|
|
|
|
now = datetime(2026, 5, 1, tzinfo=timezone.utc)
|
|
result = compute_changed_since(
|
|
last_sync=None, window_days=7, max_history_days=180, now=now,
|
|
)
|
|
expected = (now - timedelta(days=180)).isoformat()
|
|
assert result == expected
|
|
|
|
|
|
def test_negative_window_raises():
|
|
from connectors.keboola.incremental import compute_changed_since
|
|
|
|
last_sync = datetime(2026, 5, 1, tzinfo=timezone.utc)
|
|
with pytest.raises(ValueError, match="window_days"):
|
|
compute_changed_since(
|
|
last_sync=last_sync, window_days=-1, max_history_days=None, now=last_sync,
|
|
)
|
|
|
|
|
|
# ───────────────────────────── merge_parquet ──────────────────────────────────
|
|
|
|
|
|
def _seed_parquet(path: Path, rows: list[dict], schema: pa.Schema) -> None:
|
|
table = pa.Table.from_pylist(rows, schema=schema)
|
|
pq.write_table(table, path, compression="snappy")
|
|
|
|
|
|
def _seed_csv(path: Path, rows: list[dict]) -> None:
|
|
with path.open("w", newline="") as f:
|
|
if not rows:
|
|
return
|
|
writer = _csv.DictWriter(f, fieldnames=list(rows[0].keys()))
|
|
writer.writeheader()
|
|
writer.writerows(rows)
|
|
|
|
|
|
def test_merge_inserts_new_rows(tmp_path):
|
|
from connectors.keboola.incremental import merge_parquet
|
|
|
|
schema = pa.schema([pa.field("id", pa.int64()), pa.field("v", pa.int64())])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"id": 1, "v": 10}, {"id": 2, "v": 20}], schema)
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"id": "3", "v": "30"}])
|
|
|
|
merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=["id"],
|
|
dtypes={"id": "Int64", "v": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=schema,
|
|
)
|
|
|
|
out = pq.read_table(pq_path).to_pylist()
|
|
assert sorted(o["id"] for o in out) == [1, 2, 3]
|
|
|
|
|
|
def test_merge_updates_existing_row_keeping_last(tmp_path):
|
|
from connectors.keboola.incremental import merge_parquet
|
|
|
|
schema = pa.schema([pa.field("id", pa.int64()), pa.field("v", pa.int64())])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"id": 1, "v": 10}, {"id": 2, "v": 20}], schema)
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"id": "2", "v": "999"}, {"id": "3", "v": "30"}])
|
|
|
|
merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=["id"],
|
|
dtypes={"id": "Int64", "v": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=schema,
|
|
)
|
|
|
|
out = sorted(pq.read_table(pq_path).to_pylist(), key=lambda r: r["id"])
|
|
assert out == [{"id": 1, "v": 10}, {"id": 2, "v": 999}, {"id": 3, "v": 30}]
|
|
|
|
|
|
def test_merge_composite_primary_key(tmp_path):
|
|
from connectors.keboola.incremental import merge_parquet
|
|
|
|
schema = pa.schema([
|
|
pa.field("a", pa.int64()), pa.field("b", pa.int64()), pa.field("v", pa.int64()),
|
|
])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"a": 1, "b": 1, "v": 1}, {"a": 1, "b": 2, "v": 2}], schema)
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"a": "1", "b": "2", "v": "999"}])
|
|
|
|
merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=["a", "b"],
|
|
dtypes={"a": "Int64", "b": "Int64", "v": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=schema,
|
|
)
|
|
|
|
out = sorted(pq.read_table(pq_path).to_pylist(), key=lambda r: (r["a"], r["b"]))
|
|
assert out == [{"a": 1, "b": 1, "v": 1}, {"a": 1, "b": 2, "v": 999}]
|
|
|
|
|
|
def test_merge_without_primary_key_appends_without_dedup(tmp_path, caplog):
|
|
"""Per legacy behavior, missing PK = pure append. Operator's responsibility."""
|
|
from connectors.keboola.incremental import merge_parquet
|
|
|
|
schema = pa.schema([pa.field("id", pa.int64())])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"id": 1}, {"id": 2}], schema)
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"id": "2"}, {"id": "3"}])
|
|
|
|
import logging
|
|
with caplog.at_level(logging.WARNING):
|
|
merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=[],
|
|
dtypes={"id": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=schema,
|
|
)
|
|
|
|
out = pq.read_table(pq_path).to_pylist()
|
|
assert len(out) == 4 # 2 existing + 2 new, includes duplicate id=2
|
|
assert "no primary_key" in caplog.text.lower()
|
|
|
|
|
|
def test_merge_atomic_on_failure(tmp_path, monkeypatch):
|
|
"""If write fails mid-flight, the existing parquet must remain intact."""
|
|
from connectors.keboola.incremental import merge_parquet
|
|
|
|
schema = pa.schema([pa.field("id", pa.int64())])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"id": 1}], schema)
|
|
original_bytes = pq_path.read_bytes()
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"id": "2"}])
|
|
|
|
def boom(*a, **kw):
|
|
raise RuntimeError("disk full")
|
|
monkeypatch.setattr("pyarrow.parquet.write_table", boom)
|
|
|
|
with pytest.raises(RuntimeError, match="disk full"):
|
|
merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=["id"],
|
|
dtypes={"id": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=schema,
|
|
)
|
|
|
|
assert pq_path.read_bytes() == original_bytes
|
|
|
|
|
|
def test_merge_pk_dtype_conversion_failure_raises_hard(tmp_path, monkeypatch):
|
|
"""Devin Review finding 0004 regression guard.
|
|
|
|
If `_convert_column` fails for a primary_key column, the merge must raise
|
|
rather than warn-and-continue. The pre-fix behavior left the PK column as
|
|
object/string in the delta while existing_df has it typed (e.g. int64),
|
|
producing a mixed-type column after concat that silently broke
|
|
`drop_duplicates` (int 1 != str '1' under Python equality)."""
|
|
from connectors.keboola import incremental as _incremental
|
|
|
|
schema = pa.schema([pa.field("id", pa.int64()), pa.field("v", pa.int64())])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"id": 1, "v": 10}], schema)
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"id": "2", "v": "20"}])
|
|
|
|
def boom(series, dtype, col_name=""):
|
|
raise ValueError(f"synthetic conversion failure on {col_name!r}")
|
|
monkeypatch.setattr(_incremental, "_convert_column", boom)
|
|
|
|
with pytest.raises(RuntimeError, match="PK column 'id' dtype conversion failed"):
|
|
_incremental.merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=["id"],
|
|
dtypes={"id": "Int64", "v": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=schema,
|
|
)
|
|
|
|
|
|
def test_merge_non_pk_dtype_conversion_failure_warns_and_continues(tmp_path, monkeypatch, caplog):
|
|
"""Inverse of the PK-fail guard: a non-PK dtype conversion failure is
|
|
soft-handled (logged warning, delta column stays as string). Locks the
|
|
asymmetric policy in `merge_parquet` — PK failures are load-bearing for
|
|
dedup correctness, non-PK failures degrade gracefully (pyarrow_schema=None
|
|
here mirrors the path used when Keboola Storage metadata is unavailable)."""
|
|
from connectors.keboola import incremental as _incremental
|
|
|
|
schema = pa.schema([pa.field("id", pa.int64()), pa.field("v", pa.string())])
|
|
pq_path = tmp_path / "t.parquet"
|
|
_seed_parquet(pq_path, [{"id": 1, "v": "10"}], schema)
|
|
|
|
csv_path = tmp_path / "delta.csv"
|
|
_seed_csv(csv_path, [{"id": "2", "v": "20"}])
|
|
|
|
real_convert = _incremental._convert_column
|
|
|
|
def selective_boom(series, dtype, col_name=""):
|
|
if col_name == "v":
|
|
raise ValueError("synthetic non-pk conversion failure")
|
|
return real_convert(series, dtype, col_name=col_name)
|
|
monkeypatch.setattr(_incremental, "_convert_column", selective_boom)
|
|
|
|
import logging
|
|
with caplog.at_level(logging.WARNING):
|
|
_incremental.merge_parquet(
|
|
existing_parquet=pq_path,
|
|
new_csv=csv_path,
|
|
primary_key=["id"],
|
|
dtypes={"id": "Int64", "v": "Int64"},
|
|
date_columns=[],
|
|
pyarrow_schema=None,
|
|
)
|
|
|
|
assert "failed to apply dtype" in caplog.text.lower()
|
|
out = sorted(pq.read_table(pq_path).to_pylist(), key=lambda r: r["id"])
|
|
assert [r["id"] for r in out] == [1, 2]
|