agnes-the-ai-analyst/tests/test_keboola_partitioned_e2e.py
ZdenekSrotyr 506a378c3a
release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217)
## Summary

Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):

- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).

## Architecture

- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.

## Test plan

- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
  - Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
  - Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)

## Bugs caught + fixed during E2E

The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:

1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.

## Operator notes

- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).

## Spec source

The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 19:01:27 +02:00

182 lines
7.2 KiB
Python

"""End-to-end partitioned sync with mocked Storage SDK."""
from datetime import date, datetime, timezone
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
def test_first_sync_chunked_writes_per_partition_files(tmp_path, monkeypatch):
"""Two-chunk history: latest chunk has rows, previous has rows, then 2 empty
chunks stop the loop."""
from connectors.keboola.partitioned import extract_partitioned
from connectors.keboola.client import KeboolaClient
chunk_payloads = iter([
# most recent → oldest
"id,date\n1,2026-05-01\n2,2026-05-15\n",
"id,date\n3,2026-04-10\n",
"id,date\n", # empty 1
"id,date\n", # empty 2 — stop
])
def fake_export(self, table_id, output_path, changed_since=None, changed_until=None, **kw):
body = next(chunk_payloads)
Path(output_path).write_text(body)
rows = max(0, len(body.strip().split("\n")) - 1)
return {"exported_rows": rows}
fake_schema = pa.schema([
pa.field("id", pa.int64()), pa.field("date", pa.date32()),
])
monkeypatch.setattr(KeboolaClient, "__init__", lambda self, **kw: None)
monkeypatch.setattr(KeboolaClient, "export_table", fake_export)
monkeypatch.setattr(KeboolaClient, "get_pyarrow_schema", lambda self, tid: fake_schema)
monkeypatch.setattr(KeboolaClient, "get_pandas_dtypes", lambda self, tid: {"id": "Int64"})
monkeypatch.setattr(KeboolaClient, "get_date_columns", lambda self, tid: ["date"])
out_dir = tmp_path / "data" / "sales"
result = extract_partitioned(
table_config={
"id": "in.c-sales.orders", "name": "orders",
"bucket": "in.c-sales", "source_table": "orders",
"primary_key": ["id"], "partition_by": "date",
"partition_granularity": "month",
"incremental_window_days": 1,
"max_history_days": None,
"initial_load_chunk_days": 30,
},
output_dir=out_dir,
last_sync=None,
keboola_url="https://kbc.example", keboola_token="tok",
now=datetime(2026, 5, 7, tzinfo=timezone.utc),
)
files = sorted(p.name for p in out_dir.glob("*.parquet"))
assert files == ["2026_04.parquet", "2026_05.parquet"]
assert result["rows"] == 3
def test_incremental_partitioned_merges_only_affected(tmp_path, monkeypatch):
"""Existing partitions for 2026_04 and 2026_05. Delta touches only 2026_05.
2026_04's bytes must be unchanged."""
from connectors.keboola.partitioned import extract_partitioned
from connectors.keboola.client import KeboolaClient
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("date", pa.date32()),
pa.field("v", pa.int64()),
])
out_dir = tmp_path / "data" / "sales"
out_dir.mkdir(parents=True)
# Seed 2026_04 (untouched by this delta)
apr = out_dir / "2026_04.parquet"
pq.write_table(pa.Table.from_pylist([
{"id": 100, "date": date(2026, 4, 1), "v": 1},
], schema=schema), apr, compression="snappy")
apr_bytes_before = apr.read_bytes()
# Seed 2026_05
may = out_dir / "2026_05.parquet"
pq.write_table(pa.Table.from_pylist([
{"id": 1, "date": date(2026, 5, 1), "v": 10},
], schema=schema), may, compression="snappy")
delta_payload = "id,date,v\n1,2026-05-01,999\n2,2026-05-15,20\n"
def fake_export(self, table_id, output_path, **kw):
Path(output_path).write_text(delta_payload)
return {"exported_rows": 2}
monkeypatch.setattr(KeboolaClient, "__init__", lambda self, **kw: None)
monkeypatch.setattr(KeboolaClient, "export_table", fake_export)
monkeypatch.setattr(KeboolaClient, "get_pyarrow_schema", lambda self, tid: schema)
monkeypatch.setattr(KeboolaClient, "get_pandas_dtypes",
lambda self, tid: {"id": "Int64", "v": "Int64"})
monkeypatch.setattr(KeboolaClient, "get_date_columns", lambda self, tid: ["date"])
extract_partitioned(
table_config={
"id": "in.c-sales.orders", "name": "orders",
"bucket": "in.c-sales", "source_table": "orders",
"primary_key": ["id"], "partition_by": "date",
"partition_granularity": "month",
"incremental_window_days": 1, "max_history_days": None,
"initial_load_chunk_days": 30,
},
output_dir=out_dir,
last_sync=datetime(2026, 5, 6, tzinfo=timezone.utc),
keboola_url="https://kbc.example", keboola_token="tok",
now=datetime(2026, 5, 7, tzinfo=timezone.utc),
)
# 2026_04 bytes-identical (no read, no write)
assert apr.read_bytes() == apr_bytes_before
# 2026_05 has the updated v=999 row and the new id=2 row
rows = sorted(pq.read_table(may).to_pylist(), key=lambda r: r["id"])
assert len(rows) == 2
assert rows[0]["v"] == 999
assert rows[1]["id"] == 2
def test_zero_delta_is_noop_for_partitioned(tmp_path, monkeypatch):
from connectors.keboola.partitioned import extract_partitioned
from connectors.keboola.client import KeboolaClient
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("date", pa.date32()),
])
out_dir = tmp_path / "data" / "orders"
out_dir.mkdir(parents=True)
pq.write_table(pa.Table.from_pylist([
{"id": 1, "date": date(2026, 5, 1)},
], schema=schema), out_dir / "2026_05.parquet", compression="snappy")
def fake_export(self, table_id, output_path, **kw):
Path(output_path).write_text("id,date\n")
return {"exported_rows": 0}
monkeypatch.setattr(KeboolaClient, "__init__", lambda self, **kw: None)
monkeypatch.setattr(KeboolaClient, "export_table", fake_export)
monkeypatch.setattr(KeboolaClient, "get_pyarrow_schema", lambda self, tid: schema)
monkeypatch.setattr(KeboolaClient, "get_pandas_dtypes", lambda self, tid: {"id": "Int64"})
monkeypatch.setattr(KeboolaClient, "get_date_columns", lambda self, tid: ["date"])
result = extract_partitioned(
table_config={
"id": "in.c-sales.orders", "name": "orders",
"bucket": "in.c-sales", "source_table": "orders",
"primary_key": ["id"], "partition_by": "date",
"partition_granularity": "month",
"incremental_window_days": 1,
},
output_dir=out_dir,
last_sync=datetime(2026, 5, 6, tzinfo=timezone.utc),
keboola_url="https://kbc.example", keboola_token="tok",
now=datetime(2026, 5, 7, tzinfo=timezone.utc),
)
assert result["delta_rows"] == 0
assert result["partitions_touched"] == 0
assert result["rows"] == 1
def test_missing_partition_by_raises(tmp_path):
from connectors.keboola.partitioned import extract_partitioned, InvalidPartitionConfigError
out_dir = tmp_path / "data" / "x"
with pytest.raises(InvalidPartitionConfigError, match="partition_by"):
extract_partitioned(
table_config={
"id": "in.c-x.y", "name": "y",
"bucket": "in.c-x", "source_table": "y",
"partition_granularity": "month",
},
output_dir=out_dir,
last_sync=None,
keboola_url="https://kbc.example", keboola_token="tok",
now=datetime(2026, 5, 7, tzinfo=timezone.utc),
)