agnes-the-ai-analyst/connectors/keboola/incremental.py
ZdenekSrotyr 506a378c3a
release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217)
## Summary

Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):

- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).

## Architecture

- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.

## Test plan

- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
  - Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
  - Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)

## Bugs caught + fixed during E2E

The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:

1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.

## Operator notes

- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).

## Spec source

The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 19:01:27 +02:00

308 lines
12 KiB
Python

"""Incremental Keboola extraction — watermark, merge, and orchestration.
Mirrors internal repo's `src/data_sync.py:_incremental_single_file_sync`
(lines 366-527), simplified to a single-file (non-partitioned) flow.
Partitioned tables are handled by `connectors.keboola.partitioned`.
Pipeline per table:
1. Read watermark (last_sync) from sync_state — caller's job.
2. compute_changed_since: subtract incremental_window_days from last_sync,
or use max_history_days for first sync.
3. KeboolaClient.export_table(changed_since=...) — pulls only delta rows.
4. If 0 rows, no-op (parquet untouched, log "no changes").
5. If existing parquet, merge_parquet (concat → drop_duplicates by PK).
6. If first sync, csv_to_parquet directly.
"""
from __future__ import annotations
import logging
import os
import tempfile
from datetime import datetime, timezone, timedelta
from pathlib import Path
from typing import Any, Dict, List, Optional
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from connectors.keboola.parquet_io import (
apply_schema_to_table,
convert_date_columns_to_date32,
csv_to_parquet,
_convert_column,
)
logger = logging.getLogger(__name__)
DEFAULT_WINDOW_DAYS = 7
def compute_changed_since(
*,
last_sync: Optional[datetime],
window_days: Optional[int],
max_history_days: Optional[int],
now: datetime,
) -> Optional[str]:
"""Compute the `changedSince` ISO timestamp to pass to Keboola Storage API.
Args:
last_sync: When this table was last successfully synced. None on first sync.
window_days: Backtrack window applied to last_sync. None → DEFAULT_WINDOW_DAYS (7).
max_history_days: Cap on first-sync history depth. None → unbounded (returns None).
now: Current time, injected for testability.
Returns:
ISO 8601 timestamp string, or None when the first sync should download all rows.
"""
if window_days is not None and window_days < 0:
raise ValueError(f"window_days must be >= 0, got {window_days}")
if last_sync is not None:
win = window_days if window_days is not None else DEFAULT_WINDOW_DAYS
return (last_sync - timedelta(days=win)).isoformat()
if max_history_days is not None:
return (now - timedelta(days=max_history_days)).isoformat()
return None
def merge_parquet(
*,
existing_parquet: Path,
new_csv: Path,
primary_key: List[str],
dtypes: Dict[str, str],
date_columns: List[str],
pyarrow_schema: Optional[pa.Schema],
) -> Dict[str, int]:
"""Merge a CSV delta into an existing parquet by primary key.
Read existing parquet → load delta CSV with explicit dtypes → concat →
drop_duplicates(subset=primary_key, keep='last') → write to a sibling
`.tmp` → atomic rename. The .tmp lives in the same directory as the
target so the rename is atomic on the same filesystem.
No primary key = pure append (matches legacy behavior; logs a warning
so operators notice the missing PK on a sync that needed dedup).
Reuses the typed-schema helpers from `parquet_io.py` so the merged
parquet has the same column types as a fresh `csv_to_parquet` write.
NOTE on memory: this loads both existing parquet and delta CSV into
pandas RAM. For tables in the multi-million-row range this may OOM.
Switch to partitioned strategy for those tables (per-partition merge
keeps memory bounded) — see `connectors.keboola.partitioned`.
"""
existing_parquet = Path(existing_parquet)
new_csv = Path(new_csv)
existing_df = pq.read_table(existing_parquet).to_pandas()
# Memory-pressure heuristic. pandas concat + drop_duplicates loads
# both sides into RAM; large slowly-changing tables can OOM the
# extractor subprocess. Surface a WARNING at the merge call site —
# operators see the signal at the point the work is happening
# instead of post-mortem after a kill. 5M rows is roughly 0.5-2 GB
# depending on column count + dtype mix; for tables that consistently
# cross this, switch to sync_strategy='partitioned' (per-partition
# merge keeps memory bounded).
if len(existing_df) > 5_000_000:
logger.warning(
"merge: %s has %d existing rows — pandas merge loads the full "
"parquet + delta into RAM and may OOM. Consider switching this "
"table to sync_strategy='partitioned' (per-partition merge "
"keeps memory bounded). See CHANGELOG ### Internal for the "
"OOM caveat.",
existing_parquet.name, len(existing_df),
)
delta_df = pd.read_csv(new_csv, dtype=str)
if dtypes:
for col, dtype in dtypes.items():
if col in delta_df.columns and "datetime" not in dtype:
try:
delta_df[col] = _convert_column(delta_df[col], dtype, col_name=col)
except Exception as e:
# Devin Review finding 0004: a primary_key column whose
# dtype conversion fails leaves delta_df[col] as object
# (string) while existing_df has it typed (e.g. int64).
# `pd.concat` then produces an object column for that PK,
# and `drop_duplicates` compares int `1` against string
# `'1'` as unequal — silent dedup failure. Fail loudly
# for PK columns instead of warn-and-continue, since PK
# integrity is load-bearing for the merge contract.
if primary_key and col in primary_key:
raise RuntimeError(
f"merge: PK column {col!r} dtype conversion failed "
f"({e}); refusing to silently dedup with mixed "
f"int/string types. Fix the upstream Keboola "
f"data quality issue or change sync_strategy "
f"to full_refresh."
) from e
logger.warning("merge: failed to apply dtype %s to %r: %s", dtype, col, e)
combined = pd.concat([existing_df, delta_df], ignore_index=True)
if primary_key:
# Defensive coercion: if the concat produced an object-dtype PK column
# (existing typed + delta string), drop_duplicates would compare
# heterogeneous values via Python equality which doesn't bridge
# int↔string. Coerce both sides to string for the dedup comparison
# so the result is deterministic; the post-dedup pa.Table.from_pandas
# + apply_schema_to_table cast restores the canonical type.
for pk in primary_key:
if pk in combined.columns and combined[pk].dtype == object:
combined[pk] = combined[pk].astype(str)
combined = combined.drop_duplicates(subset=primary_key, keep="last")
logger.info(
"merge: %s, %d existing + %d delta rows → %d after dedup on %s",
existing_parquet.name, len(existing_df), len(delta_df),
len(combined), primary_key,
)
else:
logger.warning(
"merge: %s has no primary_key configured — appending without dedup. "
"Duplicates between existing and delta will accumulate.",
existing_parquet.name,
)
table = pa.Table.from_pandas(combined, preserve_index=False)
if date_columns:
table = convert_date_columns_to_date32(table, date_columns)
if pyarrow_schema is not None:
table = apply_schema_to_table(table, pyarrow_schema)
tmp_path = existing_parquet.with_suffix(existing_parquet.suffix + ".tmp")
if tmp_path.exists():
tmp_path.unlink()
try:
pq.write_table(table, tmp_path, compression="snappy")
os.replace(tmp_path, existing_parquet)
finally:
if tmp_path.exists():
tmp_path.unlink()
return {"rows": len(combined), "delta_rows": len(delta_df)}
def extract_incremental(
*,
table_config: Dict[str, Any],
parquet_path: Path,
last_sync: Optional[datetime],
keboola_url: str,
keboola_token: str,
now: Optional[datetime] = None,
) -> Dict[str, Any]:
"""Extract one Keboola table incrementally.
`table_config` keys consumed: id, name, bucket, source_table,
primary_key, incremental_window_days, max_history_days.
Returns:
{
"rows": total rows in the parquet after merge,
"delta_rows": rows in the delta export (may be 0),
"changed_since_used": ISO string passed to Storage API
(None on first sync without max_history_days),
}
"""
from connectors.keboola.client import KeboolaClient
parquet_path = Path(parquet_path)
parquet_path.parent.mkdir(parents=True, exist_ok=True)
# Build the Keboola Storage API table_id from bucket + source_table (the
# canonical KBC reference, e.g. `in.c-finance.circle`). Fall back to the
# registry's `id` only when bucket is empty — the registry id is the
# slugified agnes view name (e.g. `circle_inc`), NOT a valid KBC ref.
bucket = table_config.get("bucket", "")
source_table = table_config.get("source_table") or table_config.get("name")
if bucket and source_table:
table_id = f"{bucket}.{source_table}"
else:
table_id = table_config.get("id") or table_config.get("name")
primary_key = table_config.get("primary_key") or []
if isinstance(primary_key, str):
primary_key = [primary_key]
now = now or datetime.now(timezone.utc)
changed_since = compute_changed_since(
last_sync=last_sync,
window_days=table_config.get("incremental_window_days"),
max_history_days=table_config.get("max_history_days"),
now=now,
)
client = KeboolaClient(token=keboola_token, url=keboola_url)
try:
pyarrow_schema = client.get_pyarrow_schema(table_id)
except Exception as e:
logger.warning("Schema unavailable for %s: %s", table_id, e)
pyarrow_schema = None
try:
dtypes = client.get_pandas_dtypes(table_id) if pyarrow_schema else {}
except Exception:
dtypes = {}
try:
date_columns = client.get_date_columns(table_id) if pyarrow_schema else []
except Exception:
date_columns = []
with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp:
csv_path = Path(tmp.name)
try:
export_info = client.export_table(
table_id, csv_path, changed_since=changed_since,
)
delta_rows = export_info.get("exported_rows", 0)
if delta_rows == 0:
existing_rows = 0
if parquet_path.exists():
existing_rows = pq.read_metadata(parquet_path).num_rows
logger.info("incremental: %s — no changes since %s", table_id, changed_since)
return {
"rows": existing_rows,
"delta_rows": 0,
"changed_since_used": changed_since,
}
if parquet_path.exists():
merge_info = merge_parquet(
existing_parquet=parquet_path,
new_csv=csv_path,
primary_key=primary_key,
dtypes=dtypes,
date_columns=date_columns,
pyarrow_schema=pyarrow_schema,
)
return {
"rows": merge_info["rows"],
"delta_rows": delta_rows,
"changed_since_used": changed_since,
}
else:
csv_to_parquet(
csv_path=csv_path,
parquet_path=parquet_path,
dtypes=dtypes,
date_columns=date_columns,
pyarrow_schema=pyarrow_schema,
table_id=table_id,
)
rows = pq.read_metadata(parquet_path).num_rows
return {
"rows": rows,
"delta_rows": delta_rows,
"changed_since_used": changed_since,
}
finally:
if csv_path.exists():
csv_path.unlink()