## Summary
Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):
- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).
## Architecture
- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.
## Test plan
- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
- Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
- Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)
## Bugs caught + fixed during E2E
The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:
1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.
## Operator notes
- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).
## Spec source
The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
153 lines
4.2 KiB
JSON
153 lines
4.2 KiB
JSON
{
|
|
"hooks": {
|
|
"SessionStart": [
|
|
{
|
|
"hooks": [
|
|
{
|
|
"type": "command",
|
|
"command": "agnes self-upgrade --quiet 2>/dev/null || true; agnes pull --quiet 2>/dev/null || true"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"hooks": [
|
|
{
|
|
"type": "command",
|
|
"command": "bash -c \"agnes refresh-marketplace --quiet 2>/dev/null || true\""
|
|
}
|
|
]
|
|
}
|
|
],
|
|
"SessionEnd": [
|
|
{
|
|
"hooks": [
|
|
{
|
|
"type": "command",
|
|
"command": "bash -c \"( nohup agnes push --quiet </dev/null >/dev/null 2>&1 & ) ; true\""
|
|
}
|
|
]
|
|
}
|
|
]
|
|
},
|
|
"permissions": {
|
|
"allow": [
|
|
"Bash(git rebase:*)",
|
|
"Bash(git add:*)",
|
|
"Bash(git checkout:*)",
|
|
"Bash(git branch:*)",
|
|
"Bash(git cherry-pick:*)",
|
|
"Bash(git log:*)",
|
|
"Bash(git show:*)",
|
|
"Bash(git commit:*)",
|
|
"Bash(git fetch:*)",
|
|
"Bash(git diff:*)",
|
|
"Bash(git status:*)",
|
|
"Bash(git remote:*)",
|
|
"Bash(git tag:*)",
|
|
"Bash(find:*)",
|
|
"Bash(ls:*)",
|
|
"Bash(tree:*)",
|
|
"Bash(head:*)",
|
|
"Bash(tail:*)",
|
|
"Bash(wc:*)",
|
|
"Bash(which:*)",
|
|
"Bash(where:*)",
|
|
"Bash(pwd:*)",
|
|
"Bash(whoami:*)",
|
|
"Bash(echo:*)",
|
|
"Bash(file:*)",
|
|
"Bash(stat:*)",
|
|
"Bash(bash server/scripts/*)",
|
|
"Bash(python server/scripts/*)",
|
|
"Bash(ssh:*)",
|
|
"Bash(scp:*)",
|
|
"WebFetch(domain:github.com)",
|
|
"WebSearch"
|
|
],
|
|
"deny": [
|
|
"Read(**/.env)",
|
|
"Read(**/.env.*)",
|
|
"Read(**/credentials*)",
|
|
"Read(**/*credentials*)",
|
|
"Read(**/.credentials*)",
|
|
"Read(**/secrets*)",
|
|
"Read(**/*secrets*)",
|
|
"Read(**/.secrets*)",
|
|
"Read(**/*.pem)",
|
|
"Read(**/*.key)",
|
|
"Read(**/*.p12)",
|
|
"Read(**/*.pfx)",
|
|
"Read(**/*.keystore)",
|
|
"Read(**/*id_rsa*)",
|
|
"Read(**/*id_dsa*)",
|
|
"Read(**/*id_ecdsa*)",
|
|
"Read(**/*id_ed25519*)",
|
|
"Read(**/.ssh/*)",
|
|
"Read(**/.aws/credentials)",
|
|
"Read(**/.aws/config)",
|
|
"Read(**/.kube/config)",
|
|
"Read(**/.docker/config.json)",
|
|
"Read(**/.npmrc)",
|
|
"Read(**/.pypirc)",
|
|
"Read(**/.netrc)",
|
|
"Read(**/.git-credentials)",
|
|
"Read(**/master.key)",
|
|
"Read(**/config/master.key)",
|
|
"Read(**/*.crt)",
|
|
"Read(**/*.cer)",
|
|
"Read(**/*.jks)",
|
|
"Read(**/password*)",
|
|
"Read(**/*password*)",
|
|
"Read(**/token*)",
|
|
"Read(**/*token*)",
|
|
"Read(**/apikey*)",
|
|
"Read(**/*apikey*)",
|
|
"Read(**/.htpasswd)",
|
|
"Write(**/.env)",
|
|
"Write(**/.env.*)",
|
|
"Write(**/credentials*)",
|
|
"Write(**/*credentials*)",
|
|
"Write(**/secrets*)",
|
|
"Write(**/*secrets*)",
|
|
"Write(**/*.pem)",
|
|
"Write(**/*.key)",
|
|
"Write(**/.ssh/*)",
|
|
"Edit(**/.env)",
|
|
"Edit(**/.env.*)",
|
|
"Edit(**/credentials*)",
|
|
"Edit(**/*credentials*)",
|
|
"Edit(**/secrets*)",
|
|
"Edit(**/*secrets*)",
|
|
"Edit(**/*.pem)",
|
|
"Edit(**/*.key)",
|
|
"Edit(**/.ssh/*)",
|
|
"Bash(cat:*)",
|
|
"Write(server/**)",
|
|
"Edit(server/**)"
|
|
],
|
|
"ask": [
|
|
"Bash(rm:*)",
|
|
"Bash(git reset:--hard:*)",
|
|
"Bash(git clean:*)",
|
|
"Bash(git push:--force:*)",
|
|
"Bash(git push:-f:*)",
|
|
"Bash(npm install:*)",
|
|
"Bash(yarn add:*)",
|
|
"Bash(pip install:*)",
|
|
"Bash(composer install:*)",
|
|
"Bash(docker:*)",
|
|
"Bash(kubectl:*)",
|
|
"Bash(grep:*)",
|
|
"Bash(env:*)",
|
|
"Write(**/package.json)",
|
|
"Edit(**/package.json)",
|
|
"Write(**/composer.json)",
|
|
"Edit(**/composer.json)",
|
|
"Write(**/package-lock.json)",
|
|
"Write(**/composer.lock)",
|
|
"Write(**/yarn.lock)",
|
|
"Write(**/.gitignore)",
|
|
"Edit(**/.gitignore)"
|
|
]
|
|
}
|
|
}
|