## Summary
Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):
- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).
## Architecture
- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.
## Test plan
- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
- Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
- Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)
## Bugs caught + fixed during E2E
The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:
1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.
## Operator notes
- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).
## Spec source
The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`claude -p` (headless mode) gives SessionEnd hook subprocesses ~1 second before SIGTERM, regardless of work in progress. `agnes push` for a typical workspace takes 5-30s. The current synchronous SessionEnd hook (`agnes push --quiet 2>/dev/null || true`) was therefore being killed mid-first-upload — `|| true` masks the SIGTERM as exit 0, so this regression was invisible until I traced it via a wrapper script and Claude's `~/.claude/debug/<sid>.txt` log.
Fix: wrap SessionEnd push in `bash -c "( nohup agnes push --quiet </dev/null >/dev/null 2>&1 & ) ; true"`. The subshell exits immediately, orphaning the upload child to init so it survives the hook subprocess kill. Same `bash -c` pattern as the existing `refresh-marketplace` SessionStart entry (for Windows compatibility).
End-to-end verified against production: claude exited in 5s, detached child completed the upload, file `491e3a23-...jsonl` landed on the server within 30s with mtime 14:30 UTC.
## Test plan
- [x] `pytest tests/test_lib_hooks.py` — added `test_session_end_push_is_detached` regression test asserting `nohup`, `&`, `</dev/null` are all present.
- [x] `pytest tests/test_setup_hooks_template.py` — assertions loosened from `==` to `in` where necessary.
- [x] Verified end-to-end against production with the detached wrapper before opening this PR (manual probe).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/222" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Verified against production: `claude -p` headless mode doesn't fire SessionEnd hooks (proven via `--output-format stream-json --include-hook-events`: zero `SessionEnd` events), so any session JSONLs from `-p` invocations stay orphaned locally and never reach the server. Fix: add `agnes push --quiet` as a third SessionStart entry — symmetric self-heal alongside the existing `agnes pull` entry. Existing workspaces pick this up on their next `agnes init` via the marker-based migration already in `cli/lib/hooks.py`.
Separately: a colleague's fresh install showed `agnes diagnose` warning "uploads are not being processed", which led them to suspect their `agnes push` was broken. The warning is actually about the LLM-based `verification-detector` backlog (uploads themselves were arriving fine — confirmed by 23+3 JSONLs landed on the server while the warning was firing). Reword the warning to "verification-detector backlog" + add `last_processed` to the diagnose dict so operators don't have to grep logs to confirm.
## Test plan
- [x] `pytest tests/test_lib_hooks.py` — updated count + added `agnes push in SessionStart` assertion.
- [x] `pytest tests/test_setup_hooks_template.py` — updated.
- [x] `pytest tests/test_clean_install_integration.py` — updated.
- [x] `pytest tests/test_health_session_pipeline.py` — updated warning text + asserted `last_processed` field.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/220" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
Caught by my own broader test scope after Devin fixes — three test files
asserted on user-visible strings that were renamed by the bootstrap PR
but the assertions weren't updated:
- tests/test_api_query_guardrail.py:110 — asserted `da fetch in suggestion`
on /api/query 400 response. Renamed to `agnes snapshot create`.
- tests/test_query_materialized_error_message.py:56 — asserted `da sync`
in materialized-not-yet error detail. Renamed to `agnes pull`.
- tests/test_cli_error_render.py:71 — fixture data + assertion both
carried `da fetch`. Updated to `agnes snapshot create`.
Plus an actual content miss: docs/setup/claude_settings.json (a template
shipped to operators) still installed `da sync` / `da sync --upload-only`
hooks. The companion test file (tests/test_setup_hooks_template.py) was
asserting that legacy state. Updated both:
- Template hooks: `agnes pull --quiet` / `agnes push --quiet`
- Test assertions + function name match the new commands
The analyst flow becomes a closed loop with the server-curated table catalog:
- `da analyst setup` writes `<workspace>/.claude/settings.json` with two hooks:
SessionStart → `da sync --quiet || true` — pulls fresh RBAC-filtered parquets at session start
SessionEnd → `da sync --upload-only --quiet || true` — uploads session jsonl + CLAUDE.local.md
- `|| true` keeps Claude Code unblocked when the server is down.
- Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace.
- `da sync --quiet` rewrites the CLI output for hook consumption — 0 stdout on success, single-line error on failure.
- Existing settings.json is patched (deep-merged), not overwritten; malformed JSON is reported, not silently overwritten.
Tests cover: workspace bootstrap, hook insertion, malformed-json safety, quiet-mode output shape.
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template
Closes#153.
The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.
Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.
## Changes
### config/claude_md_template.txt (+83)
Added a `## Remote Queries (BigQuery)` section covering:
- Discovery first — `da catalog --json | jq '...'` to see all tables
with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
- `da fetch` (preferred) — materialize a filtered subset locally,
query the snapshot, drop when done.
- `da query --remote` — one-shot server-side execution (cheap probes).
- `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
--select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.
### docs/setup/claude_md_template.txt (deleted)
Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.
### CHANGELOG.md
`### Fixed` and `### Removed` entries under [Unreleased].
## Tested
- Manually walked the diff against `da skills show agnes-data-querying`
output on a live VM (foundryai-development) — patterns + flags
match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
is identical to existing template substitution path so render is
not at risk.
## Out of scope
- The companion gap that data_description.md often only enumerates
query_mode: "local" tables (no signal that other modes exist) —
separate concern, fix likely belongs in the metadata generator
on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
`query_mode: "remote"` in the registry — workflow improvement, not
a code bug.
* chore(release): cut 0.28.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
Session testing revealed 3 issues with remote queries:
1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but
claude_settings.json had `cat` in deny list, causing 2-3 failed
attempts per query. Replaced with file-based approach: Write tool
creates JSON file, then `ssh ... < file` avoids the cat deny.
2. ssh/scp commands were not in the allow list, requiring manual
approval for every remote query. Added both to allow list.
3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every
parquet export. Replaced with .arrow().read_all().
Also added instruction for proactive hybrid analysis when remote
tables are available (agent was only using local data until asked).
Agent was failing 3x on SSH commands due to backticks (BQ table names)
and single quotes (SQL string literals) getting mangled by nested shell
interpretation (local -> SSH -> bash -> Python).
New --stdin mode reads query spec as JSON from stdin via heredoc:
cat <<'QUERY' | ssh alias 'bash remote_query.sh --stdin'
{"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."}
QUERY
Heredoc with <<'QUERY' (quoted) passes everything literally -- no
escaping needed for backticks, quotes, or parentheses.
Updated claude_md_template.txt to use --stdin as the primary method.
Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/.
Added to data-ops group on server.
New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH,
CONFIG_DIR, .env) so agents use simple:
ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table'
Updated claude_md_template.txt to use wrapper instead of raw commands.
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.
Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).
Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
Server venv is created during bootstrap via SSH (same package list,
installed natively on Linux). Removes sync_data.sh section that copied
pip freeze output across platforms (Windows/macOS freeze is incompatible
with Linux).
- Rewrite bootstrap.yaml as clean structured YAML with steps, commands,
descriptions, conditions, and notes
- Add _generate_setup_instructions() in app.py that reads YAML, substitutes
placeholders, and produces clipboard-ready plain text
- Replace 50-line hardcoded JS string builder with single tojson variable
- All setup instructions now editable in one YAML file
Add {ssh_alias} and {ssh_key} placeholders so each instance can use
its own SSH config name (avoids conflicts when user has multiple instances).
Remove Keboola-specific sync_settings and dataset references.
Simplify to single download_server_data step (rsync with scp fallback).
Handle SSH alias conflicts gracefully.
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs
Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}
Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config
All 118 tests pass.
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.