agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	506a378c3a	release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217 ) ## Summary Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`): - Typed parquet in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR. - Incremental sync via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta). - Partitioned sync — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic. - `where_filters` — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally). ## Architecture - Schema migration v25 → v26: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it). - Per-table dispatcher in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`. - API conflict policy: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`. - Admin UI: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy. ## Test plan - [x] Unit + module — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge - [x] HTML form structure — all v26 inputs + visibility classes + JS payload fields verified in rendered template - [x] Real Keboola roundtrip — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs: - Sync 1: `changedSince=None` → full pull → 9 rows typed parquet - Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed) - [x] Browser UX — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence - [x] Regression — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change) ## Bugs caught + fixed during E2E The browser + real-Keboola roundtrip exposed four bugs the unit tests missed: 1. JS visibility race — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver. 2. PUT cannot clear field — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`. 3. Subprocess DB lock conflict — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn. 4. Wrong KBC table_id — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty. ## Operator notes - Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`. - `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry. - Date placeholders are resolved at sync time, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding). ## Spec source The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-05-07 19:01:27 +02:00
ZdenekSrotyr	28430ced09	Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190 ) * fix: cutover regressions + parallel Keboola legacy fallback Bundled fixes from a fresh-deploy run on a Keboola Storage backend with the block-shared-snowflake-access feature flag — DuckDB Keboola extension's per-table scan can't access bucket schemas, so the legacy kbcstorage Storage-API client is the only working path. CUTOVER REGRESSIONS - agnes pull hash mismatch on every Keboola local-mode table — src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12] while the CLI compares against full 32-char content MD5. Now stores the same content MD5 the materialized SQL path already used. - Trailing-slash sanitization in connectors/keboola/access.py and extractor.py — DuckDB Keboola extension's ATTACH fails when the URL ends in / (canonical form). - src/profiler.py:TableInfo.description becomes optional — two call sites instantiated without it, crashing the profiler pass. - scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images ran as root, current runs as agnes (uid 999). Reads target uid:gid from /etc/passwd inside the new image and chowns ${STATE_DIR}, /data/extracts, /data/analytics when the digest moves. - POST /api/sync/trigger is now singleton per process — two near-simultaneous trigger calls each forked an extractor subprocess, fought for extract.duckdb's file lock, starved uvicorn, flipped the container to unhealthy. Trigger now returns 409 (sync_already_in_progress) when held; _run_sync acquires non-blocking. PARALLEL LEGACY FALLBACK - Process pool fan-out for the _extract_via_legacy queue (default 8 workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not thread pool, because connectors/keboola/client.py:export_table does os.chdir(temp_dir) — process-global, so threads raced and slice files landed in the wrong directory ("[Errno 2] No such file or directory: '<job_id>.csv_X_Y_Z.csv'"). - Extractor subprocess timeout 1800s -> 3600s (configurable via AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export jobs need the headroom on telemetry-class projects. - Process group cleanup on timeout — Popen(start_new_session=True) puts the extractor in its own group. On timeout the parent SIGTERMs the group (10s grace) then SIGKILLs stragglers. Without this, the pool workers were reparented to PID 1 and continued holding open Keboola Storage export jobs. Inline extractor script also installs a SIGTERM -> sys.exit(143) handler so the with ProcessPoolExecutor(...) block __exit__ runs cleanly. Tests: existing tests that patched subprocess.run updated to patch subprocess.Popen with a _FakePopen stand-in (same exit-code-injection contract). Two tests that exercised the parallel path forced AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into ProcessPoolExecutor subprocesses). Squashed onto current main (was 7 commits + multi-commit CHANGELOG + agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict resolution against main's flat-mount STATE_DIR refactor and 0.38.0 release cut). * feat(keboola): Storage API direct extract path; drop extension data path The DuckDB Keboola extension's COPY routes through Keboola QueryService, which is unreliable on linked-bucket projects (extension v0.1.6 fixes that case but isn't yet in the community CDN, and pre-fix any project with the block-shared-snowflake-access feature flag couldn't see bucket schemas at all). Move the extract path off the extension entirely and talk to the Storage API directly via signed-URL download — works on any project, regardless of extension state. connectors/keboola/storage_api.py (NEW) Lightweight client built on requests.Session. Three endpoints: - POST /v2/storage/tables/{id}/export-async (kicks off job) - GET /v2/storage/jobs/{id} (poll until done) - GET /v2/storage/files/{id}?federationToken=1 (signed URL detail) - GET <signed_url> (download bytes) Supports sliced exports (manifest + per-slice signed URLs) and gzipped payloads. ExportFilter dataclass mirrors the Keboola filter spec (whereFilters / columns / changedSince / limit) and handles JSON round-trip with the registry's source_query column. Token redaction in error messages. Bounded exponential backoff on job polling. No cloud-SDK dependency on the data path; thread-safe. connectors/keboola/extractor.py - materialize_query() rewritten: takes bucket/source_table/source_query (JSON filter spec), exports via KeboolaStorageClient, converts CSV to parquet via DuckDB, atomic os.replace. Same return shape so sync.py downstream code stays uniform with the BQ branch. - _extract_via_legacy() also moved to Storage API direct (kept the name for caller compatibility with _legacy_worker / the parallel batch extractor). Per-call temp directories — no os.chdir, threads don't race. app/api/sync.py _run_materialized_pass for source_type='keboola' rows now constructs a KeboolaStorageClient (replaces KeboolaAccess) and passes bucket/source_table/source_query to materialize_query. Reuses one client across rows for HTTP keep-alive. Sources keboola URL from env too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url configured. cli/commands/admin.py discover-and-register defaults Keboola rows to query_mode='materialized' (NULL source_query = full table), matching the v26 migration's unification of the local/materialized split for Keboola. BigQuery and Jira keep their per-source defaults. src/db.py Schema bump 25 → 26. Migration: UPDATE table_registry SET query_mode='materialized' WHERE source_type='keboola' AND query_mode='local'. NULL source_query on those rows means "full table export" — same effective behavior the local mode provided, but now via Storage API instead of the extension. pyproject.toml kbcstorage dep stays (admin-side bucket/table list still uses the SDK in app/api/admin.py / connectors/keboola/client.py); only the data path is migrated off the SDK. Comment updated to reflect the new boundary. tests - test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing, HTTP client (token redaction, retry logic, polling), download_file (single, gzipped, sliced), end-to-end export_table_to_csv. - test_keboola_materialize.py rewritten: mocks KeboolaStorageClient instead of FakeAccess; same atomic-write + zero-rows + unsafe-id contracts. - test_sync_trigger_keboola_materialized.py: registry rows now carry bucket+source_table+JSON-shape source_query. 114+ Keboola-impacted tests green locally. * test: schema version assertion bumped to 26 alongside the keboola query_mode migration * fix(keboola): cutover hot-patches surfaced on agnes-dev Five small fixes that were applied as in-container hot-patches during agnes-dev cutover and need to be on the source-of-truth image so a fresh upgrade does not undo them. - app/api/sync.py: auto-discover gate considers the WHOLE registry (any source, any mode), not just rows where source matches and query_mode is local. After the v25→v26 keboola materialized migration an instance can have 30 materialized rows and zero local rows; the previous gate kept re-firing _discover_and_register_tables every scheduler tick, creating duplicate auto-discovered rows with the wrong bucket prefix every time. - app/api/admin.py: _discover_and_register_tables reassembles the bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of dropping the stage prefix; default query_mode for keboola is now materialized (the v26 contract); validator allows NULL source_query for keboola materialized rows (full-table export via Storage API export-async, no SQL needed). - cli/commands/admin.py: register-table mirrors the server validator (NULL source_query allowed for source_type=keboola); --bucket help text generalized to cover both BQ dataset and Keboola bucket id. - connectors/keboola/extractor.py: max_line_size=64 MiB on read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration in particular) do not trip the default 2 MiB ceiling. - connectors/keboola/storage_api.py: GCP backend support — when the Storage API returns a manifest whose slice URLs are gs:// references with a gcsCredentials block, rewrite to the JSON REST download endpoint and authenticate with the issued OAuth bearer token; redact tokens in any surfaced error string. * test: align with new keboola materialized + auto-discover-gate contracts - test_admin_keboola_materialized: rename test_register_keboola_materialized_rejects_missing_source_query → test_register_keboola_materialized_accepts_missing_source_query. v25→v26 introduced 'keboola materialized with NULL source_query means full-table export via Storage API export-async' as the default registration shape; the rejection case is no longer the contract. - test_sync_filter: add list_all() to _StubRegistry. The auto-discover gate in _run_sync now keys off the WHOLE registry (not just local rows) so materialized-only Keboola instances do not re-trigger discovery on every tick. * feat(keboola): native parquet export — skip CSV roundtrip Storage API export-async accepts fileType={csv,parquet}. Switching the materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB tables (read_csv with all_varchar + max_line_size=64MB has to materialize the whole CSV in memory before COPY can stream out a parquet). Snowflake UNLOAD on Keboola's side already produces typed, self-contained parquet files; the extractor downloads them and renames into place. Two cases: - Single-file export (small table): file_info.url points at one signed URL; download_file streams chunks straight to .parquet.tmp and we're done. No DuckDB. - Sliced export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB default — so anything larger arrives as N parquet slices): each slice is a complete parquet file with its own footer; naive concat would corrupt them. download_file_slices keeps the slices as separate files in a tempdir, then DuckDB COPY (SELECT * FROM read_parquet([slice0, slice1, ...])) merges them into one consolidated parquet. DuckDB streams row groups during this — peak memory bounded to one row group (~1 MiB) regardless of source size. The legacy CSV path stays as the explicit opt-in via source_query= '{"file_type":"csv"}' for projects whose backend can't UNLOAD parquet (none known today; cheap escape hatch). Backward-compat alias KeboolaStorageClient.export_table_to_csv kept. Also fixes a latent bug in download_file's gzip detection: previous heuristic flagged any unencrypted file as gzipped, which would have corrupted parquet downloads at gunzip time. Name-suffix-only now. * fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes Three small self-contained fixes uncovered during agnes-dev cutover. - connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses ignore_cleanup_errors=True so a worker death mid-write doesn't leave multi-GiB stale slice trees on the boot disk. (12 GiB seen after a disk-full crash where TemporaryDirectory's own cleanup also raised and got swallowed.) - src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0 = always due). Force-resync of an errored row no longer requires waiting out the default 'every 1h' interval — admin can flip the schedule, trigger, then flip back. - app/api/sync.py: POST /api/sync/trigger accepts both ['table_id'] (legacy bare-array body) and {'tables': ['table_id']} (matches the response payload shape, more discoverable for clients building requests by hand). Malformed bodies return 422 with a structured detail; null/missing means 'sync everything' as before. Tests cover: tempdir cleanup on raise (sliced parquet path), is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger body parametrized matrix (8 valid shapes + 6 rejection cases). * fix: targeted-trigger filter in materialized pass + auto-upgrade defer Two operational gaps observed during agnes-dev cutover, in the same sync-routing area. - _run_materialized_pass now takes a 'tables' arg and skips rows not in the target set with reason='not_in_target'. POST /api/sync/trigger with a body of tables previously only scoped the legacy extractor subprocess — the materialized pass kept iterating every due materialized row, so an admin asking to re-sync kbc_job re-ran every other due materialized row alongside it. Match on registry id OR name (admins commonly pass either form). tables=None preserves the no-filter behavior. - New GET /api/sync/status (public, no auth) returns {locked: bool} off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before docker compose up -d and exits 0 with a 'deferred recreate' log line if a sync is in flight — the next 5-min cron tick retries. Pre-fix, an auto-upgrade triggered mid-sync would recreate the uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD download (observed when kbc_job's first 7-day retry got SIGKILLed). Connection failures in the probe fall through to the upgrade — being stuck on a wedged image is worse than interrupting a hypothetical sync. * fix: auto-discover protects admin overrides + surfaces drift Two real-world incidents on agnes-dev drove this: 1. kbc_job was registered manually with the correct (in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover re-run would have inserted a SECOND kbc_job row at the slugified id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery places it) — and that row's Storage API export-async 404s. 2. An earlier auto-discover bug stripped the stage prefix from bucket ids ('c-finance' instead of 'in.c-finance'), inserting 137 rows whose syncs all failed. Fix: - _discover_and_register_tables now builds a plan first (_build_keboola_discovery_plan) classifying each discovered table into one of new / existing_match / existing_drift / invalid, then executes only the 'new' bucket. Drift rows are reported with both sides of the disagreement plus drift_kind: - same_id_diff_coords: registry has the same id but different bucket / source_table (admin migrated coords inline). - name_collision: discovery's slugified id differs from any registry id, but the discovered .name matches an existing row's .name (case-insensitive). Catches the kbc_job case. - Bucket detection now prefers the API's authoritative bucket_id field (separate field on the Keboola tables.list response, normalised by KeboolaClient.discover_all_tables). Falls back to id-string parsing only when bucket_id is missing (older fallback path inside discover_all_tables). - Endpoint POST /api/admin/discover-and-register?dry_run=true returns the plan without writing — would_register, drift, invalid lists. Lets an operator audit before merging discovery with a registry that has admin overrides. Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule — the runtime started accepting it in the previous commit (force-resync override) and the validator follows suit. * feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp The container's /tmp lives on the boot disk's overlayfs (29 GiB on agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes slices into per-call /tmp tempdirs that fill multi-GiB / many-slice exports long before the dedicated data disk fills. agnes-dev hit 100% boot-disk while the 20 GiB data disk had 15 GiB free. connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR; mkdirs the target on first use; unset / empty / unwritable falls back to None (system tempdir, OSS-pre-fix behaviour). Both materialize_query (parquet path) and _extract_via_legacy (CSV fallback) and the sliced-CSV concat path in storage_api use the helper now. docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler, and extract services. The data volume is the dedicated disk in production layouts and a plain docker volume in single-disk dev/laptop setups — same blast radius as the previous /tmp default on the latter, no regression.	2026-05-07 12:12:14 +02:00
ZdenekSrotyr	103efb69f0	chore(cli-rename): replace stale `da` verbs in active code paths Bring admin UI, audit-log messages, code comments, and analyst-facing skill docs in line with the post-bootstrap CLI surface (`agnes pull`, `agnes push`, `agnes init`, `agnes snapshot create`). The legacy `_LEGACY_STRINGS` detection tuple in `app/api/claude_md.py` and the hook upgrade markers in `cli/lib/hooks.py` are intentionally left as-is — they exist precisely to flag pre-rewrite content for re-authoring. Strip "(folded from `da metrics list`)" / "(lifted from `da metrics show`)" / "Replaces the old `da analyst status`" docstring noise — the rename history is in CHANGELOG.md, not in module docstrings.	2026-05-04 21:10:43 +02:00
ZdenekSrotyr	e438170ade	merge: pull #174 (BQ materialize view fix + concurrency, 0.33.0) into bootstrap branch Brings in zs/materialize-sync-fix (PR #174): - BigQuery view materialize works (wrap admin SQL in bigquery_query()) - Per-table mutex + fcntl.flock for concurrent COPY corruption - Cost guardrail dry-run engages on materialized rows - Schema v23 -> v24 migration: rewrite source_query to BQ-native - Server-generated trivial source_query from bucket+source_table - Validator backtick relaxation for materialized rows - 0.33.0 release cut Conflict resolution: - CHANGELOG.md: keep our [Unreleased] (bootstrap rewrite content) ABOVE the new [0.33.0] section from #174. The bootstrap rewrite remains unreleased; it'll cut 0.34.0 (or later) when this PR merges to main. - tests/conftest.py: union — keep our analyst-bootstrap fixture re-export AND #174's bq_instance / stub_bq_extractor fixtures. - pyproject.toml auto-merged to 0.33.0 (matches the cut), correct. - src/db.py auto-merged: SCHEMA_VERSION = 24, _v23_to_v24_finalize added — no overlap with our work which left schema at v23. - CLAUDE.md auto-merged: schema-history paragraph extended with v24. Verified: 79/79 across CLI bootstrap suite + materialize suite + schema v24 migration tests pass locally on Python 3.13/macOS.	2026-05-04 20:53:00 +02:00
ZdenekSrotyr	c7c42de0f0	feat(sync): treat MaterializeInFlightError as 'skipped, in_flight' _run_materialized_pass distinguishes due-check skips from in-flight skips and never calls state.set_error for either. summary['skipped'] becomes a list of {table, reason} dicts; the end-of-pass log line breaks out the in_flight subcount. Hoists is_table_due to module-level import so test monkeypatching of the symbol intercepts the call (the previous local import made patches a no-op).	2026-05-04 18:11:38 +02:00
ZdenekSrotyr	1563b05f2e	refactor(cli): hard-cutover env vars + config dir to AGNES_* Task 0.5 of clean-analyst-bootstrap. Greenfield rewrite — no fallback, no aliases. Existing dev environments lose their cached PAT and must re-authenticate. Env var renames (hard cutover): - DA_CONFIG_DIR -> AGNES_CONFIG_DIR - DA_SERVER -> AGNES_SERVER - DA_SERVER_URL -> AGNES_SERVER_URL (test-only stale ref, not in spec) - DA_NO_UPDATE_CHECK -> AGNES_NO_UPDATE_CHECK - DA_LOCAL_DIR -> AGNES_LOCAL_DIR - DA_TOKEN -> AGNES_TOKEN - DA_STREAM_RETRIES -> AGNES_STREAM_RETRIES Config dir rename: ~/.config/da/ -> ~/.config/agnes/ (across code, comments, docstrings, error messages, install templates, dev scripts). Stale `da X` references in CLI source (and adjacent app/, tests/): swept docstrings, comments, help text, and error messages where the verb survives the rewrite (init, pull, push, catalog, status, diagnose, auth, admin, skills, query, schema, describe, explore, disk-info, snapshot, login, logout, whoami, server, setup) and replaced `da X` with `agnes X`. Intentionally kept `da sync`, `da fetch`, `da analyst`, `da metrics` — those verbs are removed in later tasks; the legacy strings will be detected by `_LEGACY_STRINGS` (added in Task 2). Test fixes: - TestCLIVersion now asserts output starts with `agnes ` (was `da `). Test results: 2675 passed, 25 skipped (full pytest run, excluding 9 pre-existing test_db.py / test_user_management.py / test_e2e_extract.py / test_cli_binary_rename.py failures unrelated to this rename).	2026-05-04 16:35:44 +02:00
ZdenekSrotyr	f0979f997a	fix(admin-api): reject backtick BQ-native source_query at register; surface materialize errors per-row E2E testing showed admin POSTs of materialized BQ rows whose source_query uses BigQuery-native backtick identifiers (`prj.ds.t`) silently no-op'd at the next sync tick — the materialize path runs the SQL through the DuckDB BQ extension's COPY which uses DuckDB's parser; backticks aren't recognized and the query either parse-errors or matches zero rows. No parquet lands at the canonical path and no error reaches an operator-visible surface. Two-part fix: 1. RegisterTableRequest's _check_mode_query_coherence model_validator now rejects any source_query containing a backtick with a 422 + actionable message pointing at the DuckDB equivalent (bq."dataset"."table"). Same check is applied in update_table on the merged record so PATCHes that flip a stored source_query to backtick form are also caught. Covers BQ AND Keboola materialized rows since both connectors funnel source_query through DuckDB's COPY. 2. _run_materialized_pass now persists per-row failures via the new SyncStateRepository.set_error / clear_error methods (existing sync_state.error / status columns — no schema migration). GET /api/admin/registry enriches each row with `last_sync_error` from a single batched SELECT against sync_state, so the admin UI / da admin status can show "this table failed last sync because: X" instead of operators having to trawl scheduler logs. Recovered rows have the error cleared automatically — update_sync's success path resets status='ok' / error=NULL on the upsert. The materialized-path test fixture's _materialized_payload helper is updated to use DuckDB-flavor SQL (the prior backtick example pre-dated the fix). 6 new tests cover register/update rejection on BQ + Keboola, the sync_state error persistence, and the registry response surface.	2026-05-01 22:51:02 +02:00
ZdenekSrotyr	16938ae7cb	fix(materialized): address 4 Devin Review findings on PR #152 Devin Review on commit `7052a235` flagged 4 real bugs in the Keboola materialized path. All four are fixed; 3 new regression tests pin the behavior so future refactors can't quietly regress. BUG_pr-review-job-3fbd31c9_0001 — _run_materialized_pass gated behind 'if bq_project:' app/api/sync.py:444-466 wrapped the entire materialized pass (which dispatches BOTH BigQuery AND Keboola rows by source_type) in a check for data_source.bigquery.project being non-empty. On Keboola-only instances this short-circuited and Keboola materialized rows sat in table_registry forever without their SQL being evaluated — the feature CHANGELOG advertised was dead code on the most common deployment shape. Fix: always run the materialized pass; the BQ branch's per-row try/except catches the typed BqAccessError(not_configured) the sentinel raises when no BQ project is set, so non-BQ instances incur a per-row error for any (hypothetical) BQ-tagged row but the Keboola path runs cleanly. Log line renamed 'Materialized BQ' → 'Materialized SQL' to match. BUG_pr-review-job-3fbd31c9_0004 — wrong config key 'url' instead of 'stack_url' app/api/sync.py:149 read get_value('data_source', 'keboola', 'url'), but the canonical config key documented in instance.yaml.example:111 and used by app/api/admin.py:1503 + 2359 is 'stack_url'. Production Keboola instances would always see an empty URL and fail with the 'not configured' error. The pre-existing test patched the wrong key too, so it passed without catching the mismatch. Fix: use stack_url in both sync.py and the test fixture. BUG_pr-review-job-3fbd31c9_0003 — no atomic write in Keboola materialize_query connectors/keboola/extractor.py wrote COPY directly to the final '<id>.parquet' path. A mid-COPY failure (network, disk full, extension crash) left a partial parquet that the orchestrator rebuild would later pick up and serve to analysts. BQ's materialize_query already uses a '<id>.parquet.tmp' staging path + os.replace() atomic swap (connectors/bigquery/extractor.py:370-445); Keboola now mirrors that pattern with the same try/except cleanup on COPY failure. BUG_pr-review-job-3fbd31c9_0002 — full file read into memory for MD5 Same file:60-62 used parquet_path.read_bytes() for the MD5 hash. Multi-GB Keboola materialized results would OOM on memory-constrained containers. BQ's version uses streaming 8 KiB-chunk hashing (connectors/bigquery/extractor.py:438-442); Keboola now mirrors it. Tests: - test_run_sync_runs_materialized_pass_on_keboola_only_instance — pins BUG_0001's fix; setting bigquery.project='' must NOT skip Keboola materialized dispatch - test_keboola_materialize_atomic_write_on_failure — pins BUG_0003; a mid-COPY RuntimeError leaves no .parquet AND no .parquet.tmp at the canonical path - test_keboola_materialize_uses_tmp_path_during_copy — documents the atomic-write contract: COPY targets .parquet.tmp, final swap to .parquet (no .tmp suffix on the result['path']) - existing test_run_materialized_pass_dispatches_keboola_to_keboola_extractor fixture updated: stack_url instead of url Full sweep: 2505 passed, 25 skipped, 0 failed (modulo 8 pre-existing internal_roles schema-migration failures called out in the task brief).	2026-05-01 20:58:17 +02:00
ZdenekSrotyr	85d3810535	feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors. Backend (BigQuery + Keboola, schema v20): - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19) - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, , bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload Operator + CLI surface: - da admin register-table --query / --query-mode materialized - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators Tests (incl. spike + integration + regression): - test_db_migration_v20, test_table_registry_source_query - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_ creds) - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized - test_api_admin_materialized, test_cli_admin_materialized - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.	2026-05-01 20:25:56 +02:00
minasarustamyan	d4ac84dd46	feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150 ) * feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu. Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass) de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje explicitní resource_grants(group, "table", id) řádek. Schema v18 → v19 (src/db.py:_v18_to_v19_finalize): - DROP TABLE dataset_permissions, access_requests - DROP COLUMN users.role (NULL artifact since v13) - DROP COLUMN table_registry.is_public - Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT → drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách s historic FK constraints. INSERT picks intersection sloupců, takže test fixtures s minimal pre-v19 schemou migrate cleanly. Runtime: - src/rbac.py:can_access_table → deleguje na app.auth.access.can_access - DatasetPermissionRepository, AccessRequestRepository smazány - AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn (TABLE je unconditionally enabled) API drop: - app/api/permissions.py, app/api/access_requests.py celé soubory - /admin/permissions web route + admin_permissions.html - "Request Access" modal v catalog.html + locked-row UI - ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut je uvnitř can_access_table) - /api/settings: drop permissions field z GET; PUT /api/settings/dataset gate přepnut na can_access(user_id, "table", dataset, conn) Auth: - app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored) - app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest (admin promotion = explicit add to Admin group via memberships API) - src/repositories/users.py: drop role z create() / update() CLI: - da admin set-role smazán → hard-fail s replacement command - da admin add-user --role flag pryč - da auth import-token --role flag pryč - da auth whoami: drop "Role:" výpis - cli/config.py:save_token: role parametr now optional, no longer written (back-compat se starými token.json soubory zachována — pole se ignoruje) Tests: - DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py - REWRITE: test_access_control.py (resource_grants flow), test_rbac.py (can_access_table over resource_grants), test_journey_rbac.py (drop access-request flow), test_resource_types.py (drop env-gate tests, drop is_public from helpers), test_v2_.py (drop role-based user dicts in favor of id-based + Admin group membership), test_settings_api.py (no permissions field, can_access gate) - TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create a 3rd positional role z create_access_token - NEW: test_v18_to_v19 migration test (test_db.py), test_can_access_table_no_implicit_public (test_rbac.py), test_admin_set_role_returns_hardfail (test_cli_admin.py) - OpenAPI snapshot regenerated Docs: - CHANGELOG: BREAKING entry pod [Unreleased] - CLAUDE.md: schema v18 → v19 - docs/architecture.md: schema table + RBAC sekce přepsána - docs/auth-google-oauth.md: admin promotion přes da admin break-glass - cli/skills/security.md: kompletně přepsáno na group-based model - docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn) Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing Windows-specific issues (fcntl, charset) nesouvisející s tímto PR — ověřeno git stash pop. Plan: ~/.claude/plans/floofy-coalescing-parnas.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(release): cut 0.27.0 --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 22:02:16 +02:00
Vojtech	38f6b639d2	feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136 ) Cuts release 0.20.0. ## Highlights - X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation) - Error pages (HTML + JSON 500) surface request_id for support tickets - Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel - Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls - Telegram bot drops bot.log file — stdout only (BREAKING) ## Devin findings addressed - BUG_0001: .env.template no longer claims FastAPI debug=True - BUG_0002: subprocess extractor logs INFO to stderr again - ANALYSIS_0003: _wants_html no longer matches Accept: / (curl gets JSON as before) - BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production - BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed See CHANGELOG [0.20.0] for full notes.	2026-04-29 22:54:21 +02:00
ZdenekSrotyr	b7a1795834	feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135 ) Bundles 4 issues: - #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators) - #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation) - #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS) - #89 — OpenMetadataClient verify=True default (BREAKING for self-signed) Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.	2026-04-29 22:06:30 +02:00
ZdenekSrotyr	2e1dfb7553	feat(v2): claude-driven fetch primitives + 0.14.0 (#102 ) Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.	2026-04-29 01:07:19 +02:00
ZdenekSrotyr	e9d7af3cce	feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening This squashes 13 commits from ma/staging plus a small docstring translation into a single coherent unit. Three workstreams. == RBAC v13 redesign == - Drops core.viewer/analyst/km_admin/admin hierarchy and the internal_roles / group_mappings / user_role_grants / plugin_access tables. - Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry. - Two authorization primitives in app.auth.access: require_admin — Admin-group god-mode require_resource_access(rt, "{path}") — entity-scoped grants Single DB lookup per request; no session cache; no implies BFS. - /admin/access UI (single page) replaces /admin/role-mapping + /admin/plugin-access. CLI `da admin group/grant ` replaces `da admin role/mapping/grant-role/revoke-role/effective-roles`. - ResourceType.TABLE listing-only — admins can record table grants, runtime enforcement still flows through legacy dataset_permissions (migration plan in docs/TODO-rbac-data-enforcement.md). == Claude Code marketplace == - Aggregated /marketplace.zip + /marketplace.git/ (PAT-gated, RBAC-filtered, content-addressed cache via dulwich). - Admin god-mode dropped on the marketplace surface — admins curate their own view via grants like everyone else. - Bare-repo cache materializes per RBAC-filtered ETag; stale entries not pruned in this iteration (disclaimed in git_backend.py docstring). == #81 #83 #44 security/ops hardening == - #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias). - #81 Group B — Keboola extractor 3-state exit codes: 0 success / 1 total fail / 2 PARTIAL fail Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary alerting must teach it the new partial signal. - #81 Group C — schema v10 view_ownership; rejects silent overwrite of a prior connector's view name on collision. - #81 Group D — extractor-side identifier validation. - #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset + path-traversal fix. - #44 — entire /api/scripts/* surface is admin-only (planted-script + sandbox-bypass risk closed). == Web UI polish + deploy fix == - /admin/access: live grant-count badges (no stale snapshot revert), shared-header CSS link added to /catalog and /admin/{tables,permissions}, per-resource-type colored stripes. - docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't silently shadow sub-mounts and write state to the wrong disk. == OSS vendor-neutralization (waves 1+2) == - scripts/grpn/ → scripts/ops/. Customer-specific identifiers (project IDs, internal hostnames, dev/prod VM IPs, brand names) replaced with placeholders across code, docs, Terraform, Caddyfile, OAuth probe, and planning docs. Downstream infra repos that copied scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must update the path. == Translation == - src/repositories/user_groups.py::ensure_system docstring translated from Czech to English for codebase consistency. Co-authored-by: Mina Rustamyan <mina@keboola.com>	2026-04-28 14:25:04 +02:00
ZdenekSrotyr	ef74ec010c	fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99 ) Closes M14 from issue #81. Keboola extractor exits 0/1/2 (success/full-fail/partial). sync.py interprets exit 2 as PARTIAL FAILURE (data-quality alert, distinct from exit 1). Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including runtime mock subprocess (rc=0/1/2/124). Refs #81 Group B.	2026-04-27 21:52:46 +02:00
ZdenekSrotyr	6c53082295	feat: multi-instance deployment — all 14 must-have items from spec CalVer CI (release.yml) with stable/dev channels, health endpoint with version/channel/schema_version, JWT secret auto-generation with file persistence, smoke test script + Docker-in-CI, pre-migration snapshot, /api/admin/configure for headless setup, /api/admin/ discover-and-register, /setup wizard, OpenAPI snapshot test, custom connector mount support, CHANGELOG, migration safety tests, startup banner. 663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1 updated JWT test).	2026-04-10 11:57:42 +02:00
ZdenekSrotyr	53a9e838f9	feat: add graceful shutdown handler - Add close_system_db() function in src/db.py to cleanly close shared DB connection - Add lifespan context manager in app/main.py to trigger shutdown on app exit - Integrate lifespan into FastAPI app initialization - All API tests pass (77/77)	2026-04-09 07:03:45 +02:00
ZdenekSrotyr	3e3f84a00e	feat: dynamic login providers + profiler auto-trigger + refresh endpoint	2026-04-08 07:04:40 +02:00
ZdenekSrotyr	2b7348a773	fix: sync only extracts local tables, skips remote Was using list_by_source() which returns all tables including remote. Now uses list_local() to skip query_mode='remote' tables.	2026-03-31 15:35:49 +02:00
ZdenekSrotyr	8f3a342108	fix: sync logs via stderr for docker compose visibility	2026-03-31 14:05:01 +02:00
ZdenekSrotyr	7612385ed6	fix: extractor subprocess reads table configs via stdin, not DuckDB Subprocess cannot open system.duckdb (main process holds lock). Now main process reads table_registry and passes configs as JSON via stdin to subprocess. Subprocess never touches system.duckdb.	2026-03-31 13:57:02 +02:00
ZdenekSrotyr	4d1acd014a	refactor: remove legacy webapp + add missing tests + housekeeping Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to .gitignore, increase extractor timeout to 30 min. Phase B: Add 10 new tests — access request lifecycle (4), CLI admin commands (5), sync subprocess trigger (1). 578 tests passing. Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask app fully replaced by FastAPI app/. Fix auth providers to use app.instance_config instead of webapp.config. Update CLAUDE.md. Delete 6 webapp-only test files. Fix Jira service config imports.	2026-03-31 13:44:06 +02:00
ZdenekSrotyr	2d6a94fb6f	fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename Three-pronged fix for DuckDB lock conflicts: 1. WAL mode on system.duckdb — enables concurrent readers + writer 2. Sync trigger runs extractor as subprocess (not background task) — separate process = separate DuckDB connections, no lock conflict 3. Both extractor and orchestrator write to .tmp then atomic rename — avoids lock conflict with API reads on extract.duckdb/analytics.duckdb Fixes #9 permanently.	2026-03-31 13:19:57 +02:00
ZdenekSrotyr	1074d5ec49	feat: implement data access control — table-level permissions Schema v3: add is_public column to table_registry (default true). src/rbac.py: can_access_table() checks admin bypass, public flag, explicit permissions, wildcard bucket permissions. API enforcement: - manifest: filters tables by user access - download: 403 if no access - catalog: filters table list - query: validates referenced tables against allowed list New admin permissions API (/api/admin/permissions) for grant/revoke. 28 access control tests + 733 total tests passing.	2026-03-31 12:33:31 +02:00
ZdenekSrotyr	1bf97c725c	feat: wire orchestrator into API — replace DataSyncManager sync.py: _run_sync() now calls extractor + SyncOrchestrator.rebuild() data.py: parquet lookup searches /data/extracts/ first, legacy fallback catalog.py: list tables from DuckDB table_registry instead of src.config admin.py: discover-tables uses KeboolaClient directly, remove old TableRegistry dep	2026-03-30 20:16:33 +02:00
ZdenekSrotyr	1287e63ed9	feat: complete system — web UI, all API endpoints, governance, admin, CLI commands Major additions: - Web UI: Jinja2 templates in FastAPI (login, dashboard, catalog, corporate memory, admin) - API: catalog profiles/metrics, telegram verify/unlink/status, admin table registry CRUD - Corporate memory governance: approve/reject/mandate/revoke/edit/batch + audit log - Sync: real DataSyncManager trigger, sync-settings, table-subscriptions - CLI: setup (init/test/deploy/verify), server (logs/restart/deploy/backup), explore - Instance config integration (instance.yaml loaded at startup) - 140 tests passing (25 new)	2026-03-27 16:52:22 +01:00
ZdenekSrotyr	a3918d3833	feat: add FastAPI server with auth, RBAC, and all API endpoints - JWT auth with role-based access control (viewer/analyst/admin/km_admin) - Endpoints: health, sync manifest, data download, query, users CRUD, corporate memory, session/artifact upload - 18 API tests covering auth, RBAC, all endpoints	2026-03-27 15:19:18 +01:00

27 commits