agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	16938ae7cb	fix(materialized): address 4 Devin Review findings on PR #152 Devin Review on commit `7052a235` flagged 4 real bugs in the Keboola materialized path. All four are fixed; 3 new regression tests pin the behavior so future refactors can't quietly regress. BUG_pr-review-job-3fbd31c9_0001 — _run_materialized_pass gated behind 'if bq_project:' app/api/sync.py:444-466 wrapped the entire materialized pass (which dispatches BOTH BigQuery AND Keboola rows by source_type) in a check for data_source.bigquery.project being non-empty. On Keboola-only instances this short-circuited and Keboola materialized rows sat in table_registry forever without their SQL being evaluated — the feature CHANGELOG advertised was dead code on the most common deployment shape. Fix: always run the materialized pass; the BQ branch's per-row try/except catches the typed BqAccessError(not_configured) the sentinel raises when no BQ project is set, so non-BQ instances incur a per-row error for any (hypothetical) BQ-tagged row but the Keboola path runs cleanly. Log line renamed 'Materialized BQ' → 'Materialized SQL' to match. BUG_pr-review-job-3fbd31c9_0004 — wrong config key 'url' instead of 'stack_url' app/api/sync.py:149 read get_value('data_source', 'keboola', 'url'), but the canonical config key documented in instance.yaml.example:111 and used by app/api/admin.py:1503 + 2359 is 'stack_url'. Production Keboola instances would always see an empty URL and fail with the 'not configured' error. The pre-existing test patched the wrong key too, so it passed without catching the mismatch. Fix: use stack_url in both sync.py and the test fixture. BUG_pr-review-job-3fbd31c9_0003 — no atomic write in Keboola materialize_query connectors/keboola/extractor.py wrote COPY directly to the final '<id>.parquet' path. A mid-COPY failure (network, disk full, extension crash) left a partial parquet that the orchestrator rebuild would later pick up and serve to analysts. BQ's materialize_query already uses a '<id>.parquet.tmp' staging path + os.replace() atomic swap (connectors/bigquery/extractor.py:370-445); Keboola now mirrors that pattern with the same try/except cleanup on COPY failure. BUG_pr-review-job-3fbd31c9_0002 — full file read into memory for MD5 Same file:60-62 used parquet_path.read_bytes() for the MD5 hash. Multi-GB Keboola materialized results would OOM on memory-constrained containers. BQ's version uses streaming 8 KiB-chunk hashing (connectors/bigquery/extractor.py:438-442); Keboola now mirrors it. Tests: - test_run_sync_runs_materialized_pass_on_keboola_only_instance — pins BUG_0001's fix; setting bigquery.project='' must NOT skip Keboola materialized dispatch - test_keboola_materialize_atomic_write_on_failure — pins BUG_0003; a mid-COPY RuntimeError leaves no .parquet AND no .parquet.tmp at the canonical path - test_keboola_materialize_uses_tmp_path_during_copy — documents the atomic-write contract: COPY targets .parquet.tmp, final swap to .parquet (no .tmp suffix on the result['path']) - existing test_run_materialized_pass_dispatches_keboola_to_keboola_extractor fixture updated: stack_url instead of url Full sweep: 2505 passed, 25 skipped, 0 failed (modulo 8 pre-existing internal_roles schema-migration failures called out in the task brief).	2026-05-01 20:58:17 +02:00
ZdenekSrotyr	85d3810535	feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors. Backend (BigQuery + Keboola, schema v20): - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19) - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, , bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload Operator + CLI surface: - da admin register-table --query / --query-mode materialized - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators Tests (incl. spike + integration + regression): - test_db_migration_v20, test_table_registry_source_query - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_ creds) - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized - test_api_admin_materialized, test_cli_admin_materialized - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.	2026-05-01 20:25:56 +02:00
Vojtech	38f6b639d2	feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136 ) Cuts release 0.20.0. ## Highlights - X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation) - Error pages (HTML + JSON 500) surface request_id for support tickets - Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel - Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls - Telegram bot drops bot.log file — stdout only (BREAKING) ## Devin findings addressed - BUG_0001: .env.template no longer claims FastAPI debug=True - BUG_0002: subprocess extractor logs INFO to stderr again - ANALYSIS_0003: _wants_html no longer matches Accept: / (curl gets JSON as before) - BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production - BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed See CHANGELOG [0.20.0] for full notes.	2026-04-29 22:54:21 +02:00
ZdenekSrotyr	ef74ec010c	fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99 ) Closes M14 from issue #81. Keboola extractor exits 0/1/2 (success/full-fail/partial). sync.py interprets exit 2 as PARTIAL FAILURE (data-quality alert, distinct from exit 1). Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including runtime mock subprocess (rc=0/1/2/124). Refs #81 Group B.	2026-04-27 21:52:46 +02:00
ZdenekSrotyr	569cd90d75	fix(security): #81 Group D — extractor-side identifier validation (squashed) (#97 ) Closes M15 from issue #81 — SQL injection via attacker-controlled identifiers in connectors/keboola/extractor.py and connectors/bigquery/extractor.py. Lifted _validate_identifier from src/orchestrator.py into a new src/identifier_validation.py shared module (single source of truth for both layers). Two validator policies: - validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for table_name — matches the orchestrator's rebuild-time check, so dashed names fail fast at extraction rather than being silently dropped. - validate_quoted_identifier (relaxed, accepts dashes/dots) for bucket/dataset/source_table — Keboola in.c-foo and BigQuery my-dataset are legitimate, just need to be safe inside `"..."`. Both extractors skip-and-continue on unsafe rows (logged + counted in failure stats); _extract_via_extension re-validates as defense-in-depth. 71/71 extractor + orchestrator tests pass. Refs #81 Group D.	2026-04-27 21:46:17 +02:00
ZdenekSrotyr	f25393871d	fix: escape single quotes in ATTACH TOKEN parameters - In src/orchestrator.py _attach_remote_extensions: escape token with '' before passing to ATTACH - In connectors/keboola/extractor.py _try_attach_extension: escape token with '' before passing to ATTACH Prevents SQL injection if token contains single quotes.	2026-04-09 07:00:13 +02:00
ZdenekSrotyr	e425d4baa5	fix: handle WAL files in atomic swap to prevent DB corruption Add _atomic_swap_db helper that removes stale WAL files before and after moving the temp DuckDB into place. Apply CHECKPOINT before close in both orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.	2026-04-09 06:57:29 +02:00
ZdenekSrotyr	79443e0df4	fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy - Legacy extractor now uses read_csv(all_varchar=true) to avoid type inference errors (e.g. seniority column typed as DOUBLE with string values) - DEPLOYMENT.md rewritten based on actual dev VM deployment experience: deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow	2026-04-08 19:09:55 +02:00
ZdenekSrotyr	06e1cf0a8d	feat: generic _remote_attach contract for remote DuckDB extension views Extractors with remote tables now write a _remote_attach table into extract.duckdb so the orchestrator can re-ATTACH external extensions at query time. The mechanism is source-agnostic — any connector can use it. - Keboola extractor writes _remote_attach + creates views on kbc.* - Orchestrator reads _remote_attach, installs extension, reads token from env - Graceful degradation: missing token → warning, local tables still work	2026-04-08 18:10:12 +02:00
ZdenekSrotyr	2d6a94fb6f	fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename Three-pronged fix for DuckDB lock conflicts: 1. WAL mode on system.duckdb — enables concurrent readers + writer 2. Sync trigger runs extractor as subprocess (not background task) — separate process = separate DuckDB connections, no lock conflict 3. Both extractor and orchestrator write to .tmp then atomic rename — avoids lock conflict with API reads on extract.duckdb/analytics.duckdb Fixes #9 permanently.	2026-03-31 13:19:57 +02:00
ZdenekSrotyr	10d9280ab5	fix: extractor writes to temp file to avoid lock with orchestrator Writes extract.duckdb.tmp then renames atomically, avoiding DuckDB lock conflict when orchestrator holds a read connection on extract.duckdb.	2026-03-31 13:09:51 +02:00
ZdenekSrotyr	bd0b6d19c6	fix: legacy extractor constructs full Keboola table ID from bucket+source_table Was using tc['id'] which is the registry ID (e.g. 'circle'), not the full Keboola ID (e.g. 'in.c-finance.circle') needed by the API.	2026-03-31 12:06:38 +02:00
ZdenekSrotyr	0084f80ff6	fix: legacy extractor passes Path to export_table, not str Fixes 'str' object has no attribute 'parent' when Keboola DuckDB extension falls back to legacy client.	2026-03-31 12:03:16 +02:00
ZdenekSrotyr	18e5f0b6e8	feat: implement extract.duckdb contract — orchestrator + extractors Phase 0: extend table_registry schema (v1→v2 migration), add source_type/bucket/source_table/query_mode columns. Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master analytics.duckdb. Keboola extractor uses DuckDB extension with legacy client fallback. BigQuery extractor is remote-only via DuckDB BQ extension (no data download). 62 tests passing.	2026-03-30 20:12:56 +02:00

14 commits