agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	569cd90d75	fix(security): #81 Group D — extractor-side identifier validation (squashed) (#97 ) Closes M15 from issue #81 — SQL injection via attacker-controlled identifiers in connectors/keboola/extractor.py and connectors/bigquery/extractor.py. Lifted _validate_identifier from src/orchestrator.py into a new src/identifier_validation.py shared module (single source of truth for both layers). Two validator policies: - validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for table_name — matches the orchestrator's rebuild-time check, so dashed names fail fast at extraction rather than being silently dropped. - validate_quoted_identifier (relaxed, accepts dashes/dots) for bucket/dataset/source_table — Keboola in.c-foo and BigQuery my-dataset are legitimate, just need to be safe inside `"..."`. Both extractors skip-and-continue on unsafe rows (logged + counted in failure stats); _extract_via_extension re-validates as defense-in-depth. 71/71 extractor + orchestrator tests pass. Refs #81 Group D.	2026-04-27 21:46:17 +02:00
ZdenekSrotyr	1488e01bf9	feat: add temp-file swap to BigQuery extractor Write to extract.duckdb.tmp, then atomically swap into place with WAL cleanup. Prevents lock conflicts with orchestrator holding read lock on existing database.	2026-04-09 07:00:19 +02:00
ZdenekSrotyr	1b219cabe9	fix: remove dead PRAGMA enable_wal code DuckDB has used WAL by default since v0.8, so this pragma is not valid DuckDB syntax. Removed obsolete try-except block that attempted to enable WAL on system database initialization.	2026-04-09 06:59:57 +02:00
ZdenekSrotyr	3ba207a7f8	feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var, so _remote_attach uses empty token_env. Orchestrator now supports both token-based (Keboola) and env-based (BigQuery) authentication modes.	2026-04-08 18:13:31 +02:00
ZdenekSrotyr	b502bd8bdd	refactor: delete old sync pipeline — 9,500 lines removed Phase 5 cleanup: remove all code replaced by extract.duckdb architecture. Deleted modules: - src/config.py (653) — replaced by DuckDB table_registry - src/parquet_manager.py (755) — replaced by DuckDB COPY TO - src/data_sync.py (734) — replaced by SyncOrchestrator - src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH - src/table_registry.py (464) — replaced by DuckDB repository - connectors/keboola/adapter.py (820) — replaced by extractor.py - connectors/bigquery/adapter.py (665) — replaced by extractor.py - connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension Updated all imports in webapp, catalog_export, enricher, router, sync_settings_service, generate_sample_data. Kept keboola/client.py as fallback (removed src.config dependency). 704 tests passing.	2026-03-31 07:50:37 +02:00
ZdenekSrotyr	18e5f0b6e8	feat: implement extract.duckdb contract — orchestrator + extractors Phase 0: extend table_registry schema (v1→v2 migration), add source_type/bucket/source_table/query_mode columns. Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master analytics.duckdb. Keboola extractor uses DuckDB extension with legacy client fallback. BigQuery extractor is remote-only via DuckDB BQ extension (no data download). 62 tests passing.	2026-03-30 20:12:56 +02:00
Petr	f19ff10e1a	Fix: don't update last_sync when partitioned sync gets 0 new rows When BQ returns empty results (e.g., data not yet refreshed), the scheduler was marking sync as complete for the day. This meant the next 15-min tick would skip it ("none are due") and data would stay stale until the next day's scheduled run. Now: if partitioned sync processes partitions but gets 0 new rows, last_sync is NOT updated. The scheduler will retry on the next tick (15 min later) when data may be available.	2026-03-16 23:01:35 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	85c87ec375	Pass explicit bqstorage_client to to_arrow_iterable() for Storage API Without explicit bqstorage_client parameter, to_arrow_iterable() silently falls back to REST API pagination (~5K rows/sec). With explicit client, it uses parallel gRPC streams via BQ Storage API (~300K rows/sec). No temp table materialization - BQ already writes query results to an internal temp table automatically. We just tell the reader to use the fast gRPC path instead of slow HTTP pagination.	2026-03-12 10:51:44 +01:00
Petr	4f74543a12	Fix streaming: use RowIterator.to_arrow_iterable() not QueryJob QueryJob only has to_arrow(), not to_arrow_iterable(). Must call query_job.result() first to get RowIterator, which has the streaming to_arrow_iterable() method.	2026-03-11 20:15:35 +01:00
Petr	ee70da86c3	Stream BQ results to Parquet instead of loading into memory Replace to_arrow() (loads entire result into RAM) with to_arrow_iterable() (streams RecordBatches). Each batch is written directly to disk via ParquetWriter - constant memory regardless of table size. Prevents OOM on 8GB server for multi-million row tables.	2026-03-11 20:13:03 +01:00
Petr	a191ede28c	Add columns and row_filter to TableConfig for selective BQ export Propagate column selection and row filtering from data_description.md through the BigQuery adapter to the BQ client. This enables exporting only needed columns and applying date range filters at the SQL level, critical for large DataView tables (e.g., 412-col unit_economics).	2026-03-11 19:37:04 +01:00
Petr	e26e47a071	Add BQ Storage API fallback to REST when readsessions permission missing	2026-03-11 13:59:09 +01:00
Petr	758910463b	Add BigQuery data source adapter BigQuery connector that syncs BQ tables to local Parquet files via PyArrow (no CSV intermediate step). Supports full refresh, timestamp-based incremental (via incremental_column), and partition-based sync strategies. - connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized queries, metadata cache, cross-project support (job project != data project) - connectors/bigquery/adapter.py: DataSource implementation with merge/dedup - src/config.py: Add incremental_column field to TableConfig - 72 unit tests (mocked, no GCP SDK required)	2026-03-11 13:56:12 +01:00

14 commits