agnes-the-ai-analyst

Author	SHA1	Message	Date
Petr	8c6c162417	Fix: --sql not required when --stdin is used argparse was rejecting --stdin mode because --sql was required=True. Changed to required=False with runtime validation in main().	2026-03-21 12:17:02 +01:00
Petr	67df4acd73	Add --stdin JSON mode to avoid shell escaping nightmare Agent was failing 3x on SSH commands due to backticks (BQ table names) and single quotes (SQL string literals) getting mangled by nested shell interpretation (local -> SSH -> bash -> Python). New --stdin mode reads query spec as JSON from stdin via heredoc: cat <<'QUERY' \| ssh alias 'bash remote_query.sh --stdin' {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."} QUERY Heredoc with <<'QUERY' (quoted) passes everything literally -- no escaping needed for backticks, quotes, or parentheses. Updated claude_md_template.txt to use --stdin as the primary method.	2026-03-21 12:15:50 +01:00
Petr	39763ea5a2	Fix: load instance.yaml without requiring webapp secrets Analysts don't have WEBAPP_SECRET_KEY, so load_instance_config() validation failed with noisy warnings. Now reads instance.yaml directly with yaml.safe_load, skipping secret validation.	2026-03-21 12:01:41 +01:00
Petr	d180b2014e	Step 28: Remote query architecture for local+remote table JOINs Add src/remote_query.py CLI module enabling the AI agent to run SQL queries spanning local Parquet tables and remote BigQuery tables in a single DuckDB session on the server. Two-phase protocol: BQ sub-queries (--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql) joins everything. Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits (500K per BQ sub-query, 100K final result). Changes: - New src/remote_query.py with CLI, BQ registration, output formatting - Add bq_entity_type field to TableConfig (view vs table routing) - Extract create_local_views() from duckdb_manager.py for reuse - Update claude_md_template.txt with remote query agent instructions - Update example configs with remote_query section and docs - 52 new tests (42 remote_query + 10 bq_entity_type), all passing	2026-03-21 11:39:15 +01:00
Petr	fb63a72a98	Add data product discovery, fix remove-analyst script - client.py: add search_by_data_product() for OpenMetadata search API - catalog_export.py: prefer data product discovery over tag filtering (finds all 16 metrics in FoundryAIDataModel vs 3 with tag filter) - remove-analyst: fix GROUPS bash variable collision, improve messaging	2026-03-18 12:52:41 +01:00
Petr	ab99f0af92	Fix sync_schedule validation to accept multi-time daily format The scheduler.py already supported "daily HH:MM,HH:MM,HH:MM" format (commit `5f27d05`), but config.py validation regex only accepted single time "daily HH:MM", causing data-refresh to crash on startup. Also adds: - tests/test_config_sync_schedule.py (16 test cases) - Makefile with validate-config target for CI/CD integration	2026-03-17 13:21:14 +01:00
Petr	5f27d05894	Support multiple daily sync times (e.g., "daily 07:00,13:00,18:00") Scheduler now accepts comma-separated HH:MM times in daily schedules. Each time slot is independently evaluated - if any slot has passed and last_sync is before it, the table is marked as due. This lets tables sync multiple times per day to pick up data refreshes that happen throughout the day (e.g., Keboola pipelines running 3x/day).	2026-03-16 23:09:48 +01:00
Petr	ad525a96aa	Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI) Add filter_tag support to catalog_export and webapp so only metrics with the required tag are exported to YAML and displayed in UI. Previously all 19+ metrics were exported regardless of relevance. - Add has_tag() helper to transformer module - catalog_export.py: filter_tag parameter from instance.yaml openmetadata config - webapp/app.py: filter metrics in _load_metrics_from_catalog() - 7 new tests (has_tag, filter_tag export, stale cleanup)	2026-03-16 22:03:53 +01:00
Petr	80c5b902e0	Add scheduled data sync and catalog refresh with systemd timers - New sync_schedule and profile_after_sync fields in TableConfig (formats: "every 15m", "every 1h", "daily 05:00") - New src/scheduler.py with schedule evaluation logic (is_table_due) - New --scheduled mode in data_sync.py: only syncs tables that are due, respects profile_after_sync flag, auto-restarts webapp after profiling - Systemd timer+service for data-refresh (every 15 min) - Systemd timer+service for catalog-refresh (every 15 min) - deploy.sh enables new timers automatically - Complete table config reference in data_description.md.example - 58 new scheduler tests	2026-03-15 02:16:31 +01:00
Petr	d9f3977028	URL-encode FQN in catalog header links (spaces -> %20)	2026-03-15 02:06:22 +01:00
Petr	60039c0af3	Add direct catalog URL to YAML header (metric/table entity links) Source line now links directly to the entity in OpenMetadata: - metrics: https://datacatalog.../metric/UniqueVisitors - tables: https://datacatalog.../table/bigquery.project.dataset.table	2026-03-15 02:03:27 +01:00
Petr	985f47cdb7	Add catalog export: generate YAML metrics and tables from OpenMetadata - New `connectors/openmetadata/transformer.py` with shared parsing logic for extracting categories, grain, dimensions, expressions from OM tags - New `src/catalog_export.py` script (python -m src.catalog_export) that fetches metrics/tables from OpenMetadata API and writes YAML files to /data/docs/metrics/ and /data/docs/tables/ for agent consumption - Refactor webapp/app.py to delegate to transformer (with inline fallback) - Add `fields` parameter to client.get_metrics() and get_metric_by_fqn() for fetching tags+owners in a single API call - Fix pre-existing mock bug in test_openmetadata_enricher (base_url) - 101 new tests (80 transformer + 21 export), all passing	2026-03-15 01:15:30 +01:00
Petr	be58e63394	Move profiler config to instance.yaml (KISS principle) Instead of hardcoded Python constants, load profiler settings from config: - instance.yaml: profiler section with all parameters - Defaults: fallback to sensible defaults if config not found - Centralized: all profiler tuning in one place, no code changes needed	2026-03-12 14:45:14 +01:00
Petr	c25278538c	Simplify profiler config: use single SAMPLE_SIZE parameter (KISS) Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE: - If table > SAMPLE_SIZE: sample that many rows - Otherwise: use all rows Cleaner, easier to configure.	2026-03-12 14:43:23 +01:00
Petr	c5c24cb45b	Implement OpenMetadata catalog integration (Phase 1) Add OpenMetadata REST API connector and enricher to merge table/column metadata from OpenMetadata catalog at sync and query time. Changes: - connectors/openmetadata/client.py: HTTP client for OM API - connectors/openmetadata/enricher.py: Data enrichment with TTL cache - tests/test_openmetadata_*: Unit tests for client and enricher - src/config.py: Add catalog_fqn field to TableConfig - src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md) - webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url - config/instance.yaml.example: Document openmetadata section Features: - FQN auto-derivation: bigquery.{table.id} - TTL cache (default 1h) to avoid repeated API calls - Graceful degradation: disabled if token missing, silent on HTTP errors - Column description priority: catalog > BQ API > (none) - Table description priority: catalog > data_description.md	2026-03-12 14:07:13 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	d2e83ce9d0	Set DuckDB memory_limit=4GB in profiler to prevent OOM Server has 8GB RAM with other services running. DuckDB defaults to using all available memory, causing OOM killer when profiling large tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).	2026-03-12 11:06:49 +01:00
Petr	a191ede28c	Add columns and row_filter to TableConfig for selective BQ export Propagate column selection and row filtering from data_description.md through the BigQuery adapter to the BQ client. This enables exporting only needed columns and applying date range filters at the SQL level, critical for large DataView tables (e.g., 412-col unit_economics).	2026-03-11 19:37:04 +01:00
Petr	758910463b	Add BigQuery data source adapter BigQuery connector that syncs BQ tables to local Parquet files via PyArrow (no CSV intermediate step). Supports full refresh, timestamp-based incremental (via incremental_column), and partition-based sync strategies. - connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized queries, metadata cache, cross-project support (job project != data project) - connectors/bigquery/adapter.py: DataSource implementation with merge/dedup - src/config.py: Add incremental_column field to TableConfig - 72 unit tests (mocked, no GCP SDK required)	2026-03-11 13:56:12 +01:00
Petr	28543d98b1	Fix profiler file_size and catalog stats fallback - Profiler computes file_size_mb from actual parquet files when sync_state.json is absent (sample data / no-sync deployments) - Catalog header falls back to profiles.json for aggregate stats (tables count, total rows) when sync_state.json is missing	2026-03-10 22:12:46 +01:00
Petr	1be0dc5300	Add flat parquet fallback to profiler get_parquet_path Tries subfolder path first (Keboola-style layout), then falls back to flat path for simple deployments like sample data.	2026-03-10 22:09:14 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	266e8573d3	Extract Keboola into connectors/keboola module Move all Keboola-specific code out of src/ into connectors/keboola/: - git mv src/keboola_client.py -> connectors/keboola/client.py - Extract LocalKeboolaSource (855 lines) from data_sync.py -> connectors/keboola/adapter.py - Rename to KeboolaDataSource with full env var validation - Extend DataSource ABC with get_column_metadata() and get_source_name() - Add dynamic connector registry via importlib in create_data_source() - Refactor _generate_schema_yaml to use ABC methods (source_type, _schema_version: 2) - Remove src/adapters/ (redundant facade layer) - Remove Keboola validation from src/config.py (connector validates itself) - Add 14 tests for factory, ABC defaults, env validation, dynamic lookup	2026-03-09 12:22:16 +01:00
Petr	86edd27655	Extract Jira into connectors/jira module Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default	2026-03-09 11:17:50 +01:00
Petr	c56905d34f	Initial commit: OSS data distribution platform Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.	2026-03-08 23:31:28 +01:00

1 2 3

125 commits