agnes-the-ai-analyst

Author	SHA1	Message	Date
Petr	ad525a96aa	Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI) Add filter_tag support to catalog_export and webapp so only metrics with the required tag are exported to YAML and displayed in UI. Previously all 19+ metrics were exported regardless of relevance. - Add has_tag() helper to transformer module - catalog_export.py: filter_tag parameter from instance.yaml openmetadata config - webapp/app.py: filter metrics in _load_metrics_from_catalog() - 7 new tests (has_tag, filter_tag export, stale cleanup)	2026-03-16 22:03:53 +01:00
Petr	ab1a93ed67	Strip HTML tags from OpenMetadata descriptions in YAML export OpenMetadata stores descriptions as rich HTML (<p>, <strong>,  , etc.). Add strip_html() to transformer that converts to clean plain text for YAML files consumed by Claude Code agent. Applied to metric descriptions, table descriptions, and column descriptions. Webapp display dict keeps raw HTML since the modal renders it correctly.	2026-03-15 01:57:04 +01:00
Petr	985f47cdb7	Add catalog export: generate YAML metrics and tables from OpenMetadata - New `connectors/openmetadata/transformer.py` with shared parsing logic for extracting categories, grain, dimensions, expressions from OM tags - New `src/catalog_export.py` script (python -m src.catalog_export) that fetches metrics/tables from OpenMetadata API and writes YAML files to /data/docs/metrics/ and /data/docs/tables/ for agent consumption - Refactor webapp/app.py to delegate to transformer (with inline fallback) - Add `fields` parameter to client.get_metrics() and get_metric_by_fqn() for fetching tags+owners in a single API call - Fix pre-existing mock bug in test_openmetadata_enricher (base_url) - 101 new tests (80 transformer + 21 export), all passing	2026-03-15 01:15:30 +01:00
Petr	268fe07f91	Fix: Use correct OpenMetadata API field names for metrics OpenMetadata uses different field names than expected: - metricExpression instead of expression - metricType instead of type - unitOfMeasurement instead of unit - granularity instead of grain Remove 'fields' query parameter from /api/v1/metrics - returns 400 Bad Request when invalid field names are specified. Let API return full metric objects. Update parsing to extract metadata from proper OpenMetadata fields instead of relying on tags (tags are optional, fields are always present).	2026-03-12 15:16:24 +01:00
Petr	5fc9526627	Phase 2: Replace demo YAML metrics with OpenMetadata catalog data - Add get_metric_by_fqn() to OpenMetadataClient - Add get_metrics() to CatalogEnricher with TTL caching - Implement _parse_om_metric() to extract category/grain from OpenMetadata tags - Implement _load_metrics_from_catalog() to fetch and categorize metrics - Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON - Add /api/catalog/metrics/<fqn> endpoint for metric detail modal - Update _load_metrics_data() to prefer catalog over YAML fallback - Update metric_modal.js to route catalog:{fqn} to catalog API endpoint - Delete 10 demo YAML files from docs/metrics/ - Replace metric tests with new unit tests for catalog parsing functions (19 tests) Catalog metrics provide single source of truth vs maintaining demo YAML files. UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.	2026-03-12 15:10:42 +01:00
Petr	e2d3afade3	Log when catalog enrichment (tags/tier) are found	2026-03-12 14:34:45 +01:00
Petr	14d75d6229	Fix: correct OpenMetadata catalog URL path and add debug logging - Change catalog URL from /explore/{fqn} to /table/{fqn} - Add debug logging to see parsed tags, owners, tier from API response	2026-03-12 14:34:12 +01:00
Petr	a7faf70cb3	Allow self-signed certificates for OpenMetadata catalog (internal networks) OpenMetadata catalog uses self-signed HTTPS certificate on internal networks. Disable SSL verification in httpx client and suppress related warnings.	2026-03-12 14:12:44 +01:00
Petr	c5c24cb45b	Implement OpenMetadata catalog integration (Phase 1) Add OpenMetadata REST API connector and enricher to merge table/column metadata from OpenMetadata catalog at sync and query time. Changes: - connectors/openmetadata/client.py: HTTP client for OM API - connectors/openmetadata/enricher.py: Data enrichment with TTL cache - tests/test_openmetadata_*: Unit tests for client and enricher - src/config.py: Add catalog_fqn field to TableConfig - src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md) - webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url - config/instance.yaml.example: Document openmetadata section Features: - FQN auto-derivation: bigquery.{table.id} - TTL cache (default 1h) to avoid repeated API calls - Graceful degradation: disabled if token missing, silent on HTTP errors - Column description priority: catalog > BQ API > (none) - Table description priority: catalog > data_description.md	2026-03-12 14:07:13 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	85c87ec375	Pass explicit bqstorage_client to to_arrow_iterable() for Storage API Without explicit bqstorage_client parameter, to_arrow_iterable() silently falls back to REST API pagination (~5K rows/sec). With explicit client, it uses parallel gRPC streams via BQ Storage API (~300K rows/sec). No temp table materialization - BQ already writes query results to an internal temp table automatically. We just tell the reader to use the fast gRPC path instead of slow HTTP pagination.	2026-03-12 10:51:44 +01:00
Petr	4f74543a12	Fix streaming: use RowIterator.to_arrow_iterable() not QueryJob QueryJob only has to_arrow(), not to_arrow_iterable(). Must call query_job.result() first to get RowIterator, which has the streaming to_arrow_iterable() method.	2026-03-11 20:15:35 +01:00
Petr	ee70da86c3	Stream BQ results to Parquet instead of loading into memory Replace to_arrow() (loads entire result into RAM) with to_arrow_iterable() (streams RecordBatches). Each batch is written directly to disk via ParquetWriter - constant memory regardless of table size. Prevents OOM on 8GB server for multi-million row tables.	2026-03-11 20:13:03 +01:00
Petr	a191ede28c	Add columns and row_filter to TableConfig for selective BQ export Propagate column selection and row filtering from data_description.md through the BigQuery adapter to the BQ client. This enables exporting only needed columns and applying date range filters at the SQL level, critical for large DataView tables (e.g., 412-col unit_economics).	2026-03-11 19:37:04 +01:00
Petr	e26e47a071	Add BQ Storage API fallback to REST when readsessions permission missing	2026-03-11 13:59:09 +01:00
Petr	758910463b	Add BigQuery data source adapter BigQuery connector that syncs BQ tables to local Parquet files via PyArrow (no CSV intermediate step). Supports full refresh, timestamp-based incremental (via incremental_column), and partition-based sync strategies. - connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized queries, metadata cache, cross-project support (job project != data project) - connectors/bigquery/adapter.py: DataSource implementation with merge/dedup - src/config.py: Add incremental_column field to TableConfig - 72 unit tests (mocked, no GCP SDK required)	2026-03-11 13:56:12 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	38b86127ed	Branding cleanup: remove Keboola-specific references from docs and config - server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE - server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions - .gitignore: generic comment for data directory - CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/ - docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern - connectors/jira/README.md: keboola-analyst -> data-analyst in config paths - dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding	2026-03-09 12:22:27 +01:00
Petr	266e8573d3	Extract Keboola into connectors/keboola module Move all Keboola-specific code out of src/ into connectors/keboola/: - git mv src/keboola_client.py -> connectors/keboola/client.py - Extract LocalKeboolaSource (855 lines) from data_sync.py -> connectors/keboola/adapter.py - Rename to KeboolaDataSource with full env var validation - Extend DataSource ABC with get_column_metadata() and get_source_name() - Add dynamic connector registry via importlib in create_data_source() - Refactor _generate_schema_yaml to use ABC methods (source_type, _schema_version: 2) - Remove src/adapters/ (redundant facade layer) - Remove Keboola validation from src/config.py (connector validates itself) - Add 14 tests for factory, ABC defaults, env validation, dynamic lookup	2026-03-09 12:22:16 +01:00
Petr	86edd27655	Extract Jira into connectors/jira module Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default	2026-03-09 11:17:50 +01:00

20 commits