agnes-the-ai-analyst

Author	SHA1	Message	Date
Petr	95358448e6	Add modular LLM connector for Corporate Memory Replace hardwired Anthropic API calls with a pluggable provider system. Each deployment configures its AI provider in instance.yaml — switching between Anthropic, LiteLLM, OpenRouter, or any OpenAI-compatible proxy is a config change, not a code change. New connectors/llm/ module: - StructuredExtractor Protocol with extract_json() interface - AnthropicExtractor: direct Anthropic SDK with retry + backoff - OpenAICompatExtractor: any OpenAI-compatible proxy with three-layer structured output fallback (json_schema -> json_object -> prompt) - Configurable structured_output policy (strict/json/auto) - Custom exception hierarchy (auth/rate_limit/timeout/format/refusal) - Zero secrets in logs: no API keys, prompts, or responses logged Reviewed by: Google Gemini, Claude Sonnet, OpenAI GPT-5.4. Security audit passed with all critical findings resolved.	2026-03-23 12:08:33 +01:00
Petr	84d14da611	Fix remote query UX: file-based stdin, ssh permissions, deprecation Session testing revealed 3 issues with remote queries: 1. CLAUDE.md template recommended `cat <<HEREDOC \| ssh ...` but claude_settings.json had `cat` in deny list, causing 2-3 failed attempts per query. Replaced with file-based approach: Write tool creates JSON file, then `ssh ... < file` avoids the cat deny. 2. ssh/scp commands were not in the allow list, requiring manual approval for every remote query. Added both to allow list. 3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every parquet export. Replaced with .arrow().read_all(). Also added instruction for proactive hybrid analysis when remote tables are available (agent was only using local data until asked).	2026-03-21 18:41:43 +01:00
Petr	8c6c162417	Fix: --sql not required when --stdin is used argparse was rejecting --stdin mode because --sql was required=True. Changed to required=False with runtime validation in main().	2026-03-21 12:17:02 +01:00
Petr	67df4acd73	Add --stdin JSON mode to avoid shell escaping nightmare Agent was failing 3x on SSH commands due to backticks (BQ table names) and single quotes (SQL string literals) getting mangled by nested shell interpretation (local -> SSH -> bash -> Python). New --stdin mode reads query spec as JSON from stdin via heredoc: cat <<'QUERY' \| ssh alias 'bash remote_query.sh --stdin' {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."} QUERY Heredoc with <<'QUERY' (quoted) passes everything literally -- no escaping needed for backticks, quotes, or parentheses. Updated claude_md_template.txt to use --stdin as the primary method.	2026-03-21 12:15:50 +01:00
Petr	39763ea5a2	Fix: load instance.yaml without requiring webapp secrets Analysts don't have WEBAPP_SECRET_KEY, so load_instance_config() validation failed with noisy warnings. Now reads instance.yaml directly with yaml.safe_load, skipping secret validation.	2026-03-21 12:01:41 +01:00
Petr	dfec39722b	Fix remote_query.sh: use analyst-readable env file GCP OS Login doesn't honor /etc/group changes for SSH sessions, so analyst can't read /opt/data-analyst/.env even after usermod. Wrapper now reads .remote_query.env from scripts dir (dataread group), falls back to .env for admin users. The env file contains only non-secret BQ config (project ID, location, data dir).	2026-03-21 11:59:57 +01:00
Petr	dce8454894	Add remote_query.sh wrapper, fix analyst SSH permissions Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/. Added to data-ops group on server. New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH, CONFIG_DIR, .env) so agents use simple: ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table' Updated claude_md_template.txt to use wrapper instead of raw commands.	2026-03-21 11:58:04 +01:00
Petr	ed5a5ec706	Fix: duckdb_manager CONFIG_DIR support for server deployment find_project_root() and parse_data_description() now check CONFIG_DIR env var first when looking for data_description.md. On server deployment, data_description.md lives in instance/config/ (CONFIG_DIR), not in the OSS repo's docs/ directory.	2026-03-21 11:40:55 +01:00
Petr	d180b2014e	Step 28: Remote query architecture for local+remote table JOINs Add src/remote_query.py CLI module enabling the AI agent to run SQL queries spanning local Parquet tables and remote BigQuery tables in a single DuckDB session on the server. Two-phase protocol: BQ sub-queries (--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql) joins everything. Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits (500K per BQ sub-query, 100K final result). Changes: - New src/remote_query.py with CLI, BQ registration, output formatting - Add bq_entity_type field to TableConfig (view vs table routing) - Extract create_local_views() from duckdb_manager.py for reuse - Update claude_md_template.txt with remote query agent instructions - Update example configs with remote_query section and docs - 52 new tests (42 remote_query + 10 bq_entity_type), all passing	2026-03-21 11:39:15 +01:00
Petr	ed16122994	Use data_product config for metric discovery instead of filter_tag in webapp	2026-03-18 16:10:15 +01:00
Petr	e63c8747b5	Fix metric expression extraction: use 'code' field OpenMetadata stores SQL in metricExpression.code, not .expression. This caused all metric expressions to export as empty strings.	2026-03-18 13:01:23 +01:00
Petr	908d1f2247	Fix search_by_data_product: client-side filtering OpenMetadata search API ignores queryFilter for dataProducts field. Use type-specific index + client-side filtering by dataProducts membership instead. Correctly returns 16/32 metrics for FoundryAI.	2026-03-18 12:54:59 +01:00
Petr	fb63a72a98	Add data product discovery, fix remove-analyst script - client.py: add search_by_data_product() for OpenMetadata search API - catalog_export.py: prefer data product discovery over tag filtering (finds all 16 metrics in FoundryAIDataModel vs 3 with tag filter) - remove-analyst: fix GROUPS bash variable collision, improve messaging	2026-03-18 12:52:41 +01:00
Petr	ab99f0af92	Fix sync_schedule validation to accept multi-time daily format The scheduler.py already supported "daily HH:MM,HH:MM,HH:MM" format (commit `5f27d05`), but config.py validation regex only accepted single time "daily HH:MM", causing data-refresh to crash on startup. Also adds: - tests/test_config_sync_schedule.py (16 test cases) - Makefile with validate-config target for CI/CD integration	2026-03-17 13:21:14 +01:00
Petr	5f27d05894	Support multiple daily sync times (e.g., "daily 07:00,13:00,18:00") Scheduler now accepts comma-separated HH:MM times in daily schedules. Each time slot is independently evaluated - if any slot has passed and last_sync is before it, the table is marked as due. This lets tables sync multiple times per day to pick up data refreshes that happen throughout the day (e.g., Keboola pipelines running 3x/day).	2026-03-16 23:09:48 +01:00
Petr	f19ff10e1a	Fix: don't update last_sync when partitioned sync gets 0 new rows When BQ returns empty results (e.g., data not yet refreshed), the scheduler was marking sync as complete for the day. This meant the next 15-min tick would skip it ("none are due") and data would stay stale until the next day's scheduled run. Now: if partitioned sync processes partitions but gets 0 new rows, last_sync is NOT updated. The scheduler will retry on the next tick (15 min later) when data may be available.	2026-03-16 23:01:35 +01:00
Petr	6c0abf275b	Add cache busting to metric_modal.css include	2026-03-16 22:16:37 +01:00
Petr	9be22fdc82	Fix metric display: use displayName in list, render HTML in modal List view: - Show display_name ("M1 + VFM Operational") instead of name ("M1PlusVFMOperational") - Strip HTML and truncate description for clean list excerpts Modal detail: - Render original HTML from catalog instead of stripped plain text - Add .om-description CSS class for structured HTML (bold labels, lists, code) - Pass description_html alongside plain text description for backwards compat	2026-03-16 22:11:58 +01:00
Petr	ad525a96aa	Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI) Add filter_tag support to catalog_export and webapp so only metrics with the required tag are exported to YAML and displayed in UI. Previously all 19+ metrics were exported regardless of relevance. - Add has_tag() helper to transformer module - catalog_export.py: filter_tag parameter from instance.yaml openmetadata config - webapp/app.py: filter metrics in _load_metrics_from_catalog() - 7 new tests (has_tag, filter_tag export, stale cleanup)	2026-03-16 22:03:53 +01:00
Petr	440662c8fe	Fix remove-analyst silent failure caused by set -e + pipefail The script was exiting silently on the GROUPS=$(groups ... \| cut ...) line — set -eo pipefail caused bash to terminate the script before any echo output, making it appear to do nothing. Replace set -euo pipefail with set -u and explicit error handling. Admin scripts must always report what happened, never exit silently. Also: use id -nG instead of groups\|cut pipe, add verification step after userdel, and log each operation for visibility.	2026-03-15 14:17:39 +01:00
Petr	2181d490e9	Fix systemd NAMESPACE failures caused by missing ReadWritePaths dirs data-refresh.service: use /tmp instead of /tmp/data_analyst_staging in ReadWritePaths — the subdirectory may not exist at service start, causing mount namespace setup to fail before any Exec* directive runs. deploy.sh: fix typo services/corporate-memory -> services/corporate_memory so the mkdir conditional actually matches the repo directory name. deploy.sh: add ReadWritePaths validation loop that auto-creates any missing directories listed in installed .service files before daemon-reload. This acts as a safety net against future NAMESPACE failures from new services.	2026-03-15 11:40:11 +01:00
Petr	80c5b902e0	Add scheduled data sync and catalog refresh with systemd timers - New sync_schedule and profile_after_sync fields in TableConfig (formats: "every 15m", "every 1h", "daily 05:00") - New src/scheduler.py with schedule evaluation logic (is_table_due) - New --scheduled mode in data_sync.py: only syncs tables that are due, respects profile_after_sync flag, auto-restarts webapp after profiling - Systemd timer+service for data-refresh (every 15 min) - Systemd timer+service for catalog-refresh (every 15 min) - deploy.sh enables new timers automatically - Complete table config reference in data_description.md.example - 58 new scheduler tests	2026-03-15 02:16:31 +01:00
Petr	d9f3977028	URL-encode FQN in catalog header links (spaces -> %20)	2026-03-15 02:06:22 +01:00
Petr	60039c0af3	Add direct catalog URL to YAML header (metric/table entity links) Source line now links directly to the entity in OpenMetadata: - metrics: https://datacatalog.../metric/UniqueVisitors - tables: https://datacatalog.../table/bigquery.project.dataset.table	2026-03-15 02:03:27 +01:00
Petr	ab1a93ed67	Strip HTML tags from OpenMetadata descriptions in YAML export OpenMetadata stores descriptions as rich HTML (<p>, <strong>,  , etc.). Add strip_html() to transformer that converts to clean plain text for YAML files consumed by Claude Code agent. Applied to metric descriptions, table descriptions, and column descriptions. Webapp display dict keeps raw HTML since the modal renders it correctly.	2026-03-15 01:57:04 +01:00
Petr	985f47cdb7	Add catalog export: generate YAML metrics and tables from OpenMetadata - New `connectors/openmetadata/transformer.py` with shared parsing logic for extracting categories, grain, dimensions, expressions from OM tags - New `src/catalog_export.py` script (python -m src.catalog_export) that fetches metrics/tables from OpenMetadata API and writes YAML files to /data/docs/metrics/ and /data/docs/tables/ for agent consumption - Refactor webapp/app.py to delegate to transformer (with inline fallback) - Add `fields` parameter to client.get_metrics() and get_metric_by_fqn() for fetching tags+owners in a single API call - Fix pre-existing mock bug in test_openmetadata_enricher (base_url) - 101 new tests (80 transformer + 21 export), all passing	2026-03-15 01:15:30 +01:00
Petr	e17dd85504	Remove hardcoded Jira/Keboola references from sync_data.sh - Silent fallback when no sync settings exist (no 'Jira disabled' message) - Generic dataset exclude/include loop driven by sync_settings.yaml - Generic cleanup loop for disabled datasets - Replaces 100+ lines of hardcoded Jira/kbc_telemetry_expert blocks	2026-03-15 01:02:37 +01:00
Petr	49adbe26ec	Move server venv setup to bootstrap, remove cross-platform pip freeze sync Server venv is created during bootstrap via SSH (same package list, installed natively on Linux). Removes sync_data.sh section that copied pip freeze output across platforms (Windows/macOS freeze is incompatible with Linux).	2026-03-15 00:59:45 +01:00
Petr	22a1bb5847	Auto-restart sync_data.sh after self-update (exec replaces process)	2026-03-15 00:53:17 +01:00
Petr	6f9de274fb	Fix: CLAUDE.md template vars overwrote $SSH_HOST used by rsync The CLAUDE.md generation section reused SSH_HOST variable name to store the server IP, overwriting the SSH alias needed for rsync. Renamed to TMPL_SSH_ALIAS/TMPL_SERVER_HOST/TMPL_WEBAPP_URL to avoid collision.	2026-03-15 00:51:49 +01:00
Petr	508d92771f	Generate setup instructions from bootstrap.yaml (single source of truth) - Rewrite bootstrap.yaml as clean structured YAML with steps, commands, descriptions, conditions, and notes - Add _generate_setup_instructions() in app.py that reads YAML, substitutes placeholders, and produces clipboard-ready plain text - Replace 50-line hardcoded JS string builder with single tojson variable - All setup instructions now editable in one YAML file	2026-03-15 00:37:19 +01:00
Petr	85c07732b2	Fix dashboard stats: support flat sync_state.json format (no 'tables' wrapper) BigQuery adapter writes table entries at top level, not nested under 'tables'. Detect flat format by checking if values contain 'rows' key.	2026-03-15 00:26:10 +01:00
Petr	b3ba65be59	Add ssh_alias, ssh_key, project_dir, disabled_providers to instance.yaml.example	2026-03-15 00:15:20 +01:00
Petr	3ebb15cbab	Make project_dir, ssh_key configurable in Get Started UI Read server.project_dir from instance.yaml (default: 'data-analyst'). Replace hardcoded 'data-analyst' folder name and 'data_analyst_server' SSH key name in dashboard template with Jinja variables.	2026-03-15 00:12:46 +01:00
Petr	021c453ea6	Auto-create .sync_connection via printf command in bootstrap Replace 'Save this to .sync_connection' prose with actual printf command that Claude/user executes. Fix heredoc indentation in bootstrap.yaml.	2026-03-15 00:05:42 +01:00
Petr	b0e4749b0d	Replace hardcoded 'data-analyst' SSH alias with configurable $SSH_HOST Read SSH alias from .sync_connection file at script start (default: 'data-analyst' for backward compatibility). All 32 occurrences of hardcoded 'data-analyst:' and 'ssh data-analyst' replaced with $SSH_HOST.	2026-03-15 00:03:07 +01:00
Petr	2237334b05	Make CLAUDE.md template generic and instance-aware - Remove all Keboola-specific content (metric categories, MRR/ARR refs, corporate memory, hardcoded server IP) - Add {ssh_alias}, {server_host}, {webapp_url} placeholders - Bootstrap saves .sync_connection file with instance details - sync_data.sh reads .sync_connection to substitute all placeholders - Text instructions in dashboard include .sync_connection step	2026-03-14 23:57:58 +01:00
Petr	6728da63fb	Use ssh_alias and ssh_key from config in bootstrap text instructions Replace hardcoded 'data-analyst' and '~/.ssh/data_analyst_server' in the copyBootstrapInstructions JS function with values from instance config. Pass ssh_alias and ssh_key to dashboard template context.	2026-03-14 21:06:37 +01:00
Petr	13938bf72f	Read ssh_alias and ssh_key from instance.yaml for bootstrap instructions Config reads server.ssh_alias and server.ssh_key from instance.yaml (defaults: 'data-analyst' and '~/.ssh/data_analyst_server' for backward compat). App.py substitutes {ssh_alias} and {ssh_key} in bootstrap.yaml template.	2026-03-14 21:04:51 +01:00
Petr	140cbb3cee	Make bootstrap.yaml instance-agnostic with configurable SSH alias Add {ssh_alias} and {ssh_key} placeholders so each instance can use its own SSH config name (avoids conflicts when user has multiple instances). Remove Keboola-specific sync_settings and dataset references. Simplify to single download_server_data step (rsync with scp fallback). Handle SSH alias conflicts gracefully.	2026-03-14 20:58:26 +01:00
Petr	4206b06d92	Make deploy.sh data-source agnostic with --scripts-only flag - Add --scripts-only flag for quick script/docs deployment without restart - Replace hardcoded Keboola env vars with generic loop over all known vars (supports Keboola, BigQuery, OpenMetadata, and optional services) - Make data directories conditional (Jira, notifications, corporate memory created only when relevant code/config exists) - Enable timers only when their .timer files exist on disk - Use root:data-ops ownership (works without deploy user)	2026-03-14 20:38:43 +01:00
Petr	c2681ccc86	Add cache-busting with git commit hash for static assets Flask will now include git commit hash as URL parameter (v=abc1234) for metric_modal.js and other static assets. This ensures browser doesn't cache stale JavaScript when code changes. Cache invalidation based on actual git history rather than timestamps.	2026-03-12 15:37:29 +01:00
Petr	f6000cc867	Fix: URL-encode metric FQN in catalog modal request When metric FQN contains spaces (e.g. 'Active2 Customers'), JavaScript was creating invalid URLs with literal spaces. Now properly encoding FQN with encodeURIComponent() to convert spaces to %20 before sending to API. Flask automatically decodes the path parameter back to original FQN.	2026-03-12 15:20:10 +01:00
Petr	1bcd7e4080	Fix: URL-decode metric FQN in catalog endpoint FQN can contain spaces (e.g., 'Active2 Customers') which get URL-encoded as 'Active2%20Customers' in the path parameter. Need to decode before passing to OpenMetadata API.	2026-03-12 15:18:08 +01:00
Petr	268fe07f91	Fix: Use correct OpenMetadata API field names for metrics OpenMetadata uses different field names than expected: - metricExpression instead of expression - metricType instead of type - unitOfMeasurement instead of unit - granularity instead of grain Remove 'fields' query parameter from /api/v1/metrics - returns 400 Bad Request when invalid field names are specified. Let API return full metric objects. Update parsing to extract metadata from proper OpenMetadata fields instead of relying on tags (tags are optional, fields are always present).	2026-03-12 15:16:24 +01:00
Petr	da6d605ae0	Add sample metric YAML as fallback when OpenMetadata metrics unavailable The /api/v1/metrics endpoint may not be available in all OpenMetadata instances. This sample metric provides a fallback for demonstration purposes.	2026-03-12 15:14:04 +01:00
Petr	5fc9526627	Phase 2: Replace demo YAML metrics with OpenMetadata catalog data - Add get_metric_by_fqn() to OpenMetadataClient - Add get_metrics() to CatalogEnricher with TTL caching - Implement _parse_om_metric() to extract category/grain from OpenMetadata tags - Implement _load_metrics_from_catalog() to fetch and categorize metrics - Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON - Add /api/catalog/metrics/<fqn> endpoint for metric detail modal - Update _load_metrics_data() to prefer catalog over YAML fallback - Update metric_modal.js to route catalog:{fqn} to catalog API endpoint - Delete 10 demo YAML files from docs/metrics/ - Replace metric tests with new unit tests for catalog parsing functions (19 tests) Catalog metrics provide single source of truth vs maintaining demo YAML files. UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.	2026-03-12 15:10:42 +01:00
Petr	be58e63394	Move profiler config to instance.yaml (KISS principle) Instead of hardcoded Python constants, load profiler settings from config: - instance.yaml: profiler section with all parameters - Defaults: fallback to sensible defaults if config not found - Centralized: all profiler tuning in one place, no code changes needed	2026-03-12 14:45:14 +01:00
Petr	c25278538c	Simplify profiler config: use single SAMPLE_SIZE parameter (KISS) Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE: - If table > SAMPLE_SIZE: sample that many rows - Otherwise: use all rows Cleaner, easier to configure.	2026-03-12 14:43:23 +01:00
Petr	e2d3afade3	Log when catalog enrichment (tags/tier) are found	2026-03-12 14:34:45 +01:00

1 2 3 4 5

206 commits