agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	816168f96b	docs: add remote query implementation plan (5 tasks) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 11:02:04 +02:00
ZdenekSrotyr	eb68e6292d	docs: fix remote query spec after code review - Address read-only LOAD uncertainty with verification step + workaround - Clarify register_bq wraps BQ logic (not delegates to register_bq_table) - Use existing max_bq_registration_rows config key name - Apply SQL blocklist to both register_bq and final sql - Define connection lifecycle (caller owns, try/finally) - Fix CLI argument handling (optional positional + --sql flag) - Document concurrency safety (Unix inode semantics) - Handle missing google-cloud-bigquery gracefully Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 10:58:25 +02:00
ZdenekSrotyr	017cf07674	docs: add design spec for remote query (extension re-attach + two-phase BQ) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 10:52:39 +02:00
ZdenekSrotyr	344d744089	feat: add 10 starter pack metrics (revenue, usage, sales, operations)	2026-04-10 19:35:28 +02:00
ZdenekSrotyr	06ac937f8b	docs: add implementation plans for porting internal features Three independent plans following TDD approach: 1. Business metrics (10 tasks) — schema v4, repository, CLI, API, starter pack, profiler integration 2. Analyst bootstrap (4 tasks) — da analyst setup, CLAUDE.md template, freshness check 3. Metadata writer (4 tasks) — column metadata repo, CLI, API, Keboola push Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:08:55 +02:00
ZdenekSrotyr	c57e195932	docs: fix design spec after code review Addresses all Critical and Important issues found by reviewer: - Fix schema migration details (_V3_TO_V4_MIGRATIONS, _ensure_schema chain) - Add YAML-to-DuckDB field mapping table (table→table_name) - Remove unexplained src/metrics.py from new files - Fix API endpoint URLs (table/{id} → {table_id}, /api/data/tables → /api/catalog/tables) - Commit to da analyst as top-level command (not sub-sub-command) - Fix CLAUDE.local.md path to .claude/CLAUDE.local.md - Remove duplicate --upload-local flag (--upload-only already exists) - Detail profiler refactor call sites - Add metrics API deprecation plan for catalog endpoint - Use {metric_id:path} for slash-containing IDs - Add --force flag and resume behavior for bootstrap - Specify proposals directory path - Simplify da metrics add to --file import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:58:39 +02:00
ZdenekSrotyr	1ce632bc0b	docs: add design spec for porting internal features to OSS Covers business metrics in DuckDB, analyst bootstrap flow, and metadata writer — based on comparison with internal repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:49:34 +02:00
ZdenekSrotyr	6c53082295	feat: multi-instance deployment — all 14 must-have items from spec CalVer CI (release.yml) with stable/dev channels, health endpoint with version/channel/schema_version, JWT secret auto-generation with file persistence, smoke test script + Docker-in-CI, pre-migration snapshot, /api/admin/configure for headless setup, /api/admin/ discover-and-register, /setup wizard, OpenAPI snapshot test, custom connector mount support, CHANGELOG, migration safety tests, startup banner. 663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1 updated JWT test).	2026-04-10 11:57:42 +02:00
ZdenekSrotyr	cce179f114	docs: add versioned tags per channel (dev-YYYY.MM.N, stable-YYYY.MM.N)	2026-04-10 06:44:25 +02:00
ZdenekSrotyr	4ea22232ef	docs: multi-instance deployment and versioning design spec	2026-04-09 21:14:21 +02:00
ZdenekSrotyr	c8e232e43e	docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture - CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references, update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to config/.env.template and config/instance.yaml.example - disaster-recovery.md: rewrite for Docker volumes; cover GCP disk snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH - server.md: strip rsync, systemd, nginx, Linux group, and sudo sections; keep Docker Compose operations, log viewing, health checks, sync/admin CLI, and Jira webhook procedures	2026-04-09 18:44:25 +02:00
ZdenekSrotyr	22cfbfe5fb	docs: update references to deleted files - QUICKSTART.md: replace data_description.md.example copy step with note that tables are registered via the admin API or web UI - NOTIFICATIONS.md: replace examples/ section with planned-feature note - telegram_bot.md: remove examples/notifications/ rows from deployment table and example scripts section; note feature is planned - dev_docs/README.md: remove plan-corporate-memory.md entry - duckdb_manager.py: update comment from remote_query.py to query API endpoint	2026-04-09 17:15:19 +02:00
ZdenekSrotyr	988cdb4320	docs: add production deployment sections to DEPLOYMENT.md Add GHCR pre-built images, HTTPS/Caddy, multi-instance (Terraform + manual), and instance update procedures.	2026-04-09 16:41:26 +02:00
ZdenekSrotyr	53f39bb38d	chore: clean stale docs — rewrite architecture.md, remove old plans - architecture.md rewritten for v2 (FastAPI, DuckDB, Docker) — removed all Flask/rsync/SSH/systemd references - Deleted PLAN.md and REFACTORING_PLAN.md (completed, superseded) - auto-install.md replaced with redirect to DEPLOYMENT.md - Fixed absolute paths in superpowers plan doc	2026-04-09 09:06:13 +02:00
ZdenekSrotyr	1b219cabe9	fix: remove dead PRAGMA enable_wal code DuckDB has used WAL by default since v0.8, so this pragma is not valid DuckDB syntax. Removed obsolete try-except block that attempted to enable WAL on system database initialization.	2026-04-09 06:59:57 +02:00
ZdenekSrotyr	89154d043b	chore: clean repo for public release — fix references, remove drafts - Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI - Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs - Remove real GCP project ID from terraform.tfvars.example - Delete internal draft documents (dev_docs/draft/) - Update infra/main.tf to clone from main branch	2026-04-08 19:27:25 +02:00
ZdenekSrotyr	79443e0df4	fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy - Legacy extractor now uses read_csv(all_varchar=true) to avoid type inference errors (e.g. seniority column typed as DOUBLE with string values) - DEPLOYMENT.md rewritten based on actual dev VM deployment experience: deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow	2026-04-08 19:09:55 +02:00
ZdenekSrotyr	92fbb88c15	chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs	2026-04-08 12:10:47 +02:00
ZdenekSrotyr	1074d5ec49	feat: implement data access control — table-level permissions Schema v3: add is_public column to table_registry (default true). src/rbac.py: can_access_table() checks admin bypass, public flag, explicit permissions, wildcard bucket permissions. API enforcement: - manifest: filters tables by user access - download: 403 if no access - catalog: filters table list - query: validates referenced tables against allowed list New admin permissions API (/api/admin/permissions) for grant/revoke. 28 access control tests + 733 total tests passing.	2026-03-31 12:33:31 +02:00
ZdenekSrotyr	18e5f0b6e8	feat: implement extract.duckdb contract — orchestrator + extractors Phase 0: extend table_registry schema (v1→v2 migration), add source_type/bucket/source_table/query_mode columns. Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master analytics.duckdb. Keboola extractor uses DuckDB extension with legacy client fallback. BigQuery extractor is remote-only via DuckDB BQ extension (no data download). 62 tests passing.	2026-03-30 20:12:56 +02:00
ZdenekSrotyr	0b9720d090	docs: rewrite core refactoring spec v2 — simplified extract.duckdb contract	2026-03-30 19:24:19 +02:00
ZdenekSrotyr	9ee7b3bd09	docs: add core refactoring design spec — DuckDB-centric extract architecture	2026-03-30 18:15:52 +02:00
ZdenekSrotyr	1287e63ed9	feat: complete system — web UI, all API endpoints, governance, admin, CLI commands Major additions: - Web UI: Jinja2 templates in FastAPI (login, dashboard, catalog, corporate memory, admin) - API: catalog profiles/metrics, telegram verify/unlink/status, admin table registry CRUD - Corporate memory governance: approve/reject/mandate/revoke/edit/batch + audit log - Sync: real DataSyncManager trigger, sync-settings, table-subscriptions - CLI: setup (init/test/deploy/verify), server (logs/restart/deploy/backup), explore - Instance config integration (instance.yaml loaded at startup) - 140 tests passing (25 new)	2026-03-27 16:52:22 +01:00
ZdenekSrotyr	07b396bfe2	docs: add refactoring plan, design spec, and gitignore updates	2026-03-27 15:42:57 +01:00
Petr	1318b74ff1	Add Corporate Memory governance — Phase 1 (data model + admin API) Add admin curation layer between AI extraction and knowledge distribution. Admins (km_admin flag in instance.yaml) can approve, reject, mandate, and revoke knowledge items. Mandatory items distribute to all targeted users automatically. Three governance modes (configurable per instance): - mandatory_only: admin controls everything, no user voting - admin_curated: admin controls, users vote as feedback signal - hybrid: mandatory from admin + optional from user voting Three approval workflows: - review_queue: nothing published without admin approval - auto_publish: items go live immediately, admin intervenes retroactively - threshold: confidence-based auto-publish (Phase 5) Includes: - 9 admin action functions (approve/reject/mandate/revoke/edit/batch/...) - 11 new admin API endpoints under /api/corporate-memory/admin/ - Immutable audit log (audit.jsonl) - Audience targeting via groups - Automatic migration of existing items to "approved" status - km_admin_required auth decorator - 69 tests covering all governance logic - Backward compatible: no config = legacy wiki behavior	2026-03-23 19:15:33 +01:00
Petr	95358448e6	Add modular LLM connector for Corporate Memory Replace hardwired Anthropic API calls with a pluggable provider system. Each deployment configures its AI provider in instance.yaml — switching between Anthropic, LiteLLM, OpenRouter, or any OpenAI-compatible proxy is a config change, not a code change. New connectors/llm/ module: - StructuredExtractor Protocol with extract_json() interface - AnthropicExtractor: direct Anthropic SDK with retry + backoff - OpenAICompatExtractor: any OpenAI-compatible proxy with three-layer structured output fallback (json_schema -> json_object -> prompt) - Configurable structured_output policy (strict/json/auto) - Custom exception hierarchy (auth/rate_limit/timeout/format/refusal) - Zero secrets in logs: no API keys, prompts, or responses logged Reviewed by: Google Gemini, Claude Sonnet, OpenAI GPT-5.4. Security audit passed with all critical findings resolved.	2026-03-23 12:08:33 +01:00
Petr	84d14da611	Fix remote query UX: file-based stdin, ssh permissions, deprecation Session testing revealed 3 issues with remote queries: 1. CLAUDE.md template recommended `cat <<HEREDOC \| ssh ...` but claude_settings.json had `cat` in deny list, causing 2-3 failed attempts per query. Replaced with file-based approach: Write tool creates JSON file, then `ssh ... < file` avoids the cat deny. 2. ssh/scp commands were not in the allow list, requiring manual approval for every remote query. Added both to allow list. 3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every parquet export. Replaced with .arrow().read_all(). Also added instruction for proactive hybrid analysis when remote tables are available (agent was only using local data until asked).	2026-03-21 18:41:43 +01:00
Petr	67df4acd73	Add --stdin JSON mode to avoid shell escaping nightmare Agent was failing 3x on SSH commands due to backticks (BQ table names) and single quotes (SQL string literals) getting mangled by nested shell interpretation (local -> SSH -> bash -> Python). New --stdin mode reads query spec as JSON from stdin via heredoc: cat <<'QUERY' \| ssh alias 'bash remote_query.sh --stdin' {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."} QUERY Heredoc with <<'QUERY' (quoted) passes everything literally -- no escaping needed for backticks, quotes, or parentheses. Updated claude_md_template.txt to use --stdin as the primary method.	2026-03-21 12:15:50 +01:00
Petr	dce8454894	Add remote_query.sh wrapper, fix analyst SSH permissions Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/. Added to data-ops group on server. New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH, CONFIG_DIR, .env) so agents use simple: ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table' Updated claude_md_template.txt to use wrapper instead of raw commands.	2026-03-21 11:58:04 +01:00
Petr	d180b2014e	Step 28: Remote query architecture for local+remote table JOINs Add src/remote_query.py CLI module enabling the AI agent to run SQL queries spanning local Parquet tables and remote BigQuery tables in a single DuckDB session on the server. Two-phase protocol: BQ sub-queries (--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql) joins everything. Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits (500K per BQ sub-query, 100K final result). Changes: - New src/remote_query.py with CLI, BQ registration, output formatting - Add bq_entity_type field to TableConfig (view vs table routing) - Extract create_local_views() from duckdb_manager.py for reuse - Update claude_md_template.txt with remote query agent instructions - Update example configs with remote_query section and docs - 52 new tests (42 remote_query + 10 bq_entity_type), all passing	2026-03-21 11:39:15 +01:00
Petr	49adbe26ec	Move server venv setup to bootstrap, remove cross-platform pip freeze sync Server venv is created during bootstrap via SSH (same package list, installed natively on Linux). Removes sync_data.sh section that copied pip freeze output across platforms (Windows/macOS freeze is incompatible with Linux).	2026-03-15 00:59:45 +01:00
Petr	508d92771f	Generate setup instructions from bootstrap.yaml (single source of truth) - Rewrite bootstrap.yaml as clean structured YAML with steps, commands, descriptions, conditions, and notes - Add _generate_setup_instructions() in app.py that reads YAML, substitutes placeholders, and produces clipboard-ready plain text - Replace 50-line hardcoded JS string builder with single tojson variable - All setup instructions now editable in one YAML file	2026-03-15 00:37:19 +01:00
Petr	021c453ea6	Auto-create .sync_connection via printf command in bootstrap Replace 'Save this to .sync_connection' prose with actual printf command that Claude/user executes. Fix heredoc indentation in bootstrap.yaml.	2026-03-15 00:05:42 +01:00
Petr	2237334b05	Make CLAUDE.md template generic and instance-aware - Remove all Keboola-specific content (metric categories, MRR/ARR refs, corporate memory, hardcoded server IP) - Add {ssh_alias}, {server_host}, {webapp_url} placeholders - Bootstrap saves .sync_connection file with instance details - sync_data.sh reads .sync_connection to substitute all placeholders - Text instructions in dashboard include .sync_connection step	2026-03-14 23:57:58 +01:00
Petr	140cbb3cee	Make bootstrap.yaml instance-agnostic with configurable SSH alias Add {ssh_alias} and {ssh_key} placeholders so each instance can use its own SSH config name (avoids conflicts when user has multiple instances). Remove Keboola-specific sync_settings and dataset references. Simplify to single download_server_data step (rsync with scp fallback). Handle SSH alias conflicts gracefully.	2026-03-14 20:58:26 +01:00
Petr	da6d605ae0	Add sample metric YAML as fallback when OpenMetadata metrics unavailable The /api/v1/metrics endpoint may not be available in all OpenMetadata instances. This sample metric provides a fallback for demonstration purposes.	2026-03-12 15:14:04 +01:00
Petr	5fc9526627	Phase 2: Replace demo YAML metrics with OpenMetadata catalog data - Add get_metric_by_fqn() to OpenMetadataClient - Add get_metrics() to CatalogEnricher with TTL caching - Implement _parse_om_metric() to extract category/grain from OpenMetadata tags - Implement _load_metrics_from_catalog() to fetch and categorize metrics - Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON - Add /api/catalog/metrics/<fqn> endpoint for metric detail modal - Update _load_metrics_data() to prefer catalog over YAML fallback - Update metric_modal.js to route catalog:{fqn} to catalog API endpoint - Delete 10 demo YAML files from docs/metrics/ - Replace metric tests with new unit tests for catalog parsing functions (19 tests) Catalog metrics provide single source of truth vs maintaining demo YAML files. UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.	2026-03-12 15:10:42 +01:00
Petr	c77a6f6c2e	Fix clipped annotation badges in theme-reference.html Remove overflow:hidden from mockup containers and reposition surface/text_primary badges that were cut off at edges.	2026-03-11 14:09:04 +01:00
Petr	d438438e33	Add configurable white-label theming via instance.yaml Extend theming from 3 CSS variables (primary colors only) to 14 configurable properties covering colors, fonts, borders, and shape. All values are optional with sensible defaults. - New _theme.html include replaces duplicated inline injection - Wire theme include into all 7 templates (base, login, dashboard, catalog, admin_tables, activity_center, corporate_memory) - Conditional font loading: skip default Inter when custom font_url set - Config.theme_overrides() classmethod generates CSS variable dict - Visual theme-reference.html guide for instance configurators - Document all theme keys in instance.yaml.example	2026-03-11 13:58:58 +01:00
Petr	49559fba1b	Remove hardcoded Jira and Telemetry cards from catalog These Keboola-specific data source cards don't belong in the OSS repo. The catalog now shows only dynamic content: Core Business Data (from data_description.md) and Business Metrics (from docs/metrics/*.yml). Also update auto-install.md with Business Metrics documentation, pipeline diagram, and expanded checklist.	2026-03-10 22:48:07 +01:00
Petr	5a84473213	Add dynamic Business Metrics with sample e-commerce definitions Replace hardcoded Keboola-specific metrics card in Data Catalog with dynamic Jinja template that renders whatever metric YAMLs exist in docs/metrics/. Add 10 sample e-commerce metric definitions across 4 categories (revenue, customers, marketing, support) that align with the sample data generator tables. Key changes: - MetricParser: new category colors + dynamic sql_* field discovery - _load_metrics_data(): scans docs/metrics//.yml with prod fallback - catalog.html: 240 lines hardcoded HTML -> 35 lines Jinja loop - metric_modal.js: regex-based category class removal, new categories - 21 tests validating YAML schema, parser, and loader	2026-03-10 22:38:44 +01:00
Petr	f685dc357f	Document Data Catalog and Profiler pipeline in auto-install guide - Add architecture diagram showing data flow from instance config through profiler to webapp - Explain folder_mapping dual purpose (catalog categories + file paths) - Add Step 6c for running the profiler - Document foreign_keys for relationship diagrams - Explain profiles.json fallback for catalog header stats - Expand checklist with profiler verification steps	2026-03-10 22:14:45 +01:00
Petr	7f61ae8772	Update auto-install docs with Data Catalog setup - Split Step 6 into 6a (Generate Parquet) and 6b (Configure Data Catalog) - Document data_description.md + instance.yaml catalog categories - Uncomment data_description.md symlink in Step 3c - Add Data Catalog verification to Step 6 checklist	2026-03-10 22:00:28 +01:00
Petr	302494b632	Add --format parquet using project's ParquetManager Generator now supports --format {csv,parquet,both}. Parquet mode uses src.parquet_manager.ParquetManager for snappy compression, proper column types (DATE, TIMESTAMP, DOUBLE), and metadata. No more ad-hoc pandas conversion needed on the server.	2026-03-10 21:46:20 +01:00
Petr	44bf43535b	Add sample data generator with 9 e-commerce tables Synthetic data generator for demo/testing without real data adapter: - 9 tables: customers, products, campaigns, web_sessions, web_leads, orders, order_items, payments, support_tickets - 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB) - Realistic patterns: seasonality, Pareto customer distribution, segment-based behavior, referential integrity - Deterministic output via --seed parameter Also: docs/sample-data.md, updated auto-install.md with Step 6, updated CLAUDE.md (email auth provider, dual-repo architecture)	2026-03-10 12:31:14 +01:00
Petr	879bc6c44f	docs	2026-03-10 11:43:11 +01:00
Petr	495940d6b8	Rewrite auto-install guide with dual-repo architecture Document the full end-to-end workflow: OSS repo (code) + private instance repo (config/secrets). Covers SSH key isolation per repo, symlink bridging, and ongoing deployment workflow.	2026-03-10 11:38:41 +01:00
Petr	a8a9efeb60	Update auto-install docs with steps 3-4 (config + email auth)	2026-03-10 10:43:39 +01:00
Petr	f2d3d156e3	Move standalone services from server/ to services/ Extract 4 self-contained services into services/ module: - server/telegram_bot/ -> services/telegram_bot/ - server/ws_gateway/ -> services/ws_gateway/ - server/corporate_memory/ -> services/corporate_memory/ - server/session_collector.py -> services/session_collector/ Each service now has its own systemd/ directory with .service and .timer files. deploy.sh updated to auto-discover service units from services//systemd/. server/ now contains only deployment infrastructure (deploy.sh, setup scripts, bin/ management tools, sudoers, nginx config). All imports updated: webapp/app.py, server/bin/ scripts, systemd ExecStart paths.	2026-03-09 12:54:30 +01:00
Petr	38b86127ed	Branding cleanup: remove Keboola-specific references from docs and config - server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE - server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions - .gitignore: generic comment for data directory - CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/ - docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern - connectors/jira/README.md: keboola-analyst -> data-analyst in config paths - dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding	2026-03-09 12:22:27 +01:00

1 2

53 commits