agnes-the-ai-analyst

Author	SHA1	Message	Date
Petr	dfec39722b	Fix remote_query.sh: use analyst-readable env file GCP OS Login doesn't honor /etc/group changes for SSH sessions, so analyst can't read /opt/data-analyst/.env even after usermod. Wrapper now reads .remote_query.env from scripts dir (dataread group), falls back to .env for admin users. The env file contains only non-secret BQ config (project ID, location, data dir).	2026-03-21 11:59:57 +01:00
Petr	dce8454894	Add remote_query.sh wrapper, fix analyst SSH permissions Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/. Added to data-ops group on server. New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH, CONFIG_DIR, .env) so agents use simple: ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table' Updated claude_md_template.txt to use wrapper instead of raw commands.	2026-03-21 11:58:04 +01:00
Petr	ed5a5ec706	Fix: duckdb_manager CONFIG_DIR support for server deployment find_project_root() and parse_data_description() now check CONFIG_DIR env var first when looking for data_description.md. On server deployment, data_description.md lives in instance/config/ (CONFIG_DIR), not in the OSS repo's docs/ directory.	2026-03-21 11:40:55 +01:00
Petr	d180b2014e	Step 28: Remote query architecture for local+remote table JOINs Add src/remote_query.py CLI module enabling the AI agent to run SQL queries spanning local Parquet tables and remote BigQuery tables in a single DuckDB session on the server. Two-phase protocol: BQ sub-queries (--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql) joins everything. Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits (500K per BQ sub-query, 100K final result). Changes: - New src/remote_query.py with CLI, BQ registration, output formatting - Add bq_entity_type field to TableConfig (view vs table routing) - Extract create_local_views() from duckdb_manager.py for reuse - Update claude_md_template.txt with remote query agent instructions - Update example configs with remote_query section and docs - 52 new tests (42 remote_query + 10 bq_entity_type), all passing	2026-03-21 11:39:15 +01:00
Petr	e17dd85504	Remove hardcoded Jira/Keboola references from sync_data.sh - Silent fallback when no sync settings exist (no 'Jira disabled' message) - Generic dataset exclude/include loop driven by sync_settings.yaml - Generic cleanup loop for disabled datasets - Replaces 100+ lines of hardcoded Jira/kbc_telemetry_expert blocks	2026-03-15 01:02:37 +01:00
Petr	49adbe26ec	Move server venv setup to bootstrap, remove cross-platform pip freeze sync Server venv is created during bootstrap via SSH (same package list, installed natively on Linux). Removes sync_data.sh section that copied pip freeze output across platforms (Windows/macOS freeze is incompatible with Linux).	2026-03-15 00:59:45 +01:00
Petr	22a1bb5847	Auto-restart sync_data.sh after self-update (exec replaces process)	2026-03-15 00:53:17 +01:00
Petr	6f9de274fb	Fix: CLAUDE.md template vars overwrote $SSH_HOST used by rsync The CLAUDE.md generation section reused SSH_HOST variable name to store the server IP, overwriting the SSH alias needed for rsync. Renamed to TMPL_SSH_ALIAS/TMPL_SERVER_HOST/TMPL_WEBAPP_URL to avoid collision.	2026-03-15 00:51:49 +01:00
Petr	b0e4749b0d	Replace hardcoded 'data-analyst' SSH alias with configurable $SSH_HOST Read SSH alias from .sync_connection file at script start (default: 'data-analyst' for backward compatibility). All 32 occurrences of hardcoded 'data-analyst:' and 'ssh data-analyst' replaced with $SSH_HOST.	2026-03-15 00:03:07 +01:00
Petr	2237334b05	Make CLAUDE.md template generic and instance-aware - Remove all Keboola-specific content (metric categories, MRR/ARR refs, corporate memory, hardcoded server IP) - Add {ssh_alias}, {server_host}, {webapp_url} placeholders - Bootstrap saves .sync_connection file with instance details - sync_data.sh reads .sync_connection to substitute all placeholders - Text instructions in dashboard include .sync_connection step	2026-03-14 23:57:58 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	468f56092b	Add standalone DuckDB-based data profiler script Zero-dependency profiler for Parquet/CSV files producing JSON profiles with column statistics, histograms, alerts, and sample data. Supports single files, directories, composite primary keys, and optional HTML report generation.	2026-03-11 15:12:04 +01:00
Petr	302494b632	Add --format parquet using project's ParquetManager Generator now supports --format {csv,parquet,both}. Parquet mode uses src.parquet_manager.ParquetManager for snappy compression, proper column types (DATE, TIMESTAMP, DOUBLE), and metadata. No more ad-hoc pandas conversion needed on the server.	2026-03-10 21:46:20 +01:00
Petr	44bf43535b	Add sample data generator with 9 e-commerce tables Synthetic data generator for demo/testing without real data adapter: - 9 tables: customers, products, campaigns, web_sessions, web_leads, orders, order_items, payments, support_tickets - 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB) - Realistic patterns: seasonality, Pareto customer distribution, segment-based behavior, referential integrity - Deterministic output via --seed parameter Also: docs/sample-data.md, updated auto-install.md with Step 6, updated CLAUDE.md (email auth provider, dual-repo architecture)	2026-03-10 12:31:14 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	15b513266d	Merge dev_scripts/ into scripts/ Move dev_run.py and test_sync.sh from dev_scripts/ to scripts/, eliminating the separate dev_scripts directory. Update scripts README with development scripts section.	2026-03-09 13:11:36 +01:00
Petr	86edd27655	Extract Jira into connectors/jira module Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default	2026-03-09 11:17:50 +01:00
Petr	26c4e0934d	OSS cleanup: remove internal references, harden deployment, add config env interpolation Phase 1 - Internal reference cleanup: - Delete dev_docs/meetings/ (internal meeting notes/transcripts) - Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic - Replace "Internal AI Data Analyst" with "AI Data Analyst" - Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst - Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs Phase 2 - Deployment hardening: - Tighten sudoers wildcards to explicit paths (visudo, sudoers cp) - setup.sh creates all groups (data-ops, dataread, data-private) and deploy user - webapp-setup.sh copies sudoers-webapp from repo instead of inline definition - deploy.sh conditional copy for data_description.md (not in git for OSS) - deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples} Phase 3 - Config and misc: - Add ${ENV_VAR} interpolation to config/loader.py - Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.) - Create config/.env.template for secret values - Add MIT LICENSE - Fix .gitignore: add .venv/, docs/data_description.md - Fix README.md: CSV status Planned, remove metrics/, update license text - Translate Czech comments in requirements.txt to English - Fix test_account_service.py: mock username mapping instead of relying on instance config All 118 tests pass.	2026-03-09 07:59:57 +01:00
Petr	c56905d34f	Initial commit: OSS data distribution platform Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.	2026-03-08 23:31:28 +01:00

19 commits