agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	0dd8b13d62	infra: add fetch-env-from-secrets.sh for VM-side .env generation Reads JWT_SECRET_KEY and KEBOOLA_STORAGE_TOKEN from Secret Manager, combines with non-secret config, writes .env with chmod 600. Run as part of VM startup or manually for rotation.	2026-04-21 16:18:35 +02:00
ZdenekSrotyr	5ad96e5f86	infra: add bootstrap-gcp.sh for per-customer GCP setup Creates agnes-deploy SA with Terraform-scoped roles, GCS tfstate bucket, and generates a JSON key. Idempotent — safe to re-run. Expanded .gitignore to block *-key.json files from ever being committed.	2026-04-21 16:18:35 +02:00
ZdenekSrotyr	30106e6a49	feat: add standalone metric YAML → DuckDB migration script	2026-04-10 19:35:36 +02:00
ZdenekSrotyr	40cca627be	fix: address Devin review round 4 — bash arithmetic, CalVer max, docs - smoke-test.sh: replace ((PASS++)) with PASS=$((PASS + 1)) to avoid set -e abort when counter is 0 (bash returns exit 1 for ((0))) - CalVer: use max(N) from existing tags instead of count, safe when tags are deleted (e.g. deprecated version cleanup) - CLAUDE.md: update schema version from v2 to v3 663 tests pass.	2026-04-10 14:39:16 +02:00
ZdenekSrotyr	6c53082295	feat: multi-instance deployment — all 14 must-have items from spec CalVer CI (release.yml) with stable/dev channels, health endpoint with version/channel/schema_version, JWT secret auto-generation with file persistence, smoke test script + Docker-in-CI, pre-migration snapshot, /api/admin/configure for headless setup, /api/admin/ discover-and-register, /setup wizard, OpenAPI snapshot test, custom connector mount support, CHANGELOG, migration safety tests, startup banner. 663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1 updated JWT test).	2026-04-10 11:57:42 +02:00
ZdenekSrotyr	5e0e4ceb9e	fix: rewrite Makefile and scripts/README.md Makefile simplified to four targets (test, dev, docker, lint) aligned with the current FastAPI/Docker architecture. scripts/README.md rewritten to list only the active and migration scripts that still exist.	2026-04-09 17:16:04 +02:00
ZdenekSrotyr	22cfbfe5fb	docs: update references to deleted files - QUICKSTART.md: replace data_description.md.example copy step with note that tables are registered via the admin API or web UI - NOTIFICATIONS.md: replace examples/ section with planned-feature note - telegram_bot.md: remove examples/notifications/ rows from deployment table and example scripts section; note feature is planned - dev_docs/README.md: remove plan-corporate-memory.md entry - duckdb_manager.py: update comment from remote_query.py to query API endpoint	2026-04-09 17:15:19 +02:00
ZdenekSrotyr	f3bd378b47	chore: remove 17 dead files from v1 architecture Removes unused scripts (collect_session, generate_user_sync_configs, standalone_profiler, remote_query, update, setup_views, test_sync, activate_venv, backfill_gap, sync_config_template), legacy config (data_description.md.example), llms.txt, completed planning docs (plan-rsync-fix, plan_parquet_types_fix, plan-corporate-memory), and notification examples/ directory.	2026-04-09 17:14:06 +02:00
ZdenekSrotyr	5ee12d78e7	refactor: final cleanup — delete legacy auth, clean deps, fix hash, migrate to uv - Delete root auth/ directory (legacy Flask providers, orphaned) - Clean requirements.txt: remove Flask, gunicorn, authlib, sendgrid, anthropic, openai, argon2-cffi (9 unused deps) - Fix hash computation in orchestrator: MD5 of parquet mtime+size (CLI sync now skips unchanged tables correctly) - Migrate pip → uv in CLAUDE.md, scripts/init.sh, pyproject.toml - Sync pyproject.toml dependencies with requirements.txt 578 tests passing.	2026-03-31 19:18:30 +02:00
ZdenekSrotyr	4d1acd014a	refactor: remove legacy webapp + add missing tests + housekeeping Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to .gitignore, increase extractor timeout to 30 min. Phase B: Add 10 new tests — access request lifecycle (4), CLI admin commands (5), sync subprocess trigger (1). 578 tests passing. Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask app fully replaced by FastAPI app/. Fix auth providers to use app.instance_config instead of webapp.config. Update CLAUDE.md. Delete 6 webapp-only test files. Fix Jira service config imports.	2026-03-31 13:44:06 +02:00
ZdenekSrotyr	b0eaef88cc	refactor: delete old server infra — 4,200 lines removed Remove all legacy deployment infrastructure replaced by Docker + Kamal: - server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers, nginx config, systemd units, bin scripts) - scripts/sync_data.sh (replaced by da sync + API) - All services/*/systemd/ files (replaced by docker-compose) - tests/test_deploy_guard.py and tests/test_sync_data.py 688 tests passing.	2026-03-31 08:06:41 +02:00
ZdenekSrotyr	b502bd8bdd	refactor: delete old sync pipeline — 9,500 lines removed Phase 5 cleanup: remove all code replaced by extract.duckdb architecture. Deleted modules: - src/config.py (653) — replaced by DuckDB table_registry - src/parquet_manager.py (755) — replaced by DuckDB COPY TO - src/data_sync.py (734) — replaced by SyncOrchestrator - src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH - src/table_registry.py (464) — replaced by DuckDB repository - connectors/keboola/adapter.py (820) — replaced by extractor.py - connectors/bigquery/adapter.py (665) — replaced by extractor.py - connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension Updated all imports in webapp, catalog_export, enricher, router, sync_settings_service, generate_sample_data. Kept keboola/client.py as fallback (removed src.config dependency). 704 tests passing.	2026-03-31 07:50:37 +02:00
ZdenekSrotyr	8bc1fceb52	feat: add migration scripts for extract.duckdb transition migrate_registry_to_duckdb.py: imports tables from data_description.md or table_registry.json into DuckDB table_registry with source columns. migrate_parquets_to_extracts.py: copies parquets to /data/extracts/ and creates extract.duckdb with _meta + views.	2026-03-30 20:21:12 +02:00
ZdenekSrotyr	64acc8d731	feat: add JSON to DuckDB migration script with tests	2026-03-27 15:09:06 +01:00
Petr	dfec39722b	Fix remote_query.sh: use analyst-readable env file GCP OS Login doesn't honor /etc/group changes for SSH sessions, so analyst can't read /opt/data-analyst/.env even after usermod. Wrapper now reads .remote_query.env from scripts dir (dataread group), falls back to .env for admin users. The env file contains only non-secret BQ config (project ID, location, data dir).	2026-03-21 11:59:57 +01:00
Petr	dce8454894	Add remote_query.sh wrapper, fix analyst SSH permissions Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/. Added to data-ops group on server. New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH, CONFIG_DIR, .env) so agents use simple: ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table' Updated claude_md_template.txt to use wrapper instead of raw commands.	2026-03-21 11:58:04 +01:00
Petr	ed5a5ec706	Fix: duckdb_manager CONFIG_DIR support for server deployment find_project_root() and parse_data_description() now check CONFIG_DIR env var first when looking for data_description.md. On server deployment, data_description.md lives in instance/config/ (CONFIG_DIR), not in the OSS repo's docs/ directory.	2026-03-21 11:40:55 +01:00
Petr	d180b2014e	Step 28: Remote query architecture for local+remote table JOINs Add src/remote_query.py CLI module enabling the AI agent to run SQL queries spanning local Parquet tables and remote BigQuery tables in a single DuckDB session on the server. Two-phase protocol: BQ sub-queries (--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql) joins everything. Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits (500K per BQ sub-query, 100K final result). Changes: - New src/remote_query.py with CLI, BQ registration, output formatting - Add bq_entity_type field to TableConfig (view vs table routing) - Extract create_local_views() from duckdb_manager.py for reuse - Update claude_md_template.txt with remote query agent instructions - Update example configs with remote_query section and docs - 52 new tests (42 remote_query + 10 bq_entity_type), all passing	2026-03-21 11:39:15 +01:00
Petr	e17dd85504	Remove hardcoded Jira/Keboola references from sync_data.sh - Silent fallback when no sync settings exist (no 'Jira disabled' message) - Generic dataset exclude/include loop driven by sync_settings.yaml - Generic cleanup loop for disabled datasets - Replaces 100+ lines of hardcoded Jira/kbc_telemetry_expert blocks	2026-03-15 01:02:37 +01:00
Petr	49adbe26ec	Move server venv setup to bootstrap, remove cross-platform pip freeze sync Server venv is created during bootstrap via SSH (same package list, installed natively on Linux). Removes sync_data.sh section that copied pip freeze output across platforms (Windows/macOS freeze is incompatible with Linux).	2026-03-15 00:59:45 +01:00
Petr	22a1bb5847	Auto-restart sync_data.sh after self-update (exec replaces process)	2026-03-15 00:53:17 +01:00
Petr	6f9de274fb	Fix: CLAUDE.md template vars overwrote $SSH_HOST used by rsync The CLAUDE.md generation section reused SSH_HOST variable name to store the server IP, overwriting the SSH alias needed for rsync. Renamed to TMPL_SSH_ALIAS/TMPL_SERVER_HOST/TMPL_WEBAPP_URL to avoid collision.	2026-03-15 00:51:49 +01:00
Petr	b0e4749b0d	Replace hardcoded 'data-analyst' SSH alias with configurable $SSH_HOST Read SSH alias from .sync_connection file at script start (default: 'data-analyst' for backward compatibility). All 32 occurrences of hardcoded 'data-analyst:' and 'ssh data-analyst' replaced with $SSH_HOST.	2026-03-15 00:03:07 +01:00
Petr	2237334b05	Make CLAUDE.md template generic and instance-aware - Remove all Keboola-specific content (metric categories, MRR/ARR refs, corporate memory, hardcoded server IP) - Add {ssh_alias}, {server_host}, {webapp_url} placeholders - Bootstrap saves .sync_connection file with instance details - sync_data.sh reads .sync_connection to substitute all placeholders - Text instructions in dashboard include .sync_connection step	2026-03-14 23:57:58 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	468f56092b	Add standalone DuckDB-based data profiler script Zero-dependency profiler for Parquet/CSV files producing JSON profiles with column statistics, histograms, alerts, and sample data. Supports single files, directories, composite primary keys, and optional HTML report generation.	2026-03-11 15:12:04 +01:00
Petr	302494b632	Add --format parquet using project's ParquetManager Generator now supports --format {csv,parquet,both}. Parquet mode uses src.parquet_manager.ParquetManager for snappy compression, proper column types (DATE, TIMESTAMP, DOUBLE), and metadata. No more ad-hoc pandas conversion needed on the server.	2026-03-10 21:46:20 +01:00
Petr	44bf43535b	Add sample data generator with 9 e-commerce tables Synthetic data generator for demo/testing without real data adapter: - 9 tables: customers, products, campaigns, web_sessions, web_leads, orders, order_items, payments, support_tickets - 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB) - Realistic patterns: seasonality, Pareto customer distribution, segment-based behavior, referential integrity - Deterministic output via --seed parameter Also: docs/sample-data.md, updated auto-install.md with Step 6, updated CLAUDE.md (email auth provider, dual-repo architecture)	2026-03-10 12:31:14 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	15b513266d	Merge dev_scripts/ into scripts/ Move dev_run.py and test_sync.sh from dev_scripts/ to scripts/, eliminating the separate dev_scripts directory. Update scripts README with development scripts section.	2026-03-09 13:11:36 +01:00
Petr	86edd27655	Extract Jira into connectors/jira module Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default	2026-03-09 11:17:50 +01:00
Petr	26c4e0934d	OSS cleanup: remove internal references, harden deployment, add config env interpolation Phase 1 - Internal reference cleanup: - Delete dev_docs/meetings/ (internal meeting notes/transcripts) - Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic - Replace "Internal AI Data Analyst" with "AI Data Analyst" - Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst - Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs Phase 2 - Deployment hardening: - Tighten sudoers wildcards to explicit paths (visudo, sudoers cp) - setup.sh creates all groups (data-ops, dataread, data-private) and deploy user - webapp-setup.sh copies sudoers-webapp from repo instead of inline definition - deploy.sh conditional copy for data_description.md (not in git for OSS) - deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples} Phase 3 - Config and misc: - Add ${ENV_VAR} interpolation to config/loader.py - Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.) - Create config/.env.template for secret values - Add MIT LICENSE - Fix .gitignore: add .venv/, docs/data_description.md - Fix README.md: CSV status Planned, remove metrics/, update license text - Translate Czech comments in requirements.txt to English - Fix test_account_service.py: mock username mapping instead of relying on instance config All 118 tests pass.	2026-03-09 07:59:57 +01:00
Petr	c56905d34f	Initial commit: OSS data distribution platform Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.	2026-03-08 23:31:28 +01:00

33 commits