agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	865d6d657e	fix: keboola client metadata_cache_path uses DATA_DIR instead of deleted config Fixes #7 — NameError: name 'config' is not defined	2026-03-31 11:57:57 +02:00
ZdenekSrotyr	04c5aecc58	fix: update Terraform for extract.duckdb architecture - Create /data/extracts instead of /data/src_data/parquet - Add admin_email variable for SEED_ADMIN_EMAIL	2026-03-31 09:49:32 +02:00
ZdenekSrotyr	e1e2d6d903	feat: add SEED_ADMIN_EMAIL for Docker test environments app/main.py: seed admin user on startup when SEED_ADMIN_EMAIL is set. docker-compose.test.yml: expose port 8000, add seed env var.	2026-03-31 09:48:12 +02:00
ZdenekSrotyr	617e724d21	feat: add E2E test suite — API, extractor, Docker tests/conftest.py: shared fixtures (e2e_env, seeded_app, create_mock_extract) tests/test_e2e_api.py: 11 tests — full sync flow, RBAC, table lifecycle tests/test_e2e_extract.py: 6 tests — Keboola/BQ/Jira pipelines, multi-source, corrupt handling tests/test_e2e_docker.py: 3 tests — Docker health + full flow (opt-in via -m docker) Fix admin update route (duplicate id kwarg, .dict() → .model_dump()). 705 tests passing.	2026-03-31 08:18:54 +02:00
ZdenekSrotyr	b0eaef88cc	refactor: delete old server infra — 4,200 lines removed Remove all legacy deployment infrastructure replaced by Docker + Kamal: - server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers, nginx config, systemd units, bin scripts) - scripts/sync_data.sh (replaced by da sync + API) - All services/*/systemd/ files (replaced by docker-compose) - tests/test_deploy_guard.py and tests/test_sync_data.py 688 tests passing.	2026-03-31 08:06:41 +02:00
ZdenekSrotyr	caa60a507d	feat: add centralized RBAC module — replace Linux group auth New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(), is_admin(), is_km_admin(), has_dataset_access(), set_user_role(). webapp/auth.py: admin_required + km_admin_required now use DuckDB roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check). app/auth/dependencies.py: imports Role from src/rbac.py (single source). 11 RBAC tests passing.	2026-03-31 08:04:35 +02:00
ZdenekSrotyr	9fef90a729	docs: rewrite CLAUDE.md for extract.duckdb architecture Update project structure, architecture diagram, key implementation details, development commands, and extensibility docs. Add extract service to docker-compose.yml for one-shot extraction.	2026-03-31 07:52:44 +02:00
ZdenekSrotyr	b502bd8bdd	refactor: delete old sync pipeline — 9,500 lines removed Phase 5 cleanup: remove all code replaced by extract.duckdb architecture. Deleted modules: - src/config.py (653) — replaced by DuckDB table_registry - src/parquet_manager.py (755) — replaced by DuckDB COPY TO - src/data_sync.py (734) — replaced by SyncOrchestrator - src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH - src/table_registry.py (464) — replaced by DuckDB repository - connectors/keboola/adapter.py (820) — replaced by extractor.py - connectors/bigquery/adapter.py (665) — replaced by extractor.py - connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension Updated all imports in webapp, catalog_export, enricher, router, sync_settings_service, generate_sample_data. Kept keboola/client.py as fallback (removed src.config dependency). 704 tests passing.	2026-03-31 07:50:37 +02:00
ZdenekSrotyr	9f20529f10	fix: resolve 7 preexisting test failures - Remove iCloud duplicate files (test_db 2.py, src/db 2.py) - Fix metrics expression fallback to top-level field in transformer + webapp - Fix sync_data.sh rsync exception pattern for $SSH_HOST variable - Fix deploy_guard cp regex to skip shell variable expansions - Update sudoers-deploy with missing root:data-ops rules - Update CRITICAL_DIRS ownership expectations to match deploy.sh reality 913 tests passing, 0 failures.	2026-03-30 20:36:00 +02:00
ZdenekSrotyr	e2a7ee21a2	fix: Jira extract_init handles empty parquet dirs gracefully DuckDB read_parquet glob fails when no files match. Skip view creation for tables without parquet files, create views only after first write.	2026-03-30 20:28:29 +02:00
ZdenekSrotyr	8bc1fceb52	feat: add migration scripts for extract.duckdb transition migrate_registry_to_duckdb.py: imports tables from data_description.md or table_registry.json into DuckDB table_registry with source columns. migrate_parquets_to_extracts.py: copies parquets to /data/extracts/ and creates extract.duckdb with _meta + views.	2026-03-30 20:21:12 +02:00
ZdenekSrotyr	e058c71777	feat: adapt Jira connector to extract.duckdb format - New extract_init.py: creates extract.duckdb with _meta + views for 6 entity types - Update default paths to /data/extracts/jira/data/ and /data/extracts/jira/raw/ - After parquet writes, update _meta table in extract.duckdb - Trigger SyncOrchestrator.rebuild_source("jira") after successful transform	2026-03-30 20:19:27 +02:00
ZdenekSrotyr	1bf97c725c	feat: wire orchestrator into API — replace DataSyncManager sync.py: _run_sync() now calls extractor + SyncOrchestrator.rebuild() data.py: parquet lookup searches /data/extracts/ first, legacy fallback catalog.py: list tables from DuckDB table_registry instead of src.config admin.py: discover-tables uses KeboolaClient directly, remove old TableRegistry dep	2026-03-30 20:16:33 +02:00
ZdenekSrotyr	18e5f0b6e8	feat: implement extract.duckdb contract — orchestrator + extractors Phase 0: extend table_registry schema (v1→v2 migration), add source_type/bucket/source_table/query_mode columns. Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master analytics.duckdb. Keboola extractor uses DuckDB extension with legacy client fallback. BigQuery extractor is remote-only via DuckDB BQ extension (no data download). 62 tests passing.	2026-03-30 20:12:56 +02:00
ZdenekSrotyr	0b9720d090	docs: rewrite core refactoring spec v2 — simplified extract.duckdb contract	2026-03-30 19:24:19 +02:00
ZdenekSrotyr	9ee7b3bd09	docs: add core refactoring design spec — DuckDB-centric extract architecture	2026-03-30 18:15:52 +02:00
ZdenekSrotyr	a4944dba4a	feat: auto-generate JWT secret in Terraform, remove manual variable	2026-03-30 16:03:19 +02:00
ZdenekSrotyr	b6a94add67	feat: add Terraform config for GCP deployment - GCE e2-small with Ubuntu 24.04 + Docker - Static IP, firewall rules, SSD boot disk - Startup script: installs Docker, clones repo, creates .env, starts compose - Outputs: IP, SSH command, API URL, bootstrap command, CLI setup - ~7$/month for always-on server	2026-03-30 15:55:26 +02:00
ZdenekSrotyr	7b0a161d3d	fix: handle timezone-naive timestamps in health check	2026-03-30 14:19:40 +02:00
ZdenekSrotyr	bca5e91826	feat: add bootstrap endpoint + deploy skill for AI agents - POST /auth/bootstrap — creates first admin, self-deactivates after - da setup bootstrap — CLI command for agent-driven setup - da setup verify — structured health check (JSON output for agents) - cli/skills/deploy.md — complete deployment guide for AI agents - 6 bootstrap tests including full agent deployment flow simulation - 156 total tests passing	2026-03-30 14:01:01 +02:00
ZdenekSrotyr	a74f69d6b1	chore: exclude CI workflow from push (needs workflow scope)	2026-03-27 17:41:27 +01:00
ZdenekSrotyr	0b91d4ac47	feat: complete web UI + auth providers + template compatibility All 7 web pages rendering (200): /login, /dashboard, /catalog, /corporate-memory, /corporate-memory/admin, /activity-center, /admin/tables All 13 API endpoints working (200): health, sync, data, query, users, memory, scripts, settings, telegram, admin, catalog Auth providers: Google OAuth, Password (argon2), Email magic link Cookie-based JWT auth for web UI after OAuth redirect FlexDict for Flask→FastAPI template compatibility 150 tests passing	2026-03-27 17:34:39 +01:00
ZdenekSrotyr	1a7939c594	feat: add auth providers (Google OAuth, Password, Email magic link) + web UI fixes - Google OAuth with authlib + auto user creation + cookie-based JWT - Password auth with argon2 hash + setup token flow - Email magic link with SMTP/SendGrid support - Cookie-based auth for web UI (after OAuth redirect) - Dashboard template compatibility (user_info, activity, desktop status) - 150 tests passing	2026-03-27 17:07:59 +01:00
ZdenekSrotyr	fb1e60d8e1	fix: fix TemplateResponse API for Starlette compatibility Use new TemplateResponse(request, name, context) signature. Add Flask compat shims (get_flashed_messages, url_for, session).	2026-03-27 16:59:04 +01:00
ZdenekSrotyr	1287e63ed9	feat: complete system — web UI, all API endpoints, governance, admin, CLI commands Major additions: - Web UI: Jinja2 templates in FastAPI (login, dashboard, catalog, corporate memory, admin) - API: catalog profiles/metrics, telegram verify/unlink/status, admin table registry CRUD - Corporate memory governance: approve/reject/mandate/revoke/edit/batch + audit log - Sync: real DataSyncManager trigger, sync-settings, table-subscriptions - CLI: setup (init/test/deploy/verify), server (logs/restart/deploy/backup), explore - Instance config integration (instance.yaml loaded at startup) - 140 tests passing (25 new)	2026-03-27 16:52:22 +01:00
ZdenekSrotyr	c5527ec153	fix: harden script sandbox and SQL query security Fixes found by E2E QA agent: - Script sandbox: block os, sys, socket, eval, exec, open, __import__, getattr, pathlib and 20+ other dangerous patterns - SQL query: block COPY, ATTACH, read_csv, semicolons, non-SELECT - Added 24 security tests covering all attack vectors	2026-03-27 16:11:05 +01:00
ZdenekSrotyr	07b396bfe2	docs: add refactoring plan, design spec, and gitignore updates	2026-03-27 15:42:57 +01:00
ZdenekSrotyr	e0ce91ddb9	feat: add dataset permissions, script execution, Kamal config, CI/CD - SyncSettingsRepository + DatasetPermissionRepository with RBAC - Script deploy/run/undeploy API with import sandboxing - User sync settings API with permission checks - 4 CLI skills (connectors, security, notifications, corporate-memory) - Kamal production + staging configs - GitHub Actions CI + deploy workflows - 91 total tests passing	2026-03-27 15:40:11 +01:00
ZdenekSrotyr	3701130a11	feat: add Docker, CLI tool, scheduler, and agent skills - Dockerfile (uv-based) + docker-compose.yml (3 services) - CLI tool 'da' with commands: auth, sync, query, status, admin, diagnose, skills - Scheduler sidecar service (replaces systemd timers) - pyproject.toml for uv distribution - Built-in skills (setup, troubleshoot) for AI agents - 17 CLI tests, 75 total tests passing	2026-03-27 15:30:03 +01:00
ZdenekSrotyr	a3918d3833	feat: add FastAPI server with auth, RBAC, and all API endpoints - JWT auth with role-based access control (viewer/analyst/admin/km_admin) - Endpoints: health, sync manifest, data download, query, users CRUD, corporate memory, session/artifact upload - 18 API tests covering auth, RBAC, all endpoints	2026-03-27 15:19:18 +01:00
ZdenekSrotyr	64acc8d731	feat: add JSON to DuckDB migration script with tests	2026-03-27 15:09:06 +01:00
ZdenekSrotyr	79b0b66f2e	feat: add DuckDB state layer with all repository classes - src/db.py: schema with 14 tables matching design spec - 7 repository classes: SyncState, Users, Knowledge, Audit, Telegram, PendingCode, Script, TableRegistry, Profiles - 37 tests covering all CRUD operations	2026-03-27 15:06:55 +01:00
ZdenekSrotyr	f76411c603	feat: add DuckDB state layer with schema management Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 13:55:54 +01:00
Petr	eb7e5bdf8f	Add data freshness indicators and remote table visibility to UI - Fix sync_state.json parsing: derive last_updated from table last_sync timestamps when root-level field is missing (flat format support) - Parse ALL YAML blocks from data_description.md (was only first block) - Show remote tables (daily_deal_traffic) in catalog with "Live" badge - Show per-table sync timestamps and Local/Live query mode badges - Add data freshness note to Business Metrics section - Dashboard: fix "Not yet synced" bug, show local/live table breakdown	2026-03-25 16:24:26 +01:00
Petr	a667b4e32f	Fix profiler crash for remote-only tables without primary_key Same issue as config.py - profiler's TableInfo and parser required primary_key and sync_strategy, breaking auto-profile after sync when daily_deal_traffic (remote-only) is in config.	2026-03-25 14:47:00 +01:00
Petr	4ebb3fc7b2	Fix data sync crash: make primary_key and sync_strategy optional Remote-only tables (query_mode="remote") like daily_deal_traffic don't need primary_key or sync_strategy. The parser used hard lookups (table_data["primary_key"]) causing KeyError and breaking all data sync since 2026-03-21. Changes: - TableConfig: default primary_key="" and sync_strategy="none" - Parser: use .get() with defaults instead of [] lookups - Validator: add "none" as valid sync_strategy	2026-03-25 14:43:22 +01:00
Petr	74ecf66f80	Increase knowledge item content limit from 500 to 1000 chars	2026-03-24 00:12:15 +01:00
Petr	0560bbc127	Rename Mandate button to Make Mandatory	2026-03-23 19:44:08 +01:00
Petr	e85d296b0a	Add Corporate Memory admin review queue UI (Phase 2) Admin page at /corporate-memory/admin with three tabs: - Review Queue: pending items with approve/mandate/reject + batch ops - All Items: status filter, promote/demote/revoke actions - Audit Log: filterable action history table Features: - Keyboard shortcuts (j/k navigate, a/r/m = approve/reject/mandate) - Inline mandate form (mandatory reason + audience targeting) - Toast notifications on action success/error - Pending count badge on main Corporate Memory page - Matches existing visual design (CSS variables, card styles)	2026-03-23 19:32:33 +01:00
Petr	1318b74ff1	Add Corporate Memory governance — Phase 1 (data model + admin API) Add admin curation layer between AI extraction and knowledge distribution. Admins (km_admin flag in instance.yaml) can approve, reject, mandate, and revoke knowledge items. Mandatory items distribute to all targeted users automatically. Three governance modes (configurable per instance): - mandatory_only: admin controls everything, no user voting - admin_curated: admin controls, users vote as feedback signal - hybrid: mandatory from admin + optional from user voting Three approval workflows: - review_queue: nothing published without admin approval - auto_publish: items go live immediately, admin intervenes retroactively - threshold: confidence-based auto-publish (Phase 5) Includes: - 9 admin action functions (approve/reject/mandate/revoke/edit/batch/...) - 11 new admin API endpoints under /api/corporate-memory/admin/ - Immutable audit log (audit.jsonl) - Audience targeting via groups - Automatic migration of existing items to "approved" status - km_admin_required auth decorator - 69 tests covering all governance logic - Backward compatible: no config = legacy wiki behavior	2026-03-23 19:15:33 +01:00
Petr	c04791b702	Suppress httpcore debug logging in LLM connector	2026-03-23 12:57:35 +01:00
Petr	f619fadc42	Fix SSL verification and suppress OpenAI SDK debug logging - Add verify_ssl config option for corporate proxies with self-signed certs - Suppress openai/httpx debug loggers that dump full request bodies (including prompt content) — security requirement	2026-03-23 12:56:04 +01:00
Petr	95358448e6	Add modular LLM connector for Corporate Memory Replace hardwired Anthropic API calls with a pluggable provider system. Each deployment configures its AI provider in instance.yaml — switching between Anthropic, LiteLLM, OpenRouter, or any OpenAI-compatible proxy is a config change, not a code change. New connectors/llm/ module: - StructuredExtractor Protocol with extract_json() interface - AnthropicExtractor: direct Anthropic SDK with retry + backoff - OpenAICompatExtractor: any OpenAI-compatible proxy with three-layer structured output fallback (json_schema -> json_object -> prompt) - Configurable structured_output policy (strict/json/auto) - Custom exception hierarchy (auth/rate_limit/timeout/format/refusal) - Zero secrets in logs: no API keys, prompts, or responses logged Reviewed by: Google Gemini, Claude Sonnet, OpenAI GPT-5.4. Security audit passed with all critical findings resolved.	2026-03-23 12:08:33 +01:00
Petr	84d14da611	Fix remote query UX: file-based stdin, ssh permissions, deprecation Session testing revealed 3 issues with remote queries: 1. CLAUDE.md template recommended `cat <<HEREDOC \| ssh ...` but claude_settings.json had `cat` in deny list, causing 2-3 failed attempts per query. Replaced with file-based approach: Write tool creates JSON file, then `ssh ... < file` avoids the cat deny. 2. ssh/scp commands were not in the allow list, requiring manual approval for every remote query. Added both to allow list. 3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every parquet export. Replaced with .arrow().read_all(). Also added instruction for proactive hybrid analysis when remote tables are available (agent was only using local data until asked).	2026-03-21 18:41:43 +01:00
Petr	8c6c162417	Fix: --sql not required when --stdin is used argparse was rejecting --stdin mode because --sql was required=True. Changed to required=False with runtime validation in main().	2026-03-21 12:17:02 +01:00
Petr	67df4acd73	Add --stdin JSON mode to avoid shell escaping nightmare Agent was failing 3x on SSH commands due to backticks (BQ table names) and single quotes (SQL string literals) getting mangled by nested shell interpretation (local -> SSH -> bash -> Python). New --stdin mode reads query spec as JSON from stdin via heredoc: cat <<'QUERY' \| ssh alias 'bash remote_query.sh --stdin' {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."} QUERY Heredoc with <<'QUERY' (quoted) passes everything literally -- no escaping needed for backticks, quotes, or parentheses. Updated claude_md_template.txt to use --stdin as the primary method.	2026-03-21 12:15:50 +01:00
Petr	39763ea5a2	Fix: load instance.yaml without requiring webapp secrets Analysts don't have WEBAPP_SECRET_KEY, so load_instance_config() validation failed with noisy warnings. Now reads instance.yaml directly with yaml.safe_load, skipping secret validation.	2026-03-21 12:01:41 +01:00
Petr	dfec39722b	Fix remote_query.sh: use analyst-readable env file GCP OS Login doesn't honor /etc/group changes for SSH sessions, so analyst can't read /opt/data-analyst/.env even after usermod. Wrapper now reads .remote_query.env from scripts dir (dataread group), falls back to .env for admin users. The env file contains only non-secret BQ config (project ID, location, data dir).	2026-03-21 11:59:57 +01:00
Petr	dce8454894	Add remote_query.sh wrapper, fix analyst SSH permissions Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/. Added to data-ops group on server. New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH, CONFIG_DIR, .env) so agents use simple: ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table' Updated claude_md_template.txt to use wrapper instead of raw commands.	2026-03-21 11:58:04 +01:00
Petr	ed5a5ec706	Fix: duckdb_manager CONFIG_DIR support for server deployment find_project_root() and parse_data_description() now check CONFIG_DIR env var first when looking for data_description.md. On server deployment, data_description.md lives in instance/config/ (CONFIG_DIR), not in the OSS repo's docs/ directory.	2026-03-21 11:40:55 +01:00

1 2 3

148 commits