agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	2ad8828f8c	fix: stdin register_bq parsing, separate BQ SQL validation - cli/commands/query.py: --stdin mode now reads register_bq from the JSON payload and merges it into the register_bq option list, matching the documented {"register_bq": {...}, "sql": "..."} contract. - src/remote_query.py: add _validate_bq_sql() with a narrower blocklist (writes only); register_bq() now calls _validate_bq_sql() so legitimate BQ operations like INFORMATION_SCHEMA, CALL, IMPORT are not blocked. The final DuckDB execute() path still uses the full _validate_sql(). - tests/test_remote_query.py: add TestValidateBqSql covering allowed INFORMATION_SCHEMA queries and blocked write operations.	2026-04-11 19:31:39 +02:00
ZdenekSrotyr	f4129dc87d	fix: alias validation, url escaping, read-only CLI, blocklist comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 11:28:27 +02:00
ZdenekSrotyr	ed43feb4e6	feat: add POST /api/query/hybrid endpoint for two-phase BQ+DuckDB queries	2026-04-11 11:09:42 +02:00
ZdenekSrotyr	d605e7d95f	feat: add --register-bq and --stdin to da query for hybrid BQ+local queries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 11:09:11 +02:00
ZdenekSrotyr	86bbb8fce4	feat: add RemoteQueryEngine with BQ registration and safety limits Two-phase query engine: Phase 1 registers BQ query results as DuckDB Arrow views (with COUNT pre-check, row/memory limits, Storage API fallback); Phase 2 executes validated SQL against DuckDB with result serialization and truncation. 25 tests covering all branches.	2026-04-11 11:07:08 +02:00
ZdenekSrotyr	0a69814fca	fix: re-attach remote extensions in get_analytics_db_readonly() Add _reattach_remote_extensions() helper that reads _remote_attach tables from attached extract.duckdb files and LOADs the corresponding DuckDB extensions, so BigQuery and other remote views resolve correctly in read-only analytics connections.	2026-04-11 11:04:04 +02:00
ZdenekSrotyr	4b4c071959	fix: async httpx in metadata push, guard access_token, add push test - Replace synchronous httpx.post() with async httpx.AsyncClient in push_metadata_to_source endpoint to avoid blocking the event loop - Guard data["access_token"] in CLI analyst setup with .get() and a clear error message on missing key - Add test_push_non_keboola_table_fails and test_push_keboola_table to TestMetadataAPI, covering 400/404 path and the happy path with mocked async httpx	2026-04-11 08:33:10 +02:00
ZdenekSrotyr	126d151413	fix: address code review — path injection, multi-table search, metrics import API, error handling - Validate view names with _SAFE_IDENTIFIER regex and check path traversal in _initialize_duckdb() - find_by_table() and get_table_map() now also search the tables[] array field - Add POST /api/admin/metrics/import endpoint for YAML file upload - Replace generic except in _connect_to_instance() with specific HTTPStatusError/TimeoutException handlers - Generate .claude/settings.json in _generate_claude_md() bootstrap - Update test_find_by_table and test_get_table_map to cover tables[] array lookups - Add test_import_metrics_yaml in TestMetricsAPI	2026-04-10 19:56:00 +02:00
ZdenekSrotyr	847b48f3af	feat: add da analyst status and returning-session freshness check	2026-04-10 19:44:07 +02:00
ZdenekSrotyr	d0cdfcf8c1	feat: add column metadata API with Keboola push support	2026-04-10 19:44:03 +02:00
ZdenekSrotyr	a3531f0ead	feat: add da analyst setup command with bootstrap flow	2026-04-10 19:43:36 +02:00
ZdenekSrotyr	89552a00f0	feat: add da admin metadata-show and metadata-apply commands	2026-04-10 19:42:47 +02:00
ZdenekSrotyr	bf90f06774	feat: add ColumnMetadataRepository with CRUD and proposal import	2026-04-10 19:41:53 +02:00
ZdenekSrotyr	488b79f4d1	fix: use SCHEMA_VERSION constant in e2e migration test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:39:19 +02:00
ZdenekSrotyr	344d744089	feat: add 10 starter pack metrics (revenue, usage, sales, operations)	2026-04-10 19:35:28 +02:00
ZdenekSrotyr	5cf0df77fc	feat: add Metrics API endpoints (GET/POST/DELETE) with admin auth - New app/api/metrics.py: GET /api/metrics, GET /api/metrics/{id:path}, POST /api/admin/metrics (201), DELETE /api/admin/metrics/{id:path} - Add require_admin dependency to app/auth/dependencies.py - Register metrics_router in app/main.py before web_router - Deprecate GET /api/catalog/metrics/{path} with 301 redirect to new endpoint - 7 new tests in TestMetricsAPI covering CRUD, 404, RBAC, category filter	2026-04-10 19:32:13 +02:00
ZdenekSrotyr	5cddb5573a	feat: add da metrics CLI subcommand (list, show, import, export, validate) Implements Task 4 — five Typer commands under `da metrics`: - list/show use api_get() to query the server API - import/export/validate access DuckDB directly via MetricRepository and TableRegistryRepository (no server required) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 19:28:51 +02:00
ZdenekSrotyr	a65de8574e	feat: add import_from_yaml and export_to_yaml to MetricRepository Adds YAML-based bulk import/export to MetricRepository, supporting list-wrapped and plain-dict YAML formats, table→table_name field mapping, and sql_by_* → sql_variants collection (and reverse on export). All 24 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 19:25:11 +02:00
ZdenekSrotyr	88d536ca29	feat: add MetricRepository with full CRUD and search for metric_definitions Implements MetricRepository following the table_registry pattern — raw SQL, dict returns, ON CONFLICT upsert, and json.dumps for sql_variants/validation. Includes 18 tests covering create, read, list, update, delete, find_by_table, find_by_synonym, and get_table_map.	2026-04-10 19:21:25 +02:00
ZdenekSrotyr	cc1445f7ed	fix: use SCHEMA_VERSION constant in v3-to-v4 migration test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:18:57 +02:00
ZdenekSrotyr	bc394bd266	feat: schema migration v3→v4 with metric_definitions and column_metadata tables Add SCHEMA_VERSION = 4, _V3_TO_V4_MIGRATIONS list, and if current < 4 block in _ensure_schema(). Both new tables are also added to _SYSTEM_SCHEMA for fresh installs. Tests cover fresh install, all columns, and v3→v4 migration path.	2026-04-10 19:14:32 +02:00
ZdenekSrotyr	6c53082295	feat: multi-instance deployment — all 14 must-have items from spec CalVer CI (release.yml) with stable/dev channels, health endpoint with version/channel/schema_version, JWT secret auto-generation with file persistence, smoke test script + Docker-in-CI, pre-migration snapshot, /api/admin/configure for headless setup, /api/admin/ discover-and-register, /setup wizard, OpenAPI snapshot test, custom connector mount support, CHANGELOG, migration safety tests, startup banner. 663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1 updated JWT test).	2026-04-10 11:57:42 +02:00
ZdenekSrotyr	cf59abe6dd	fix: update tests to provide password after OAuth token bypass fix	2026-04-09 16:35:15 +02:00
ZdenekSrotyr	2043594670	fix: restrict script execution endpoints to analyst/admin roles deploy, run, and run-deployed require analyst; undeploy requires admin. Update test to use admin token for undeploy.	2026-04-09 16:31:42 +02:00
ZdenekSrotyr	449053bf8a	fix: enforce per-table access control on catalog profile endpoints Add can_access_table check to GET /api/catalog/profile/{table_name} and POST /api/catalog/profile/{table_name}/refresh, returning 403 for unauthorized tables. Update test_api_complete to cover new 403 behaviour and fix the existing 404 test to use admin token.	2026-04-09 16:30:24 +02:00
ZdenekSrotyr	ad6b3a96e4	fix: enforce role guards on admin web pages Add require_role(Role.ADMIN) to /admin/tables and /admin/permissions, and require_role(Role.KM_ADMIN) to /corporate-memory/admin so that non-admin users receive 403 instead of being served the page. Fix admin_cookie test fixture to supply a password_hash (required since the /auth/token endpoint blocks passwordless requests). Add analyst fixture and TestAdminRoleGuards tests verifying analysts get 403 and admins get 200 on the protected routes.	2026-04-09 16:30:13 +02:00
ZdenekSrotyr	3205a8d300	fix: block /auth/token for OAuth-only users without password_hash Users without a password_hash (Google OAuth / magic-link accounts) could obtain a JWT by simply posting their email to /auth/token. Add an else clause that rejects such requests with 401, directing them to their configured auth provider. Update and extend tests accordingly.	2026-04-09 16:29:47 +02:00
ZdenekSrotyr	55515266ea	fix: block DuckDB metadata functions and relative paths in query endpoint Add information_schema, duckdb_* introspection functions, pragma_* functions, and relative path traversal patterns to the SQL blocklist so users cannot enumerate schema metadata regardless of RBAC. Add six corresponding tests.	2026-04-09 16:29:11 +02:00
ZdenekSrotyr	afa84f6585	fix: web UI smoke tests — reset DuckDB singleton, get token via API	2026-04-09 07:18:17 +02:00
ZdenekSrotyr	5131816a5b	test: add missing coverage for web UI, Jira extract, instance config, and concurrent rebuild - tests/test_web_ui.py: smoke tests for all authenticated web pages (login, dashboard, catalog, corporate-memory, activity-center, admin/tables, admin/permissions) - tests/test_jira_service.py: unit tests for extract_init and update_meta in the Jira connector - tests/test_instance_config.py: verifies get_instance_name() returns a string when config file is absent - tests/test_orchestrator.py: concurrent rebuild test asserting rebuild succeeds while a read-only connection holds the analytics DB	2026-04-09 07:15:14 +02:00
ZdenekSrotyr	8df8183a9f	feat: add 50 MB upload size limit to session and artifact endpoints Rejects files exceeding MAX_UPLOAD_SIZE with HTTP 413 before writing to disk.	2026-04-09 07:14:16 +02:00
ZdenekSrotyr	8e9a0c367a	fix: replace os.environ direct assignment with monkeypatch.setenv in test fixtures Prevents environment variable leaking between tests. All DATA_DIR, JWT_SECRET_KEY, and SCRIPT_TIMEOUT assignments in fixtures now use monkeypatch.setenv() which auto-reverts after each test. Removes manual os.environ.pop() cleanup lines.	2026-04-09 07:11:36 +02:00
ZdenekSrotyr	53a9e838f9	feat: add graceful shutdown handler - Add close_system_db() function in src/db.py to cleanly close shared DB connection - Add lifespan context manager in app/main.py to trigger shutdown on app exit - Integrate lifespan into FastAPI app initialization - All API tests pass (77/77)	2026-04-09 07:03:45 +02:00
ZdenekSrotyr	1b3acce7e9	fix: replace substring table access check with word-boundary regex Replace substring matching with word-boundary regex in query endpoint's table access validation. Prevents false positives where short table names like 'id' would block any query containing the word. Uses re.escape() to safely handle special characters in table names. - Import re module at top - Use regex pattern with word boundaries (\b) for matching - Add tests to verify no false positives and proper blocking	2026-04-09 07:00:48 +02:00
ZdenekSrotyr	3e9c347cf1	fix: validate extract dir name in get_analytics_db_readonly to prevent SQL injection Adds _SAFE_IDENTIFIER regex guard before ATTACHing extract.duckdb files in the read-only analytics connection, matching the same fix already applied in the orchestrator. Adds test coverage for malicious directory names.	2026-04-09 06:57:31 +02:00
ZdenekSrotyr	e425d4baa5	fix: handle WAL files in atomic swap to prevent DB corruption Add _atomic_swap_db helper that removes stale WAL files before and after moving the temp DuckDB into place. Apply CHECKPOINT before close in both orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.	2026-04-09 06:57:29 +02:00
ZdenekSrotyr	3321d2e266	security: reduce JWT expiry to 24h and add jti claim Tokens previously lasted 30 days with no revocation path. Expiry is now 24 hours and every token carries a unique jti (UUID hex) to support future revocation checks.	2026-04-09 06:57:23 +02:00
ZdenekSrotyr	23ae6a602c	security: harden query endpoint SQL blocklist and disable external access Expand blocked keywords to cover parquet_scan, read_csv_auto, query_table, iceberg_scan, delta_scan, call, URL schemes (http/https/s3/gcs), and additional file-scan functions. Set enable_external_access=false on the non-read-only analytics connection path. Add three new tests covering parquet_scan, read_csv_auto, and query_table blocking.	2026-04-09 06:54:58 +02:00
ZdenekSrotyr	4aa97c23d2	fix: raise RuntimeError on missing JWT_SECRET_KEY in non-test environments Prevents production deployments from silently using a hardcoded default secret. TESTING=1 still resolves to a built-in test key so the existing test suite is unaffected. Adds a test that verifies the RuntimeError is raised when neither JWT_SECRET_KEY nor TESTING is set.	2026-04-09 06:54:29 +02:00
ZdenekSrotyr	0d3ab5060c	fix: reject unsafe SQL identifiers in orchestrator Adds _validate_identifier() with ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$ regex and applies it to source_name (directory names), table_name (_meta rows), and alias/extension (_remote_attach rows) before any SQL interpolation. Adds two tests covering SQL-injection directory names and malicious _meta entries.	2026-04-09 06:51:07 +02:00
ZdenekSrotyr	cb9c566d07	fix: rebuild_source delegates to full rebuild to preserve all source views _do_rebuild_source was creating a fresh temp DB with only one source, then atomically replacing analytics.duckdb — wiping views from every other source. Now it delegates to _do_rebuild so all extract dirs are re-attached in a single pass. Adds test_rebuild_source_preserves_other_sources to guard the regression.	2026-04-09 06:48:25 +02:00
ZdenekSrotyr	94c6b0f839	fix: require password verification when user has password_hash in /auth/token Previously the password check was gated on both user.password_hash and request.password being truthy, so an attacker could omit the password field (which defaults to "") and receive a valid JWT. Now any user with a stored hash must supply a non-empty password that passes argon2 verification. Adds six TestTokenEndpoint tests covering empty, missing, wrong, and correct password, plus no-hash user and unknown user cases.	2026-04-09 06:44:31 +02:00
ZdenekSrotyr	3ba207a7f8	feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var, so _remote_attach uses empty token_env. Orchestrator now supports both token-based (Keboola) and env-based (BigQuery) authentication modes.	2026-04-08 18:13:31 +02:00
ZdenekSrotyr	06e1cf0a8d	feat: generic _remote_attach contract for remote DuckDB extension views Extractors with remote tables now write a _remote_attach table into extract.duckdb so the orchestrator can re-ATTACH external extensions at query time. The mechanism is source-agnostic — any connector can use it. - Keboola extractor writes _remote_attach + creates views on kbc.* - Orchestrator reads _remote_attach, installs extension, reads token from env - Graceful degradation: missing token → warning, local tables still work	2026-04-08 18:10:12 +02:00
ZdenekSrotyr	05a1b452e9	security: harden query (read-only DB), uploads (path sanitization), scripts (AST validation)	2026-04-08 12:09:19 +02:00
ZdenekSrotyr	67a1e0bb45	feat: Jira webhook FastAPI adapter — replaces Flask Blueprint	2026-04-08 07:04:50 +02:00
ZdenekSrotyr	4d1acd014a	refactor: remove legacy webapp + add missing tests + housekeeping Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to .gitignore, increase extractor timeout to 30 min. Phase B: Add 10 new tests — access request lifecycle (4), CLI admin commands (5), sync subprocess trigger (1). 578 tests passing. Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask app fully replaced by FastAPI app/. Fix auth providers to use app.instance_config instead of webapp.config. Update CLAUDE.md. Delete 6 webapp-only test files. Fix Jira service config imports.	2026-03-31 13:44:06 +02:00
ZdenekSrotyr	1074d5ec49	feat: implement data access control — table-level permissions Schema v3: add is_public column to table_registry (default true). src/rbac.py: can_access_table() checks admin bypass, public flag, explicit permissions, wildcard bucket permissions. API enforcement: - manifest: filters tables by user access - download: 403 if no access - catalog: filters table list - query: validates referenced tables against allowed list New admin permissions API (/api/admin/permissions) for grant/revoke. 28 access control tests + 733 total tests passing.	2026-03-31 12:33:31 +02:00
ZdenekSrotyr	617e724d21	feat: add E2E test suite — API, extractor, Docker tests/conftest.py: shared fixtures (e2e_env, seeded_app, create_mock_extract) tests/test_e2e_api.py: 11 tests — full sync flow, RBAC, table lifecycle tests/test_e2e_extract.py: 6 tests — Keboola/BQ/Jira pipelines, multi-source, corrupt handling tests/test_e2e_docker.py: 3 tests — Docker health + full flow (opt-in via -m docker) Fix admin update route (duplicate id kwarg, .dict() → .model_dump()). 705 tests passing.	2026-03-31 08:18:54 +02:00
ZdenekSrotyr	b0eaef88cc	refactor: delete old server infra — 4,200 lines removed Remove all legacy deployment infrastructure replaced by Docker + Kamal: - server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers, nginx config, systemd units, bin scripts) - scripts/sync_data.sh (replaced by da sync + API) - All services/*/systemd/ files (replaced by docker-compose) - tests/test_deploy_guard.py and tests/test_sync_data.py 688 tests passing.	2026-03-31 08:06:41 +02:00
ZdenekSrotyr	caa60a507d	feat: add centralized RBAC module — replace Linux group auth New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(), is_admin(), is_km_admin(), has_dataset_access(), set_user_role(). webapp/auth.py: admin_required + km_admin_required now use DuckDB roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check). app/auth/dependencies.py: imports Role from src/rbac.py (single source). 11 RBAC tests passing.	2026-03-31 08:04:35 +02:00
ZdenekSrotyr	b502bd8bdd	refactor: delete old sync pipeline — 9,500 lines removed Phase 5 cleanup: remove all code replaced by extract.duckdb architecture. Deleted modules: - src/config.py (653) — replaced by DuckDB table_registry - src/parquet_manager.py (755) — replaced by DuckDB COPY TO - src/data_sync.py (734) — replaced by SyncOrchestrator - src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH - src/table_registry.py (464) — replaced by DuckDB repository - connectors/keboola/adapter.py (820) — replaced by extractor.py - connectors/bigquery/adapter.py (665) — replaced by extractor.py - connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension Updated all imports in webapp, catalog_export, enricher, router, sync_settings_service, generate_sample_data. Kept keboola/client.py as fallback (removed src.config dependency). 704 tests passing.	2026-03-31 07:50:37 +02:00
ZdenekSrotyr	9f20529f10	fix: resolve 7 preexisting test failures - Remove iCloud duplicate files (test_db 2.py, src/db 2.py) - Fix metrics expression fallback to top-level field in transformer + webapp - Fix sync_data.sh rsync exception pattern for $SSH_HOST variable - Fix deploy_guard cp regex to skip shell variable expansions - Update sudoers-deploy with missing root:data-ops rules - Update CRITICAL_DIRS ownership expectations to match deploy.sh reality 913 tests passing, 0 failures.	2026-03-30 20:36:00 +02:00
ZdenekSrotyr	18e5f0b6e8	feat: implement extract.duckdb contract — orchestrator + extractors Phase 0: extend table_registry schema (v1→v2 migration), add source_type/bucket/source_table/query_mode columns. Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master analytics.duckdb. Keboola extractor uses DuckDB extension with legacy client fallback. BigQuery extractor is remote-only via DuckDB BQ extension (no data download). 62 tests passing.	2026-03-30 20:12:56 +02:00
ZdenekSrotyr	bca5e91826	feat: add bootstrap endpoint + deploy skill for AI agents - POST /auth/bootstrap — creates first admin, self-deactivates after - da setup bootstrap — CLI command for agent-driven setup - da setup verify — structured health check (JSON output for agents) - cli/skills/deploy.md — complete deployment guide for AI agents - 6 bootstrap tests including full agent deployment flow simulation - 156 total tests passing	2026-03-30 14:01:01 +02:00
ZdenekSrotyr	1a7939c594	feat: add auth providers (Google OAuth, Password, Email magic link) + web UI fixes - Google OAuth with authlib + auto user creation + cookie-based JWT - Password auth with argon2 hash + setup token flow - Email magic link with SMTP/SendGrid support - Cookie-based auth for web UI (after OAuth redirect) - Dashboard template compatibility (user_info, activity, desktop status) - 150 tests passing	2026-03-27 17:07:59 +01:00
ZdenekSrotyr	1287e63ed9	feat: complete system — web UI, all API endpoints, governance, admin, CLI commands Major additions: - Web UI: Jinja2 templates in FastAPI (login, dashboard, catalog, corporate memory, admin) - API: catalog profiles/metrics, telegram verify/unlink/status, admin table registry CRUD - Corporate memory governance: approve/reject/mandate/revoke/edit/batch + audit log - Sync: real DataSyncManager trigger, sync-settings, table-subscriptions - CLI: setup (init/test/deploy/verify), server (logs/restart/deploy/backup), explore - Instance config integration (instance.yaml loaded at startup) - 140 tests passing (25 new)	2026-03-27 16:52:22 +01:00
ZdenekSrotyr	c5527ec153	fix: harden script sandbox and SQL query security Fixes found by E2E QA agent: - Script sandbox: block os, sys, socket, eval, exec, open, __import__, getattr, pathlib and 20+ other dangerous patterns - SQL query: block COPY, ATTACH, read_csv, semicolons, non-SELECT - Added 24 security tests covering all attack vectors	2026-03-27 16:11:05 +01:00
ZdenekSrotyr	e0ce91ddb9	feat: add dataset permissions, script execution, Kamal config, CI/CD - SyncSettingsRepository + DatasetPermissionRepository with RBAC - Script deploy/run/undeploy API with import sandboxing - User sync settings API with permission checks - 4 CLI skills (connectors, security, notifications, corporate-memory) - Kamal production + staging configs - GitHub Actions CI + deploy workflows - 91 total tests passing	2026-03-27 15:40:11 +01:00
ZdenekSrotyr	3701130a11	feat: add Docker, CLI tool, scheduler, and agent skills - Dockerfile (uv-based) + docker-compose.yml (3 services) - CLI tool 'da' with commands: auth, sync, query, status, admin, diagnose, skills - Scheduler sidecar service (replaces systemd timers) - pyproject.toml for uv distribution - Built-in skills (setup, troubleshoot) for AI agents - 17 CLI tests, 75 total tests passing	2026-03-27 15:30:03 +01:00
ZdenekSrotyr	a3918d3833	feat: add FastAPI server with auth, RBAC, and all API endpoints - JWT auth with role-based access control (viewer/analyst/admin/km_admin) - Endpoints: health, sync manifest, data download, query, users CRUD, corporate memory, session/artifact upload - 18 API tests covering auth, RBAC, all endpoints	2026-03-27 15:19:18 +01:00
ZdenekSrotyr	64acc8d731	feat: add JSON to DuckDB migration script with tests	2026-03-27 15:09:06 +01:00
ZdenekSrotyr	79b0b66f2e	feat: add DuckDB state layer with all repository classes - src/db.py: schema with 14 tables matching design spec - 7 repository classes: SyncState, Users, Knowledge, Audit, Telegram, PendingCode, Script, TableRegistry, Profiles - 37 tests covering all CRUD operations	2026-03-27 15:06:55 +01:00
ZdenekSrotyr	f76411c603	feat: add DuckDB state layer with schema management Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 13:55:54 +01:00
Petr	1318b74ff1	Add Corporate Memory governance — Phase 1 (data model + admin API) Add admin curation layer between AI extraction and knowledge distribution. Admins (km_admin flag in instance.yaml) can approve, reject, mandate, and revoke knowledge items. Mandatory items distribute to all targeted users automatically. Three governance modes (configurable per instance): - mandatory_only: admin controls everything, no user voting - admin_curated: admin controls, users vote as feedback signal - hybrid: mandatory from admin + optional from user voting Three approval workflows: - review_queue: nothing published without admin approval - auto_publish: items go live immediately, admin intervenes retroactively - threshold: confidence-based auto-publish (Phase 5) Includes: - 9 admin action functions (approve/reject/mandate/revoke/edit/batch/...) - 11 new admin API endpoints under /api/corporate-memory/admin/ - Immutable audit log (audit.jsonl) - Audience targeting via groups - Automatic migration of existing items to "approved" status - km_admin_required auth decorator - 69 tests covering all governance logic - Backward compatible: no config = legacy wiki behavior	2026-03-23 19:15:33 +01:00
Petr	95358448e6	Add modular LLM connector for Corporate Memory Replace hardwired Anthropic API calls with a pluggable provider system. Each deployment configures its AI provider in instance.yaml — switching between Anthropic, LiteLLM, OpenRouter, or any OpenAI-compatible proxy is a config change, not a code change. New connectors/llm/ module: - StructuredExtractor Protocol with extract_json() interface - AnthropicExtractor: direct Anthropic SDK with retry + backoff - OpenAICompatExtractor: any OpenAI-compatible proxy with three-layer structured output fallback (json_schema -> json_object -> prompt) - Configurable structured_output policy (strict/json/auto) - Custom exception hierarchy (auth/rate_limit/timeout/format/refusal) - Zero secrets in logs: no API keys, prompts, or responses logged Reviewed by: Google Gemini, Claude Sonnet, OpenAI GPT-5.4. Security audit passed with all critical findings resolved.	2026-03-23 12:08:33 +01:00
Petr	8c6c162417	Fix: --sql not required when --stdin is used argparse was rejecting --stdin mode because --sql was required=True. Changed to required=False with runtime validation in main().	2026-03-21 12:17:02 +01:00
Petr	d180b2014e	Step 28: Remote query architecture for local+remote table JOINs Add src/remote_query.py CLI module enabling the AI agent to run SQL queries spanning local Parquet tables and remote BigQuery tables in a single DuckDB session on the server. Two-phase protocol: BQ sub-queries (--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql) joins everything. Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits (500K per BQ sub-query, 100K final result). Changes: - New src/remote_query.py with CLI, BQ registration, output formatting - Add bq_entity_type field to TableConfig (view vs table routing) - Extract create_local_views() from duckdb_manager.py for reuse - Update claude_md_template.txt with remote query agent instructions - Update example configs with remote_query section and docs - 52 new tests (42 remote_query + 10 bq_entity_type), all passing	2026-03-21 11:39:15 +01:00
Petr	ab99f0af92	Fix sync_schedule validation to accept multi-time daily format The scheduler.py already supported "daily HH:MM,HH:MM,HH:MM" format (commit `5f27d05`), but config.py validation regex only accepted single time "daily HH:MM", causing data-refresh to crash on startup. Also adds: - tests/test_config_sync_schedule.py (16 test cases) - Makefile with validate-config target for CI/CD integration	2026-03-17 13:21:14 +01:00
Petr	5f27d05894	Support multiple daily sync times (e.g., "daily 07:00,13:00,18:00") Scheduler now accepts comma-separated HH:MM times in daily schedules. Each time slot is independently evaluated - if any slot has passed and last_sync is before it, the table is marked as due. This lets tables sync multiple times per day to pick up data refreshes that happen throughout the day (e.g., Keboola pipelines running 3x/day).	2026-03-16 23:09:48 +01:00
Petr	ad525a96aa	Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI) Add filter_tag support to catalog_export and webapp so only metrics with the required tag are exported to YAML and displayed in UI. Previously all 19+ metrics were exported regardless of relevance. - Add has_tag() helper to transformer module - catalog_export.py: filter_tag parameter from instance.yaml openmetadata config - webapp/app.py: filter metrics in _load_metrics_from_catalog() - 7 new tests (has_tag, filter_tag export, stale cleanup)	2026-03-16 22:03:53 +01:00
Petr	80c5b902e0	Add scheduled data sync and catalog refresh with systemd timers - New sync_schedule and profile_after_sync fields in TableConfig (formats: "every 15m", "every 1h", "daily 05:00") - New src/scheduler.py with schedule evaluation logic (is_table_due) - New --scheduled mode in data_sync.py: only syncs tables that are due, respects profile_after_sync flag, auto-restarts webapp after profiling - Systemd timer+service for data-refresh (every 15 min) - Systemd timer+service for catalog-refresh (every 15 min) - deploy.sh enables new timers automatically - Complete table config reference in data_description.md.example - 58 new scheduler tests	2026-03-15 02:16:31 +01:00
Petr	ab1a93ed67	Strip HTML tags from OpenMetadata descriptions in YAML export OpenMetadata stores descriptions as rich HTML (<p>, <strong>,  , etc.). Add strip_html() to transformer that converts to clean plain text for YAML files consumed by Claude Code agent. Applied to metric descriptions, table descriptions, and column descriptions. Webapp display dict keeps raw HTML since the modal renders it correctly.	2026-03-15 01:57:04 +01:00
Petr	985f47cdb7	Add catalog export: generate YAML metrics and tables from OpenMetadata - New `connectors/openmetadata/transformer.py` with shared parsing logic for extracting categories, grain, dimensions, expressions from OM tags - New `src/catalog_export.py` script (python -m src.catalog_export) that fetches metrics/tables from OpenMetadata API and writes YAML files to /data/docs/metrics/ and /data/docs/tables/ for agent consumption - Refactor webapp/app.py to delegate to transformer (with inline fallback) - Add `fields` parameter to client.get_metrics() and get_metric_by_fqn() for fetching tags+owners in a single API call - Fix pre-existing mock bug in test_openmetadata_enricher (base_url) - 101 new tests (80 transformer + 21 export), all passing	2026-03-15 01:15:30 +01:00
Petr	5fc9526627	Phase 2: Replace demo YAML metrics with OpenMetadata catalog data - Add get_metric_by_fqn() to OpenMetadataClient - Add get_metrics() to CatalogEnricher with TTL caching - Implement _parse_om_metric() to extract category/grain from OpenMetadata tags - Implement _load_metrics_from_catalog() to fetch and categorize metrics - Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON - Add /api/catalog/metrics/<fqn> endpoint for metric detail modal - Update _load_metrics_data() to prefer catalog over YAML fallback - Update metric_modal.js to route catalog:{fqn} to catalog API endpoint - Delete 10 demo YAML files from docs/metrics/ - Replace metric tests with new unit tests for catalog parsing functions (19 tests) Catalog metrics provide single source of truth vs maintaining demo YAML files. UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.	2026-03-12 15:10:42 +01:00
Petr	14d75d6229	Fix: correct OpenMetadata catalog URL path and add debug logging - Change catalog URL from /explore/{fqn} to /table/{fqn} - Add debug logging to see parsed tags, owners, tier from API response	2026-03-12 14:34:12 +01:00
Petr	c5c24cb45b	Implement OpenMetadata catalog integration (Phase 1) Add OpenMetadata REST API connector and enricher to merge table/column metadata from OpenMetadata catalog at sync and query time. Changes: - connectors/openmetadata/client.py: HTTP client for OM API - connectors/openmetadata/enricher.py: Data enrichment with TTL cache - tests/test_openmetadata_*: Unit tests for client and enricher - src/config.py: Add catalog_fqn field to TableConfig - src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md) - webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url - config/instance.yaml.example: Document openmetadata section Features: - FQN auto-derivation: bigquery.{table.id} - TTL cache (default 1h) to avoid repeated API calls - Graceful degradation: disabled if token missing, silent on HTTP errors - Column description priority: catalog > BQ API > (none) - Table description priority: catalog > data_description.md	2026-03-12 14:07:13 +01:00
Petr	8bb46a9e0a	Add per-partition streaming sync and hybrid query architecture Partitioned sync: iterates day-by-day instead of loading full dataset. Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB. New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates. Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid). DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables). Data sync: skips remote tables, filters by query_mode. Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).	2026-03-12 13:20:41 +01:00
Petr	ee70da86c3	Stream BQ results to Parquet instead of loading into memory Replace to_arrow() (loads entire result into RAM) with to_arrow_iterable() (streams RecordBatches). Each batch is written directly to disk via ParquetWriter - constant memory regardless of table size. Prevents OOM on 8GB server for multi-million row tables.	2026-03-11 20:13:03 +01:00
Petr	a191ede28c	Add columns and row_filter to TableConfig for selective BQ export Propagate column selection and row filtering from data_description.md through the BigQuery adapter to the BQ client. This enables exporting only needed columns and applying date range filters at the SQL level, critical for large DataView tables (e.g., 412-col unit_economics).	2026-03-11 19:37:04 +01:00
Petr	758910463b	Add BigQuery data source adapter BigQuery connector that syncs BQ tables to local Parquet files via PyArrow (no CSV intermediate step). Supports full refresh, timestamp-based incremental (via incremental_column), and partition-based sync strategies. - connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized queries, metadata cache, cross-project support (job project != data project) - connectors/bigquery/adapter.py: DataSource implementation with merge/dedup - src/config.py: Add incremental_column field to TableConfig - 72 unit tests (mocked, no GCP SDK required)	2026-03-11 13:56:12 +01:00
Petr	5a84473213	Add dynamic Business Metrics with sample e-commerce definitions Replace hardcoded Keboola-specific metrics card in Data Catalog with dynamic Jinja template that renders whatever metric YAMLs exist in docs/metrics/. Add 10 sample e-commerce metric definitions across 4 categories (revenue, customers, marketing, support) that align with the sample data generator tables. Key changes: - MetricParser: new category colors + dynamic sql_* field discovery - _load_metrics_data(): scans docs/metrics//.yml with prod fallback - catalog.html: 240 lines hardcoded HTML -> 35 lines Jinja loop - metric_modal.js: regex-based category class removal, new categories - 21 tests validating YAML schema, parser, and loader	2026-03-10 22:38:44 +01:00
Petr	302494b632	Add --format parquet using project's ParquetManager Generator now supports --format {csv,parquet,both}. Parquet mode uses src.parquet_manager.ParquetManager for snappy compression, proper column types (DATE, TIMESTAMP, DOUBLE), and metadata. No more ad-hoc pandas conversion needed on the server.	2026-03-10 21:46:20 +01:00
Petr	44bf43535b	Add sample data generator with 9 e-commerce tables Synthetic data generator for demo/testing without real data adapter: - 9 tables: customers, products, campaigns, web_sessions, web_leads, orders, order_items, payments, support_tickets - 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB) - Realistic patterns: seasonality, Pareto customer distribution, segment-based behavior, referential integrity - Deterministic output via --seed parameter Also: docs/sample-data.md, updated auto-install.md with Step 6, updated CLAUDE.md (email auth provider, dual-repo architecture)	2026-03-10 12:31:14 +01:00
Petr	f635195c80	Add multi-domain support and full-email username generation - Support comma-separated domains in auth.allowed_domain config - Use full email as system username (user@domain.com -> user_domain_com) to avoid collisions with reserved names and across domains - Update both auth providers (google, email) for multi-domain display - Add tests for username generation and update email auth tests	2026-03-10 10:50:01 +01:00
Petr	e2ab219171	Add email magic link authentication provider New pluggable auth provider that sends passwordless sign-in links. Works with domain restriction (same as Google OAuth). Falls back to showing the link in browser when SMTP is not configured (dev mode).	2026-03-10 10:39:19 +01:00
Petr	b99ec576ca	Add self-service data onboarding system Table Registry as central source of truth (JSON) with atomic writes, optimistic locking, audit logging, and data_description.md generation. Existing readers (config.py, profiler.py) need zero changes. Phase 1 - Discovery API: - discover_tables() on DataSource ABC + Keboola implementation - admin_required decorator with server-side recomputation - GET /api/admin/discover-tables endpoint Phase 2 - Table Registry: - src/table_registry.py with CRUD, validation, migration from MD - Admin API: register/update/unregister with version locking - DELETE cascade cleans up per-user subscriptions Phase 3 - Auto-Profiling: - profile_changed_tables() for incremental profiling - Non-fatal hook in sync_all() after successful sync Phase 4 - Per-Table Subscriptions: - table_mode (all/explicit) with per-table toggles - GET/POST /api/table-subscriptions endpoints - Subscription status in catalog and dashboard views Phase 5 - Smart Sync: - Python-generated rsync filter files (not shell YAML parsing) - sync_data.sh uses --filter="merge ..." for explicit mode Phase 6 - Admin UI: - /admin/tables with discovery, registration modal, registry mgmt - Vanilla JS, matching existing design system	2026-03-09 14:25:37 +01:00
Petr	86edd27655	Extract Jira into connectors/jira module Move all Jira-specific code into a self-contained connector module: - 22 files moved via git mv (transform, service, webhook, scripts, systemd units, tests, docs, bin helper) - All imports updated to use connectors.jira.* paths - Jira is now conditional: auto-detected via JIRA_DOMAIN env var - Webapp registers Jira blueprint only when available - Health service monitors Jira timers only when enabled - Profiler loads Jira tables dynamically from filesystem - Sync settings uses config-driven dependency validation - Renamed keboola_platform_url -> custom_url in transform - Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths - Fixed pytest.ini to skip live tests by default	2026-03-09 11:17:50 +01:00
Petr	26c4e0934d	OSS cleanup: remove internal references, harden deployment, add config env interpolation Phase 1 - Internal reference cleanup: - Delete dev_docs/meetings/ (internal meeting notes/transcripts) - Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic - Replace "Internal AI Data Analyst" with "AI Data Analyst" - Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst - Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs Phase 2 - Deployment hardening: - Tighten sudoers wildcards to explicit paths (visudo, sudoers cp) - setup.sh creates all groups (data-ops, dataread, data-private) and deploy user - webapp-setup.sh copies sudoers-webapp from repo instead of inline definition - deploy.sh conditional copy for data_description.md (not in git for OSS) - deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples} Phase 3 - Config and misc: - Add ${ENV_VAR} interpolation to config/loader.py - Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.) - Create config/.env.template for secret values - Add MIT LICENSE - Fix .gitignore: add .venv/, docs/data_description.md - Fix README.md: CSV status Planned, remove metrics/, update license text - Translate Czech comments in requirements.txt to English - Fix test_account_service.py: mock username mapping instead of relying on instance config All 118 tests pass.	2026-03-09 07:59:57 +01:00
Petr	c56905d34f	Initial commit: OSS data distribution platform Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.	2026-03-08 23:31:28 +01:00

1 2 3

140 commits