agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	cb9c566d07	fix: rebuild_source delegates to full rebuild to preserve all source views _do_rebuild_source was creating a fresh temp DB with only one source, then atomically replacing analytics.duckdb — wiping views from every other source. Now it delegates to _do_rebuild so all extract dirs are re-attached in a single pass. Adds test_rebuild_source_preserves_other_sources to guard the regression.	2026-04-09 06:48:25 +02:00
ZdenekSrotyr	94c6b0f839	fix: require password verification when user has password_hash in /auth/token Previously the password check was gated on both user.password_hash and request.password being truthy, so an attacker could omit the password field (which defaults to "") and receive a valid JWT. Now any user with a stored hash must supply a non-empty password that passes argon2 verification. Adds six TestTokenEndpoint tests covering empty, missing, wrong, and correct password, plus no-hash user and unknown user cases.	2026-04-09 06:44:31 +02:00
ZdenekSrotyr	89154d043b	chore: clean repo for public release — fix references, remove drafts - Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI - Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs - Remove real GCP project ID from terraform.tfvars.example - Delete internal draft documents (dev_docs/draft/) - Update infra/main.tf to clone from main branch	2026-04-08 19:27:25 +02:00
ZdenekSrotyr	79443e0df4	fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy - Legacy extractor now uses read_csv(all_varchar=true) to avoid type inference errors (e.g. seniority column typed as DOUBLE with string values) - DEPLOYMENT.md rewritten based on actual dev VM deployment experience: deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow	2026-04-08 19:09:55 +02:00
ZdenekSrotyr	2635f77974	ci: add CI test suite + deploy pipeline - ci.yml: runs 607 tests + Docker build on push/PR - deploy.yml: tests → build → GHCR push → Kamal deploy on main	2026-04-08 18:24:05 +02:00
ZdenekSrotyr	cfa08c4b4c	chore: remove obsolete CI workflows (deploy-guard, deploy.yml.example) deploy-guard.yml referenced deleted tests and sudoers files. deploy.yml.example used legacy SSH-based deployment. Updated ci.yml and deploy.yml are in .gitignore (need workflow scope to push).	2026-04-08 18:16:48 +02:00
ZdenekSrotyr	3ba207a7f8	feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var, so _remote_attach uses empty token_env. Orchestrator now supports both token-based (Keboola) and env-based (BigQuery) authentication modes.	2026-04-08 18:13:31 +02:00
ZdenekSrotyr	06e1cf0a8d	feat: generic _remote_attach contract for remote DuckDB extension views Extractors with remote tables now write a _remote_attach table into extract.duckdb so the orchestrator can re-ATTACH external extensions at query time. The mechanism is source-agnostic — any connector can use it. - Keboola extractor writes _remote_attach + creates views on kbc.* - Orchestrator reads _remote_attach, installs extension, reads token from env - Graceful degradation: missing token → warning, local tables still work	2026-04-08 18:10:12 +02:00
ZdenekSrotyr	ee7d5630ef	fix: keep external_access enabled — views need read_parquet on local files File access attacks blocked by SQL blocklist instead of DuckDB pragma (pragma also blocks legitimate view resolution via read_parquet).	2026-04-08 12:33:05 +02:00
ZdenekSrotyr	f2f9a62803	fix: set enable_external_access=false AFTER ATTACHing extracts	2026-04-08 12:29:27 +02:00
ZdenekSrotyr	6efdf4ca64	fix: read-only analytics DB ATTACHes extract.duckdb files for view resolution	2026-04-08 12:27:12 +02:00
ZdenekSrotyr	a0f7e98f11	chore: add auth/ to gitignore (legacy dir, not tracked)	2026-04-08 12:12:37 +02:00
ZdenekSrotyr	92fbb88c15	chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs	2026-04-08 12:10:47 +02:00
ZdenekSrotyr	05a1b452e9	security: harden query (read-only DB), uploads (path sanitization), scripts (AST validation)	2026-04-08 12:09:19 +02:00
ZdenekSrotyr	224635b88d	security: fix auth (argon2, cookie, JWT), CORS, session middleware, pyproject.toml	2026-04-08 12:08:52 +02:00
ZdenekSrotyr	d5659d7091	fix: login page uses login_buttons format expected by template	2026-04-08 07:11:03 +02:00
ZdenekSrotyr	67a1e0bb45	feat: Jira webhook FastAPI adapter — replaces Flask Blueprint	2026-04-08 07:04:50 +02:00
ZdenekSrotyr	3e3f84a00e	feat: dynamic login providers + profiler auto-trigger + refresh endpoint	2026-04-08 07:04:40 +02:00
ZdenekSrotyr	4bad893cb8	feat: Docker services (ws-gateway, corporate-memory, session-collector) + scheduler auto-auth	2026-04-08 07:04:26 +02:00
ZdenekSrotyr	bae9619363	fix: restore authlib + argon2-cffi — needed by FastAPI auth providers Google OAuth uses authlib, password auth uses argon2. These were incorrectly removed as 'legacy' but are used by app/auth/providers/.	2026-03-31 19:23:24 +02:00
ZdenekSrotyr	5ee12d78e7	refactor: final cleanup — delete legacy auth, clean deps, fix hash, migrate to uv - Delete root auth/ directory (legacy Flask providers, orphaned) - Clean requirements.txt: remove Flask, gunicorn, authlib, sendgrid, anthropic, openai, argon2-cffi (9 unused deps) - Fix hash computation in orchestrator: MD5 of parquet mtime+size (CLI sync now skips unchanged tables correctly) - Migrate pip → uv in CLAUDE.md, scripts/init.sh, pyproject.toml - Sync pyproject.toml dependencies with requirements.txt 578 tests passing.	2026-03-31 19:18:30 +02:00
ZdenekSrotyr	2b7348a773	fix: sync only extracts local tables, skips remote Was using list_by_source() which returns all tables including remote. Now uses list_local() to skip query_mode='remote' tables.	2026-03-31 15:35:49 +02:00
ZdenekSrotyr	8f3a342108	fix: sync logs via stderr for docker compose visibility	2026-03-31 14:05:01 +02:00
ZdenekSrotyr	7612385ed6	fix: extractor subprocess reads table configs via stdin, not DuckDB Subprocess cannot open system.duckdb (main process holds lock). Now main process reads table_registry and passes configs as JSON via stdin to subprocess. Subprocess never touches system.duckdb.	2026-03-31 13:57:02 +02:00
ZdenekSrotyr	4d1acd014a	refactor: remove legacy webapp + add missing tests + housekeeping Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to .gitignore, increase extractor timeout to 30 min. Phase B: Add 10 new tests — access request lifecycle (4), CLI admin commands (5), sync subprocess trigger (1). 578 tests passing. Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask app fully replaced by FastAPI app/. Fix auth providers to use app.instance_config instead of webapp.config. Update CLAUDE.md. Delete 6 webapp-only test files. Fix Jira service config imports.	2026-03-31 13:44:06 +02:00
ZdenekSrotyr	6aee6cf454	fix: CLI sync downloads tables with empty hash (not yet computed) Empty server hash was matching empty local hash, skipping all tables. Now treats empty hash or missing local entry as 'needs download'.	2026-03-31 13:30:10 +02:00
ZdenekSrotyr	2d6a94fb6f	fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename Three-pronged fix for DuckDB lock conflicts: 1. WAL mode on system.duckdb — enables concurrent readers + writer 2. Sync trigger runs extractor as subprocess (not background task) — separate process = separate DuckDB connections, no lock conflict 3. Both extractor and orchestrator write to .tmp then atomic rename — avoids lock conflict with API reads on extract.duckdb/analytics.duckdb Fixes #9 permanently.	2026-03-31 13:19:57 +02:00
ZdenekSrotyr	10d9280ab5	fix: extractor writes to temp file to avoid lock with orchestrator Writes extract.duckdb.tmp then renames atomically, avoiding DuckDB lock conflict when orchestrator holds a read connection on extract.duckdb.	2026-03-31 13:09:51 +02:00
ZdenekSrotyr	675a29c1c7	fix: DuckDB connection pool — shared connection avoids lock conflicts Fixes #9 — background sync tasks could not access system.duckdb because FastAPI held an exclusive lock. Now uses single shared connection per DATA_DIR with cursor() for thread safety.	2026-03-31 13:01:04 +02:00
ZdenekSrotyr	04fa1402e4	feat: CLI admin commands — register-table, discover-and-register, list-tables da admin register-table: register single table da admin discover-and-register: auto-discover from Keboola API + bulk register da admin list-tables: show all registered tables Used to register all 142 Keboola tables on production.	2026-03-31 12:55:03 +02:00
ZdenekSrotyr	2e7d5d1fe9	feat: access request UI — catalog badges, request modal, admin approval page Backend: - access_requests table in DuckDB schema - AccessRequestRepository with create/approve/deny/list - API: POST/GET /api/access-requests (submit, my requests, pending, approve, deny) UI: - Catalog: lock icon on private tables, "Request Access" button + modal - Catalog: "Pending" badge for tables with pending requests - Admin permissions page (/admin/permissions): approve/deny requests, grant/revoke permissions, view all user permissions - Cross-navigation between admin/tables and admin/permissions 733 tests passing.	2026-03-31 12:45:29 +02:00
ZdenekSrotyr	1074d5ec49	feat: implement data access control — table-level permissions Schema v3: add is_public column to table_registry (default true). src/rbac.py: can_access_table() checks admin bypass, public flag, explicit permissions, wildcard bucket permissions. API enforcement: - manifest: filters tables by user access - download: 403 if no access - catalog: filters table list - query: validates referenced tables against allowed list New admin permissions API (/api/admin/permissions) for grant/revoke. 28 access control tests + 733 total tests passing.	2026-03-31 12:33:31 +02:00
ZdenekSrotyr	78f003f5b5	fix: reject empty table name in register-table endpoint Fixes #8 — empty name created orphaned record that couldn't be deleted.	2026-03-31 12:18:58 +02:00
ZdenekSrotyr	bd0b6d19c6	fix: legacy extractor constructs full Keboola table ID from bucket+source_table Was using tc['id'] which is the registry ID (e.g. 'circle'), not the full Keboola ID (e.g. 'in.c-finance.circle') needed by the API.	2026-03-31 12:06:38 +02:00
ZdenekSrotyr	0084f80ff6	fix: legacy extractor passes Path to export_table, not str Fixes 'str' object has no attribute 'parent' when Keboola DuckDB extension falls back to legacy client.	2026-03-31 12:03:16 +02:00
ZdenekSrotyr	865d6d657e	fix: keboola client metadata_cache_path uses DATA_DIR instead of deleted config Fixes #7 — NameError: name 'config' is not defined	2026-03-31 11:57:57 +02:00
ZdenekSrotyr	04c5aecc58	fix: update Terraform for extract.duckdb architecture - Create /data/extracts instead of /data/src_data/parquet - Add admin_email variable for SEED_ADMIN_EMAIL	2026-03-31 09:49:32 +02:00
ZdenekSrotyr	e1e2d6d903	feat: add SEED_ADMIN_EMAIL for Docker test environments app/main.py: seed admin user on startup when SEED_ADMIN_EMAIL is set. docker-compose.test.yml: expose port 8000, add seed env var.	2026-03-31 09:48:12 +02:00
ZdenekSrotyr	617e724d21	feat: add E2E test suite — API, extractor, Docker tests/conftest.py: shared fixtures (e2e_env, seeded_app, create_mock_extract) tests/test_e2e_api.py: 11 tests — full sync flow, RBAC, table lifecycle tests/test_e2e_extract.py: 6 tests — Keboola/BQ/Jira pipelines, multi-source, corrupt handling tests/test_e2e_docker.py: 3 tests — Docker health + full flow (opt-in via -m docker) Fix admin update route (duplicate id kwarg, .dict() → .model_dump()). 705 tests passing.	2026-03-31 08:18:54 +02:00
ZdenekSrotyr	b0eaef88cc	refactor: delete old server infra — 4,200 lines removed Remove all legacy deployment infrastructure replaced by Docker + Kamal: - server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers, nginx config, systemd units, bin scripts) - scripts/sync_data.sh (replaced by da sync + API) - All services/*/systemd/ files (replaced by docker-compose) - tests/test_deploy_guard.py and tests/test_sync_data.py 688 tests passing.	2026-03-31 08:06:41 +02:00
ZdenekSrotyr	caa60a507d	feat: add centralized RBAC module — replace Linux group auth New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(), is_admin(), is_km_admin(), has_dataset_access(), set_user_role(). webapp/auth.py: admin_required + km_admin_required now use DuckDB roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check). app/auth/dependencies.py: imports Role from src/rbac.py (single source). 11 RBAC tests passing.	2026-03-31 08:04:35 +02:00
ZdenekSrotyr	9fef90a729	docs: rewrite CLAUDE.md for extract.duckdb architecture Update project structure, architecture diagram, key implementation details, development commands, and extensibility docs. Add extract service to docker-compose.yml for one-shot extraction.	2026-03-31 07:52:44 +02:00
ZdenekSrotyr	b502bd8bdd	refactor: delete old sync pipeline — 9,500 lines removed Phase 5 cleanup: remove all code replaced by extract.duckdb architecture. Deleted modules: - src/config.py (653) — replaced by DuckDB table_registry - src/parquet_manager.py (755) — replaced by DuckDB COPY TO - src/data_sync.py (734) — replaced by SyncOrchestrator - src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH - src/table_registry.py (464) — replaced by DuckDB repository - connectors/keboola/adapter.py (820) — replaced by extractor.py - connectors/bigquery/adapter.py (665) — replaced by extractor.py - connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension Updated all imports in webapp, catalog_export, enricher, router, sync_settings_service, generate_sample_data. Kept keboola/client.py as fallback (removed src.config dependency). 704 tests passing.	2026-03-31 07:50:37 +02:00
ZdenekSrotyr	9f20529f10	fix: resolve 7 preexisting test failures - Remove iCloud duplicate files (test_db 2.py, src/db 2.py) - Fix metrics expression fallback to top-level field in transformer + webapp - Fix sync_data.sh rsync exception pattern for $SSH_HOST variable - Fix deploy_guard cp regex to skip shell variable expansions - Update sudoers-deploy with missing root:data-ops rules - Update CRITICAL_DIRS ownership expectations to match deploy.sh reality 913 tests passing, 0 failures.	2026-03-30 20:36:00 +02:00
ZdenekSrotyr	e2a7ee21a2	fix: Jira extract_init handles empty parquet dirs gracefully DuckDB read_parquet glob fails when no files match. Skip view creation for tables without parquet files, create views only after first write.	2026-03-30 20:28:29 +02:00
ZdenekSrotyr	8bc1fceb52	feat: add migration scripts for extract.duckdb transition migrate_registry_to_duckdb.py: imports tables from data_description.md or table_registry.json into DuckDB table_registry with source columns. migrate_parquets_to_extracts.py: copies parquets to /data/extracts/ and creates extract.duckdb with _meta + views.	2026-03-30 20:21:12 +02:00
ZdenekSrotyr	e058c71777	feat: adapt Jira connector to extract.duckdb format - New extract_init.py: creates extract.duckdb with _meta + views for 6 entity types - Update default paths to /data/extracts/jira/data/ and /data/extracts/jira/raw/ - After parquet writes, update _meta table in extract.duckdb - Trigger SyncOrchestrator.rebuild_source("jira") after successful transform	2026-03-30 20:19:27 +02:00
ZdenekSrotyr	1bf97c725c	feat: wire orchestrator into API — replace DataSyncManager sync.py: _run_sync() now calls extractor + SyncOrchestrator.rebuild() data.py: parquet lookup searches /data/extracts/ first, legacy fallback catalog.py: list tables from DuckDB table_registry instead of src.config admin.py: discover-tables uses KeboolaClient directly, remove old TableRegistry dep	2026-03-30 20:16:33 +02:00
ZdenekSrotyr	18e5f0b6e8	feat: implement extract.duckdb contract — orchestrator + extractors Phase 0: extend table_registry schema (v1→v2 migration), add source_type/bucket/source_table/query_mode columns. Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master analytics.duckdb. Keboola extractor uses DuckDB extension with legacy client fallback. BigQuery extractor is remote-only via DuckDB BQ extension (no data download). 62 tests passing.	2026-03-30 20:12:56 +02:00
ZdenekSrotyr	0b9720d090	docs: rewrite core refactoring spec v2 — simplified extract.duckdb contract	2026-03-30 19:24:19 +02:00

... 8 9 10 11 12 ...

633 commits