Commit graph

912 commits

Author SHA1 Message Date
ZdenekSrotyr
2b7348a773 fix: sync only extracts local tables, skips remote
Was using list_by_source() which returns all tables including remote.
Now uses list_local() to skip query_mode='remote' tables.
2026-03-31 15:35:49 +02:00
ZdenekSrotyr
8f3a342108 fix: sync logs via stderr for docker compose visibility 2026-03-31 14:05:01 +02:00
ZdenekSrotyr
7612385ed6 fix: extractor subprocess reads table configs via stdin, not DuckDB
Subprocess cannot open system.duckdb (main process holds lock).
Now main process reads table_registry and passes configs as JSON
via stdin to subprocess. Subprocess never touches system.duckdb.
2026-03-31 13:57:02 +02:00
ZdenekSrotyr
4d1acd014a refactor: remove legacy webapp + add missing tests + housekeeping
Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to
.gitignore, increase extractor timeout to 30 min.

Phase B: Add 10 new tests — access request lifecycle (4), CLI admin
commands (5), sync subprocess trigger (1). 578 tests passing.

Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask
app fully replaced by FastAPI app/. Fix auth providers to use
app.instance_config instead of webapp.config. Update CLAUDE.md.

Delete 6 webapp-only test files. Fix Jira service config imports.
2026-03-31 13:44:06 +02:00
ZdenekSrotyr
6aee6cf454 fix: CLI sync downloads tables with empty hash (not yet computed)
Empty server hash was matching empty local hash, skipping all tables.
Now treats empty hash or missing local entry as 'needs download'.
2026-03-31 13:30:10 +02:00
ZdenekSrotyr
2d6a94fb6f fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename
Three-pronged fix for DuckDB lock conflicts:

1. WAL mode on system.duckdb — enables concurrent readers + writer
2. Sync trigger runs extractor as subprocess (not background task) —
   separate process = separate DuckDB connections, no lock conflict
3. Both extractor and orchestrator write to .tmp then atomic rename —
   avoids lock conflict with API reads on extract.duckdb/analytics.duckdb

Fixes #9 permanently.
2026-03-31 13:19:57 +02:00
ZdenekSrotyr
10d9280ab5 fix: extractor writes to temp file to avoid lock with orchestrator
Writes extract.duckdb.tmp then renames atomically, avoiding DuckDB lock
conflict when orchestrator holds a read connection on extract.duckdb.
2026-03-31 13:09:51 +02:00
ZdenekSrotyr
675a29c1c7 fix: DuckDB connection pool — shared connection avoids lock conflicts
Fixes #9 — background sync tasks could not access system.duckdb
because FastAPI held an exclusive lock. Now uses single shared
connection per DATA_DIR with cursor() for thread safety.
2026-03-31 13:01:04 +02:00
ZdenekSrotyr
04fa1402e4 feat: CLI admin commands — register-table, discover-and-register, list-tables
da admin register-table: register single table
da admin discover-and-register: auto-discover from Keboola API + bulk register
da admin list-tables: show all registered tables

Used to register all 142 Keboola tables on production.
2026-03-31 12:55:03 +02:00
ZdenekSrotyr
2e7d5d1fe9 feat: access request UI — catalog badges, request modal, admin approval page
Backend:
- access_requests table in DuckDB schema
- AccessRequestRepository with create/approve/deny/list
- API: POST/GET /api/access-requests (submit, my requests, pending, approve, deny)

UI:
- Catalog: lock icon on private tables, "Request Access" button + modal
- Catalog: "Pending" badge for tables with pending requests
- Admin permissions page (/admin/permissions): approve/deny requests,
  grant/revoke permissions, view all user permissions
- Cross-navigation between admin/tables and admin/permissions

733 tests passing.
2026-03-31 12:45:29 +02:00
ZdenekSrotyr
1074d5ec49 feat: implement data access control — table-level permissions
Schema v3: add is_public column to table_registry (default true).

src/rbac.py: can_access_table() checks admin bypass, public flag,
explicit permissions, wildcard bucket permissions.

API enforcement:
- manifest: filters tables by user access
- download: 403 if no access
- catalog: filters table list
- query: validates referenced tables against allowed list

New admin permissions API (/api/admin/permissions) for grant/revoke.

28 access control tests + 733 total tests passing.
2026-03-31 12:33:31 +02:00
ZdenekSrotyr
78f003f5b5 fix: reject empty table name in register-table endpoint
Fixes #8 — empty name created orphaned record that couldn't be deleted.
2026-03-31 12:18:58 +02:00
ZdenekSrotyr
bd0b6d19c6 fix: legacy extractor constructs full Keboola table ID from bucket+source_table
Was using tc['id'] which is the registry ID (e.g. 'circle'), not the
full Keboola ID (e.g. 'in.c-finance.circle') needed by the API.
2026-03-31 12:06:38 +02:00
ZdenekSrotyr
0084f80ff6 fix: legacy extractor passes Path to export_table, not str
Fixes 'str' object has no attribute 'parent' when Keboola DuckDB
extension falls back to legacy client.
2026-03-31 12:03:16 +02:00
ZdenekSrotyr
865d6d657e fix: keboola client metadata_cache_path uses DATA_DIR instead of deleted config
Fixes #7 — NameError: name 'config' is not defined
2026-03-31 11:57:57 +02:00
ZdenekSrotyr
04c5aecc58 fix: update Terraform for extract.duckdb architecture
- Create /data/extracts instead of /data/src_data/parquet
- Add admin_email variable for SEED_ADMIN_EMAIL
2026-03-31 09:49:32 +02:00
ZdenekSrotyr
e1e2d6d903 feat: add SEED_ADMIN_EMAIL for Docker test environments
app/main.py: seed admin user on startup when SEED_ADMIN_EMAIL is set.
docker-compose.test.yml: expose port 8000, add seed env var.
2026-03-31 09:48:12 +02:00
ZdenekSrotyr
617e724d21 feat: add E2E test suite — API, extractor, Docker
tests/conftest.py: shared fixtures (e2e_env, seeded_app, create_mock_extract)
tests/test_e2e_api.py: 11 tests — full sync flow, RBAC, table lifecycle
tests/test_e2e_extract.py: 6 tests — Keboola/BQ/Jira pipelines, multi-source, corrupt handling
tests/test_e2e_docker.py: 3 tests — Docker health + full flow (opt-in via -m docker)

Fix admin update route (duplicate id kwarg, .dict() → .model_dump()).

705 tests passing.
2026-03-31 08:18:54 +02:00
ZdenekSrotyr
b0eaef88cc refactor: delete old server infra — 4,200 lines removed
Remove all legacy deployment infrastructure replaced by Docker + Kamal:
- server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers,
  nginx config, systemd units, bin scripts)
- scripts/sync_data.sh (replaced by da sync + API)
- All services/*/systemd/ files (replaced by docker-compose)
- tests/test_deploy_guard.py and tests/test_sync_data.py

688 tests passing.
2026-03-31 08:06:41 +02:00
ZdenekSrotyr
caa60a507d feat: add centralized RBAC module — replace Linux group auth
New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(),
is_admin(), is_km_admin(), has_dataset_access(), set_user_role().

webapp/auth.py: admin_required + km_admin_required now use DuckDB
roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check).

app/auth/dependencies.py: imports Role from src/rbac.py (single source).

11 RBAC tests passing.
2026-03-31 08:04:35 +02:00
ZdenekSrotyr
9fef90a729 docs: rewrite CLAUDE.md for extract.duckdb architecture
Update project structure, architecture diagram, key implementation
details, development commands, and extensibility docs.
Add extract service to docker-compose.yml for one-shot extraction.
2026-03-31 07:52:44 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
9f20529f10 fix: resolve 7 preexisting test failures
- Remove iCloud duplicate files (test_db 2.py, src/db 2.py)
- Fix metrics expression fallback to top-level field in transformer + webapp
- Fix sync_data.sh rsync exception pattern for $SSH_HOST variable
- Fix deploy_guard cp regex to skip shell variable expansions
- Update sudoers-deploy with missing root:data-ops rules
- Update CRITICAL_DIRS ownership expectations to match deploy.sh reality

913 tests passing, 0 failures.
2026-03-30 20:36:00 +02:00
ZdenekSrotyr
e2a7ee21a2 fix: Jira extract_init handles empty parquet dirs gracefully
DuckDB read_parquet glob fails when no files match. Skip view creation
for tables without parquet files, create views only after first write.
2026-03-30 20:28:29 +02:00
ZdenekSrotyr
8bc1fceb52 feat: add migration scripts for extract.duckdb transition
migrate_registry_to_duckdb.py: imports tables from data_description.md
or table_registry.json into DuckDB table_registry with source columns.
migrate_parquets_to_extracts.py: copies parquets to /data/extracts/
and creates extract.duckdb with _meta + views.
2026-03-30 20:21:12 +02:00
ZdenekSrotyr
e058c71777 feat: adapt Jira connector to extract.duckdb format
- New extract_init.py: creates extract.duckdb with _meta + views for 6 entity types
- Update default paths to /data/extracts/jira/data/ and /data/extracts/jira/raw/
- After parquet writes, update _meta table in extract.duckdb
- Trigger SyncOrchestrator.rebuild_source("jira") after successful transform
2026-03-30 20:19:27 +02:00
ZdenekSrotyr
1bf97c725c feat: wire orchestrator into API — replace DataSyncManager
sync.py: _run_sync() now calls extractor + SyncOrchestrator.rebuild()
data.py: parquet lookup searches /data/extracts/ first, legacy fallback
catalog.py: list tables from DuckDB table_registry instead of src.config
admin.py: discover-tables uses KeboolaClient directly, remove old TableRegistry dep
2026-03-30 20:16:33 +02:00
ZdenekSrotyr
18e5f0b6e8 feat: implement extract.duckdb contract — orchestrator + extractors
Phase 0: extend table_registry schema (v1→v2 migration), add
source_type/bucket/source_table/query_mode columns.

Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master
analytics.duckdb. Keboola extractor uses DuckDB extension with
legacy client fallback. BigQuery extractor is remote-only via
DuckDB BQ extension (no data download).

62 tests passing.
2026-03-30 20:12:56 +02:00
ZdenekSrotyr
0b9720d090 docs: rewrite core refactoring spec v2 — simplified extract.duckdb contract 2026-03-30 19:24:19 +02:00
ZdenekSrotyr
9ee7b3bd09 docs: add core refactoring design spec — DuckDB-centric extract architecture 2026-03-30 18:15:52 +02:00
ZdenekSrotyr
a4944dba4a feat: auto-generate JWT secret in Terraform, remove manual variable 2026-03-30 16:03:19 +02:00
ZdenekSrotyr
b6a94add67 feat: add Terraform config for GCP deployment
- GCE e2-small with Ubuntu 24.04 + Docker
- Static IP, firewall rules, SSD boot disk
- Startup script: installs Docker, clones repo, creates .env, starts compose
- Outputs: IP, SSH command, API URL, bootstrap command, CLI setup
- ~7$/month for always-on server
2026-03-30 15:55:26 +02:00
ZdenekSrotyr
7b0a161d3d fix: handle timezone-naive timestamps in health check 2026-03-30 14:19:40 +02:00
ZdenekSrotyr
bca5e91826 feat: add bootstrap endpoint + deploy skill for AI agents
- POST /auth/bootstrap — creates first admin, self-deactivates after
- da setup bootstrap — CLI command for agent-driven setup
- da setup verify — structured health check (JSON output for agents)
- cli/skills/deploy.md — complete deployment guide for AI agents
- 6 bootstrap tests including full agent deployment flow simulation
- 156 total tests passing
2026-03-30 14:01:01 +02:00
ZdenekSrotyr
a74f69d6b1 chore: exclude CI workflow from push (needs workflow scope) 2026-03-27 17:41:27 +01:00
ZdenekSrotyr
0b91d4ac47 feat: complete web UI + auth providers + template compatibility
All 7 web pages rendering (200):
  /login, /dashboard, /catalog, /corporate-memory,
  /corporate-memory/admin, /activity-center, /admin/tables

All 13 API endpoints working (200):
  health, sync, data, query, users, memory, scripts,
  settings, telegram, admin, catalog

Auth providers: Google OAuth, Password (argon2), Email magic link
Cookie-based JWT auth for web UI after OAuth redirect
FlexDict for Flask→FastAPI template compatibility
150 tests passing
2026-03-27 17:34:39 +01:00
ZdenekSrotyr
1a7939c594 feat: add auth providers (Google OAuth, Password, Email magic link) + web UI fixes
- Google OAuth with authlib + auto user creation + cookie-based JWT
- Password auth with argon2 hash + setup token flow
- Email magic link with SMTP/SendGrid support
- Cookie-based auth for web UI (after OAuth redirect)
- Dashboard template compatibility (user_info, activity, desktop status)
- 150 tests passing
2026-03-27 17:07:59 +01:00
ZdenekSrotyr
fb1e60d8e1 fix: fix TemplateResponse API for Starlette compatibility
Use new TemplateResponse(request, name, context) signature.
Add Flask compat shims (get_flashed_messages, url_for, session).
2026-03-27 16:59:04 +01:00
ZdenekSrotyr
1287e63ed9 feat: complete system — web UI, all API endpoints, governance, admin, CLI commands
Major additions:
- Web UI: Jinja2 templates in FastAPI (login, dashboard, catalog, corporate memory, admin)
- API: catalog profiles/metrics, telegram verify/unlink/status, admin table registry CRUD
- Corporate memory governance: approve/reject/mandate/revoke/edit/batch + audit log
- Sync: real DataSyncManager trigger, sync-settings, table-subscriptions
- CLI: setup (init/test/deploy/verify), server (logs/restart/deploy/backup), explore
- Instance config integration (instance.yaml loaded at startup)
- 140 tests passing (25 new)
2026-03-27 16:52:22 +01:00
ZdenekSrotyr
c5527ec153 fix: harden script sandbox and SQL query security
Fixes found by E2E QA agent:
- Script sandbox: block os, sys, socket, eval, exec, open, __import__,
  getattr, pathlib and 20+ other dangerous patterns
- SQL query: block COPY, ATTACH, read_csv, semicolons, non-SELECT
- Added 24 security tests covering all attack vectors
2026-03-27 16:11:05 +01:00
ZdenekSrotyr
07b396bfe2 docs: add refactoring plan, design spec, and gitignore updates 2026-03-27 15:42:57 +01:00
ZdenekSrotyr
e0ce91ddb9 feat: add dataset permissions, script execution, Kamal config, CI/CD
- SyncSettingsRepository + DatasetPermissionRepository with RBAC
- Script deploy/run/undeploy API with import sandboxing
- User sync settings API with permission checks
- 4 CLI skills (connectors, security, notifications, corporate-memory)
- Kamal production + staging configs
- GitHub Actions CI + deploy workflows
- 91 total tests passing
2026-03-27 15:40:11 +01:00
ZdenekSrotyr
3701130a11 feat: add Docker, CLI tool, scheduler, and agent skills
- Dockerfile (uv-based) + docker-compose.yml (3 services)
- CLI tool 'da' with commands: auth, sync, query, status, admin, diagnose, skills
- Scheduler sidecar service (replaces systemd timers)
- pyproject.toml for uv distribution
- Built-in skills (setup, troubleshoot) for AI agents
- 17 CLI tests, 75 total tests passing
2026-03-27 15:30:03 +01:00
ZdenekSrotyr
a3918d3833 feat: add FastAPI server with auth, RBAC, and all API endpoints
- JWT auth with role-based access control (viewer/analyst/admin/km_admin)
- Endpoints: health, sync manifest, data download, query, users CRUD,
  corporate memory, session/artifact upload
- 18 API tests covering auth, RBAC, all endpoints
2026-03-27 15:19:18 +01:00
ZdenekSrotyr
64acc8d731 feat: add JSON to DuckDB migration script with tests 2026-03-27 15:09:06 +01:00
ZdenekSrotyr
79b0b66f2e feat: add DuckDB state layer with all repository classes
- src/db.py: schema with 14 tables matching design spec
- 7 repository classes: SyncState, Users, Knowledge, Audit,
  Telegram, PendingCode, Script, TableRegistry, Profiles
- 37 tests covering all CRUD operations
2026-03-27 15:06:55 +01:00
ZdenekSrotyr
f76411c603 feat: add DuckDB state layer with schema management
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 13:55:54 +01:00
Petr
eb7e5bdf8f Add data freshness indicators and remote table visibility to UI
- Fix sync_state.json parsing: derive last_updated from table last_sync
  timestamps when root-level field is missing (flat format support)
- Parse ALL YAML blocks from data_description.md (was only first block)
- Show remote tables (daily_deal_traffic) in catalog with "Live" badge
- Show per-table sync timestamps and Local/Live query mode badges
- Add data freshness note to Business Metrics section
- Dashboard: fix "Not yet synced" bug, show local/live table breakdown
2026-03-25 16:24:26 +01:00
Petr
a667b4e32f Fix profiler crash for remote-only tables without primary_key
Same issue as config.py - profiler's TableInfo and parser required
primary_key and sync_strategy, breaking auto-profile after sync
when daily_deal_traffic (remote-only) is in config.
2026-03-25 14:47:00 +01:00
Petr
4ebb3fc7b2 Fix data sync crash: make primary_key and sync_strategy optional
Remote-only tables (query_mode="remote") like daily_deal_traffic
don't need primary_key or sync_strategy. The parser used hard
lookups (table_data["primary_key"]) causing KeyError and breaking
all data sync since 2026-03-21.

Changes:
- TableConfig: default primary_key="" and sync_strategy="none"
- Parser: use .get() with defaults instead of [] lookups
- Validator: add "none" as valid sync_strategy
2026-03-25 14:43:22 +01:00