ZdenekSrotyr
79443e0df4
fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy
...
- Legacy extractor now uses read_csv(all_varchar=true) to avoid type
inference errors (e.g. seniority column typed as DOUBLE with string values)
- DEPLOYMENT.md rewritten based on actual dev VM deployment experience:
deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow
2026-04-08 19:09:55 +02:00
ZdenekSrotyr
2635f77974
ci: add CI test suite + deploy pipeline
...
- ci.yml: runs 607 tests + Docker build on push/PR
- deploy.yml: tests → build → GHCR push → Kamal deploy on main
2026-04-08 18:24:05 +02:00
ZdenekSrotyr
cfa08c4b4c
chore: remove obsolete CI workflows (deploy-guard, deploy.yml.example)
...
deploy-guard.yml referenced deleted tests and sudoers files.
deploy.yml.example used legacy SSH-based deployment.
Updated ci.yml and deploy.yml are in .gitignore (need workflow scope to push).
2026-04-08 18:16:48 +02:00
ZdenekSrotyr
3ba207a7f8
feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator
...
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
2026-04-08 18:13:31 +02:00
ZdenekSrotyr
06e1cf0a8d
feat: generic _remote_attach contract for remote DuckDB extension views
...
Extractors with remote tables now write a _remote_attach table into
extract.duckdb so the orchestrator can re-ATTACH external extensions
at query time. The mechanism is source-agnostic — any connector can use it.
- Keboola extractor writes _remote_attach + creates views on kbc.*
- Orchestrator reads _remote_attach, installs extension, reads token from env
- Graceful degradation: missing token → warning, local tables still work
2026-04-08 18:10:12 +02:00
ZdenekSrotyr
ee7d5630ef
fix: keep external_access enabled — views need read_parquet on local files
...
File access attacks blocked by SQL blocklist instead of DuckDB pragma
(pragma also blocks legitimate view resolution via read_parquet).
2026-04-08 12:33:05 +02:00
ZdenekSrotyr
f2f9a62803
fix: set enable_external_access=false AFTER ATTACHing extracts
2026-04-08 12:29:27 +02:00
ZdenekSrotyr
6efdf4ca64
fix: read-only analytics DB ATTACHes extract.duckdb files for view resolution
2026-04-08 12:27:12 +02:00
ZdenekSrotyr
a0f7e98f11
chore: add auth/ to gitignore (legacy dir, not tracked)
2026-04-08 12:12:37 +02:00
ZdenekSrotyr
92fbb88c15
chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs
2026-04-08 12:10:47 +02:00
ZdenekSrotyr
05a1b452e9
security: harden query (read-only DB), uploads (path sanitization), scripts (AST validation)
2026-04-08 12:09:19 +02:00
ZdenekSrotyr
224635b88d
security: fix auth (argon2, cookie, JWT), CORS, session middleware, pyproject.toml
2026-04-08 12:08:52 +02:00
ZdenekSrotyr
d5659d7091
fix: login page uses login_buttons format expected by template
2026-04-08 07:11:03 +02:00
ZdenekSrotyr
67a1e0bb45
feat: Jira webhook FastAPI adapter — replaces Flask Blueprint
2026-04-08 07:04:50 +02:00
ZdenekSrotyr
3e3f84a00e
feat: dynamic login providers + profiler auto-trigger + refresh endpoint
2026-04-08 07:04:40 +02:00
ZdenekSrotyr
4bad893cb8
feat: Docker services (ws-gateway, corporate-memory, session-collector) + scheduler auto-auth
2026-04-08 07:04:26 +02:00
ZdenekSrotyr
bae9619363
fix: restore authlib + argon2-cffi — needed by FastAPI auth providers
...
Google OAuth uses authlib, password auth uses argon2. These were
incorrectly removed as 'legacy' but are used by app/auth/providers/.
2026-03-31 19:23:24 +02:00
ZdenekSrotyr
5ee12d78e7
refactor: final cleanup — delete legacy auth, clean deps, fix hash, migrate to uv
...
- Delete root auth/ directory (legacy Flask providers, orphaned)
- Clean requirements.txt: remove Flask, gunicorn, authlib, sendgrid,
anthropic, openai, argon2-cffi (9 unused deps)
- Fix hash computation in orchestrator: MD5 of parquet mtime+size
(CLI sync now skips unchanged tables correctly)
- Migrate pip → uv in CLAUDE.md, scripts/init.sh, pyproject.toml
- Sync pyproject.toml dependencies with requirements.txt
578 tests passing.
2026-03-31 19:18:30 +02:00
ZdenekSrotyr
2b7348a773
fix: sync only extracts local tables, skips remote
...
Was using list_by_source() which returns all tables including remote.
Now uses list_local() to skip query_mode='remote' tables.
2026-03-31 15:35:49 +02:00
ZdenekSrotyr
8f3a342108
fix: sync logs via stderr for docker compose visibility
2026-03-31 14:05:01 +02:00
ZdenekSrotyr
7612385ed6
fix: extractor subprocess reads table configs via stdin, not DuckDB
...
Subprocess cannot open system.duckdb (main process holds lock).
Now main process reads table_registry and passes configs as JSON
via stdin to subprocess. Subprocess never touches system.duckdb.
2026-03-31 13:57:02 +02:00
ZdenekSrotyr
4d1acd014a
refactor: remove legacy webapp + add missing tests + housekeeping
...
Phase A: Close fixed issues (#7 , #8 , #9 ), add server/ user/ to
.gitignore, increase extractor timeout to 30 min.
Phase B: Add 10 new tests — access request lifecycle (4), CLI admin
commands (5), sync subprocess trigger (1). 578 tests passing.
Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask
app fully replaced by FastAPI app/. Fix auth providers to use
app.instance_config instead of webapp.config. Update CLAUDE.md.
Delete 6 webapp-only test files. Fix Jira service config imports.
2026-03-31 13:44:06 +02:00
ZdenekSrotyr
6aee6cf454
fix: CLI sync downloads tables with empty hash (not yet computed)
...
Empty server hash was matching empty local hash, skipping all tables.
Now treats empty hash or missing local entry as 'needs download'.
2026-03-31 13:30:10 +02:00
ZdenekSrotyr
2d6a94fb6f
fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename
...
Three-pronged fix for DuckDB lock conflicts:
1. WAL mode on system.duckdb — enables concurrent readers + writer
2. Sync trigger runs extractor as subprocess (not background task) —
separate process = separate DuckDB connections, no lock conflict
3. Both extractor and orchestrator write to .tmp then atomic rename —
avoids lock conflict with API reads on extract.duckdb/analytics.duckdb
Fixes #9 permanently.
2026-03-31 13:19:57 +02:00
ZdenekSrotyr
10d9280ab5
fix: extractor writes to temp file to avoid lock with orchestrator
...
Writes extract.duckdb.tmp then renames atomically, avoiding DuckDB lock
conflict when orchestrator holds a read connection on extract.duckdb.
2026-03-31 13:09:51 +02:00
ZdenekSrotyr
675a29c1c7
fix: DuckDB connection pool — shared connection avoids lock conflicts
...
Fixes #9 — background sync tasks could not access system.duckdb
because FastAPI held an exclusive lock. Now uses single shared
connection per DATA_DIR with cursor() for thread safety.
2026-03-31 13:01:04 +02:00
ZdenekSrotyr
04fa1402e4
feat: CLI admin commands — register-table, discover-and-register, list-tables
...
da admin register-table: register single table
da admin discover-and-register: auto-discover from Keboola API + bulk register
da admin list-tables: show all registered tables
Used to register all 142 Keboola tables on production.
2026-03-31 12:55:03 +02:00
ZdenekSrotyr
2e7d5d1fe9
feat: access request UI — catalog badges, request modal, admin approval page
...
Backend:
- access_requests table in DuckDB schema
- AccessRequestRepository with create/approve/deny/list
- API: POST/GET /api/access-requests (submit, my requests, pending, approve, deny)
UI:
- Catalog: lock icon on private tables, "Request Access" button + modal
- Catalog: "Pending" badge for tables with pending requests
- Admin permissions page (/admin/permissions): approve/deny requests,
grant/revoke permissions, view all user permissions
- Cross-navigation between admin/tables and admin/permissions
733 tests passing.
2026-03-31 12:45:29 +02:00
ZdenekSrotyr
1074d5ec49
feat: implement data access control — table-level permissions
...
Schema v3: add is_public column to table_registry (default true).
src/rbac.py: can_access_table() checks admin bypass, public flag,
explicit permissions, wildcard bucket permissions.
API enforcement:
- manifest: filters tables by user access
- download: 403 if no access
- catalog: filters table list
- query: validates referenced tables against allowed list
New admin permissions API (/api/admin/permissions) for grant/revoke.
28 access control tests + 733 total tests passing.
2026-03-31 12:33:31 +02:00
ZdenekSrotyr
78f003f5b5
fix: reject empty table name in register-table endpoint
...
Fixes #8 — empty name created orphaned record that couldn't be deleted.
2026-03-31 12:18:58 +02:00
ZdenekSrotyr
bd0b6d19c6
fix: legacy extractor constructs full Keboola table ID from bucket+source_table
...
Was using tc['id'] which is the registry ID (e.g. 'circle'), not the
full Keboola ID (e.g. 'in.c-finance.circle') needed by the API.
2026-03-31 12:06:38 +02:00
ZdenekSrotyr
0084f80ff6
fix: legacy extractor passes Path to export_table, not str
...
Fixes 'str' object has no attribute 'parent' when Keboola DuckDB
extension falls back to legacy client.
2026-03-31 12:03:16 +02:00
ZdenekSrotyr
865d6d657e
fix: keboola client metadata_cache_path uses DATA_DIR instead of deleted config
...
Fixes #7 — NameError: name 'config' is not defined
2026-03-31 11:57:57 +02:00
ZdenekSrotyr
04c5aecc58
fix: update Terraform for extract.duckdb architecture
...
- Create /data/extracts instead of /data/src_data/parquet
- Add admin_email variable for SEED_ADMIN_EMAIL
2026-03-31 09:49:32 +02:00
ZdenekSrotyr
e1e2d6d903
feat: add SEED_ADMIN_EMAIL for Docker test environments
...
app/main.py: seed admin user on startup when SEED_ADMIN_EMAIL is set.
docker-compose.test.yml: expose port 8000, add seed env var.
2026-03-31 09:48:12 +02:00
ZdenekSrotyr
617e724d21
feat: add E2E test suite — API, extractor, Docker
...
tests/conftest.py: shared fixtures (e2e_env, seeded_app, create_mock_extract)
tests/test_e2e_api.py: 11 tests — full sync flow, RBAC, table lifecycle
tests/test_e2e_extract.py: 6 tests — Keboola/BQ/Jira pipelines, multi-source, corrupt handling
tests/test_e2e_docker.py: 3 tests — Docker health + full flow (opt-in via -m docker)
Fix admin update route (duplicate id kwarg, .dict() → .model_dump()).
705 tests passing.
2026-03-31 08:18:54 +02:00
ZdenekSrotyr
b0eaef88cc
refactor: delete old server infra — 4,200 lines removed
...
Remove all legacy deployment infrastructure replaced by Docker + Kamal:
- server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers,
nginx config, systemd units, bin scripts)
- scripts/sync_data.sh (replaced by da sync + API)
- All services/*/systemd/ files (replaced by docker-compose)
- tests/test_deploy_guard.py and tests/test_sync_data.py
688 tests passing.
2026-03-31 08:06:41 +02:00
ZdenekSrotyr
caa60a507d
feat: add centralized RBAC module — replace Linux group auth
...
New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(),
is_admin(), is_km_admin(), has_dataset_access(), set_user_role().
webapp/auth.py: admin_required + km_admin_required now use DuckDB
roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check).
app/auth/dependencies.py: imports Role from src/rbac.py (single source).
11 RBAC tests passing.
2026-03-31 08:04:35 +02:00
ZdenekSrotyr
9fef90a729
docs: rewrite CLAUDE.md for extract.duckdb architecture
...
Update project structure, architecture diagram, key implementation
details, development commands, and extensibility docs.
Add extract service to docker-compose.yml for one-shot extraction.
2026-03-31 07:52:44 +02:00
ZdenekSrotyr
b502bd8bdd
refactor: delete old sync pipeline — 9,500 lines removed
...
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.
Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension
Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).
704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
9f20529f10
fix: resolve 7 preexisting test failures
...
- Remove iCloud duplicate files (test_db 2.py, src/db 2.py)
- Fix metrics expression fallback to top-level field in transformer + webapp
- Fix sync_data.sh rsync exception pattern for $SSH_HOST variable
- Fix deploy_guard cp regex to skip shell variable expansions
- Update sudoers-deploy with missing root:data-ops rules
- Update CRITICAL_DIRS ownership expectations to match deploy.sh reality
913 tests passing, 0 failures.
2026-03-30 20:36:00 +02:00
ZdenekSrotyr
e2a7ee21a2
fix: Jira extract_init handles empty parquet dirs gracefully
...
DuckDB read_parquet glob fails when no files match. Skip view creation
for tables without parquet files, create views only after first write.
2026-03-30 20:28:29 +02:00
ZdenekSrotyr
8bc1fceb52
feat: add migration scripts for extract.duckdb transition
...
migrate_registry_to_duckdb.py: imports tables from data_description.md
or table_registry.json into DuckDB table_registry with source columns.
migrate_parquets_to_extracts.py: copies parquets to /data/extracts/
and creates extract.duckdb with _meta + views.
2026-03-30 20:21:12 +02:00
ZdenekSrotyr
e058c71777
feat: adapt Jira connector to extract.duckdb format
...
- New extract_init.py: creates extract.duckdb with _meta + views for 6 entity types
- Update default paths to /data/extracts/jira/data/ and /data/extracts/jira/raw/
- After parquet writes, update _meta table in extract.duckdb
- Trigger SyncOrchestrator.rebuild_source("jira") after successful transform
2026-03-30 20:19:27 +02:00
ZdenekSrotyr
1bf97c725c
feat: wire orchestrator into API — replace DataSyncManager
...
sync.py: _run_sync() now calls extractor + SyncOrchestrator.rebuild()
data.py: parquet lookup searches /data/extracts/ first, legacy fallback
catalog.py: list tables from DuckDB table_registry instead of src.config
admin.py: discover-tables uses KeboolaClient directly, remove old TableRegistry dep
2026-03-30 20:16:33 +02:00
ZdenekSrotyr
18e5f0b6e8
feat: implement extract.duckdb contract — orchestrator + extractors
...
Phase 0: extend table_registry schema (v1→v2 migration), add
source_type/bucket/source_table/query_mode columns.
Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master
analytics.duckdb. Keboola extractor uses DuckDB extension with
legacy client fallback. BigQuery extractor is remote-only via
DuckDB BQ extension (no data download).
62 tests passing.
2026-03-30 20:12:56 +02:00
ZdenekSrotyr
0b9720d090
docs: rewrite core refactoring spec v2 — simplified extract.duckdb contract
2026-03-30 19:24:19 +02:00
ZdenekSrotyr
9ee7b3bd09
docs: add core refactoring design spec — DuckDB-centric extract architecture
2026-03-30 18:15:52 +02:00
ZdenekSrotyr
a4944dba4a
feat: auto-generate JWT secret in Terraform, remove manual variable
2026-03-30 16:03:19 +02:00
ZdenekSrotyr
b6a94add67
feat: add Terraform config for GCP deployment
...
- GCE e2-small with Ubuntu 24.04 + Docker
- Static IP, firewall rules, SSD boot disk
- Startup script: installs Docker, clones repo, creates .env, starts compose
- Outputs: IP, SSH command, API URL, bootstrap command, CLI setup
- ~7$/month for always-on server
2026-03-30 15:55:26 +02:00