Commit graph

112 commits

Author SHA1 Message Date
ZdenekSrotyr
a65de8574e feat: add import_from_yaml and export_to_yaml to MetricRepository
Adds YAML-based bulk import/export to MetricRepository, supporting
list-wrapped and plain-dict YAML formats, table→table_name field
mapping, and sql_by_* → sql_variants collection (and reverse on export).
All 24 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 19:25:11 +02:00
ZdenekSrotyr
88d536ca29 feat: add MetricRepository with full CRUD and search for metric_definitions
Implements MetricRepository following the table_registry pattern — raw SQL,
dict returns, ON CONFLICT upsert, and json.dumps for sql_variants/validation.
Includes 18 tests covering create, read, list, update, delete, find_by_table,
find_by_synonym, and get_table_map.
2026-04-10 19:21:25 +02:00
ZdenekSrotyr
bc394bd266 feat: schema migration v3→v4 with metric_definitions and column_metadata tables
Add SCHEMA_VERSION = 4, _V3_TO_V4_MIGRATIONS list, and if current < 4 block
in _ensure_schema(). Both new tables are also added to _SYSTEM_SCHEMA for
fresh installs. Tests cover fresh install, all columns, and v3→v4 migration path.
2026-04-10 19:14:32 +02:00
ZdenekSrotyr
dc8a9275e6 fix: address Devin review round 3 — retry exhaustion, discover path, WAL snapshot
- CalVer retry loop now exits with error if all 5 attempts fail
  (prevents pushing Docker image with unclaimed version tag)
- discover_tables endpoint reads data_source.keboola.url (consistent
  with configure_instance and _discover_and_register_tables)
- Pre-migration snapshot flushes WAL via CHECKPOINT before copying
  and copies .wal file if it still exists after flush

663 tests pass.
2026-04-10 14:11:17 +02:00
ZdenekSrotyr
6c53082295 feat: multi-instance deployment — all 14 must-have items from spec
CalVer CI (release.yml) with stable/dev channels, health endpoint
with version/channel/schema_version, JWT secret auto-generation with
file persistence, smoke test script + Docker-in-CI, pre-migration
snapshot, /api/admin/configure for headless setup, /api/admin/
discover-and-register, /setup wizard, OpenAPI snapshot test, custom
connector mount support, CHANGELOG, migration safety tests, startup
banner.

663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1
updated JWT test).
2026-04-10 11:57:42 +02:00
ZdenekSrotyr
471982d3f9 fix: route admin_edit through KnowledgeRepository.update instead of raw SQL 2026-04-09 18:42:52 +02:00
ZdenekSrotyr
30987eef16 fix: add union_by_name=true to read_parquet calls in profiler
Handles schema evolution across partitions when profiling tables
with multiple parquet files that may have different column sets.
2026-04-09 18:42:33 +02:00
ZdenekSrotyr
fa30298589 fix: use DATA_DIR env var instead of hardcoded /data paths
- services/telegram_bot/config.py: NOTIFICATIONS_DIR now uses DATA_DIR fallback
- src/profiler.py: DATA_DIR now uses main DATA_DIR env var instead of PROFILER_DATA_DIR
- services/telegram_bot/dispatch.py: WS_GATEWAY_SOCKET_PATH now uses WS_GATEWAY_SOCKET env var
2026-04-09 16:39:44 +02:00
ZdenekSrotyr
53a9e838f9 feat: add graceful shutdown handler
- Add close_system_db() function in src/db.py to cleanly close shared DB connection
- Add lifespan context manager in app/main.py to trigger shutdown on app exit
- Integrate lifespan into FastAPI app initialization
- All API tests pass (77/77)
2026-04-09 07:03:45 +02:00
ZdenekSrotyr
1b219cabe9 fix: remove dead PRAGMA enable_wal code
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
2026-04-09 06:59:57 +02:00
ZdenekSrotyr
3e9c347cf1 fix: validate extract dir name in get_analytics_db_readonly to prevent SQL injection
Adds _SAFE_IDENTIFIER regex guard before ATTACHing extract.duckdb files in the
read-only analytics connection, matching the same fix already applied in the
orchestrator. Adds test coverage for malicious directory names.
2026-04-09 06:57:31 +02:00
ZdenekSrotyr
e425d4baa5 fix: handle WAL files in atomic swap to prevent DB corruption
Add _atomic_swap_db helper that removes stale WAL files before and after
moving the temp DuckDB into place. Apply CHECKPOINT before close in both
orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.
2026-04-09 06:57:29 +02:00
ZdenekSrotyr
23ae6a602c security: harden query endpoint SQL blocklist and disable external access
Expand blocked keywords to cover parquet_scan, read_csv_auto, query_table,
iceberg_scan, delta_scan, call, URL schemes (http/https/s3/gcs), and
additional file-scan functions. Set enable_external_access=false on the
non-read-only analytics connection path. Add three new tests covering
parquet_scan, read_csv_auto, and query_table blocking.
2026-04-09 06:54:58 +02:00
ZdenekSrotyr
0d3ab5060c fix: reject unsafe SQL identifiers in orchestrator
Adds _validate_identifier() with ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$ regex and
applies it to source_name (directory names), table_name (_meta rows), and
alias/extension (_remote_attach rows) before any SQL interpolation.
Adds two tests covering SQL-injection directory names and malicious _meta entries.
2026-04-09 06:51:07 +02:00
ZdenekSrotyr
cb9c566d07 fix: rebuild_source delegates to full rebuild to preserve all source views
_do_rebuild_source was creating a fresh temp DB with only one source,
then atomically replacing analytics.duckdb — wiping views from every
other source. Now it delegates to _do_rebuild so all extract dirs are
re-attached in a single pass.

Adds test_rebuild_source_preserves_other_sources to guard the regression.
2026-04-09 06:48:25 +02:00
ZdenekSrotyr
3ba207a7f8 feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
2026-04-08 18:13:31 +02:00
ZdenekSrotyr
06e1cf0a8d feat: generic _remote_attach contract for remote DuckDB extension views
Extractors with remote tables now write a _remote_attach table into
extract.duckdb so the orchestrator can re-ATTACH external extensions
at query time. The mechanism is source-agnostic — any connector can use it.

- Keboola extractor writes _remote_attach + creates views on kbc.*
- Orchestrator reads _remote_attach, installs extension, reads token from env
- Graceful degradation: missing token → warning, local tables still work
2026-04-08 18:10:12 +02:00
ZdenekSrotyr
ee7d5630ef fix: keep external_access enabled — views need read_parquet on local files
File access attacks blocked by SQL blocklist instead of DuckDB pragma
(pragma also blocks legitimate view resolution via read_parquet).
2026-04-08 12:33:05 +02:00
ZdenekSrotyr
f2f9a62803 fix: set enable_external_access=false AFTER ATTACHing extracts 2026-04-08 12:29:27 +02:00
ZdenekSrotyr
6efdf4ca64 fix: read-only analytics DB ATTACHes extract.duckdb files for view resolution 2026-04-08 12:27:12 +02:00
ZdenekSrotyr
92fbb88c15 chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs 2026-04-08 12:10:47 +02:00
ZdenekSrotyr
05a1b452e9 security: harden query (read-only DB), uploads (path sanitization), scripts (AST validation) 2026-04-08 12:09:19 +02:00
ZdenekSrotyr
5ee12d78e7 refactor: final cleanup — delete legacy auth, clean deps, fix hash, migrate to uv
- Delete root auth/ directory (legacy Flask providers, orphaned)
- Clean requirements.txt: remove Flask, gunicorn, authlib, sendgrid,
  anthropic, openai, argon2-cffi (9 unused deps)
- Fix hash computation in orchestrator: MD5 of parquet mtime+size
  (CLI sync now skips unchanged tables correctly)
- Migrate pip → uv in CLAUDE.md, scripts/init.sh, pyproject.toml
- Sync pyproject.toml dependencies with requirements.txt

578 tests passing.
2026-03-31 19:18:30 +02:00
ZdenekSrotyr
4d1acd014a refactor: remove legacy webapp + add missing tests + housekeeping
Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to
.gitignore, increase extractor timeout to 30 min.

Phase B: Add 10 new tests — access request lifecycle (4), CLI admin
commands (5), sync subprocess trigger (1). 578 tests passing.

Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask
app fully replaced by FastAPI app/. Fix auth providers to use
app.instance_config instead of webapp.config. Update CLAUDE.md.

Delete 6 webapp-only test files. Fix Jira service config imports.
2026-03-31 13:44:06 +02:00
ZdenekSrotyr
2d6a94fb6f fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename
Three-pronged fix for DuckDB lock conflicts:

1. WAL mode on system.duckdb — enables concurrent readers + writer
2. Sync trigger runs extractor as subprocess (not background task) —
   separate process = separate DuckDB connections, no lock conflict
3. Both extractor and orchestrator write to .tmp then atomic rename —
   avoids lock conflict with API reads on extract.duckdb/analytics.duckdb

Fixes #9 permanently.
2026-03-31 13:19:57 +02:00
ZdenekSrotyr
675a29c1c7 fix: DuckDB connection pool — shared connection avoids lock conflicts
Fixes #9 — background sync tasks could not access system.duckdb
because FastAPI held an exclusive lock. Now uses single shared
connection per DATA_DIR with cursor() for thread safety.
2026-03-31 13:01:04 +02:00
ZdenekSrotyr
2e7d5d1fe9 feat: access request UI — catalog badges, request modal, admin approval page
Backend:
- access_requests table in DuckDB schema
- AccessRequestRepository with create/approve/deny/list
- API: POST/GET /api/access-requests (submit, my requests, pending, approve, deny)

UI:
- Catalog: lock icon on private tables, "Request Access" button + modal
- Catalog: "Pending" badge for tables with pending requests
- Admin permissions page (/admin/permissions): approve/deny requests,
  grant/revoke permissions, view all user permissions
- Cross-navigation between admin/tables and admin/permissions

733 tests passing.
2026-03-31 12:45:29 +02:00
ZdenekSrotyr
1074d5ec49 feat: implement data access control — table-level permissions
Schema v3: add is_public column to table_registry (default true).

src/rbac.py: can_access_table() checks admin bypass, public flag,
explicit permissions, wildcard bucket permissions.

API enforcement:
- manifest: filters tables by user access
- download: 403 if no access
- catalog: filters table list
- query: validates referenced tables against allowed list

New admin permissions API (/api/admin/permissions) for grant/revoke.

28 access control tests + 733 total tests passing.
2026-03-31 12:33:31 +02:00
ZdenekSrotyr
caa60a507d feat: add centralized RBAC module — replace Linux group auth
New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(),
is_admin(), is_km_admin(), has_dataset_access(), set_user_role().

webapp/auth.py: admin_required + km_admin_required now use DuckDB
roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check).

app/auth/dependencies.py: imports Role from src/rbac.py (single source).

11 RBAC tests passing.
2026-03-31 08:04:35 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
18e5f0b6e8 feat: implement extract.duckdb contract — orchestrator + extractors
Phase 0: extend table_registry schema (v1→v2 migration), add
source_type/bucket/source_table/query_mode columns.

Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master
analytics.duckdb. Keboola extractor uses DuckDB extension with
legacy client fallback. BigQuery extractor is remote-only via
DuckDB BQ extension (no data download).

62 tests passing.
2026-03-30 20:12:56 +02:00
ZdenekSrotyr
e0ce91ddb9 feat: add dataset permissions, script execution, Kamal config, CI/CD
- SyncSettingsRepository + DatasetPermissionRepository with RBAC
- Script deploy/run/undeploy API with import sandboxing
- User sync settings API with permission checks
- 4 CLI skills (connectors, security, notifications, corporate-memory)
- Kamal production + staging configs
- GitHub Actions CI + deploy workflows
- 91 total tests passing
2026-03-27 15:40:11 +01:00
ZdenekSrotyr
79b0b66f2e feat: add DuckDB state layer with all repository classes
- src/db.py: schema with 14 tables matching design spec
- 7 repository classes: SyncState, Users, Knowledge, Audit,
  Telegram, PendingCode, Script, TableRegistry, Profiles
- 37 tests covering all CRUD operations
2026-03-27 15:06:55 +01:00
ZdenekSrotyr
f76411c603 feat: add DuckDB state layer with schema management
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 13:55:54 +01:00
Petr
a667b4e32f Fix profiler crash for remote-only tables without primary_key
Same issue as config.py - profiler's TableInfo and parser required
primary_key and sync_strategy, breaking auto-profile after sync
when daily_deal_traffic (remote-only) is in config.
2026-03-25 14:47:00 +01:00
Petr
4ebb3fc7b2 Fix data sync crash: make primary_key and sync_strategy optional
Remote-only tables (query_mode="remote") like daily_deal_traffic
don't need primary_key or sync_strategy. The parser used hard
lookups (table_data["primary_key"]) causing KeyError and breaking
all data sync since 2026-03-21.

Changes:
- TableConfig: default primary_key="" and sync_strategy="none"
- Parser: use .get() with defaults instead of [] lookups
- Validator: add "none" as valid sync_strategy
2026-03-25 14:43:22 +01:00
Petr
84d14da611 Fix remote query UX: file-based stdin, ssh permissions, deprecation
Session testing revealed 3 issues with remote queries:

1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but
   claude_settings.json had `cat` in deny list, causing 2-3 failed
   attempts per query. Replaced with file-based approach: Write tool
   creates JSON file, then `ssh ... < file` avoids the cat deny.

2. ssh/scp commands were not in the allow list, requiring manual
   approval for every remote query. Added both to allow list.

3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every
   parquet export. Replaced with .arrow().read_all().

Also added instruction for proactive hybrid analysis when remote
tables are available (agent was only using local data until asked).
2026-03-21 18:41:43 +01:00
Petr
8c6c162417 Fix: --sql not required when --stdin is used
argparse was rejecting --stdin mode because --sql was required=True.
Changed to required=False with runtime validation in main().
2026-03-21 12:17:02 +01:00
Petr
67df4acd73 Add --stdin JSON mode to avoid shell escaping nightmare
Agent was failing 3x on SSH commands due to backticks (BQ table names)
and single quotes (SQL string literals) getting mangled by nested shell
interpretation (local -> SSH -> bash -> Python).

New --stdin mode reads query spec as JSON from stdin via heredoc:
  cat <<'QUERY' | ssh alias 'bash remote_query.sh --stdin'
  {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."}
  QUERY

Heredoc with <<'QUERY' (quoted) passes everything literally -- no
escaping needed for backticks, quotes, or parentheses.

Updated claude_md_template.txt to use --stdin as the primary method.
2026-03-21 12:15:50 +01:00
Petr
39763ea5a2 Fix: load instance.yaml without requiring webapp secrets
Analysts don't have WEBAPP_SECRET_KEY, so load_instance_config()
validation failed with noisy warnings. Now reads instance.yaml
directly with yaml.safe_load, skipping secret validation.
2026-03-21 12:01:41 +01:00
Petr
d180b2014e Step 28: Remote query architecture for local+remote table JOINs
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.

Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).

Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
2026-03-21 11:39:15 +01:00
Petr
fb63a72a98 Add data product discovery, fix remove-analyst script
- client.py: add search_by_data_product() for OpenMetadata search API
- catalog_export.py: prefer data product discovery over tag filtering
  (finds all 16 metrics in FoundryAIDataModel vs 3 with tag filter)
- remove-analyst: fix GROUPS bash variable collision, improve messaging
2026-03-18 12:52:41 +01:00
Petr
ab99f0af92 Fix sync_schedule validation to accept multi-time daily format
The scheduler.py already supported "daily HH:MM,HH:MM,HH:MM" format
(commit 5f27d05), but config.py validation regex only accepted single
time "daily HH:MM", causing data-refresh to crash on startup.

Also adds:
- tests/test_config_sync_schedule.py (16 test cases)
- Makefile with validate-config target for CI/CD integration
2026-03-17 13:21:14 +01:00
Petr
5f27d05894 Support multiple daily sync times (e.g., "daily 07:00,13:00,18:00")
Scheduler now accepts comma-separated HH:MM times in daily schedules.
Each time slot is independently evaluated - if any slot has passed and
last_sync is before it, the table is marked as due.

This lets tables sync multiple times per day to pick up data refreshes
that happen throughout the day (e.g., Keboola pipelines running 3x/day).
2026-03-16 23:09:48 +01:00
Petr
ad525a96aa Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI)
Add filter_tag support to catalog_export and webapp so only metrics
with the required tag are exported to YAML and displayed in UI.
Previously all 19+ metrics were exported regardless of relevance.

- Add has_tag() helper to transformer module
- catalog_export.py: filter_tag parameter from instance.yaml openmetadata config
- webapp/app.py: filter metrics in _load_metrics_from_catalog()
- 7 new tests (has_tag, filter_tag export, stale cleanup)
2026-03-16 22:03:53 +01:00
Petr
80c5b902e0 Add scheduled data sync and catalog refresh with systemd timers
- New sync_schedule and profile_after_sync fields in TableConfig
  (formats: "every 15m", "every 1h", "daily 05:00")
- New src/scheduler.py with schedule evaluation logic (is_table_due)
- New --scheduled mode in data_sync.py: only syncs tables that are due,
  respects profile_after_sync flag, auto-restarts webapp after profiling
- Systemd timer+service for data-refresh (every 15 min)
- Systemd timer+service for catalog-refresh (every 15 min)
- deploy.sh enables new timers automatically
- Complete table config reference in data_description.md.example
- 58 new scheduler tests
2026-03-15 02:16:31 +01:00
Petr
d9f3977028 URL-encode FQN in catalog header links (spaces -> %20) 2026-03-15 02:06:22 +01:00
Petr
60039c0af3 Add direct catalog URL to YAML header (metric/table entity links)
Source line now links directly to the entity in OpenMetadata:
- metrics: https://datacatalog.../metric/UniqueVisitors
- tables:  https://datacatalog.../table/bigquery.project.dataset.table
2026-03-15 02:03:27 +01:00
Petr
985f47cdb7 Add catalog export: generate YAML metrics and tables from OpenMetadata
- New `connectors/openmetadata/transformer.py` with shared parsing logic
  for extracting categories, grain, dimensions, expressions from OM tags
- New `src/catalog_export.py` script (python -m src.catalog_export) that
  fetches metrics/tables from OpenMetadata API and writes YAML files to
  /data/docs/metrics/ and /data/docs/tables/ for agent consumption
- Refactor webapp/app.py to delegate to transformer (with inline fallback)
- Add `fields` parameter to client.get_metrics() and get_metric_by_fqn()
  for fetching tags+owners in a single API call
- Fix pre-existing mock bug in test_openmetadata_enricher (base_url)
- 101 new tests (80 transformer + 21 export), all passing
2026-03-15 01:15:30 +01:00
Petr
be58e63394 Move profiler config to instance.yaml (KISS principle)
Instead of hardcoded Python constants, load profiler settings from config:
- instance.yaml: profiler section with all parameters
- Defaults: fallback to sensible defaults if config not found
- Centralized: all profiler tuning in one place, no code changes needed
2026-03-12 14:45:14 +01:00
Petr
c25278538c Simplify profiler config: use single SAMPLE_SIZE parameter (KISS)
Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE:
- If table > SAMPLE_SIZE: sample that many rows
- Otherwise: use all rows

Cleaner, easier to configure.
2026-03-12 14:43:23 +01:00
Petr
c5c24cb45b Implement OpenMetadata catalog integration (Phase 1)
Add OpenMetadata REST API connector and enricher to merge table/column metadata
from OpenMetadata catalog at sync and query time.

Changes:
- connectors/openmetadata/client.py: HTTP client for OM API
- connectors/openmetadata/enricher.py: Data enrichment with TTL cache
- tests/test_openmetadata_*: Unit tests for client and enricher
- src/config.py: Add catalog_fqn field to TableConfig
- src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md)
- webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url
- config/instance.yaml.example: Document openmetadata section

Features:
- FQN auto-derivation: bigquery.{table.id}
- TTL cache (default 1h) to avoid repeated API calls
- Graceful degradation: disabled if token missing, silent on HTTP errors
- Column description priority: catalog > BQ API > (none)
- Table description priority: catalog > data_description.md
2026-03-12 14:07:13 +01:00
Petr
8bb46a9e0a Add per-partition streaming sync and hybrid query architecture
Partitioned sync: iterates day-by-day instead of loading full dataset.
Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB.
New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates.

Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid).
DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables).
Data sync: skips remote tables, filters by query_mode.

Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).
2026-03-12 13:20:41 +01:00
Petr
d2e83ce9d0 Set DuckDB memory_limit=4GB in profiler to prevent OOM
Server has 8GB RAM with other services running. DuckDB defaults to
using all available memory, causing OOM killer when profiling large
tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).
2026-03-12 11:06:49 +01:00
Petr
a191ede28c Add columns and row_filter to TableConfig for selective BQ export
Propagate column selection and row filtering from data_description.md
through the BigQuery adapter to the BQ client. This enables exporting
only needed columns and applying date range filters at the SQL level,
critical for large DataView tables (e.g., 412-col unit_economics).
2026-03-11 19:37:04 +01:00
Petr
758910463b Add BigQuery data source adapter
BigQuery connector that syncs BQ tables to local Parquet files via PyArrow
(no CSV intermediate step). Supports full refresh, timestamp-based
incremental (via incremental_column), and partition-based sync strategies.

- connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized
  queries, metadata cache, cross-project support (job project != data project)
- connectors/bigquery/adapter.py: DataSource implementation with merge/dedup
- src/config.py: Add incremental_column field to TableConfig
- 72 unit tests (mocked, no GCP SDK required)
2026-03-11 13:56:12 +01:00
Petr
28543d98b1 Fix profiler file_size and catalog stats fallback
- Profiler computes file_size_mb from actual parquet files when
  sync_state.json is absent (sample data / no-sync deployments)
- Catalog header falls back to profiles.json for aggregate stats
  (tables count, total rows) when sync_state.json is missing
2026-03-10 22:12:46 +01:00
Petr
1be0dc5300 Add flat parquet fallback to profiler get_parquet_path
Tries subfolder path first (Keboola-style layout), then falls back to
flat path for simple deployments like sample data.
2026-03-10 22:09:14 +01:00
Petr
b99ec576ca Add self-service data onboarding system
Table Registry as central source of truth (JSON) with atomic writes,
optimistic locking, audit logging, and data_description.md generation.
Existing readers (config.py, profiler.py) need zero changes.

Phase 1 - Discovery API:
  - discover_tables() on DataSource ABC + Keboola implementation
  - admin_required decorator with server-side recomputation
  - GET /api/admin/discover-tables endpoint

Phase 2 - Table Registry:
  - src/table_registry.py with CRUD, validation, migration from MD
  - Admin API: register/update/unregister with version locking
  - DELETE cascade cleans up per-user subscriptions

Phase 3 - Auto-Profiling:
  - profile_changed_tables() for incremental profiling
  - Non-fatal hook in sync_all() after successful sync

Phase 4 - Per-Table Subscriptions:
  - table_mode (all/explicit) with per-table toggles
  - GET/POST /api/table-subscriptions endpoints
  - Subscription status in catalog and dashboard views

Phase 5 - Smart Sync:
  - Python-generated rsync filter files (not shell YAML parsing)
  - sync_data.sh uses --filter="merge ..." for explicit mode

Phase 6 - Admin UI:
  - /admin/tables with discovery, registration modal, registry mgmt
  - Vanilla JS, matching existing design system
2026-03-09 14:25:37 +01:00
Petr
266e8573d3 Extract Keboola into connectors/keboola module
Move all Keboola-specific code out of src/ into connectors/keboola/:
- git mv src/keboola_client.py -> connectors/keboola/client.py
- Extract LocalKeboolaSource (855 lines) from data_sync.py -> connectors/keboola/adapter.py
- Rename to KeboolaDataSource with full env var validation
- Extend DataSource ABC with get_column_metadata() and get_source_name()
- Add dynamic connector registry via importlib in create_data_source()
- Refactor _generate_schema_yaml to use ABC methods (source_type, _schema_version: 2)
- Remove src/adapters/ (redundant facade layer)
- Remove Keboola validation from src/config.py (connector validates itself)
- Add 14 tests for factory, ABC defaults, env validation, dynamic lookup
2026-03-09 12:22:16 +01:00
Petr
86edd27655 Extract Jira into connectors/jira module
Move all Jira-specific code into a self-contained connector module:
- 22 files moved via git mv (transform, service, webhook, scripts,
  systemd units, tests, docs, bin helper)
- All imports updated to use connectors.jira.* paths
- Jira is now conditional: auto-detected via JIRA_DOMAIN env var
- Webapp registers Jira blueprint only when available
- Health service monitors Jira timers only when enabled
- Profiler loads Jira tables dynamically from filesystem
- Sync settings uses config-driven dependency validation
- Renamed keboola_platform_url -> custom_url in transform
- Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths
- Fixed pytest.ini to skip live tests by default
2026-03-09 11:17:50 +01:00
Petr
c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00