Commit graph

33 commits

Author SHA1 Message Date
ZdenekSrotyr
0dd8b13d62 infra: add fetch-env-from-secrets.sh for VM-side .env generation
Reads JWT_SECRET_KEY and KEBOOLA_STORAGE_TOKEN from Secret Manager,
combines with non-secret config, writes .env with chmod 600.
Run as part of VM startup or manually for rotation.
2026-04-21 16:18:35 +02:00
ZdenekSrotyr
5ad96e5f86 infra: add bootstrap-gcp.sh for per-customer GCP setup
Creates agnes-deploy SA with Terraform-scoped roles, GCS tfstate bucket,
and generates a JSON key. Idempotent — safe to re-run.

Expanded .gitignore to block *-key.json files from ever being committed.
2026-04-21 16:18:35 +02:00
ZdenekSrotyr
30106e6a49 feat: add standalone metric YAML → DuckDB migration script 2026-04-10 19:35:36 +02:00
ZdenekSrotyr
40cca627be fix: address Devin review round 4 — bash arithmetic, CalVer max, docs
- smoke-test.sh: replace ((PASS++)) with PASS=$((PASS + 1)) to avoid
  set -e abort when counter is 0 (bash returns exit 1 for ((0)))
- CalVer: use max(N) from existing tags instead of count, safe when
  tags are deleted (e.g. deprecated version cleanup)
- CLAUDE.md: update schema version from v2 to v3

663 tests pass.
2026-04-10 14:39:16 +02:00
ZdenekSrotyr
6c53082295 feat: multi-instance deployment — all 14 must-have items from spec
CalVer CI (release.yml) with stable/dev channels, health endpoint
with version/channel/schema_version, JWT secret auto-generation with
file persistence, smoke test script + Docker-in-CI, pre-migration
snapshot, /api/admin/configure for headless setup, /api/admin/
discover-and-register, /setup wizard, OpenAPI snapshot test, custom
connector mount support, CHANGELOG, migration safety tests, startup
banner.

663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1
updated JWT test).
2026-04-10 11:57:42 +02:00
ZdenekSrotyr
5e0e4ceb9e fix: rewrite Makefile and scripts/README.md
Makefile simplified to four targets (test, dev, docker, lint) aligned
with the current FastAPI/Docker architecture. scripts/README.md rewritten
to list only the active and migration scripts that still exist.
2026-04-09 17:16:04 +02:00
ZdenekSrotyr
22cfbfe5fb docs: update references to deleted files
- QUICKSTART.md: replace data_description.md.example copy step with
  note that tables are registered via the admin API or web UI
- NOTIFICATIONS.md: replace examples/ section with planned-feature note
- telegram_bot.md: remove examples/notifications/ rows from deployment
  table and example scripts section; note feature is planned
- dev_docs/README.md: remove plan-corporate-memory.md entry
- duckdb_manager.py: update comment from remote_query.py to query API endpoint
2026-04-09 17:15:19 +02:00
ZdenekSrotyr
f3bd378b47 chore: remove 17 dead files from v1 architecture
Removes unused scripts (collect_session, generate_user_sync_configs,
standalone_profiler, remote_query, update, setup_views, test_sync,
activate_venv, backfill_gap, sync_config_template), legacy config
(data_description.md.example), llms.txt, completed planning docs
(plan-rsync-fix, plan_parquet_types_fix, plan-corporate-memory), and
notification examples/ directory.
2026-04-09 17:14:06 +02:00
ZdenekSrotyr
5ee12d78e7 refactor: final cleanup — delete legacy auth, clean deps, fix hash, migrate to uv
- Delete root auth/ directory (legacy Flask providers, orphaned)
- Clean requirements.txt: remove Flask, gunicorn, authlib, sendgrid,
  anthropic, openai, argon2-cffi (9 unused deps)
- Fix hash computation in orchestrator: MD5 of parquet mtime+size
  (CLI sync now skips unchanged tables correctly)
- Migrate pip → uv in CLAUDE.md, scripts/init.sh, pyproject.toml
- Sync pyproject.toml dependencies with requirements.txt

578 tests passing.
2026-03-31 19:18:30 +02:00
ZdenekSrotyr
4d1acd014a refactor: remove legacy webapp + add missing tests + housekeeping
Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to
.gitignore, increase extractor timeout to 30 min.

Phase B: Add 10 new tests — access request lifecycle (4), CLI admin
commands (5), sync subprocess trigger (1). 578 tests passing.

Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask
app fully replaced by FastAPI app/. Fix auth providers to use
app.instance_config instead of webapp.config. Update CLAUDE.md.

Delete 6 webapp-only test files. Fix Jira service config imports.
2026-03-31 13:44:06 +02:00
ZdenekSrotyr
b0eaef88cc refactor: delete old server infra — 4,200 lines removed
Remove all legacy deployment infrastructure replaced by Docker + Kamal:
- server/ directory (deploy.sh, setup.sh, webapp-setup.sh, sudoers,
  nginx config, systemd units, bin scripts)
- scripts/sync_data.sh (replaced by da sync + API)
- All services/*/systemd/ files (replaced by docker-compose)
- tests/test_deploy_guard.py and tests/test_sync_data.py

688 tests passing.
2026-03-31 08:06:41 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
8bc1fceb52 feat: add migration scripts for extract.duckdb transition
migrate_registry_to_duckdb.py: imports tables from data_description.md
or table_registry.json into DuckDB table_registry with source columns.
migrate_parquets_to_extracts.py: copies parquets to /data/extracts/
and creates extract.duckdb with _meta + views.
2026-03-30 20:21:12 +02:00
ZdenekSrotyr
64acc8d731 feat: add JSON to DuckDB migration script with tests 2026-03-27 15:09:06 +01:00
Petr
dfec39722b Fix remote_query.sh: use analyst-readable env file
GCP OS Login doesn't honor /etc/group changes for SSH sessions,
so analyst can't read /opt/data-analyst/.env even after usermod.

Wrapper now reads .remote_query.env from scripts dir (dataread group),
falls back to .env for admin users. The env file contains only
non-secret BQ config (project ID, location, data dir).
2026-03-21 11:59:57 +01:00
Petr
dce8454894 Add remote_query.sh wrapper, fix analyst SSH permissions
Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/.
Added to data-ops group on server.

New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH,
CONFIG_DIR, .env) so agents use simple:
  ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table'

Updated claude_md_template.txt to use wrapper instead of raw commands.
2026-03-21 11:58:04 +01:00
Petr
ed5a5ec706 Fix: duckdb_manager CONFIG_DIR support for server deployment
find_project_root() and parse_data_description() now check CONFIG_DIR
env var first when looking for data_description.md. On server deployment,
data_description.md lives in instance/config/ (CONFIG_DIR), not in the
OSS repo's docs/ directory.
2026-03-21 11:40:55 +01:00
Petr
d180b2014e Step 28: Remote query architecture for local+remote table JOINs
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.

Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).

Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
2026-03-21 11:39:15 +01:00
Petr
e17dd85504 Remove hardcoded Jira/Keboola references from sync_data.sh
- Silent fallback when no sync settings exist (no 'Jira disabled' message)
- Generic dataset exclude/include loop driven by sync_settings.yaml
- Generic cleanup loop for disabled datasets
- Replaces 100+ lines of hardcoded Jira/kbc_telemetry_expert blocks
2026-03-15 01:02:37 +01:00
Petr
49adbe26ec Move server venv setup to bootstrap, remove cross-platform pip freeze sync
Server venv is created during bootstrap via SSH (same package list,
installed natively on Linux). Removes sync_data.sh section that copied
pip freeze output across platforms (Windows/macOS freeze is incompatible
with Linux).
2026-03-15 00:59:45 +01:00
Petr
22a1bb5847 Auto-restart sync_data.sh after self-update (exec replaces process) 2026-03-15 00:53:17 +01:00
Petr
6f9de274fb Fix: CLAUDE.md template vars overwrote $SSH_HOST used by rsync
The CLAUDE.md generation section reused SSH_HOST variable name to store
the server IP, overwriting the SSH alias needed for rsync. Renamed to
TMPL_SSH_ALIAS/TMPL_SERVER_HOST/TMPL_WEBAPP_URL to avoid collision.
2026-03-15 00:51:49 +01:00
Petr
b0e4749b0d Replace hardcoded 'data-analyst' SSH alias with configurable $SSH_HOST
Read SSH alias from .sync_connection file at script start (default:
'data-analyst' for backward compatibility). All 32 occurrences of
hardcoded 'data-analyst:' and 'ssh data-analyst' replaced with $SSH_HOST.
2026-03-15 00:03:07 +01:00
Petr
2237334b05 Make CLAUDE.md template generic and instance-aware
- Remove all Keboola-specific content (metric categories, MRR/ARR refs,
  corporate memory, hardcoded server IP)
- Add {ssh_alias}, {server_host}, {webapp_url} placeholders
- Bootstrap saves .sync_connection file with instance details
- sync_data.sh reads .sync_connection to substitute all placeholders
- Text instructions in dashboard include .sync_connection step
2026-03-14 23:57:58 +01:00
Petr
8bb46a9e0a Add per-partition streaming sync and hybrid query architecture
Partitioned sync: iterates day-by-day instead of loading full dataset.
Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB.
New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates.

Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid).
DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables).
Data sync: skips remote tables, filters by query_mode.

Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).
2026-03-12 13:20:41 +01:00
Petr
468f56092b Add standalone DuckDB-based data profiler script
Zero-dependency profiler for Parquet/CSV files producing JSON profiles
with column statistics, histograms, alerts, and sample data.
Supports single files, directories, composite primary keys, and
optional HTML report generation.
2026-03-11 15:12:04 +01:00
Petr
302494b632 Add --format parquet using project's ParquetManager
Generator now supports --format {csv,parquet,both}. Parquet mode
uses src.parquet_manager.ParquetManager for snappy compression,
proper column types (DATE, TIMESTAMP, DOUBLE), and metadata.
No more ad-hoc pandas conversion needed on the server.
2026-03-10 21:46:20 +01:00
Petr
44bf43535b Add sample data generator with 9 e-commerce tables
Synthetic data generator for demo/testing without real data adapter:
- 9 tables: customers, products, campaigns, web_sessions, web_leads,
  orders, order_items, payments, support_tickets
- 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB)
- Realistic patterns: seasonality, Pareto customer distribution,
  segment-based behavior, referential integrity
- Deterministic output via --seed parameter

Also: docs/sample-data.md, updated auto-install.md with Step 6,
updated CLAUDE.md (email auth provider, dual-repo architecture)
2026-03-10 12:31:14 +01:00
Petr
b99ec576ca Add self-service data onboarding system
Table Registry as central source of truth (JSON) with atomic writes,
optimistic locking, audit logging, and data_description.md generation.
Existing readers (config.py, profiler.py) need zero changes.

Phase 1 - Discovery API:
  - discover_tables() on DataSource ABC + Keboola implementation
  - admin_required decorator with server-side recomputation
  - GET /api/admin/discover-tables endpoint

Phase 2 - Table Registry:
  - src/table_registry.py with CRUD, validation, migration from MD
  - Admin API: register/update/unregister with version locking
  - DELETE cascade cleans up per-user subscriptions

Phase 3 - Auto-Profiling:
  - profile_changed_tables() for incremental profiling
  - Non-fatal hook in sync_all() after successful sync

Phase 4 - Per-Table Subscriptions:
  - table_mode (all/explicit) with per-table toggles
  - GET/POST /api/table-subscriptions endpoints
  - Subscription status in catalog and dashboard views

Phase 5 - Smart Sync:
  - Python-generated rsync filter files (not shell YAML parsing)
  - sync_data.sh uses --filter="merge ..." for explicit mode

Phase 6 - Admin UI:
  - /admin/tables with discovery, registration modal, registry mgmt
  - Vanilla JS, matching existing design system
2026-03-09 14:25:37 +01:00
Petr
15b513266d Merge dev_scripts/ into scripts/
Move dev_run.py and test_sync.sh from dev_scripts/ to scripts/,
eliminating the separate dev_scripts directory. Update scripts
README with development scripts section.
2026-03-09 13:11:36 +01:00
Petr
86edd27655 Extract Jira into connectors/jira module
Move all Jira-specific code into a self-contained connector module:
- 22 files moved via git mv (transform, service, webhook, scripts,
  systemd units, tests, docs, bin helper)
- All imports updated to use connectors.jira.* paths
- Jira is now conditional: auto-detected via JIRA_DOMAIN env var
- Webapp registers Jira blueprint only when available
- Health service monitors Jira timers only when enabled
- Profiler loads Jira tables dynamically from filesystem
- Sync settings uses config-driven dependency validation
- Renamed keboola_platform_url -> custom_url in transform
- Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths
- Fixed pytest.ini to skip live tests by default
2026-03-09 11:17:50 +01:00
Petr
26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs

Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}

Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config

All 118 tests pass.
2026-03-09 07:59:57 +01:00
Petr
c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00