Commit graph

10 commits

Author SHA1 Message Date
Petr
758910463b Add BigQuery data source adapter
BigQuery connector that syncs BQ tables to local Parquet files via PyArrow
(no CSV intermediate step). Supports full refresh, timestamp-based
incremental (via incremental_column), and partition-based sync strategies.

- connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized
  queries, metadata cache, cross-project support (job project != data project)
- connectors/bigquery/adapter.py: DataSource implementation with merge/dedup
- src/config.py: Add incremental_column field to TableConfig
- 72 unit tests (mocked, no GCP SDK required)
2026-03-11 13:56:12 +01:00
Petr
5a84473213 Add dynamic Business Metrics with sample e-commerce definitions
Replace hardcoded Keboola-specific metrics card in Data Catalog with
dynamic Jinja template that renders whatever metric YAMLs exist in
docs/metrics/. Add 10 sample e-commerce metric definitions across
4 categories (revenue, customers, marketing, support) that align
with the sample data generator tables.

Key changes:
- MetricParser: new category colors + dynamic sql_* field discovery
- _load_metrics_data(): scans docs/metrics/*/*.yml with prod fallback
- catalog.html: 240 lines hardcoded HTML -> 35 lines Jinja loop
- metric_modal.js: regex-based category class removal, new categories
- 21 tests validating YAML schema, parser, and loader
2026-03-10 22:38:44 +01:00
Petr
302494b632 Add --format parquet using project's ParquetManager
Generator now supports --format {csv,parquet,both}. Parquet mode
uses src.parquet_manager.ParquetManager for snappy compression,
proper column types (DATE, TIMESTAMP, DOUBLE), and metadata.
No more ad-hoc pandas conversion needed on the server.
2026-03-10 21:46:20 +01:00
Petr
44bf43535b Add sample data generator with 9 e-commerce tables
Synthetic data generator for demo/testing without real data adapter:
- 9 tables: customers, products, campaigns, web_sessions, web_leads,
  orders, order_items, payments, support_tickets
- 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB)
- Realistic patterns: seasonality, Pareto customer distribution,
  segment-based behavior, referential integrity
- Deterministic output via --seed parameter

Also: docs/sample-data.md, updated auto-install.md with Step 6,
updated CLAUDE.md (email auth provider, dual-repo architecture)
2026-03-10 12:31:14 +01:00
Petr
f635195c80 Add multi-domain support and full-email username generation
- Support comma-separated domains in auth.allowed_domain config
- Use full email as system username (user@domain.com -> user_domain_com)
  to avoid collisions with reserved names and across domains
- Update both auth providers (google, email) for multi-domain display
- Add tests for username generation and update email auth tests
2026-03-10 10:50:01 +01:00
Petr
e2ab219171 Add email magic link authentication provider
New pluggable auth provider that sends passwordless sign-in links.
Works with domain restriction (same as Google OAuth). Falls back to
showing the link in browser when SMTP is not configured (dev mode).
2026-03-10 10:39:19 +01:00
Petr
b99ec576ca Add self-service data onboarding system
Table Registry as central source of truth (JSON) with atomic writes,
optimistic locking, audit logging, and data_description.md generation.
Existing readers (config.py, profiler.py) need zero changes.

Phase 1 - Discovery API:
  - discover_tables() on DataSource ABC + Keboola implementation
  - admin_required decorator with server-side recomputation
  - GET /api/admin/discover-tables endpoint

Phase 2 - Table Registry:
  - src/table_registry.py with CRUD, validation, migration from MD
  - Admin API: register/update/unregister with version locking
  - DELETE cascade cleans up per-user subscriptions

Phase 3 - Auto-Profiling:
  - profile_changed_tables() for incremental profiling
  - Non-fatal hook in sync_all() after successful sync

Phase 4 - Per-Table Subscriptions:
  - table_mode (all/explicit) with per-table toggles
  - GET/POST /api/table-subscriptions endpoints
  - Subscription status in catalog and dashboard views

Phase 5 - Smart Sync:
  - Python-generated rsync filter files (not shell YAML parsing)
  - sync_data.sh uses --filter="merge ..." for explicit mode

Phase 6 - Admin UI:
  - /admin/tables with discovery, registration modal, registry mgmt
  - Vanilla JS, matching existing design system
2026-03-09 14:25:37 +01:00
Petr
86edd27655 Extract Jira into connectors/jira module
Move all Jira-specific code into a self-contained connector module:
- 22 files moved via git mv (transform, service, webhook, scripts,
  systemd units, tests, docs, bin helper)
- All imports updated to use connectors.jira.* paths
- Jira is now conditional: auto-detected via JIRA_DOMAIN env var
- Webapp registers Jira blueprint only when available
- Health service monitors Jira timers only when enabled
- Profiler loads Jira tables dynamically from filesystem
- Sync settings uses config-driven dependency validation
- Renamed keboola_platform_url -> custom_url in transform
- Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths
- Fixed pytest.ini to skip live tests by default
2026-03-09 11:17:50 +01:00
Petr
26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs

Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}

Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config

All 118 tests pass.
2026-03-09 07:59:57 +01:00
Petr
c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00