Commit graph

51 commits

Author SHA1 Message Date
ZdenekSrotyr
caa60a507d feat: add centralized RBAC module — replace Linux group auth
New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(),
is_admin(), is_km_admin(), has_dataset_access(), set_user_role().

webapp/auth.py: admin_required + km_admin_required now use DuckDB
roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check).

app/auth/dependencies.py: imports Role from src/rbac.py (single source).

11 RBAC tests passing.
2026-03-31 08:04:35 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
9f20529f10 fix: resolve 7 preexisting test failures
- Remove iCloud duplicate files (test_db 2.py, src/db 2.py)
- Fix metrics expression fallback to top-level field in transformer + webapp
- Fix sync_data.sh rsync exception pattern for $SSH_HOST variable
- Fix deploy_guard cp regex to skip shell variable expansions
- Update sudoers-deploy with missing root:data-ops rules
- Update CRITICAL_DIRS ownership expectations to match deploy.sh reality

913 tests passing, 0 failures.
2026-03-30 20:36:00 +02:00
Petr
eb7e5bdf8f Add data freshness indicators and remote table visibility to UI
- Fix sync_state.json parsing: derive last_updated from table last_sync
  timestamps when root-level field is missing (flat format support)
- Parse ALL YAML blocks from data_description.md (was only first block)
- Show remote tables (daily_deal_traffic) in catalog with "Live" badge
- Show per-table sync timestamps and Local/Live query mode badges
- Add data freshness note to Business Metrics section
- Dashboard: fix "Not yet synced" bug, show local/live table breakdown
2026-03-25 16:24:26 +01:00
Petr
0560bbc127 Rename Mandate button to Make Mandatory 2026-03-23 19:44:08 +01:00
Petr
e85d296b0a Add Corporate Memory admin review queue UI (Phase 2)
Admin page at /corporate-memory/admin with three tabs:
- Review Queue: pending items with approve/mandate/reject + batch ops
- All Items: status filter, promote/demote/revoke actions
- Audit Log: filterable action history table

Features:
- Keyboard shortcuts (j/k navigate, a/r/m = approve/reject/mandate)
- Inline mandate form (mandatory reason + audience targeting)
- Toast notifications on action success/error
- Pending count badge on main Corporate Memory page
- Matches existing visual design (CSS variables, card styles)
2026-03-23 19:32:33 +01:00
Petr
1318b74ff1 Add Corporate Memory governance — Phase 1 (data model + admin API)
Add admin curation layer between AI extraction and knowledge distribution.
Admins (km_admin flag in instance.yaml) can approve, reject, mandate, and
revoke knowledge items. Mandatory items distribute to all targeted users
automatically.

Three governance modes (configurable per instance):
- mandatory_only: admin controls everything, no user voting
- admin_curated: admin controls, users vote as feedback signal
- hybrid: mandatory from admin + optional from user voting

Three approval workflows:
- review_queue: nothing published without admin approval
- auto_publish: items go live immediately, admin intervenes retroactively
- threshold: confidence-based auto-publish (Phase 5)

Includes:
- 9 admin action functions (approve/reject/mandate/revoke/edit/batch/...)
- 11 new admin API endpoints under /api/corporate-memory/admin/
- Immutable audit log (audit.jsonl)
- Audience targeting via groups
- Automatic migration of existing items to "approved" status
- km_admin_required auth decorator
- 69 tests covering all governance logic
- Backward compatible: no config = legacy wiki behavior
2026-03-23 19:15:33 +01:00
Petr
ed16122994 Use data_product config for metric discovery instead of filter_tag in webapp 2026-03-18 16:10:15 +01:00
Petr
6c0abf275b Add cache busting to metric_modal.css include 2026-03-16 22:16:37 +01:00
Petr
9be22fdc82 Fix metric display: use displayName in list, render HTML in modal
List view:
- Show display_name ("M1 + VFM Operational") instead of name ("M1PlusVFMOperational")
- Strip HTML and truncate description for clean list excerpts

Modal detail:
- Render original HTML from catalog instead of stripped plain text
- Add .om-description CSS class for structured HTML (bold labels, lists, code)
- Pass description_html alongside plain text description for backwards compat
2026-03-16 22:11:58 +01:00
Petr
ad525a96aa Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI)
Add filter_tag support to catalog_export and webapp so only metrics
with the required tag are exported to YAML and displayed in UI.
Previously all 19+ metrics were exported regardless of relevance.

- Add has_tag() helper to transformer module
- catalog_export.py: filter_tag parameter from instance.yaml openmetadata config
- webapp/app.py: filter metrics in _load_metrics_from_catalog()
- 7 new tests (has_tag, filter_tag export, stale cleanup)
2026-03-16 22:03:53 +01:00
Petr
985f47cdb7 Add catalog export: generate YAML metrics and tables from OpenMetadata
- New `connectors/openmetadata/transformer.py` with shared parsing logic
  for extracting categories, grain, dimensions, expressions from OM tags
- New `src/catalog_export.py` script (python -m src.catalog_export) that
  fetches metrics/tables from OpenMetadata API and writes YAML files to
  /data/docs/metrics/ and /data/docs/tables/ for agent consumption
- Refactor webapp/app.py to delegate to transformer (with inline fallback)
- Add `fields` parameter to client.get_metrics() and get_metric_by_fqn()
  for fetching tags+owners in a single API call
- Fix pre-existing mock bug in test_openmetadata_enricher (base_url)
- 101 new tests (80 transformer + 21 export), all passing
2026-03-15 01:15:30 +01:00
Petr
508d92771f Generate setup instructions from bootstrap.yaml (single source of truth)
- Rewrite bootstrap.yaml as clean structured YAML with steps, commands,
  descriptions, conditions, and notes
- Add _generate_setup_instructions() in app.py that reads YAML, substitutes
  placeholders, and produces clipboard-ready plain text
- Replace 50-line hardcoded JS string builder with single tojson variable
- All setup instructions now editable in one YAML file
2026-03-15 00:37:19 +01:00
Petr
85c07732b2 Fix dashboard stats: support flat sync_state.json format (no 'tables' wrapper)
BigQuery adapter writes table entries at top level, not nested under 'tables'.
Detect flat format by checking if values contain 'rows' key.
2026-03-15 00:26:10 +01:00
Petr
3ebb15cbab Make project_dir, ssh_key configurable in Get Started UI
Read server.project_dir from instance.yaml (default: 'data-analyst').
Replace hardcoded 'data-analyst' folder name and 'data_analyst_server'
SSH key name in dashboard template with Jinja variables.
2026-03-15 00:12:46 +01:00
Petr
021c453ea6 Auto-create .sync_connection via printf command in bootstrap
Replace 'Save this to .sync_connection' prose with actual printf command
that Claude/user executes. Fix heredoc indentation in bootstrap.yaml.
2026-03-15 00:05:42 +01:00
Petr
2237334b05 Make CLAUDE.md template generic and instance-aware
- Remove all Keboola-specific content (metric categories, MRR/ARR refs,
  corporate memory, hardcoded server IP)
- Add {ssh_alias}, {server_host}, {webapp_url} placeholders
- Bootstrap saves .sync_connection file with instance details
- sync_data.sh reads .sync_connection to substitute all placeholders
- Text instructions in dashboard include .sync_connection step
2026-03-14 23:57:58 +01:00
Petr
6728da63fb Use ssh_alias and ssh_key from config in bootstrap text instructions
Replace hardcoded 'data-analyst' and '~/.ssh/data_analyst_server' in
the copyBootstrapInstructions JS function with values from instance config.
Pass ssh_alias and ssh_key to dashboard template context.
2026-03-14 21:06:37 +01:00
Petr
13938bf72f Read ssh_alias and ssh_key from instance.yaml for bootstrap instructions
Config reads server.ssh_alias and server.ssh_key from instance.yaml
(defaults: 'data-analyst' and '~/.ssh/data_analyst_server' for backward compat).
App.py substitutes {ssh_alias} and {ssh_key} in bootstrap.yaml template.
2026-03-14 21:04:51 +01:00
Petr
c2681ccc86 Add cache-busting with git commit hash for static assets
Flask will now include git commit hash as URL parameter (v=abc1234)
for metric_modal.js and other static assets. This ensures browser
doesn't cache stale JavaScript when code changes.

Cache invalidation based on actual git history rather than timestamps.
2026-03-12 15:37:29 +01:00
Petr
f6000cc867 Fix: URL-encode metric FQN in catalog modal request
When metric FQN contains spaces (e.g. 'Active2 Customers'), JavaScript
was creating invalid URLs with literal spaces. Now properly encoding FQN
with encodeURIComponent() to convert spaces to %20 before sending to API.

Flask automatically decodes the path parameter back to original FQN.
2026-03-12 15:20:10 +01:00
Petr
1bcd7e4080 Fix: URL-decode metric FQN in catalog endpoint
FQN can contain spaces (e.g., 'Active2 Customers') which get URL-encoded
as 'Active2%20Customers' in the path parameter. Need to decode before
passing to OpenMetadata API.
2026-03-12 15:18:08 +01:00
Petr
268fe07f91 Fix: Use correct OpenMetadata API field names for metrics
OpenMetadata uses different field names than expected:
- metricExpression instead of expression
- metricType instead of type
- unitOfMeasurement instead of unit
- granularity instead of grain

Remove 'fields' query parameter from /api/v1/metrics - returns 400 Bad Request
when invalid field names are specified. Let API return full metric objects.

Update parsing to extract metadata from proper OpenMetadata fields instead
of relying on tags (tags are optional, fields are always present).
2026-03-12 15:16:24 +01:00
Petr
5fc9526627 Phase 2: Replace demo YAML metrics with OpenMetadata catalog data
- Add get_metric_by_fqn() to OpenMetadataClient
- Add get_metrics() to CatalogEnricher with TTL caching
- Implement _parse_om_metric() to extract category/grain from OpenMetadata tags
- Implement _load_metrics_from_catalog() to fetch and categorize metrics
- Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON
- Add /api/catalog/metrics/<fqn> endpoint for metric detail modal
- Update _load_metrics_data() to prefer catalog over YAML fallback
- Update metric_modal.js to route catalog:{fqn} to catalog API endpoint
- Delete 10 demo YAML files from docs/metrics/
- Replace metric tests with new unit tests for catalog parsing functions (19 tests)

Catalog metrics provide single source of truth vs maintaining demo YAML files.
UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.
2026-03-12 15:10:42 +01:00
Petr
14d75d6229 Fix: correct OpenMetadata catalog URL path and add debug logging
- Change catalog URL from /explore/{fqn} to /table/{fqn}
- Add debug logging to see parsed tags, owners, tier from API response
2026-03-12 14:34:12 +01:00
Petr
de66f6dd55 Fix: include partition fields in TableConfig for catalog enrichment
Pass partition_by, partition_granularity, partition_column_type, and
incremental_window_days from YAML to TableConfig to avoid validation errors
when sync_strategy='partitioned'
2026-03-12 14:29:05 +01:00
Petr
2d03a9b557 Display OpenMetadata catalog enrichment in table profile overview
- API endpoint /api/catalog/profile/ enriches response with catalog metadata (tier, owners, tags, url)
- renderOverview() template function displays 'Data Catalog' section with tier, owners, tags, and catalog link
- Graceful degradation: section only shown if catalog enrichment available
2026-03-12 14:28:02 +01:00
Petr
c5c24cb45b Implement OpenMetadata catalog integration (Phase 1)
Add OpenMetadata REST API connector and enricher to merge table/column metadata
from OpenMetadata catalog at sync and query time.

Changes:
- connectors/openmetadata/client.py: HTTP client for OM API
- connectors/openmetadata/enricher.py: Data enrichment with TTL cache
- tests/test_openmetadata_*: Unit tests for client and enricher
- src/config.py: Add catalog_fqn field to TableConfig
- src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md)
- webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url
- config/instance.yaml.example: Document openmetadata section

Features:
- FQN auto-derivation: bigquery.{table.id}
- TTL cache (default 1h) to avoid repeated API calls
- Graceful degradation: disabled if token missing, silent on HTTP errors
- Column description priority: catalog > BQ API > (none)
- Table description priority: catalog > data_description.md
2026-03-12 14:07:13 +01:00
Petr
d438438e33 Add configurable white-label theming via instance.yaml
Extend theming from 3 CSS variables (primary colors only) to 14
configurable properties covering colors, fonts, borders, and shape.
All values are optional with sensible defaults.

- New _theme.html include replaces duplicated inline injection
- Wire theme include into all 7 templates (base, login, dashboard,
  catalog, admin_tables, activity_center, corporate_memory)
- Conditional font loading: skip default Inter when custom font_url set
- Config.theme_overrides() classmethod generates CSS variable dict
- Visual theme-reference.html guide for instance configurators
- Document all theme keys in instance.yaml.example
2026-03-11 13:58:58 +01:00
Petr
eb5264b903 Make header logo configurable via instance.yaml logo_svg
Move hardcoded Keboola SVG logo from 4 templates into config.
Templates now use {{ config.LOGO_SVG | safe }}.
Default falls back to Keboola logo when not configured.
2026-03-11 13:08:26 +01:00
Petr
d3c7f7feea Fix activity-center 500 error: provide default data structure
The activity_center view was passing an empty dict but the template
expected nested keys (executive_summary, maturity_roadmap, etc).
Added _build_activity_data() that returns properly structured defaults.
2026-03-11 12:59:40 +01:00
Petr
91a05a2c2b Add auth.disabled_providers config to skip auth providers
Reads disabled_providers list from instance.yaml auth section.
Listed providers are skipped during auto-discovery.
2026-03-11 12:54:23 +01:00
Petr
954aa0f17e Add theme color support via instance.yaml
Allow instances to override primary CSS color variables
through theme section in instance.yaml config.
2026-03-11 00:42:10 +01:00
Petr
ad3b94c168 Add Business Metrics card to dashboard 2026-03-10 22:52:48 +01:00
Petr
34fde746e7 Remove hardcoded Jira and Telemetry cards from dashboard 2026-03-10 22:50:22 +01:00
Petr
49559fba1b Remove hardcoded Jira and Telemetry cards from catalog
These Keboola-specific data source cards don't belong in the OSS repo.
The catalog now shows only dynamic content: Core Business Data (from
data_description.md) and Business Metrics (from docs/metrics/*.yml).

Also update auto-install.md with Business Metrics documentation,
pipeline diagram, and expanded checklist.
2026-03-10 22:48:07 +01:00
Petr
5a84473213 Add dynamic Business Metrics with sample e-commerce definitions
Replace hardcoded Keboola-specific metrics card in Data Catalog with
dynamic Jinja template that renders whatever metric YAMLs exist in
docs/metrics/. Add 10 sample e-commerce metric definitions across
4 categories (revenue, customers, marketing, support) that align
with the sample data generator tables.

Key changes:
- MetricParser: new category colors + dynamic sql_* field discovery
- _load_metrics_data(): scans docs/metrics/*/*.yml with prod fallback
- catalog.html: 240 lines hardcoded HTML -> 35 lines Jinja loop
- metric_modal.js: regex-based category class removal, new categories
- 21 tests validating YAML schema, parser, and loader
2026-03-10 22:38:44 +01:00
Petr
28543d98b1 Fix profiler file_size and catalog stats fallback
- Profiler computes file_size_mb from actual parquet files when
  sync_state.json is absent (sample data / no-sync deployments)
- Catalog header falls back to profiles.json for aggregate stats
  (tables count, total rows) when sync_state.json is missing
2026-03-10 22:12:46 +01:00
Petr
1ac868d787 Improve setup instructions for robustness
- Check for existing SSH config entry before overwriting
- Use --no-perms --no-group in rsync (fixes macOS permission errors)
- Explicit mkdir instead of brace expansion (Claude Code compatibility)
- Gracefully handle missing server directories (empty server is OK)
- Conditional steps for setup_views.sh and CLAUDE.md template
2026-03-10 11:29:31 +01:00
Petr
fde1d6fc01 Move Claude Code setup to dashboard, remove step 5 from onboarding
- Get Started page now has 4 steps (folder, SSH key, pubkey, register)
- After account creation, dashboard shows prominent "Set up your local
  environment" CTA with claude command and Copy Setup Instructions
- CTA only visible when user hasn't synced yet (last_sync is empty)
- Bottom banner demoted to subtle secondary style for returning users
2026-03-10 11:18:56 +01:00
Petr
45454ab86a Redesign onboarding: compact single-screen layout with terminal block
- Merge steps 1-3 into a single dark terminal block with copy buttons
- Inline registration form with single-row layout for step 4
- Compact step 5 with Claude Code command and copy button on one line
- Full-width layout (960px) instead of narrow 640px column
- Everything fits on one screen without scrolling
2026-03-10 11:10:19 +01:00
Petr
9c4208bb89 Unify onboarding into single-column stepper with inline registration
Merge the two-column layout (setup steps + registration form) into one
unified flow. Step 4 now contains the registration form inline, creating
a natural top-to-bottom progression through the setup process.
2026-03-10 11:06:11 +01:00
Petr
21af1abb6e Fix setup instructions: add SSH key steps, fix clipboard on HTTP
- Add steps 2-4 (SSH key generation, copy pubkey, create account)
- Fix clipboard copy using textarea fallback for non-HTTPS contexts
- Generate simple plain-text Claude Code prompt instead of full YAML
- Show what Claude will do (SSH, rsync, DuckDB, CLAUDE.md)
2026-03-10 11:00:48 +01:00
Petr
f635195c80 Add multi-domain support and full-email username generation
- Support comma-separated domains in auth.allowed_domain config
- Use full email as system username (user@domain.com -> user_domain_com)
  to avoid collisions with reserved names and across domains
- Update both auth providers (google, email) for multi-domain display
- Add tests for username generation and update email auth tests
2026-03-10 10:50:01 +01:00
Petr
e2ab219171 Add email magic link authentication provider
New pluggable auth provider that sends passwordless sign-in links.
Works with domain restriction (same as Google OAuth). Falls back to
showing the link in browser when SMTP is not configured (dev mode).
2026-03-10 10:39:19 +01:00
Petr
b99ec576ca Add self-service data onboarding system
Table Registry as central source of truth (JSON) with atomic writes,
optimistic locking, audit logging, and data_description.md generation.
Existing readers (config.py, profiler.py) need zero changes.

Phase 1 - Discovery API:
  - discover_tables() on DataSource ABC + Keboola implementation
  - admin_required decorator with server-side recomputation
  - GET /api/admin/discover-tables endpoint

Phase 2 - Table Registry:
  - src/table_registry.py with CRUD, validation, migration from MD
  - Admin API: register/update/unregister with version locking
  - DELETE cascade cleans up per-user subscriptions

Phase 3 - Auto-Profiling:
  - profile_changed_tables() for incremental profiling
  - Non-fatal hook in sync_all() after successful sync

Phase 4 - Per-Table Subscriptions:
  - table_mode (all/explicit) with per-table toggles
  - GET/POST /api/table-subscriptions endpoints
  - Subscription status in catalog and dashboard views

Phase 5 - Smart Sync:
  - Python-generated rsync filter files (not shell YAML parsing)
  - sync_data.sh uses --filter="merge ..." for explicit mode

Phase 6 - Admin UI:
  - /admin/tables with discovery, registration modal, registry mgmt
  - Vanilla JS, matching existing design system
2026-03-09 14:25:37 +01:00
Petr
c6a711aa27 Extract pluggable auth provider system into auth/ package
Replace hardcoded Google OAuth + password auth registration with
auto-discovered auth providers. Each provider in auth/<name>/provider.py
implements AuthProvider ABC and is automatically registered at startup.

- auth/__init__.py: AuthProvider ABC + discover_providers() scanner
- auth/google/: Google OAuth provider (extracted from webapp/auth.py)
- auth/password/: Email/password provider (delegates to webapp/password_auth)
- auth/desktop/: Desktop JWT auth (API-only, not visible on login page)
- webapp/auth.py: stripped to core infra (login_required, /login, /logout)
- webapp/app.py: auto-discovery loop replaces manual blueprint registration
- login.html: dynamic provider buttons via Jinja loop
2026-03-09 13:02:08 +01:00
Petr
f2d3d156e3 Move standalone services from server/ to services/
Extract 4 self-contained services into services/ module:
- server/telegram_bot/ -> services/telegram_bot/
- server/ws_gateway/ -> services/ws_gateway/
- server/corporate_memory/ -> services/corporate_memory/
- server/session_collector.py -> services/session_collector/

Each service now has its own systemd/ directory with .service and .timer files.
deploy.sh updated to auto-discover service units from services/*/systemd/*.

server/ now contains only deployment infrastructure (deploy.sh, setup scripts,
bin/ management tools, sudoers, nginx config).

All imports updated: webapp/app.py, server/bin/ scripts, systemd ExecStart paths.
2026-03-09 12:54:30 +01:00
Petr
86edd27655 Extract Jira into connectors/jira module
Move all Jira-specific code into a self-contained connector module:
- 22 files moved via git mv (transform, service, webhook, scripts,
  systemd units, tests, docs, bin helper)
- All imports updated to use connectors.jira.* paths
- Jira is now conditional: auto-detected via JIRA_DOMAIN env var
- Webapp registers Jira blueprint only when available
- Health service monitors Jira timers only when enabled
- Profiler loads Jira tables dynamically from filesystem
- Sync settings uses config-driven dependency validation
- Renamed keboola_platform_url -> custom_url in transform
- Updated deploy.sh, sudoers-deploy, backfill_gap.sh paths
- Fixed pytest.ini to skip live tests by default
2026-03-09 11:17:50 +01:00
Petr
26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs

Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}

Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config

All 118 tests pass.
2026-03-09 07:59:57 +01:00