Commit graph

85 commits

Author SHA1 Message Date
Petr
80c5b902e0 Add scheduled data sync and catalog refresh with systemd timers
- New sync_schedule and profile_after_sync fields in TableConfig
  (formats: "every 15m", "every 1h", "daily 05:00")
- New src/scheduler.py with schedule evaluation logic (is_table_due)
- New --scheduled mode in data_sync.py: only syncs tables that are due,
  respects profile_after_sync flag, auto-restarts webapp after profiling
- Systemd timer+service for data-refresh (every 15 min)
- Systemd timer+service for catalog-refresh (every 15 min)
- deploy.sh enables new timers automatically
- Complete table config reference in data_description.md.example
- 58 new scheduler tests
2026-03-15 02:16:31 +01:00
Petr
d9f3977028 URL-encode FQN in catalog header links (spaces -> %20) 2026-03-15 02:06:22 +01:00
Petr
60039c0af3 Add direct catalog URL to YAML header (metric/table entity links)
Source line now links directly to the entity in OpenMetadata:
- metrics: https://datacatalog.../metric/UniqueVisitors
- tables:  https://datacatalog.../table/bigquery.project.dataset.table
2026-03-15 02:03:27 +01:00
Petr
ab1a93ed67 Strip HTML tags from OpenMetadata descriptions in YAML export
OpenMetadata stores descriptions as rich HTML (<p>, <strong>, &nbsp;, etc.).
Add strip_html() to transformer that converts to clean plain text for YAML
files consumed by Claude Code agent. Applied to metric descriptions, table
descriptions, and column descriptions. Webapp display dict keeps raw HTML
since the modal renders it correctly.
2026-03-15 01:57:04 +01:00
Petr
985f47cdb7 Add catalog export: generate YAML metrics and tables from OpenMetadata
- New `connectors/openmetadata/transformer.py` with shared parsing logic
  for extracting categories, grain, dimensions, expressions from OM tags
- New `src/catalog_export.py` script (python -m src.catalog_export) that
  fetches metrics/tables from OpenMetadata API and writes YAML files to
  /data/docs/metrics/ and /data/docs/tables/ for agent consumption
- Refactor webapp/app.py to delegate to transformer (with inline fallback)
- Add `fields` parameter to client.get_metrics() and get_metric_by_fqn()
  for fetching tags+owners in a single API call
- Fix pre-existing mock bug in test_openmetadata_enricher (base_url)
- 101 new tests (80 transformer + 21 export), all passing
2026-03-15 01:15:30 +01:00
Petr
e17dd85504 Remove hardcoded Jira/Keboola references from sync_data.sh
- Silent fallback when no sync settings exist (no 'Jira disabled' message)
- Generic dataset exclude/include loop driven by sync_settings.yaml
- Generic cleanup loop for disabled datasets
- Replaces 100+ lines of hardcoded Jira/kbc_telemetry_expert blocks
2026-03-15 01:02:37 +01:00
Petr
49adbe26ec Move server venv setup to bootstrap, remove cross-platform pip freeze sync
Server venv is created during bootstrap via SSH (same package list,
installed natively on Linux). Removes sync_data.sh section that copied
pip freeze output across platforms (Windows/macOS freeze is incompatible
with Linux).
2026-03-15 00:59:45 +01:00
Petr
22a1bb5847 Auto-restart sync_data.sh after self-update (exec replaces process) 2026-03-15 00:53:17 +01:00
Petr
6f9de274fb Fix: CLAUDE.md template vars overwrote $SSH_HOST used by rsync
The CLAUDE.md generation section reused SSH_HOST variable name to store
the server IP, overwriting the SSH alias needed for rsync. Renamed to
TMPL_SSH_ALIAS/TMPL_SERVER_HOST/TMPL_WEBAPP_URL to avoid collision.
2026-03-15 00:51:49 +01:00
Petr
508d92771f Generate setup instructions from bootstrap.yaml (single source of truth)
- Rewrite bootstrap.yaml as clean structured YAML with steps, commands,
  descriptions, conditions, and notes
- Add _generate_setup_instructions() in app.py that reads YAML, substitutes
  placeholders, and produces clipboard-ready plain text
- Replace 50-line hardcoded JS string builder with single tojson variable
- All setup instructions now editable in one YAML file
2026-03-15 00:37:19 +01:00
Petr
85c07732b2 Fix dashboard stats: support flat sync_state.json format (no 'tables' wrapper)
BigQuery adapter writes table entries at top level, not nested under 'tables'.
Detect flat format by checking if values contain 'rows' key.
2026-03-15 00:26:10 +01:00
Petr
b3ba65be59 Add ssh_alias, ssh_key, project_dir, disabled_providers to instance.yaml.example 2026-03-15 00:15:20 +01:00
Petr
3ebb15cbab Make project_dir, ssh_key configurable in Get Started UI
Read server.project_dir from instance.yaml (default: 'data-analyst').
Replace hardcoded 'data-analyst' folder name and 'data_analyst_server'
SSH key name in dashboard template with Jinja variables.
2026-03-15 00:12:46 +01:00
Petr
021c453ea6 Auto-create .sync_connection via printf command in bootstrap
Replace 'Save this to .sync_connection' prose with actual printf command
that Claude/user executes. Fix heredoc indentation in bootstrap.yaml.
2026-03-15 00:05:42 +01:00
Petr
b0e4749b0d Replace hardcoded 'data-analyst' SSH alias with configurable $SSH_HOST
Read SSH alias from .sync_connection file at script start (default:
'data-analyst' for backward compatibility). All 32 occurrences of
hardcoded 'data-analyst:' and 'ssh data-analyst' replaced with $SSH_HOST.
2026-03-15 00:03:07 +01:00
Petr
2237334b05 Make CLAUDE.md template generic and instance-aware
- Remove all Keboola-specific content (metric categories, MRR/ARR refs,
  corporate memory, hardcoded server IP)
- Add {ssh_alias}, {server_host}, {webapp_url} placeholders
- Bootstrap saves .sync_connection file with instance details
- sync_data.sh reads .sync_connection to substitute all placeholders
- Text instructions in dashboard include .sync_connection step
2026-03-14 23:57:58 +01:00
Petr
6728da63fb Use ssh_alias and ssh_key from config in bootstrap text instructions
Replace hardcoded 'data-analyst' and '~/.ssh/data_analyst_server' in
the copyBootstrapInstructions JS function with values from instance config.
Pass ssh_alias and ssh_key to dashboard template context.
2026-03-14 21:06:37 +01:00
Petr
13938bf72f Read ssh_alias and ssh_key from instance.yaml for bootstrap instructions
Config reads server.ssh_alias and server.ssh_key from instance.yaml
(defaults: 'data-analyst' and '~/.ssh/data_analyst_server' for backward compat).
App.py substitutes {ssh_alias} and {ssh_key} in bootstrap.yaml template.
2026-03-14 21:04:51 +01:00
Petr
140cbb3cee Make bootstrap.yaml instance-agnostic with configurable SSH alias
Add {ssh_alias} and {ssh_key} placeholders so each instance can use
its own SSH config name (avoids conflicts when user has multiple instances).
Remove Keboola-specific sync_settings and dataset references.
Simplify to single download_server_data step (rsync with scp fallback).
Handle SSH alias conflicts gracefully.
2026-03-14 20:58:26 +01:00
Petr
4206b06d92 Make deploy.sh data-source agnostic with --scripts-only flag
- Add --scripts-only flag for quick script/docs deployment without restart
- Replace hardcoded Keboola env vars with generic loop over all known vars
  (supports Keboola, BigQuery, OpenMetadata, and optional services)
- Make data directories conditional (Jira, notifications, corporate memory
  created only when relevant code/config exists)
- Enable timers only when their .timer files exist on disk
- Use root:data-ops ownership (works without deploy user)
2026-03-14 20:38:43 +01:00
Petr
c2681ccc86 Add cache-busting with git commit hash for static assets
Flask will now include git commit hash as URL parameter (v=abc1234)
for metric_modal.js and other static assets. This ensures browser
doesn't cache stale JavaScript when code changes.

Cache invalidation based on actual git history rather than timestamps.
2026-03-12 15:37:29 +01:00
Petr
f6000cc867 Fix: URL-encode metric FQN in catalog modal request
When metric FQN contains spaces (e.g. 'Active2 Customers'), JavaScript
was creating invalid URLs with literal spaces. Now properly encoding FQN
with encodeURIComponent() to convert spaces to %20 before sending to API.

Flask automatically decodes the path parameter back to original FQN.
2026-03-12 15:20:10 +01:00
Petr
1bcd7e4080 Fix: URL-decode metric FQN in catalog endpoint
FQN can contain spaces (e.g., 'Active2 Customers') which get URL-encoded
as 'Active2%20Customers' in the path parameter. Need to decode before
passing to OpenMetadata API.
2026-03-12 15:18:08 +01:00
Petr
268fe07f91 Fix: Use correct OpenMetadata API field names for metrics
OpenMetadata uses different field names than expected:
- metricExpression instead of expression
- metricType instead of type
- unitOfMeasurement instead of unit
- granularity instead of grain

Remove 'fields' query parameter from /api/v1/metrics - returns 400 Bad Request
when invalid field names are specified. Let API return full metric objects.

Update parsing to extract metadata from proper OpenMetadata fields instead
of relying on tags (tags are optional, fields are always present).
2026-03-12 15:16:24 +01:00
Petr
da6d605ae0 Add sample metric YAML as fallback when OpenMetadata metrics unavailable
The /api/v1/metrics endpoint may not be available in all OpenMetadata instances.
This sample metric provides a fallback for demonstration purposes.
2026-03-12 15:14:04 +01:00
Petr
5fc9526627 Phase 2: Replace demo YAML metrics with OpenMetadata catalog data
- Add get_metric_by_fqn() to OpenMetadataClient
- Add get_metrics() to CatalogEnricher with TTL caching
- Implement _parse_om_metric() to extract category/grain from OpenMetadata tags
- Implement _load_metrics_from_catalog() to fetch and categorize metrics
- Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON
- Add /api/catalog/metrics/<fqn> endpoint for metric detail modal
- Update _load_metrics_data() to prefer catalog over YAML fallback
- Update metric_modal.js to route catalog:{fqn} to catalog API endpoint
- Delete 10 demo YAML files from docs/metrics/
- Replace metric tests with new unit tests for catalog parsing functions (19 tests)

Catalog metrics provide single source of truth vs maintaining demo YAML files.
UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.
2026-03-12 15:10:42 +01:00
Petr
be58e63394 Move profiler config to instance.yaml (KISS principle)
Instead of hardcoded Python constants, load profiler settings from config:
- instance.yaml: profiler section with all parameters
- Defaults: fallback to sensible defaults if config not found
- Centralized: all profiler tuning in one place, no code changes needed
2026-03-12 14:45:14 +01:00
Petr
c25278538c Simplify profiler config: use single SAMPLE_SIZE parameter (KISS)
Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE:
- If table > SAMPLE_SIZE: sample that many rows
- Otherwise: use all rows

Cleaner, easier to configure.
2026-03-12 14:43:23 +01:00
Petr
e2d3afade3 Log when catalog enrichment (tags/tier) are found 2026-03-12 14:34:45 +01:00
Petr
14d75d6229 Fix: correct OpenMetadata catalog URL path and add debug logging
- Change catalog URL from /explore/{fqn} to /table/{fqn}
- Add debug logging to see parsed tags, owners, tier from API response
2026-03-12 14:34:12 +01:00
Petr
de66f6dd55 Fix: include partition fields in TableConfig for catalog enrichment
Pass partition_by, partition_granularity, partition_column_type, and
incremental_window_days from YAML to TableConfig to avoid validation errors
when sync_strategy='partitioned'
2026-03-12 14:29:05 +01:00
Petr
2d03a9b557 Display OpenMetadata catalog enrichment in table profile overview
- API endpoint /api/catalog/profile/ enriches response with catalog metadata (tier, owners, tags, url)
- renderOverview() template function displays 'Data Catalog' section with tier, owners, tags, and catalog link
- Graceful degradation: section only shown if catalog enrichment available
2026-03-12 14:28:02 +01:00
Petr
a7faf70cb3 Allow self-signed certificates for OpenMetadata catalog (internal networks)
OpenMetadata catalog uses self-signed HTTPS certificate on internal networks.
Disable SSL verification in httpx client and suppress related warnings.
2026-03-12 14:12:44 +01:00
Petr
c5c24cb45b Implement OpenMetadata catalog integration (Phase 1)
Add OpenMetadata REST API connector and enricher to merge table/column metadata
from OpenMetadata catalog at sync and query time.

Changes:
- connectors/openmetadata/client.py: HTTP client for OM API
- connectors/openmetadata/enricher.py: Data enrichment with TTL cache
- tests/test_openmetadata_*: Unit tests for client and enricher
- src/config.py: Add catalog_fqn field to TableConfig
- src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md)
- webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url
- config/instance.yaml.example: Document openmetadata section

Features:
- FQN auto-derivation: bigquery.{table.id}
- TTL cache (default 1h) to avoid repeated API calls
- Graceful degradation: disabled if token missing, silent on HTTP errors
- Column description priority: catalog > BQ API > (none)
- Table description priority: catalog > data_description.md
2026-03-12 14:07:13 +01:00
Petr
8bb46a9e0a Add per-partition streaming sync and hybrid query architecture
Partitioned sync: iterates day-by-day instead of loading full dataset.
Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB.
New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates.

Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid).
DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables).
Data sync: skips remote tables, filters by query_mode.

Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).
2026-03-12 13:20:41 +01:00
Petr
d2e83ce9d0 Set DuckDB memory_limit=4GB in profiler to prevent OOM
Server has 8GB RAM with other services running. DuckDB defaults to
using all available memory, causing OOM killer when profiling large
tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).
2026-03-12 11:06:49 +01:00
Petr
85c87ec375 Pass explicit bqstorage_client to to_arrow_iterable() for Storage API
Without explicit bqstorage_client parameter, to_arrow_iterable() silently
falls back to REST API pagination (~5K rows/sec). With explicit client,
it uses parallel gRPC streams via BQ Storage API (~300K rows/sec).

No temp table materialization - BQ already writes query results to an
internal temp table automatically. We just tell the reader to use the
fast gRPC path instead of slow HTTP pagination.
2026-03-12 10:51:44 +01:00
Petr
4f74543a12 Fix streaming: use RowIterator.to_arrow_iterable() not QueryJob
QueryJob only has to_arrow(), not to_arrow_iterable().
Must call query_job.result() first to get RowIterator,
which has the streaming to_arrow_iterable() method.
2026-03-11 20:15:35 +01:00
Petr
ee70da86c3 Stream BQ results to Parquet instead of loading into memory
Replace to_arrow() (loads entire result into RAM) with
to_arrow_iterable() (streams RecordBatches). Each batch is written
directly to disk via ParquetWriter - constant memory regardless
of table size. Prevents OOM on 8GB server for multi-million row tables.
2026-03-11 20:13:03 +01:00
Petr
a191ede28c Add columns and row_filter to TableConfig for selective BQ export
Propagate column selection and row filtering from data_description.md
through the BigQuery adapter to the BQ client. This enables exporting
only needed columns and applying date range filters at the SQL level,
critical for large DataView tables (e.g., 412-col unit_economics).
2026-03-11 19:37:04 +01:00
Petr
468f56092b Add standalone DuckDB-based data profiler script
Zero-dependency profiler for Parquet/CSV files producing JSON profiles
with column statistics, histograms, alerts, and sample data.
Supports single files, directories, composite primary keys, and
optional HTML report generation.
2026-03-11 15:12:04 +01:00
Petr
c77a6f6c2e Fix clipped annotation badges in theme-reference.html
Remove overflow:hidden from mockup containers and reposition
surface/text_primary badges that were cut off at edges.
2026-03-11 14:09:04 +01:00
Petr
e26e47a071 Add BQ Storage API fallback to REST when readsessions permission missing 2026-03-11 13:59:09 +01:00
Petr
d438438e33 Add configurable white-label theming via instance.yaml
Extend theming from 3 CSS variables (primary colors only) to 14
configurable properties covering colors, fonts, borders, and shape.
All values are optional with sensible defaults.

- New _theme.html include replaces duplicated inline injection
- Wire theme include into all 7 templates (base, login, dashboard,
  catalog, admin_tables, activity_center, corporate_memory)
- Conditional font loading: skip default Inter when custom font_url set
- Config.theme_overrides() classmethod generates CSS variable dict
- Visual theme-reference.html guide for instance configurators
- Document all theme keys in instance.yaml.example
2026-03-11 13:58:58 +01:00
Petr
758910463b Add BigQuery data source adapter
BigQuery connector that syncs BQ tables to local Parquet files via PyArrow
(no CSV intermediate step). Supports full refresh, timestamp-based
incremental (via incremental_column), and partition-based sync strategies.

- connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized
  queries, metadata cache, cross-project support (job project != data project)
- connectors/bigquery/adapter.py: DataSource implementation with merge/dedup
- src/config.py: Add incremental_column field to TableConfig
- 72 unit tests (mocked, no GCP SDK required)
2026-03-11 13:56:12 +01:00
Petr
eb5264b903 Make header logo configurable via instance.yaml logo_svg
Move hardcoded Keboola SVG logo from 4 templates into config.
Templates now use {{ config.LOGO_SVG | safe }}.
Default falls back to Keboola logo when not configured.
2026-03-11 13:08:26 +01:00
Petr
d3c7f7feea Fix activity-center 500 error: provide default data structure
The activity_center view was passing an empty dict but the template
expected nested keys (executive_summary, maturity_roadmap, etc).
Added _build_activity_data() that returns properly structured defaults.
2026-03-11 12:59:40 +01:00
Petr
91a05a2c2b Add auth.disabled_providers config to skip auth providers
Reads disabled_providers list from instance.yaml auth section.
Listed providers are skipped during auto-discovery.
2026-03-11 12:54:23 +01:00
Petr
954aa0f17e Add theme color support via instance.yaml
Allow instances to override primary CSS color variables
through theme section in instance.yaml config.
2026-03-11 00:42:10 +01:00
Petr
e35e602c59 Update CLAUDE.md with metrics, table registry, password auth
Add docs/metrics/ to project structure, Business Metrics and Table
Registry patterns to implementation details, password auth provider
to extensibility section, fix sync command for returning users.
2026-03-10 23:05:03 +01:00