OpenMetadata search API ignores queryFilter for dataProducts field.
Use type-specific index + client-side filtering by dataProducts
membership instead. Correctly returns 16/32 metrics for FoundryAI.
- client.py: add search_by_data_product() for OpenMetadata search API
- catalog_export.py: prefer data product discovery over tag filtering
(finds all 16 metrics in FoundryAIDataModel vs 3 with tag filter)
- remove-analyst: fix GROUPS bash variable collision, improve messaging
The scheduler.py already supported "daily HH:MM,HH:MM,HH:MM" format
(commit 5f27d05), but config.py validation regex only accepted single
time "daily HH:MM", causing data-refresh to crash on startup.
Also adds:
- tests/test_config_sync_schedule.py (16 test cases)
- Makefile with validate-config target for CI/CD integration
Scheduler now accepts comma-separated HH:MM times in daily schedules.
Each time slot is independently evaluated - if any slot has passed and
last_sync is before it, the table is marked as due.
This lets tables sync multiple times per day to pick up data refreshes
that happen throughout the day (e.g., Keboola pipelines running 3x/day).
When BQ returns empty results (e.g., data not yet refreshed), the
scheduler was marking sync as complete for the day. This meant the
next 15-min tick would skip it ("none are due") and data would stay
stale until the next day's scheduled run.
Now: if partitioned sync processes partitions but gets 0 new rows,
last_sync is NOT updated. The scheduler will retry on the next tick
(15 min later) when data may be available.
List view:
- Show display_name ("M1 + VFM Operational") instead of name ("M1PlusVFMOperational")
- Strip HTML and truncate description for clean list excerpts
Modal detail:
- Render original HTML from catalog instead of stripped plain text
- Add .om-description CSS class for structured HTML (bold labels, lists, code)
- Pass description_html alongside plain text description for backwards compat
Add filter_tag support to catalog_export and webapp so only metrics
with the required tag are exported to YAML and displayed in UI.
Previously all 19+ metrics were exported regardless of relevance.
- Add has_tag() helper to transformer module
- catalog_export.py: filter_tag parameter from instance.yaml openmetadata config
- webapp/app.py: filter metrics in _load_metrics_from_catalog()
- 7 new tests (has_tag, filter_tag export, stale cleanup)
The script was exiting silently on the GROUPS=$(groups ... | cut ...)
line — set -eo pipefail caused bash to terminate the script before any
echo output, making it appear to do nothing.
Replace set -euo pipefail with set -u and explicit error handling.
Admin scripts must always report what happened, never exit silently.
Also: use id -nG instead of groups|cut pipe, add verification step
after userdel, and log each operation for visibility.
data-refresh.service: use /tmp instead of /tmp/data_analyst_staging in
ReadWritePaths — the subdirectory may not exist at service start, causing
mount namespace setup to fail before any Exec* directive runs.
deploy.sh: fix typo services/corporate-memory -> services/corporate_memory
so the mkdir conditional actually matches the repo directory name.
deploy.sh: add ReadWritePaths validation loop that auto-creates any missing
directories listed in installed .service files before daemon-reload. This
acts as a safety net against future NAMESPACE failures from new services.
- New sync_schedule and profile_after_sync fields in TableConfig
(formats: "every 15m", "every 1h", "daily 05:00")
- New src/scheduler.py with schedule evaluation logic (is_table_due)
- New --scheduled mode in data_sync.py: only syncs tables that are due,
respects profile_after_sync flag, auto-restarts webapp after profiling
- Systemd timer+service for data-refresh (every 15 min)
- Systemd timer+service for catalog-refresh (every 15 min)
- deploy.sh enables new timers automatically
- Complete table config reference in data_description.md.example
- 58 new scheduler tests
OpenMetadata stores descriptions as rich HTML (<p>, <strong>, , etc.).
Add strip_html() to transformer that converts to clean plain text for YAML
files consumed by Claude Code agent. Applied to metric descriptions, table
descriptions, and column descriptions. Webapp display dict keeps raw HTML
since the modal renders it correctly.
- New `connectors/openmetadata/transformer.py` with shared parsing logic
for extracting categories, grain, dimensions, expressions from OM tags
- New `src/catalog_export.py` script (python -m src.catalog_export) that
fetches metrics/tables from OpenMetadata API and writes YAML files to
/data/docs/metrics/ and /data/docs/tables/ for agent consumption
- Refactor webapp/app.py to delegate to transformer (with inline fallback)
- Add `fields` parameter to client.get_metrics() and get_metric_by_fqn()
for fetching tags+owners in a single API call
- Fix pre-existing mock bug in test_openmetadata_enricher (base_url)
- 101 new tests (80 transformer + 21 export), all passing
Server venv is created during bootstrap via SSH (same package list,
installed natively on Linux). Removes sync_data.sh section that copied
pip freeze output across platforms (Windows/macOS freeze is incompatible
with Linux).
The CLAUDE.md generation section reused SSH_HOST variable name to store
the server IP, overwriting the SSH alias needed for rsync. Renamed to
TMPL_SSH_ALIAS/TMPL_SERVER_HOST/TMPL_WEBAPP_URL to avoid collision.
- Rewrite bootstrap.yaml as clean structured YAML with steps, commands,
descriptions, conditions, and notes
- Add _generate_setup_instructions() in app.py that reads YAML, substitutes
placeholders, and produces clipboard-ready plain text
- Replace 50-line hardcoded JS string builder with single tojson variable
- All setup instructions now editable in one YAML file
Read server.project_dir from instance.yaml (default: 'data-analyst').
Replace hardcoded 'data-analyst' folder name and 'data_analyst_server'
SSH key name in dashboard template with Jinja variables.
Read SSH alias from .sync_connection file at script start (default:
'data-analyst' for backward compatibility). All 32 occurrences of
hardcoded 'data-analyst:' and 'ssh data-analyst' replaced with $SSH_HOST.
Replace hardcoded 'data-analyst' and '~/.ssh/data_analyst_server' in
the copyBootstrapInstructions JS function with values from instance config.
Pass ssh_alias and ssh_key to dashboard template context.
Config reads server.ssh_alias and server.ssh_key from instance.yaml
(defaults: 'data-analyst' and '~/.ssh/data_analyst_server' for backward compat).
App.py substitutes {ssh_alias} and {ssh_key} in bootstrap.yaml template.
Add {ssh_alias} and {ssh_key} placeholders so each instance can use
its own SSH config name (avoids conflicts when user has multiple instances).
Remove Keboola-specific sync_settings and dataset references.
Simplify to single download_server_data step (rsync with scp fallback).
Handle SSH alias conflicts gracefully.
- Add --scripts-only flag for quick script/docs deployment without restart
- Replace hardcoded Keboola env vars with generic loop over all known vars
(supports Keboola, BigQuery, OpenMetadata, and optional services)
- Make data directories conditional (Jira, notifications, corporate memory
created only when relevant code/config exists)
- Enable timers only when their .timer files exist on disk
- Use root:data-ops ownership (works without deploy user)
Flask will now include git commit hash as URL parameter (v=abc1234)
for metric_modal.js and other static assets. This ensures browser
doesn't cache stale JavaScript when code changes.
Cache invalidation based on actual git history rather than timestamps.
When metric FQN contains spaces (e.g. 'Active2 Customers'), JavaScript
was creating invalid URLs with literal spaces. Now properly encoding FQN
with encodeURIComponent() to convert spaces to %20 before sending to API.
Flask automatically decodes the path parameter back to original FQN.
FQN can contain spaces (e.g., 'Active2 Customers') which get URL-encoded
as 'Active2%20Customers' in the path parameter. Need to decode before
passing to OpenMetadata API.
OpenMetadata uses different field names than expected:
- metricExpression instead of expression
- metricType instead of type
- unitOfMeasurement instead of unit
- granularity instead of grain
Remove 'fields' query parameter from /api/v1/metrics - returns 400 Bad Request
when invalid field names are specified. Let API return full metric objects.
Update parsing to extract metadata from proper OpenMetadata fields instead
of relying on tags (tags are optional, fields are always present).
- Add get_metric_by_fqn() to OpenMetadataClient
- Add get_metrics() to CatalogEnricher with TTL caching
- Implement _parse_om_metric() to extract category/grain from OpenMetadata tags
- Implement _load_metrics_from_catalog() to fetch and categorize metrics
- Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON
- Add /api/catalog/metrics/<fqn> endpoint for metric detail modal
- Update _load_metrics_data() to prefer catalog over YAML fallback
- Update metric_modal.js to route catalog:{fqn} to catalog API endpoint
- Delete 10 demo YAML files from docs/metrics/
- Replace metric tests with new unit tests for catalog parsing functions (19 tests)
Catalog metrics provide single source of truth vs maintaining demo YAML files.
UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.
Instead of hardcoded Python constants, load profiler settings from config:
- instance.yaml: profiler section with all parameters
- Defaults: fallback to sensible defaults if config not found
- Centralized: all profiler tuning in one place, no code changes needed
Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE:
- If table > SAMPLE_SIZE: sample that many rows
- Otherwise: use all rows
Cleaner, easier to configure.
Pass partition_by, partition_granularity, partition_column_type, and
incremental_window_days from YAML to TableConfig to avoid validation errors
when sync_strategy='partitioned'
- API endpoint /api/catalog/profile/ enriches response with catalog metadata (tier, owners, tags, url)
- renderOverview() template function displays 'Data Catalog' section with tier, owners, tags, and catalog link
- Graceful degradation: section only shown if catalog enrichment available
Add OpenMetadata REST API connector and enricher to merge table/column metadata
from OpenMetadata catalog at sync and query time.
Changes:
- connectors/openmetadata/client.py: HTTP client for OM API
- connectors/openmetadata/enricher.py: Data enrichment with TTL cache
- tests/test_openmetadata_*: Unit tests for client and enricher
- src/config.py: Add catalog_fqn field to TableConfig
- src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md)
- webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url
- config/instance.yaml.example: Document openmetadata section
Features:
- FQN auto-derivation: bigquery.{table.id}
- TTL cache (default 1h) to avoid repeated API calls
- Graceful degradation: disabled if token missing, silent on HTTP errors
- Column description priority: catalog > BQ API > (none)
- Table description priority: catalog > data_description.md
Server has 8GB RAM with other services running. DuckDB defaults to
using all available memory, causing OOM killer when profiling large
tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).
Without explicit bqstorage_client parameter, to_arrow_iterable() silently
falls back to REST API pagination (~5K rows/sec). With explicit client,
it uses parallel gRPC streams via BQ Storage API (~300K rows/sec).
No temp table materialization - BQ already writes query results to an
internal temp table automatically. We just tell the reader to use the
fast gRPC path instead of slow HTTP pagination.
QueryJob only has to_arrow(), not to_arrow_iterable().
Must call query_job.result() first to get RowIterator,
which has the streaming to_arrow_iterable() method.
Replace to_arrow() (loads entire result into RAM) with
to_arrow_iterable() (streams RecordBatches). Each batch is written
directly to disk via ParquetWriter - constant memory regardless
of table size. Prevents OOM on 8GB server for multi-million row tables.