- services/telegram_bot/config.py: NOTIFICATIONS_DIR now uses DATA_DIR fallback
- src/profiler.py: DATA_DIR now uses main DATA_DIR env var instead of PROFILER_DATA_DIR
- services/telegram_bot/dispatch.py: WS_GATEWAY_SOCKET_PATH now uses WS_GATEWAY_SOCKET env var
- Add close_system_db() function in src/db.py to cleanly close shared DB connection
- Add lifespan context manager in app/main.py to trigger shutdown on app exit
- Integrate lifespan into FastAPI app initialization
- All API tests pass (77/77)
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
Adds _SAFE_IDENTIFIER regex guard before ATTACHing extract.duckdb files in the
read-only analytics connection, matching the same fix already applied in the
orchestrator. Adds test coverage for malicious directory names.
Add _atomic_swap_db helper that removes stale WAL files before and after
moving the temp DuckDB into place. Apply CHECKPOINT before close in both
orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.
Expand blocked keywords to cover parquet_scan, read_csv_auto, query_table,
iceberg_scan, delta_scan, call, URL schemes (http/https/s3/gcs), and
additional file-scan functions. Set enable_external_access=false on the
non-read-only analytics connection path. Add three new tests covering
parquet_scan, read_csv_auto, and query_table blocking.
Adds _validate_identifier() with ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$ regex and
applies it to source_name (directory names), table_name (_meta rows), and
alias/extension (_remote_attach rows) before any SQL interpolation.
Adds two tests covering SQL-injection directory names and malicious _meta entries.
_do_rebuild_source was creating a fresh temp DB with only one source,
then atomically replacing analytics.duckdb — wiping views from every
other source. Now it delegates to _do_rebuild so all extract dirs are
re-attached in a single pass.
Adds test_rebuild_source_preserves_other_sources to guard the regression.
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
Extractors with remote tables now write a _remote_attach table into
extract.duckdb so the orchestrator can re-ATTACH external extensions
at query time. The mechanism is source-agnostic — any connector can use it.
- Keboola extractor writes _remote_attach + creates views on kbc.*
- Orchestrator reads _remote_attach, installs extension, reads token from env
- Graceful degradation: missing token → warning, local tables still work
Three-pronged fix for DuckDB lock conflicts:
1. WAL mode on system.duckdb — enables concurrent readers + writer
2. Sync trigger runs extractor as subprocess (not background task) —
separate process = separate DuckDB connections, no lock conflict
3. Both extractor and orchestrator write to .tmp then atomic rename —
avoids lock conflict with API reads on extract.duckdb/analytics.duckdb
Fixes#9 permanently.
Fixes#9 — background sync tasks could not access system.duckdb
because FastAPI held an exclusive lock. Now uses single shared
connection per DATA_DIR with cursor() for thread safety.
Schema v3: add is_public column to table_registry (default true).
src/rbac.py: can_access_table() checks admin bypass, public flag,
explicit permissions, wildcard bucket permissions.
API enforcement:
- manifest: filters tables by user access
- download: 403 if no access
- catalog: filters table list
- query: validates referenced tables against allowed list
New admin permissions API (/api/admin/permissions) for grant/revoke.
28 access control tests + 733 total tests passing.
New src/rbac.py: Role enum, hierarchy, get_user_role(), has_role(),
is_admin(), is_km_admin(), has_dataset_access(), set_user_role().
webapp/auth.py: admin_required + km_admin_required now use DuckDB
roles instead of Linux groups (pwd.getpwnam + sudo/data-ops check).
app/auth/dependencies.py: imports Role from src/rbac.py (single source).
11 RBAC tests passing.
- SyncSettingsRepository + DatasetPermissionRepository with RBAC
- Script deploy/run/undeploy API with import sandboxing
- User sync settings API with permission checks
- 4 CLI skills (connectors, security, notifications, corporate-memory)
- Kamal production + staging configs
- GitHub Actions CI + deploy workflows
- 91 total tests passing
Same issue as config.py - profiler's TableInfo and parser required
primary_key and sync_strategy, breaking auto-profile after sync
when daily_deal_traffic (remote-only) is in config.
Remote-only tables (query_mode="remote") like daily_deal_traffic
don't need primary_key or sync_strategy. The parser used hard
lookups (table_data["primary_key"]) causing KeyError and breaking
all data sync since 2026-03-21.
Changes:
- TableConfig: default primary_key="" and sync_strategy="none"
- Parser: use .get() with defaults instead of [] lookups
- Validator: add "none" as valid sync_strategy
Session testing revealed 3 issues with remote queries:
1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but
claude_settings.json had `cat` in deny list, causing 2-3 failed
attempts per query. Replaced with file-based approach: Write tool
creates JSON file, then `ssh ... < file` avoids the cat deny.
2. ssh/scp commands were not in the allow list, requiring manual
approval for every remote query. Added both to allow list.
3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every
parquet export. Replaced with .arrow().read_all().
Also added instruction for proactive hybrid analysis when remote
tables are available (agent was only using local data until asked).
Agent was failing 3x on SSH commands due to backticks (BQ table names)
and single quotes (SQL string literals) getting mangled by nested shell
interpretation (local -> SSH -> bash -> Python).
New --stdin mode reads query spec as JSON from stdin via heredoc:
cat <<'QUERY' | ssh alias 'bash remote_query.sh --stdin'
{"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."}
QUERY
Heredoc with <<'QUERY' (quoted) passes everything literally -- no
escaping needed for backticks, quotes, or parentheses.
Updated claude_md_template.txt to use --stdin as the primary method.
Analysts don't have WEBAPP_SECRET_KEY, so load_instance_config()
validation failed with noisy warnings. Now reads instance.yaml
directly with yaml.safe_load, skipping secret validation.
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.
Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).
Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
- client.py: add search_by_data_product() for OpenMetadata search API
- catalog_export.py: prefer data product discovery over tag filtering
(finds all 16 metrics in FoundryAIDataModel vs 3 with tag filter)
- remove-analyst: fix GROUPS bash variable collision, improve messaging
The scheduler.py already supported "daily HH:MM,HH:MM,HH:MM" format
(commit 5f27d05), but config.py validation regex only accepted single
time "daily HH:MM", causing data-refresh to crash on startup.
Also adds:
- tests/test_config_sync_schedule.py (16 test cases)
- Makefile with validate-config target for CI/CD integration
Scheduler now accepts comma-separated HH:MM times in daily schedules.
Each time slot is independently evaluated - if any slot has passed and
last_sync is before it, the table is marked as due.
This lets tables sync multiple times per day to pick up data refreshes
that happen throughout the day (e.g., Keboola pipelines running 3x/day).
Add filter_tag support to catalog_export and webapp so only metrics
with the required tag are exported to YAML and displayed in UI.
Previously all 19+ metrics were exported regardless of relevance.
- Add has_tag() helper to transformer module
- catalog_export.py: filter_tag parameter from instance.yaml openmetadata config
- webapp/app.py: filter metrics in _load_metrics_from_catalog()
- 7 new tests (has_tag, filter_tag export, stale cleanup)
- New sync_schedule and profile_after_sync fields in TableConfig
(formats: "every 15m", "every 1h", "daily 05:00")
- New src/scheduler.py with schedule evaluation logic (is_table_due)
- New --scheduled mode in data_sync.py: only syncs tables that are due,
respects profile_after_sync flag, auto-restarts webapp after profiling
- Systemd timer+service for data-refresh (every 15 min)
- Systemd timer+service for catalog-refresh (every 15 min)
- deploy.sh enables new timers automatically
- Complete table config reference in data_description.md.example
- 58 new scheduler tests
- New `connectors/openmetadata/transformer.py` with shared parsing logic
for extracting categories, grain, dimensions, expressions from OM tags
- New `src/catalog_export.py` script (python -m src.catalog_export) that
fetches metrics/tables from OpenMetadata API and writes YAML files to
/data/docs/metrics/ and /data/docs/tables/ for agent consumption
- Refactor webapp/app.py to delegate to transformer (with inline fallback)
- Add `fields` parameter to client.get_metrics() and get_metric_by_fqn()
for fetching tags+owners in a single API call
- Fix pre-existing mock bug in test_openmetadata_enricher (base_url)
- 101 new tests (80 transformer + 21 export), all passing
Instead of hardcoded Python constants, load profiler settings from config:
- instance.yaml: profiler section with all parameters
- Defaults: fallback to sensible defaults if config not found
- Centralized: all profiler tuning in one place, no code changes needed
Replace SAMPLE_THRESHOLD + SAMPLE_SIZE with single SAMPLE_SIZE:
- If table > SAMPLE_SIZE: sample that many rows
- Otherwise: use all rows
Cleaner, easier to configure.
Add OpenMetadata REST API connector and enricher to merge table/column metadata
from OpenMetadata catalog at sync and query time.
Changes:
- connectors/openmetadata/client.py: HTTP client for OM API
- connectors/openmetadata/enricher.py: Data enrichment with TTL cache
- tests/test_openmetadata_*: Unit tests for client and enricher
- src/config.py: Add catalog_fqn field to TableConfig
- src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md)
- webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url
- config/instance.yaml.example: Document openmetadata section
Features:
- FQN auto-derivation: bigquery.{table.id}
- TTL cache (default 1h) to avoid repeated API calls
- Graceful degradation: disabled if token missing, silent on HTTP errors
- Column description priority: catalog > BQ API > (none)
- Table description priority: catalog > data_description.md
Server has 8GB RAM with other services running. DuckDB defaults to
using all available memory, causing OOM killer when profiling large
tables (22M rows, 39 cols triggered 7.5GB RSS -> killed).
Propagate column selection and row filtering from data_description.md
through the BigQuery adapter to the BQ client. This enables exporting
only needed columns and applying date range filters at the SQL level,
critical for large DataView tables (e.g., 412-col unit_economics).