Commit graph

51 commits

Author SHA1 Message Date
ZdenekSrotyr
c20da6d744 Remove dead Flask Blueprint from Jira connector
The Flask-based webhook endpoint at connectors/jira/webhook.py is no longer used.
FastAPI handles webhooks via app/api/jira_webhooks.py which is already integrated
into the application. This removes the redundant Flask code.
2026-04-09 07:13:20 +02:00
ZdenekSrotyr
1488e01bf9 feat: add temp-file swap to BigQuery extractor
Write to extract.duckdb.tmp, then atomically swap into place with WAL cleanup.
Prevents lock conflicts with orchestrator holding read lock on existing database.
2026-04-09 07:00:19 +02:00
ZdenekSrotyr
f25393871d fix: escape single quotes in ATTACH TOKEN parameters
- In src/orchestrator.py _attach_remote_extensions: escape token with '' before passing to ATTACH
- In connectors/keboola/extractor.py _try_attach_extension: escape token with '' before passing to ATTACH

Prevents SQL injection if token contains single quotes.
2026-04-09 07:00:13 +02:00
ZdenekSrotyr
1b219cabe9 fix: remove dead PRAGMA enable_wal code
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
2026-04-09 06:59:57 +02:00
ZdenekSrotyr
e425d4baa5 fix: handle WAL files in atomic swap to prevent DB corruption
Add _atomic_swap_db helper that removes stale WAL files before and after
moving the temp DuckDB into place. Apply CHECKPOINT before close in both
orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.
2026-04-09 06:57:29 +02:00
ZdenekSrotyr
89154d043b chore: clean repo for public release — fix references, remove drafts
- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI
- Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs
- Remove real GCP project ID from terraform.tfvars.example
- Delete internal draft documents (dev_docs/draft/)
- Update infra/main.tf to clone from main branch
2026-04-08 19:27:25 +02:00
ZdenekSrotyr
79443e0df4 fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy
- Legacy extractor now uses read_csv(all_varchar=true) to avoid type
  inference errors (e.g. seniority column typed as DOUBLE with string values)
- DEPLOYMENT.md rewritten based on actual dev VM deployment experience:
  deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow
2026-04-08 19:09:55 +02:00
ZdenekSrotyr
3ba207a7f8 feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
2026-04-08 18:13:31 +02:00
ZdenekSrotyr
06e1cf0a8d feat: generic _remote_attach contract for remote DuckDB extension views
Extractors with remote tables now write a _remote_attach table into
extract.duckdb so the orchestrator can re-ATTACH external extensions
at query time. The mechanism is source-agnostic — any connector can use it.

- Keboola extractor writes _remote_attach + creates views on kbc.*
- Orchestrator reads _remote_attach, installs extension, reads token from env
- Graceful degradation: missing token → warning, local tables still work
2026-04-08 18:10:12 +02:00
ZdenekSrotyr
92fbb88c15 chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs 2026-04-08 12:10:47 +02:00
ZdenekSrotyr
67a1e0bb45 feat: Jira webhook FastAPI adapter — replaces Flask Blueprint 2026-04-08 07:04:50 +02:00
ZdenekSrotyr
4d1acd014a refactor: remove legacy webapp + add missing tests + housekeeping
Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to
.gitignore, increase extractor timeout to 30 min.

Phase B: Add 10 new tests — access request lifecycle (4), CLI admin
commands (5), sync subprocess trigger (1). 578 tests passing.

Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask
app fully replaced by FastAPI app/. Fix auth providers to use
app.instance_config instead of webapp.config. Update CLAUDE.md.

Delete 6 webapp-only test files. Fix Jira service config imports.
2026-03-31 13:44:06 +02:00
ZdenekSrotyr
2d6a94fb6f fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename
Three-pronged fix for DuckDB lock conflicts:

1. WAL mode on system.duckdb — enables concurrent readers + writer
2. Sync trigger runs extractor as subprocess (not background task) —
   separate process = separate DuckDB connections, no lock conflict
3. Both extractor and orchestrator write to .tmp then atomic rename —
   avoids lock conflict with API reads on extract.duckdb/analytics.duckdb

Fixes #9 permanently.
2026-03-31 13:19:57 +02:00
ZdenekSrotyr
10d9280ab5 fix: extractor writes to temp file to avoid lock with orchestrator
Writes extract.duckdb.tmp then renames atomically, avoiding DuckDB lock
conflict when orchestrator holds a read connection on extract.duckdb.
2026-03-31 13:09:51 +02:00
ZdenekSrotyr
bd0b6d19c6 fix: legacy extractor constructs full Keboola table ID from bucket+source_table
Was using tc['id'] which is the registry ID (e.g. 'circle'), not the
full Keboola ID (e.g. 'in.c-finance.circle') needed by the API.
2026-03-31 12:06:38 +02:00
ZdenekSrotyr
0084f80ff6 fix: legacy extractor passes Path to export_table, not str
Fixes 'str' object has no attribute 'parent' when Keboola DuckDB
extension falls back to legacy client.
2026-03-31 12:03:16 +02:00
ZdenekSrotyr
865d6d657e fix: keboola client metadata_cache_path uses DATA_DIR instead of deleted config
Fixes #7 — NameError: name 'config' is not defined
2026-03-31 11:57:57 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
9f20529f10 fix: resolve 7 preexisting test failures
- Remove iCloud duplicate files (test_db 2.py, src/db 2.py)
- Fix metrics expression fallback to top-level field in transformer + webapp
- Fix sync_data.sh rsync exception pattern for $SSH_HOST variable
- Fix deploy_guard cp regex to skip shell variable expansions
- Update sudoers-deploy with missing root:data-ops rules
- Update CRITICAL_DIRS ownership expectations to match deploy.sh reality

913 tests passing, 0 failures.
2026-03-30 20:36:00 +02:00
ZdenekSrotyr
e2a7ee21a2 fix: Jira extract_init handles empty parquet dirs gracefully
DuckDB read_parquet glob fails when no files match. Skip view creation
for tables without parquet files, create views only after first write.
2026-03-30 20:28:29 +02:00
ZdenekSrotyr
e058c71777 feat: adapt Jira connector to extract.duckdb format
- New extract_init.py: creates extract.duckdb with _meta + views for 6 entity types
- Update default paths to /data/extracts/jira/data/ and /data/extracts/jira/raw/
- After parquet writes, update _meta table in extract.duckdb
- Trigger SyncOrchestrator.rebuild_source("jira") after successful transform
2026-03-30 20:19:27 +02:00
ZdenekSrotyr
18e5f0b6e8 feat: implement extract.duckdb contract — orchestrator + extractors
Phase 0: extend table_registry schema (v1→v2 migration), add
source_type/bucket/source_table/query_mode columns.

Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master
analytics.duckdb. Keboola extractor uses DuckDB extension with
legacy client fallback. BigQuery extractor is remote-only via
DuckDB BQ extension (no data download).

62 tests passing.
2026-03-30 20:12:56 +02:00
Petr
c04791b702 Suppress httpcore debug logging in LLM connector 2026-03-23 12:57:35 +01:00
Petr
f619fadc42 Fix SSL verification and suppress OpenAI SDK debug logging
- Add verify_ssl config option for corporate proxies with self-signed certs
- Suppress openai/httpx debug loggers that dump full request bodies
  (including prompt content) — security requirement
2026-03-23 12:56:04 +01:00
Petr
95358448e6 Add modular LLM connector for Corporate Memory
Replace hardwired Anthropic API calls with a pluggable provider system.
Each deployment configures its AI provider in instance.yaml — switching
between Anthropic, LiteLLM, OpenRouter, or any OpenAI-compatible proxy
is a config change, not a code change.

New connectors/llm/ module:
- StructuredExtractor Protocol with extract_json() interface
- AnthropicExtractor: direct Anthropic SDK with retry + backoff
- OpenAICompatExtractor: any OpenAI-compatible proxy with three-layer
  structured output fallback (json_schema -> json_object -> prompt)
- Configurable structured_output policy (strict/json/auto)
- Custom exception hierarchy (auth/rate_limit/timeout/format/refusal)
- Zero secrets in logs: no API keys, prompts, or responses logged

Reviewed by: Google Gemini, Claude Sonnet, OpenAI GPT-5.4.
Security audit passed with all critical findings resolved.
2026-03-23 12:08:33 +01:00
Petr
ed16122994 Use data_product config for metric discovery instead of filter_tag in webapp 2026-03-18 16:10:15 +01:00
Petr
e63c8747b5 Fix metric expression extraction: use 'code' field
OpenMetadata stores SQL in metricExpression.code, not .expression.
This caused all metric expressions to export as empty strings.
2026-03-18 13:01:23 +01:00
Petr
908d1f2247 Fix search_by_data_product: client-side filtering
OpenMetadata search API ignores queryFilter for dataProducts field.
Use type-specific index + client-side filtering by dataProducts
membership instead. Correctly returns 16/32 metrics for FoundryAI.
2026-03-18 12:54:59 +01:00
Petr
fb63a72a98 Add data product discovery, fix remove-analyst script
- client.py: add search_by_data_product() for OpenMetadata search API
- catalog_export.py: prefer data product discovery over tag filtering
  (finds all 16 metrics in FoundryAIDataModel vs 3 with tag filter)
- remove-analyst: fix GROUPS bash variable collision, improve messaging
2026-03-18 12:52:41 +01:00
Petr
f19ff10e1a Fix: don't update last_sync when partitioned sync gets 0 new rows
When BQ returns empty results (e.g., data not yet refreshed), the
scheduler was marking sync as complete for the day. This meant the
next 15-min tick would skip it ("none are due") and data would stay
stale until the next day's scheduled run.

Now: if partitioned sync processes partitions but gets 0 new rows,
last_sync is NOT updated. The scheduler will retry on the next tick
(15 min later) when data may be available.
2026-03-16 23:01:35 +01:00
Petr
9be22fdc82 Fix metric display: use displayName in list, render HTML in modal
List view:
- Show display_name ("M1 + VFM Operational") instead of name ("M1PlusVFMOperational")
- Strip HTML and truncate description for clean list excerpts

Modal detail:
- Render original HTML from catalog instead of stripped plain text
- Add .om-description CSS class for structured HTML (bold labels, lists, code)
- Pass description_html alongside plain text description for backwards compat
2026-03-16 22:11:58 +01:00
Petr
ad525a96aa Filter catalog metrics by configurable tag (e.g., AIAgent.FoundryAI)
Add filter_tag support to catalog_export and webapp so only metrics
with the required tag are exported to YAML and displayed in UI.
Previously all 19+ metrics were exported regardless of relevance.

- Add has_tag() helper to transformer module
- catalog_export.py: filter_tag parameter from instance.yaml openmetadata config
- webapp/app.py: filter metrics in _load_metrics_from_catalog()
- 7 new tests (has_tag, filter_tag export, stale cleanup)
2026-03-16 22:03:53 +01:00
Petr
ab1a93ed67 Strip HTML tags from OpenMetadata descriptions in YAML export
OpenMetadata stores descriptions as rich HTML (<p>, <strong>, &nbsp;, etc.).
Add strip_html() to transformer that converts to clean plain text for YAML
files consumed by Claude Code agent. Applied to metric descriptions, table
descriptions, and column descriptions. Webapp display dict keeps raw HTML
since the modal renders it correctly.
2026-03-15 01:57:04 +01:00
Petr
985f47cdb7 Add catalog export: generate YAML metrics and tables from OpenMetadata
- New `connectors/openmetadata/transformer.py` with shared parsing logic
  for extracting categories, grain, dimensions, expressions from OM tags
- New `src/catalog_export.py` script (python -m src.catalog_export) that
  fetches metrics/tables from OpenMetadata API and writes YAML files to
  /data/docs/metrics/ and /data/docs/tables/ for agent consumption
- Refactor webapp/app.py to delegate to transformer (with inline fallback)
- Add `fields` parameter to client.get_metrics() and get_metric_by_fqn()
  for fetching tags+owners in a single API call
- Fix pre-existing mock bug in test_openmetadata_enricher (base_url)
- 101 new tests (80 transformer + 21 export), all passing
2026-03-15 01:15:30 +01:00
Petr
268fe07f91 Fix: Use correct OpenMetadata API field names for metrics
OpenMetadata uses different field names than expected:
- metricExpression instead of expression
- metricType instead of type
- unitOfMeasurement instead of unit
- granularity instead of grain

Remove 'fields' query parameter from /api/v1/metrics - returns 400 Bad Request
when invalid field names are specified. Let API return full metric objects.

Update parsing to extract metadata from proper OpenMetadata fields instead
of relying on tags (tags are optional, fields are always present).
2026-03-12 15:16:24 +01:00
Petr
5fc9526627 Phase 2: Replace demo YAML metrics with OpenMetadata catalog data
- Add get_metric_by_fqn() to OpenMetadataClient
- Add get_metrics() to CatalogEnricher with TTL caching
- Implement _parse_om_metric() to extract category/grain from OpenMetadata tags
- Implement _load_metrics_from_catalog() to fetch and categorize metrics
- Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON
- Add /api/catalog/metrics/<fqn> endpoint for metric detail modal
- Update _load_metrics_data() to prefer catalog over YAML fallback
- Update metric_modal.js to route catalog:{fqn} to catalog API endpoint
- Delete 10 demo YAML files from docs/metrics/
- Replace metric tests with new unit tests for catalog parsing functions (19 tests)

Catalog metrics provide single source of truth vs maintaining demo YAML files.
UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.
2026-03-12 15:10:42 +01:00
Petr
e2d3afade3 Log when catalog enrichment (tags/tier) are found 2026-03-12 14:34:45 +01:00
Petr
14d75d6229 Fix: correct OpenMetadata catalog URL path and add debug logging
- Change catalog URL from /explore/{fqn} to /table/{fqn}
- Add debug logging to see parsed tags, owners, tier from API response
2026-03-12 14:34:12 +01:00
Petr
a7faf70cb3 Allow self-signed certificates for OpenMetadata catalog (internal networks)
OpenMetadata catalog uses self-signed HTTPS certificate on internal networks.
Disable SSL verification in httpx client and suppress related warnings.
2026-03-12 14:12:44 +01:00
Petr
c5c24cb45b Implement OpenMetadata catalog integration (Phase 1)
Add OpenMetadata REST API connector and enricher to merge table/column metadata
from OpenMetadata catalog at sync and query time.

Changes:
- connectors/openmetadata/client.py: HTTP client for OM API
- connectors/openmetadata/enricher.py: Data enrichment with TTL cache
- tests/test_openmetadata_*: Unit tests for client and enricher
- src/config.py: Add catalog_fqn field to TableConfig
- src/data_sync.py: Use enricher in _generate_schema_yaml (catalog > BQ API > data_description.md)
- webapp/app.py: Initialize enricher, enrich catalog data with tags/tier/owners/url
- config/instance.yaml.example: Document openmetadata section

Features:
- FQN auto-derivation: bigquery.{table.id}
- TTL cache (default 1h) to avoid repeated API calls
- Graceful degradation: disabled if token missing, silent on HTTP errors
- Column description priority: catalog > BQ API > (none)
- Table description priority: catalog > data_description.md
2026-03-12 14:07:13 +01:00
Petr
8bb46a9e0a Add per-partition streaming sync and hybrid query architecture
Partitioned sync: iterates day-by-day instead of loading full dataset.
Each partition: query BQ -> stream to disk -> free RAM. Peak ~50 MB.
New helpers: _sync_single_partition, _cleanup_old_partitions, _generate_partition_dates.

Config: added partition_column_type (DATE/TIMESTAMP/DATETIME), query_mode (local/remote/hybrid).
DuckDB manager: hybrid architecture support (local Parquet + remote BQ tables).
Data sync: skips remote tables, filters by query_mode.

Tests: 113 passing (adapter, client, config, data_sync, duckdb_manager).
2026-03-12 13:20:41 +01:00
Petr
85c87ec375 Pass explicit bqstorage_client to to_arrow_iterable() for Storage API
Without explicit bqstorage_client parameter, to_arrow_iterable() silently
falls back to REST API pagination (~5K rows/sec). With explicit client,
it uses parallel gRPC streams via BQ Storage API (~300K rows/sec).

No temp table materialization - BQ already writes query results to an
internal temp table automatically. We just tell the reader to use the
fast gRPC path instead of slow HTTP pagination.
2026-03-12 10:51:44 +01:00
Petr
4f74543a12 Fix streaming: use RowIterator.to_arrow_iterable() not QueryJob
QueryJob only has to_arrow(), not to_arrow_iterable().
Must call query_job.result() first to get RowIterator,
which has the streaming to_arrow_iterable() method.
2026-03-11 20:15:35 +01:00
Petr
ee70da86c3 Stream BQ results to Parquet instead of loading into memory
Replace to_arrow() (loads entire result into RAM) with
to_arrow_iterable() (streams RecordBatches). Each batch is written
directly to disk via ParquetWriter - constant memory regardless
of table size. Prevents OOM on 8GB server for multi-million row tables.
2026-03-11 20:13:03 +01:00
Petr
a191ede28c Add columns and row_filter to TableConfig for selective BQ export
Propagate column selection and row filtering from data_description.md
through the BigQuery adapter to the BQ client. This enables exporting
only needed columns and applying date range filters at the SQL level,
critical for large DataView tables (e.g., 412-col unit_economics).
2026-03-11 19:37:04 +01:00
Petr
e26e47a071 Add BQ Storage API fallback to REST when readsessions permission missing 2026-03-11 13:59:09 +01:00
Petr
758910463b Add BigQuery data source adapter
BigQuery connector that syncs BQ tables to local Parquet files via PyArrow
(no CSV intermediate step). Supports full refresh, timestamp-based
incremental (via incremental_column), and partition-based sync strategies.

- connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized
  queries, metadata cache, cross-project support (job project != data project)
- connectors/bigquery/adapter.py: DataSource implementation with merge/dedup
- src/config.py: Add incremental_column field to TableConfig
- 72 unit tests (mocked, no GCP SDK required)
2026-03-11 13:56:12 +01:00
Petr
b99ec576ca Add self-service data onboarding system
Table Registry as central source of truth (JSON) with atomic writes,
optimistic locking, audit logging, and data_description.md generation.
Existing readers (config.py, profiler.py) need zero changes.

Phase 1 - Discovery API:
  - discover_tables() on DataSource ABC + Keboola implementation
  - admin_required decorator with server-side recomputation
  - GET /api/admin/discover-tables endpoint

Phase 2 - Table Registry:
  - src/table_registry.py with CRUD, validation, migration from MD
  - Admin API: register/update/unregister with version locking
  - DELETE cascade cleans up per-user subscriptions

Phase 3 - Auto-Profiling:
  - profile_changed_tables() for incremental profiling
  - Non-fatal hook in sync_all() after successful sync

Phase 4 - Per-Table Subscriptions:
  - table_mode (all/explicit) with per-table toggles
  - GET/POST /api/table-subscriptions endpoints
  - Subscription status in catalog and dashboard views

Phase 5 - Smart Sync:
  - Python-generated rsync filter files (not shell YAML parsing)
  - sync_data.sh uses --filter="merge ..." for explicit mode

Phase 6 - Admin UI:
  - /admin/tables with discovery, registration modal, registry mgmt
  - Vanilla JS, matching existing design system
2026-03-09 14:25:37 +01:00
Petr
38b86127ed Branding cleanup: remove Keboola-specific references from docs and config
- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE
- server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions
- .gitignore: generic comment for data directory
- CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/
- docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern
- connectors/jira/README.md: keboola-analyst -> data-analyst in config paths
- dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
2026-03-09 12:22:27 +01:00
Petr
266e8573d3 Extract Keboola into connectors/keboola module
Move all Keboola-specific code out of src/ into connectors/keboola/:
- git mv src/keboola_client.py -> connectors/keboola/client.py
- Extract LocalKeboolaSource (855 lines) from data_sync.py -> connectors/keboola/adapter.py
- Rename to KeboolaDataSource with full env var validation
- Extend DataSource ABC with get_column_metadata() and get_source_name()
- Add dynamic connector registry via importlib in create_data_source()
- Refactor _generate_schema_yaml to use ABC methods (source_type, _schema_version: 2)
- Remove src/adapters/ (redundant facade layer)
- Remove Keboola validation from src/config.py (connector validates itself)
- Add 14 tests for factory, ABC defaults, env validation, dynamic lookup
2026-03-09 12:22:16 +01:00