Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
2.9 KiB
Connectors — How to add a new data source
Existing Connectors
- Keboola (
connectors/keboola/extractor.py) — DuckDB Keboola extension, batch pull - BigQuery (
connectors/bigquery/extractor.py) — DuckDB BQ extension, remote-only - Jira (
connectors/jira/) — Webhook + incremental parquet transform
extract.duckdb Contract
Every connector produces the same output:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
The _meta table must have columns:
table_name VARCHAR— view namedescription VARCHARrows BIGINTsize_bytes BIGINTextracted_at TIMESTAMPquery_mode VARCHAR— 'local' (data here) or 'remote' (query on demand)
Adding a New Connector
-
Create
connectors/<name>/extractor.py:import duckdb from pathlib import Path def run(output_dir: str, table_configs: list[dict], **kwargs): output = Path(output_dir) data_dir = output / "data" data_dir.mkdir(parents=True, exist_ok=True) conn = duckdb.connect(str(output / "extract.duckdb")) # Create _meta table # For each table: COPY TO parquet, create view, insert _meta row conn.close() -
Register tables in DuckDB
table_registryvia admin API or migration script. Setsource_typeto your connector name. -
Add required env vars to
.envandconfig/.env.template. -
The SyncOrchestrator (
src/orchestrator.py) will auto-discover your extract.duckdb.
Configuration
- Instance-level config:
config/instance.yaml(connection details) - Table definitions: DuckDB
table_registrytable - Credentials: environment variables
BigQuery: pick a mode
| Need | Mode | Why |
|---|---|---|
| Latency under 100 ms, table fits on disk | materialized |
Local parquet, no BQ roundtrip |
| Table too large for analyst's disk, occasional ad-hoc query | remote |
DuckDB BQ extension, no download |
| Table too large for disk AND analyst hits it constantly | materialized with aggregation/filter |
Scheduled COPY of a slice |
| One-off subquery joined with local data | (no registry row) | Use da query --register-bq … for ad-hoc |
Cost: materialized runs once per sync_schedule regardless of how many analysts query it; remote runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
Guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in instance.yaml.
Register a materialized table:
da admin register-table orders_90d \
--source-type bigquery \
--query-mode materialized \
--query @docs/queries/orders_90d.sql \
--schedule "every 6h"
--query also accepts inline SQL.