Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
78 lines
2.9 KiB
Markdown
78 lines
2.9 KiB
Markdown
# Connectors — How to add a new data source
|
||
|
||
## Existing Connectors
|
||
- **Keboola** (`connectors/keboola/extractor.py`) — DuckDB Keboola extension, batch pull
|
||
- **BigQuery** (`connectors/bigquery/extractor.py`) — DuckDB BQ extension, remote-only
|
||
- **Jira** (`connectors/jira/`) — Webhook + incremental parquet transform
|
||
|
||
## extract.duckdb Contract
|
||
|
||
Every connector produces the same output:
|
||
```
|
||
/data/extracts/{source_name}/
|
||
├── extract.duckdb ← _meta table + views
|
||
└── data/ ← parquet files (local sources only)
|
||
```
|
||
|
||
The `_meta` table must have columns:
|
||
- `table_name VARCHAR` — view name
|
||
- `description VARCHAR`
|
||
- `rows BIGINT`
|
||
- `size_bytes BIGINT`
|
||
- `extracted_at TIMESTAMP`
|
||
- `query_mode VARCHAR` — 'local' (data here) or 'remote' (query on demand)
|
||
|
||
## Adding a New Connector
|
||
|
||
1. Create `connectors/<name>/extractor.py`:
|
||
```python
|
||
import duckdb
|
||
from pathlib import Path
|
||
|
||
def run(output_dir: str, table_configs: list[dict], **kwargs):
|
||
output = Path(output_dir)
|
||
data_dir = output / "data"
|
||
data_dir.mkdir(parents=True, exist_ok=True)
|
||
|
||
conn = duckdb.connect(str(output / "extract.duckdb"))
|
||
# Create _meta table
|
||
# For each table: COPY TO parquet, create view, insert _meta row
|
||
conn.close()
|
||
```
|
||
|
||
2. Register tables in DuckDB `table_registry` via admin API or migration script.
|
||
Set `source_type` to your connector name.
|
||
|
||
3. Add required env vars to `.env` and `config/.env.template`.
|
||
|
||
4. The SyncOrchestrator (`src/orchestrator.py`) will auto-discover your extract.duckdb.
|
||
|
||
## Configuration
|
||
- Instance-level config: `config/instance.yaml` (connection details)
|
||
- Table definitions: DuckDB `table_registry` table
|
||
- Credentials: environment variables
|
||
|
||
## BigQuery: pick a mode
|
||
|
||
| Need | Mode | Why |
|
||
|------|------|-----|
|
||
| Latency under 100 ms, table fits on disk | `materialized` | Local parquet, no BQ roundtrip |
|
||
| Table too large for analyst's disk, occasional ad-hoc query | `remote` | DuckDB BQ extension, no download |
|
||
| Table too large for disk AND analyst hits it constantly | `materialized` with aggregation/filter | Scheduled COPY of a slice |
|
||
| One-off subquery joined with local data | (no registry row) | Use `da query --register-bq …` for ad-hoc |
|
||
|
||
Cost: `materialized` runs once per `sync_schedule` regardless of how many analysts query it; `remote` runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
|
||
|
||
Guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in `instance.yaml`.
|
||
|
||
Register a materialized table:
|
||
|
||
```bash
|
||
da admin register-table orders_90d \
|
||
--source-type bigquery \
|
||
--query-mode materialized \
|
||
--query @docs/queries/orders_90d.sql \
|
||
--schedule "every 6h"
|
||
```
|
||
|
||
`--query` also accepts inline SQL.
|