agnes-the-ai-analyst/cli/skills/connectors.md
ZdenekSrotyr b627de8344 feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs
Diagnostic + operator-facing documentation that closes the loop on the work in this PR.

`da diagnose` (via /api/health/detailed):
  - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
  - Non-BQ instances (Keboola-only, etc.) skip the check.
  - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.

config/instance.yaml.example documentation:
  - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
  - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
  - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
  - ai.base_url: provider list + UI hint
  - openmetadata + desktop: 'configurable via /admin/server-config UI' headers
  - corporate_memory: leading note that the schema is editable via UI

Other docs:
  - CHANGELOG.md: comprehensive Unreleased section
  - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
  - README.md: mode-first source table summary
  - docs/architecture.md: per-connector tab UI mention
  - cli/skills/connectors.md: bootstrap rails (parallel to #154)
  - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
  - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)

Tests:
  - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
2026-05-01 20:27:24 +02:00

78 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Connectors — How to add a new data source
## Existing Connectors
- **Keboola** (`connectors/keboola/extractor.py`) — DuckDB Keboola extension, batch pull
- **BigQuery** (`connectors/bigquery/extractor.py`) — DuckDB BQ extension, remote-only
- **Jira** (`connectors/jira/`) — Webhook + incremental parquet transform
## extract.duckdb Contract
Every connector produces the same output:
```
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
```
The `_meta` table must have columns:
- `table_name VARCHAR` — view name
- `description VARCHAR`
- `rows BIGINT`
- `size_bytes BIGINT`
- `extracted_at TIMESTAMP`
- `query_mode VARCHAR` — 'local' (data here) or 'remote' (query on demand)
## Adding a New Connector
1. Create `connectors/<name>/extractor.py`:
```python
import duckdb
from pathlib import Path
def run(output_dir: str, table_configs: list[dict], **kwargs):
output = Path(output_dir)
data_dir = output / "data"
data_dir.mkdir(parents=True, exist_ok=True)
conn = duckdb.connect(str(output / "extract.duckdb"))
# Create _meta table
# For each table: COPY TO parquet, create view, insert _meta row
conn.close()
```
2. Register tables in DuckDB `table_registry` via admin API or migration script.
Set `source_type` to your connector name.
3. Add required env vars to `.env` and `config/.env.template`.
4. The SyncOrchestrator (`src/orchestrator.py`) will auto-discover your extract.duckdb.
## Configuration
- Instance-level config: `config/instance.yaml` (connection details)
- Table definitions: DuckDB `table_registry` table
- Credentials: environment variables
## BigQuery: pick a mode
| Need | Mode | Why |
|------|------|-----|
| Latency under 100 ms, table fits on disk | `materialized` | Local parquet, no BQ roundtrip |
| Table too large for analyst's disk, occasional ad-hoc query | `remote` | DuckDB BQ extension, no download |
| Table too large for disk AND analyst hits it constantly | `materialized` with aggregation/filter | Scheduled COPY of a slice |
| One-off subquery joined with local data | (no registry row) | Use `da query --register-bq …` for ad-hoc |
Cost: `materialized` runs once per `sync_schedule` regardless of how many analysts query it; `remote` runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
Guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in `instance.yaml`.
Register a materialized table:
```bash
da admin register-table orders_90d \
--source-type bigquery \
--query-mode materialized \
--query @docs/queries/orders_90d.sql \
--schedule "every 6h"
```
`--query` also accepts inline SQL.