agnes-the-ai-analyst/cli/skills/connectors.md
ZdenekSrotyr b627de8344 feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs
Diagnostic + operator-facing documentation that closes the loop on the work in this PR.

`da diagnose` (via /api/health/detailed):
  - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
  - Non-BQ instances (Keboola-only, etc.) skip the check.
  - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.

config/instance.yaml.example documentation:
  - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
  - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
  - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
  - ai.base_url: provider list + UI hint
  - openmetadata + desktop: 'configurable via /admin/server-config UI' headers
  - corporate_memory: leading note that the schema is editable via UI

Other docs:
  - CHANGELOG.md: comprehensive Unreleased section
  - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
  - README.md: mode-first source table summary
  - docs/architecture.md: per-connector tab UI mention
  - cli/skills/connectors.md: bootstrap rails (parallel to #154)
  - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
  - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)

Tests:
  - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
2026-05-01 20:27:24 +02:00

2.9 KiB
Raw Blame History

Connectors — How to add a new data source

Existing Connectors

  • Keboola (connectors/keboola/extractor.py) — DuckDB Keboola extension, batch pull
  • BigQuery (connectors/bigquery/extractor.py) — DuckDB BQ extension, remote-only
  • Jira (connectors/jira/) — Webhook + incremental parquet transform

extract.duckdb Contract

Every connector produces the same output:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The _meta table must have columns:

  • table_name VARCHAR — view name
  • description VARCHAR
  • rows BIGINT
  • size_bytes BIGINT
  • extracted_at TIMESTAMP
  • query_mode VARCHAR — 'local' (data here) or 'remote' (query on demand)

Adding a New Connector

  1. Create connectors/<name>/extractor.py:

    import duckdb
    from pathlib import Path
    
    def run(output_dir: str, table_configs: list[dict], **kwargs):
        output = Path(output_dir)
        data_dir = output / "data"
        data_dir.mkdir(parents=True, exist_ok=True)
    
        conn = duckdb.connect(str(output / "extract.duckdb"))
        # Create _meta table
        # For each table: COPY TO parquet, create view, insert _meta row
        conn.close()
    
  2. Register tables in DuckDB table_registry via admin API or migration script. Set source_type to your connector name.

  3. Add required env vars to .env and config/.env.template.

  4. The SyncOrchestrator (src/orchestrator.py) will auto-discover your extract.duckdb.

Configuration

  • Instance-level config: config/instance.yaml (connection details)
  • Table definitions: DuckDB table_registry table
  • Credentials: environment variables

BigQuery: pick a mode

Need Mode Why
Latency under 100 ms, table fits on disk materialized Local parquet, no BQ roundtrip
Table too large for analyst's disk, occasional ad-hoc query remote DuckDB BQ extension, no download
Table too large for disk AND analyst hits it constantly materialized with aggregation/filter Scheduled COPY of a slice
One-off subquery joined with local data (no registry row) Use da query --register-bq … for ad-hoc

Cost: materialized runs once per sync_schedule regardless of how many analysts query it; remote runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.

Guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in instance.yaml.

Register a materialized table:

da admin register-table orders_90d \
    --source-type bigquery \
    --query-mode materialized \
    --query @docs/queries/orders_90d.sql \
    --schedule "every 6h"

--query also accepts inline SQL.