# Connectors — How to add a new data source ## Existing Connectors - **Keboola** (`connectors/keboola/extractor.py`) — DuckDB Keboola extension, batch pull - **BigQuery** (`connectors/bigquery/extractor.py`) — DuckDB BQ extension, remote-only - **Jira** (`connectors/jira/`) — Webhook + incremental parquet transform ## extract.duckdb Contract Every connector produces the same output: ``` /data/extracts/{source_name}/ ├── extract.duckdb ← _meta table + views └── data/ ← parquet files (local sources only) ``` The `_meta` table must have columns: - `table_name VARCHAR` — view name - `description VARCHAR` - `rows BIGINT` - `size_bytes BIGINT` - `extracted_at TIMESTAMP` - `query_mode VARCHAR` — 'local' (data here) or 'remote' (query on demand) ## Adding a New Connector 1. Create `connectors//extractor.py`: ```python import duckdb from pathlib import Path def run(output_dir: str, table_configs: list[dict], **kwargs): output = Path(output_dir) data_dir = output / "data" data_dir.mkdir(parents=True, exist_ok=True) conn = duckdb.connect(str(output / "extract.duckdb")) # Create _meta table # For each table: COPY TO parquet, create view, insert _meta row conn.close() ``` 2. Register tables in DuckDB `table_registry` via admin API or migration script. Set `source_type` to your connector name. 3. Add required env vars to `.env` and `config/.env.template`. 4. The SyncOrchestrator (`src/orchestrator.py`) will auto-discover your extract.duckdb. ## Configuration - Instance-level config: `config/instance.yaml` (connection details) - Table definitions: DuckDB `table_registry` table - Credentials: environment variables ## BigQuery: pick a mode | Need | Mode | Why | |------|------|-----| | Latency under 100 ms, table fits on disk | `materialized` | Local parquet, no BQ roundtrip | | Table too large for analyst's disk, occasional ad-hoc query | `remote` | DuckDB BQ extension, no download | | Table too large for disk AND analyst hits it constantly | `materialized` with aggregation/filter | Scheduled COPY of a slice | | One-off subquery joined with local data | (no registry row) | `agnes snapshot create` a filtered slice and join locally, or run the join server-side via `agnes query --remote` | Cost: `materialized` runs once per `sync_schedule` regardless of how many analysts query it; `remote` runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned. Guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in `instance.yaml`. Register a materialized table: ```bash agnes admin register-table orders_90d \ --source-type bigquery \ --query-mode materialized \ --query @docs/queries/orders_90d.sql \ --schedule "every 6h" ``` `--query` also accepts inline SQL.