The flag ran RemoteQueryEngine in-process on the caller's machine and
required local BigQuery credentials (BIGQUERY_PROJECT + ADC). Analysts
don't have those, so calling --register-bq from an analyst workspace
surfaced as a confusing not_configured error chain ("Could not load
static instance.yaml" + "BigQuery project not configured"). An agent
following CLAUDE.md's hybrid-queries guidance would land in exactly
that trap.
The underlying engine was originally designed server-side (commit
d180b201, "Step 28: Remote query architecture"); the CLI port (commit
d605e7d9) silently assumed parity with the server. Server-side hybrid
already exists as an admin-only POST /api/query/hybrid endpoint
(app/api/query_hybrid.py) and is untouched here.
Analysts combining local + remote data now have two documented paths:
agnes snapshot create a filtered slice and join locally, or run the
join server-side via agnes query --remote. CLAUDE.md, the agent skill,
docs/DATA_SOURCES.md, and connectors.md updated accordingly.
2.9 KiB
Connectors — How to add a new data source
Existing Connectors
- Keboola (
connectors/keboola/extractor.py) — DuckDB Keboola extension, batch pull - BigQuery (
connectors/bigquery/extractor.py) — DuckDB BQ extension, remote-only - Jira (
connectors/jira/) — Webhook + incremental parquet transform
extract.duckdb Contract
Every connector produces the same output:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
The _meta table must have columns:
table_name VARCHAR— view namedescription VARCHARrows BIGINTsize_bytes BIGINTextracted_at TIMESTAMPquery_mode VARCHAR— 'local' (data here) or 'remote' (query on demand)
Adding a New Connector
-
Create
connectors/<name>/extractor.py:import duckdb from pathlib import Path def run(output_dir: str, table_configs: list[dict], **kwargs): output = Path(output_dir) data_dir = output / "data" data_dir.mkdir(parents=True, exist_ok=True) conn = duckdb.connect(str(output / "extract.duckdb")) # Create _meta table # For each table: COPY TO parquet, create view, insert _meta row conn.close() -
Register tables in DuckDB
table_registryvia admin API or migration script. Setsource_typeto your connector name. -
Add required env vars to
.envandconfig/.env.template. -
The SyncOrchestrator (
src/orchestrator.py) will auto-discover your extract.duckdb.
Configuration
- Instance-level config:
config/instance.yaml(connection details) - Table definitions: DuckDB
table_registrytable - Credentials: environment variables
BigQuery: pick a mode
| Need | Mode | Why |
|---|---|---|
| Latency under 100 ms, table fits on disk | materialized |
Local parquet, no BQ roundtrip |
| Table too large for analyst's disk, occasional ad-hoc query | remote |
DuckDB BQ extension, no download |
| Table too large for disk AND analyst hits it constantly | materialized with aggregation/filter |
Scheduled COPY of a slice |
| One-off subquery joined with local data | (no registry row) | agnes snapshot create a filtered slice and join locally, or run the join server-side via agnes query --remote |
Cost: materialized runs once per sync_schedule regardless of how many analysts query it; remote runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
Guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in instance.yaml.
Register a materialized table:
agnes admin register-table orders_90d \
--source-type bigquery \
--query-mode materialized \
--query @docs/queries/orders_90d.sql \
--schedule "every 6h"
--query also accepts inline SQL.