Task 0.5 of clean-analyst-bootstrap. Greenfield rewrite — no fallback, no aliases. Existing dev environments lose their cached PAT and must re-authenticate. Env var renames (hard cutover): - DA_CONFIG_DIR -> AGNES_CONFIG_DIR - DA_SERVER -> AGNES_SERVER - DA_SERVER_URL -> AGNES_SERVER_URL (test-only stale ref, not in spec) - DA_NO_UPDATE_CHECK -> AGNES_NO_UPDATE_CHECK - DA_LOCAL_DIR -> AGNES_LOCAL_DIR - DA_TOKEN -> AGNES_TOKEN - DA_STREAM_RETRIES -> AGNES_STREAM_RETRIES Config dir rename: ~/.config/da/ -> ~/.config/agnes/ (across code, comments, docstrings, error messages, install templates, dev scripts). Stale `da X` references in CLI source (and adjacent app/, tests/): swept docstrings, comments, help text, and error messages where the verb survives the rewrite (init, pull, push, catalog, status, diagnose, auth, admin, skills, query, schema, describe, explore, disk-info, snapshot, login, logout, whoami, server, setup) and replaced `da X` with `agnes X`. Intentionally kept `da sync`, `da fetch`, `da analyst`, `da metrics` — those verbs are removed in later tasks; the legacy strings will be detected by `_LEGACY_STRINGS` (added in Task 2). Test fixes: - TestCLIVersion now asserts output starts with `agnes ` (was `da `). Test results: 2675 passed, 25 skipped (full pytest run, excluding 9 pre-existing test_db.py / test_user_management.py / test_e2e_extract.py / test_cli_binary_rename.py failures unrelated to this rename).
2.9 KiB
Connectors — How to add a new data source
Existing Connectors
- Keboola (
connectors/keboola/extractor.py) — DuckDB Keboola extension, batch pull - BigQuery (
connectors/bigquery/extractor.py) — DuckDB BQ extension, remote-only - Jira (
connectors/jira/) — Webhook + incremental parquet transform
extract.duckdb Contract
Every connector produces the same output:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
The _meta table must have columns:
table_name VARCHAR— view namedescription VARCHARrows BIGINTsize_bytes BIGINTextracted_at TIMESTAMPquery_mode VARCHAR— 'local' (data here) or 'remote' (query on demand)
Adding a New Connector
-
Create
connectors/<name>/extractor.py:import duckdb from pathlib import Path def run(output_dir: str, table_configs: list[dict], **kwargs): output = Path(output_dir) data_dir = output / "data" data_dir.mkdir(parents=True, exist_ok=True) conn = duckdb.connect(str(output / "extract.duckdb")) # Create _meta table # For each table: COPY TO parquet, create view, insert _meta row conn.close() -
Register tables in DuckDB
table_registryvia admin API or migration script. Setsource_typeto your connector name. -
Add required env vars to
.envandconfig/.env.template. -
The SyncOrchestrator (
src/orchestrator.py) will auto-discover your extract.duckdb.
Configuration
- Instance-level config:
config/instance.yaml(connection details) - Table definitions: DuckDB
table_registrytable - Credentials: environment variables
BigQuery: pick a mode
| Need | Mode | Why |
|---|---|---|
| Latency under 100 ms, table fits on disk | materialized |
Local parquet, no BQ roundtrip |
| Table too large for analyst's disk, occasional ad-hoc query | remote |
DuckDB BQ extension, no download |
| Table too large for disk AND analyst hits it constantly | materialized with aggregation/filter |
Scheduled COPY of a slice |
| One-off subquery joined with local data | (no registry row) | Use agnes query --register-bq … for ad-hoc |
Cost: materialized runs once per sync_schedule regardless of how many analysts query it; remote runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
Guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in instance.yaml.
Register a materialized table:
agnes admin register-table orders_90d \
--source-type bigquery \
--query-mode materialized \
--query @docs/queries/orders_90d.sql \
--schedule "every 6h"
--query also accepts inline SQL.