ZdenekSrotyr 1563b05f2e refactor(cli): hard-cutover env vars + config dir to AGNES_*

Task 0.5 of clean-analyst-bootstrap. Greenfield rewrite — no fallback,
no aliases. Existing dev environments lose their cached PAT and must
re-authenticate.

Env var renames (hard cutover):
- DA_CONFIG_DIR    -> AGNES_CONFIG_DIR
- DA_SERVER        -> AGNES_SERVER
- DA_SERVER_URL    -> AGNES_SERVER_URL  (test-only stale ref, not in spec)
- DA_NO_UPDATE_CHECK -> AGNES_NO_UPDATE_CHECK
- DA_LOCAL_DIR     -> AGNES_LOCAL_DIR
- DA_TOKEN         -> AGNES_TOKEN
- DA_STREAM_RETRIES -> AGNES_STREAM_RETRIES

Config dir rename: ~/.config/da/ -> ~/.config/agnes/ (across code,
comments, docstrings, error messages, install templates, dev scripts).

Stale `da X` references in CLI source (and adjacent app/, tests/):
swept docstrings, comments, help text, and error messages where the
verb survives the rewrite (init, pull, push, catalog, status, diagnose,
auth, admin, skills, query, schema, describe, explore, disk-info,
snapshot, login, logout, whoami, server, setup) and replaced `da X`
with `agnes X`. Intentionally kept `da sync`, `da fetch`, `da analyst`,
`da metrics` — those verbs are removed in later tasks; the legacy
strings will be detected by `_LEGACY_STRINGS` (added in Task 2).

Test fixes:
- TestCLIVersion now asserts output starts with `agnes ` (was `da `).

Test results: 2675 passed, 25 skipped (full pytest run, excluding 9
pre-existing test_db.py / test_user_management.py / test_e2e_extract.py
/ test_cli_binary_rename.py failures unrelated to this rename).

2026-05-04 16:35:44 +02:00

2.9 KiB

Raw Blame History

Connectors — How to add a new data source

Existing Connectors

Keboola (connectors/keboola/extractor.py) — DuckDB Keboola extension, batch pull
BigQuery (connectors/bigquery/extractor.py) — DuckDB BQ extension, remote-only
Jira (connectors/jira/) — Webhook + incremental parquet transform

extract.duckdb Contract

Every connector produces the same output:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The _meta table must have columns:

table_name VARCHAR — view name
description VARCHAR
rows BIGINT
size_bytes BIGINT
extracted_at TIMESTAMP
query_mode VARCHAR — 'local' (data here) or 'remote' (query on demand)

Adding a New Connector

Create connectors/<name>/extractor.py:

import duckdb
from pathlib import Path

def run(output_dir: str, table_configs: list[dict], **kwargs):
    output = Path(output_dir)
    data_dir = output / "data"
    data_dir.mkdir(parents=True, exist_ok=True)

    conn = duckdb.connect(str(output / "extract.duckdb"))
    # Create _meta table
    # For each table: COPY TO parquet, create view, insert _meta row
    conn.close()

Register tables in DuckDB table_registry via admin API or migration script. Set source_type to your connector name.
Add required env vars to .env and config/.env.template.
The SyncOrchestrator (src/orchestrator.py) will auto-discover your extract.duckdb.

Configuration

Instance-level config: config/instance.yaml (connection details)
Table definitions: DuckDB table_registry table
Credentials: environment variables

BigQuery: pick a mode

Need	Mode	Why
Latency under 100 ms, table fits on disk	`materialized`	Local parquet, no BQ roundtrip
Table too large for analyst's disk, occasional ad-hoc query	`remote`	DuckDB BQ extension, no download
Table too large for disk AND analyst hits it constantly	`materialized` with aggregation/filter	Scheduled COPY of a slice
One-off subquery joined with local data	(no registry row)	Use `agnes query --register-bq …` for ad-hoc

Cost: materialized runs once per sync_schedule regardless of how many analysts query it; remote runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.

Guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in instance.yaml.

agnes admin register-table orders_90d \
    --source-type bigquery \
    --query-mode materialized \
    --query @docs/queries/orders_90d.sql \
    --schedule "every 6h"

--query also accepts inline SQL.

2.9 KiB Raw Blame History Unescape Escape