ZdenekSrotyr 79b55b6ff3 remove agnes query --register-bq from client CLI

The flag ran RemoteQueryEngine in-process on the caller's machine and
required local BigQuery credentials (BIGQUERY_PROJECT + ADC). Analysts
don't have those, so calling --register-bq from an analyst workspace
surfaced as a confusing not_configured error chain ("Could not load
static instance.yaml" + "BigQuery project not configured"). An agent
following CLAUDE.md's hybrid-queries guidance would land in exactly
that trap.

The underlying engine was originally designed server-side (commit
d180b201, "Step 28: Remote query architecture"); the CLI port (commit
d605e7d9) silently assumed parity with the server. Server-side hybrid
already exists as an admin-only POST /api/query/hybrid endpoint
(app/api/query_hybrid.py) and is untouched here.

Analysts combining local + remote data now have two documented paths:
agnes snapshot create a filtered slice and join locally, or run the
join server-side via agnes query --remote. CLAUDE.md, the agent skill,
docs/DATA_SOURCES.md, and connectors.md updated accordingly.

2026-05-12 18:18:13 +02:00

7.3 KiB

Raw Blame History

Data Sources

Overview

AI Data Analyst uses a connector system where each connector produces an extract.duckdb following a standard contract. The SyncOrchestrator auto-discovers and ATTACHes these into the master analytics.duckdb.

Configure the data source type in config/instance.yaml:

data_source:
  type: "keboola"  # Options: keboola, bigquery, csv

Table definitions are stored in the DuckDB table_registry table (not in config files). Register tables via the admin API, CLI, or web UI.

Query Modes

Each table has a query_mode that determines how data is accessed:

local: Data is downloaded to parquet files on the Agnes server. Suitable for tables that fit in local storage.
remote: Data stays in the external source; DuckDB extension ATTACHes at query time. Suitable for large tables where only query results are transferred.

Keboola Connector

Syncs tables from Keboola Storage API using the DuckDB Keboola extension.

Requirements

Keboola Storage API token with read access
DuckDB Keboola extension (auto-installed)

Configuration

In .env:

KEBOOLA_STORAGE_TOKEN=your-token-here
KEBOOLA_STACK_URL=https://connection.your-region.keboola.com
KEBOOLA_PROJECT_ID=12345

Or configure via the admin UI (/admin/tables) or CLI:

agnes admin register-table --source-type keboola --bucket "in.c-crm" --table "company" --query-mode local

How it works

The extractor (connectors/keboola/extractor.py) uses the DuckDB Keboola extension to download data
Produces extract.duckdb with _meta table + parquet files in /data/extracts/keboola/data/
The SyncOrchestrator ATTACHes extract.duckdb into analytics.duckdb and creates views

Identifier validation

All Keboola table names, bucket names, and source table identifiers are validated against _SAFE_QUOTED_IDENTIFIER regex before use. Invalid identifiers are skipped with error logging.

BigQuery Connector

Queries BigQuery tables on-demand using the DuckDB BigQuery extension (remote attach).

Requirements

Google Cloud project with BigQuery access
Application Default Credentials (ADC) configured

Configuration

In config/instance.yaml:

bigquery:
  project_id: "your-gcp-project"

BigQuery Adapter

Registers BigQuery tables and views as remote DuckDB views (no data download). Queries issued through the master analytics.duckdb are forwarded to BigQuery via the DuckDB BigQuery extension. See also agnes snapshot create for the analytical workflow that materializes filtered subsets locally.

Requirements

DuckDB BigQuery extension (auto-installed by the extractor on first run).
A GCP service account with bigquery.metadata.get on the dataset and bigquery.data.viewer (or finer) on the table; bigquery.jobs.create on the billing project for views and agnes snapshot create queries.
Credentials resolution: GCE metadata server first, then Application Default Credentials (gcloud auth application-default login or GOOGLE_APPLICATION_CREDENTIALS). See connectors/bigquery/auth.py.

Configuration

In config/instance.yaml:

data_source:
  type: bigquery
  bigquery:
    project: my-data-project              # data + default billing project
    billing_project: my-billing-project   # optional override; needed when SA
                                          # lacks serviceusage.services.use on
                                          # the data project
    location: us

Registering BigQuery tables

Two ways, both API-first (no manual table_registry SQL).

Web UI — go to /admin/tables. With data_source.type: bigquery the page swaps the discovery panel for a "Register BigQuery table" button that opens a manual-entry modal: dataset, source table, view name, description, folder, optional sync schedule. Submit runs /api/admin/register-table/precheck first (round-trips bigquery.Client.get_table to confirm the table exists and the SA can see it), surfaces the row count + size + column count, then commits.

CLI — agnes admin register-table:

# Dry-run: validate + check the source exists, no DB write.
agnes admin register-table orders \
    --source-type bigquery \
    --bucket analytics \
    --source-table orders \
    --dry-run

# Commit
agnes admin register-table orders \
    --source-type bigquery \
    --bucket analytics \
    --source-table orders \
    --description "Order data from BQ"

The server forces query_mode=remote and profile_after_sync=false for BQ rows. Sync schedule (--sync-schedule) is accepted and stored but not yet evaluated by the scheduler — see issue #79; addressed in Milestone 3 of the admin-BQ-registration epic (#108).

Wildcard / sharded tables

Not supported in M1. The register endpoint rejects any source_table containing *. Tracked in #108 M3+.

Hybrid Queries

Server-side only. Admins can POST {sql, register_bq: {alias: bq_sql}} to /api/query/hybrid (app/api/query_hybrid.py); the BigQuery sub-queries run server-side, where BQ credentials live, and the join runs against the server's local parquet views in a single DuckDB session.

Analysts who need to combine a local table with a remote one should agnes snapshot create a filtered slice of the remote table and join it locally, or run the join server-side via agnes query --remote. The earlier agnes query --register-bq flag (which ran in-process on the caller's machine) was removed because it required local BigQuery credentials that analysts don't have.

Jira Connector

Real-time webhook-based connector that updates parquet files incrementally.

How it works

Jira webhooks hit /api/jira/webhook endpoint
The connector (connectors/jira/) processes webhook events and updates parquet files
Produces extract.duckdb with _meta table + incremental parquet data

Writing a Custom Connector

Create a new connector in connectors/<name>/extractor.py that produces the extract.duckdb contract:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

Required: `_meta` table

CREATE TABLE _meta (
    table_name   VARCHAR,
    description  VARCHAR,
    rows         INTEGER,
    size_bytes   INTEGER,
    extracted_at TIMESTAMP,
    query_mode   VARCHAR   -- 'local' or 'remote'
);

Optional: `_remote_attach` table (for remote sources)

CREATE TABLE _remote_attach (
    alias     VARCHAR,  -- DuckDB alias used in views
    extension VARCHAR,  -- Extension name
    url       VARCHAR,  -- Connection URL
    token_env VARCHAR   -- Env-var name holding the auth token (NOT the token itself)
);

Identifier validation

Import shared validators from src/identifier_validation.py:

from src.identifier_validation import validate_identifier, validate_quoted_identifier

Use validate_identifier() for strict names (alphanumeric + underscore) and validate_quoted_identifier() for names that may contain dots/hyphens (e.g., Keboola-style in.c-crm.orders).

The SyncOrchestrator auto-discovers connectors by scanning /data/extracts/*/extract.duckdb — no registration step needed beyond producing the correct output format.

See connectors/keboola/ for a complete batch-pull reference implementation, or connectors/bigquery/ for a remote-attach example.

7.3 KiB Raw Blame History

Data Sources

Overview

Query Modes

Keboola Connector

Requirements

Configuration

How it works

Identifier validation

BigQuery Connector

Requirements

Configuration

BigQuery Adapter

Requirements

Configuration

Registering BigQuery tables

Wildcard / sharded tables

Hybrid Queries

Jira Connector

How it works

Writing a Custom Connector

Required: _meta table

Optional: _remote_attach table (for remote sources)

Identifier validation

7.3 KiB

Raw Blame History

Required: `_meta` table

Optional: `_remote_attach` table (for remote sources)