The flag ran RemoteQueryEngine in-process on the caller's machine and
required local BigQuery credentials (BIGQUERY_PROJECT + ADC). Analysts
don't have those, so calling --register-bq from an analyst workspace
surfaced as a confusing not_configured error chain ("Could not load
static instance.yaml" + "BigQuery project not configured"). An agent
following CLAUDE.md's hybrid-queries guidance would land in exactly
that trap.
The underlying engine was originally designed server-side (commit
d180b201, "Step 28: Remote query architecture"); the CLI port (commit
d605e7d9) silently assumed parity with the server. Server-side hybrid
already exists as an admin-only POST /api/query/hybrid endpoint
(app/api/query_hybrid.py) and is untouched here.
Analysts combining local + remote data now have two documented paths:
agnes snapshot create a filtered slice and join locally, or run the
join server-side via agnes query --remote. CLAUDE.md, the agent skill,
docs/DATA_SOURCES.md, and connectors.md updated accordingly.
7.3 KiB
Data Sources
Overview
AI Data Analyst uses a connector system where each connector produces an extract.duckdb following a standard contract. The SyncOrchestrator auto-discovers and ATTACHes these into the master analytics.duckdb.
Configure the data source type in config/instance.yaml:
data_source:
type: "keboola" # Options: keboola, bigquery, csv
Table definitions are stored in the DuckDB table_registry table (not in config files). Register tables via the admin API, CLI, or web UI.
Query Modes
Each table has a query_mode that determines how data is accessed:
local: Data is downloaded to parquet files on the Agnes server. Suitable for tables that fit in local storage.remote: Data stays in the external source; DuckDB extension ATTACHes at query time. Suitable for large tables where only query results are transferred.
Keboola Connector
Syncs tables from Keboola Storage API using the DuckDB Keboola extension.
Requirements
- Keboola Storage API token with read access
- DuckDB Keboola extension (auto-installed)
Configuration
In .env:
KEBOOLA_STORAGE_TOKEN=your-token-here
KEBOOLA_STACK_URL=https://connection.your-region.keboola.com
KEBOOLA_PROJECT_ID=12345
Or configure via the admin UI (/admin/tables) or CLI:
agnes admin register-table --source-type keboola --bucket "in.c-crm" --table "company" --query-mode local
How it works
- The extractor (
connectors/keboola/extractor.py) uses the DuckDB Keboola extension to download data - Produces
extract.duckdbwith_metatable + parquet files in/data/extracts/keboola/data/ - The SyncOrchestrator ATTACHes
extract.duckdbintoanalytics.duckdband creates views
Identifier validation
All Keboola table names, bucket names, and source table identifiers are validated against _SAFE_QUOTED_IDENTIFIER regex before use. Invalid identifiers are skipped with error logging.
BigQuery Connector
Queries BigQuery tables on-demand using the DuckDB BigQuery extension (remote attach).
Requirements
- Google Cloud project with BigQuery access
- Application Default Credentials (ADC) configured
Configuration
In config/instance.yaml:
bigquery:
project_id: "your-gcp-project"
BigQuery Adapter
Registers BigQuery tables and views as remote DuckDB views (no data download). Queries
issued through the master analytics.duckdb are forwarded to BigQuery via the DuckDB
BigQuery extension. See also agnes snapshot create for the analytical workflow that materializes
filtered subsets locally.
Requirements
- DuckDB BigQuery extension (auto-installed by the extractor on first run).
- A GCP service account with
bigquery.metadata.geton the dataset andbigquery.data.viewer(or finer) on the table;bigquery.jobs.createon the billing project for views andagnes snapshot createqueries. - Credentials resolution: GCE metadata server first, then Application Default
Credentials (
gcloud auth application-default loginorGOOGLE_APPLICATION_CREDENTIALS). Seeconnectors/bigquery/auth.py.
Configuration
In config/instance.yaml:
data_source:
type: bigquery
bigquery:
project: my-data-project # data + default billing project
billing_project: my-billing-project # optional override; needed when SA
# lacks serviceusage.services.use on
# the data project
location: us
Registering BigQuery tables
Two ways, both API-first (no manual table_registry SQL).
Web UI — go to /admin/tables. With data_source.type: bigquery the page
swaps the discovery panel for a "Register BigQuery table" button that opens a
manual-entry modal: dataset, source table, view name, description, folder,
optional sync schedule. Submit runs /api/admin/register-table/precheck first
(round-trips bigquery.Client.get_table to confirm the table exists and the SA
can see it), surfaces the row count + size + column count, then commits.
CLI — agnes admin register-table:
# Dry-run: validate + check the source exists, no DB write.
agnes admin register-table orders \
--source-type bigquery \
--bucket analytics \
--source-table orders \
--dry-run
# Commit
agnes admin register-table orders \
--source-type bigquery \
--bucket analytics \
--source-table orders \
--description "Order data from BQ"
The server forces query_mode=remote and profile_after_sync=false for BQ
rows. Sync schedule (--sync-schedule) is accepted and stored but not yet
evaluated by the scheduler — see issue #79; addressed in Milestone 3 of the
admin-BQ-registration epic (#108).
Wildcard / sharded tables
Not supported in M1. The register endpoint rejects any source_table containing
*. Tracked in #108 M3+.
Hybrid Queries
Server-side only. Admins can POST {sql, register_bq: {alias: bq_sql}} to
/api/query/hybrid (app/api/query_hybrid.py); the BigQuery sub-queries
run server-side, where BQ credentials live, and the join runs against the
server's local parquet views in a single DuckDB session.
Analysts who need to combine a local table with a remote one should
agnes snapshot create a filtered slice of the remote table and join it
locally, or run the join server-side via agnes query --remote. The
earlier agnes query --register-bq flag (which ran in-process on the
caller's machine) was removed because it required local BigQuery
credentials that analysts don't have.
Jira Connector
Real-time webhook-based connector that updates parquet files incrementally.
How it works
- Jira webhooks hit
/api/jira/webhookendpoint - The connector (
connectors/jira/) processes webhook events and updates parquet files - Produces
extract.duckdbwith_metatable + incremental parquet data
Writing a Custom Connector
Create a new connector in connectors/<name>/extractor.py that produces the extract.duckdb contract:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
Required: _meta table
CREATE TABLE _meta (
table_name VARCHAR,
description VARCHAR,
rows INTEGER,
size_bytes INTEGER,
extracted_at TIMESTAMP,
query_mode VARCHAR -- 'local' or 'remote'
);
Optional: _remote_attach table (for remote sources)
CREATE TABLE _remote_attach (
alias VARCHAR, -- DuckDB alias used in views
extension VARCHAR, -- Extension name
url VARCHAR, -- Connection URL
token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself)
);
Identifier validation
Import shared validators from src/identifier_validation.py:
from src.identifier_validation import validate_identifier, validate_quoted_identifier
Use validate_identifier() for strict names (alphanumeric + underscore) and validate_quoted_identifier() for names that may contain dots/hyphens (e.g., Keboola-style in.c-crm.orders).
The SyncOrchestrator auto-discovers connectors by scanning /data/extracts/*/extract.duckdb — no registration step needed beyond producing the correct output format.
See connectors/keboola/ for a complete batch-pull reference implementation, or connectors/bigquery/ for a remote-attach example.