The flag ran RemoteQueryEngine in-process on the caller's machine and
required local BigQuery credentials (BIGQUERY_PROJECT + ADC). Analysts
don't have those, so calling --register-bq from an analyst workspace
surfaced as a confusing not_configured error chain ("Could not load
static instance.yaml" + "BigQuery project not configured"). An agent
following CLAUDE.md's hybrid-queries guidance would land in exactly
that trap.
The underlying engine was originally designed server-side (commit
d180b201, "Step 28: Remote query architecture"); the CLI port (commit
d605e7d9) silently assumed parity with the server. Server-side hybrid
already exists as an admin-only POST /api/query/hybrid endpoint
(app/api/query_hybrid.py) and is untouched here.
Analysts combining local + remote data now have two documented paths:
agnes snapshot create a filtered slice and join locally, or run the
join server-side via agnes query --remote. CLAUDE.md, the agent skill,
docs/DATA_SOURCES.md, and connectors.md updated accordingly.
214 lines
7.3 KiB
Markdown
214 lines
7.3 KiB
Markdown
# Data Sources
|
|
|
|
## Overview
|
|
|
|
AI Data Analyst uses a connector system where each connector produces an `extract.duckdb` following a standard contract. The SyncOrchestrator auto-discovers and ATTACHes these into the master `analytics.duckdb`.
|
|
|
|
Configure the data source type in `config/instance.yaml`:
|
|
|
|
```yaml
|
|
data_source:
|
|
type: "keboola" # Options: keboola, bigquery, csv
|
|
```
|
|
|
|
Table definitions are stored in the DuckDB `table_registry` table (not in config files). Register tables via the admin API, CLI, or web UI.
|
|
|
|
## Query Modes
|
|
|
|
Each table has a `query_mode` that determines how data is accessed:
|
|
|
|
- **`local`**: Data is downloaded to parquet files on the Agnes server. Suitable for tables that fit in local storage.
|
|
- **`remote`**: Data stays in the external source; DuckDB extension ATTACHes at query time. Suitable for large tables where only query results are transferred.
|
|
|
|
## Keboola Connector
|
|
|
|
Syncs tables from Keboola Storage API using the DuckDB Keboola extension.
|
|
|
|
### Requirements
|
|
|
|
- Keboola Storage API token with read access
|
|
- DuckDB Keboola extension (auto-installed)
|
|
|
|
### Configuration
|
|
|
|
In `.env`:
|
|
```
|
|
KEBOOLA_STORAGE_TOKEN=your-token-here
|
|
KEBOOLA_STACK_URL=https://connection.your-region.keboola.com
|
|
KEBOOLA_PROJECT_ID=12345
|
|
```
|
|
|
|
Or configure via the admin UI (`/admin/tables`) or CLI:
|
|
```bash
|
|
agnes admin register-table --source-type keboola --bucket "in.c-crm" --table "company" --query-mode local
|
|
```
|
|
|
|
### How it works
|
|
|
|
1. The extractor (`connectors/keboola/extractor.py`) uses the DuckDB Keboola extension to download data
|
|
2. Produces `extract.duckdb` with `_meta` table + parquet files in `/data/extracts/keboola/data/`
|
|
3. The SyncOrchestrator ATTACHes `extract.duckdb` into `analytics.duckdb` and creates views
|
|
|
|
### Identifier validation
|
|
|
|
All Keboola table names, bucket names, and source table identifiers are validated against `_SAFE_QUOTED_IDENTIFIER` regex before use. Invalid identifiers are skipped with error logging.
|
|
|
|
## BigQuery Connector
|
|
|
|
Queries BigQuery tables on-demand using the DuckDB BigQuery extension (remote attach).
|
|
|
|
### Requirements
|
|
|
|
- Google Cloud project with BigQuery access
|
|
- Application Default Credentials (ADC) configured
|
|
|
|
### Configuration
|
|
|
|
In `config/instance.yaml`:
|
|
```yaml
|
|
bigquery:
|
|
project_id: "your-gcp-project"
|
|
```
|
|
|
|
## BigQuery Adapter
|
|
|
|
Registers BigQuery tables and views as remote DuckDB views (no data download). Queries
|
|
issued through the master `analytics.duckdb` are forwarded to BigQuery via the DuckDB
|
|
BigQuery extension. See also `agnes snapshot create` for the analytical workflow that materializes
|
|
filtered subsets locally.
|
|
|
|
### Requirements
|
|
|
|
- DuckDB BigQuery extension (auto-installed by the extractor on first run).
|
|
- A GCP service account with `bigquery.metadata.get` on the dataset and
|
|
`bigquery.data.viewer` (or finer) on the table; `bigquery.jobs.create` on the
|
|
billing project for views and `agnes snapshot create` queries.
|
|
- Credentials resolution: GCE metadata server first, then Application Default
|
|
Credentials (`gcloud auth application-default login` or
|
|
`GOOGLE_APPLICATION_CREDENTIALS`). See `connectors/bigquery/auth.py`.
|
|
|
|
### Configuration
|
|
|
|
In `config/instance.yaml`:
|
|
|
|
```yaml
|
|
data_source:
|
|
type: bigquery
|
|
bigquery:
|
|
project: my-data-project # data + default billing project
|
|
billing_project: my-billing-project # optional override; needed when SA
|
|
# lacks serviceusage.services.use on
|
|
# the data project
|
|
location: us
|
|
```
|
|
|
|
### Registering BigQuery tables
|
|
|
|
Two ways, both API-first (no manual `table_registry` SQL).
|
|
|
|
**Web UI** — go to `/admin/tables`. With `data_source.type: bigquery` the page
|
|
swaps the discovery panel for a "Register BigQuery table" button that opens a
|
|
manual-entry modal: dataset, source table, view name, description, folder,
|
|
optional sync schedule. Submit runs `/api/admin/register-table/precheck` first
|
|
(round-trips `bigquery.Client.get_table` to confirm the table exists and the SA
|
|
can see it), surfaces the row count + size + column count, then commits.
|
|
|
|
**CLI** — `agnes admin register-table`:
|
|
|
|
```bash
|
|
# Dry-run: validate + check the source exists, no DB write.
|
|
agnes admin register-table orders \
|
|
--source-type bigquery \
|
|
--bucket analytics \
|
|
--source-table orders \
|
|
--dry-run
|
|
|
|
# Commit
|
|
agnes admin register-table orders \
|
|
--source-type bigquery \
|
|
--bucket analytics \
|
|
--source-table orders \
|
|
--description "Order data from BQ"
|
|
```
|
|
|
|
The server forces `query_mode=remote` and `profile_after_sync=false` for BQ
|
|
rows. Sync schedule (`--sync-schedule`) is accepted and stored but not yet
|
|
evaluated by the scheduler — see issue #79; addressed in Milestone 3 of the
|
|
admin-BQ-registration epic (#108).
|
|
|
|
### Wildcard / sharded tables
|
|
|
|
Not supported in M1. The register endpoint rejects any `source_table` containing
|
|
`*`. Tracked in #108 M3+.
|
|
|
|
### Hybrid Queries
|
|
|
|
Server-side only. Admins can POST `{sql, register_bq: {alias: bq_sql}}` to
|
|
`/api/query/hybrid` (`app/api/query_hybrid.py`); the BigQuery sub-queries
|
|
run server-side, where BQ credentials live, and the join runs against the
|
|
server's local parquet views in a single DuckDB session.
|
|
|
|
Analysts who need to combine a local table with a remote one should
|
|
`agnes snapshot create` a filtered slice of the remote table and join it
|
|
locally, or run the join server-side via `agnes query --remote`. The
|
|
earlier `agnes query --register-bq` flag (which ran in-process on the
|
|
caller's machine) was removed because it required local BigQuery
|
|
credentials that analysts don't have.
|
|
|
|
## Jira Connector
|
|
|
|
Real-time webhook-based connector that updates parquet files incrementally.
|
|
|
|
### How it works
|
|
|
|
1. Jira webhooks hit `/api/jira/webhook` endpoint
|
|
2. The connector (`connectors/jira/`) processes webhook events and updates parquet files
|
|
3. Produces `extract.duckdb` with `_meta` table + incremental parquet data
|
|
|
|
## Writing a Custom Connector
|
|
|
|
Create a new connector in `connectors/<name>/extractor.py` that produces the `extract.duckdb` contract:
|
|
|
|
```
|
|
/data/extracts/{source_name}/
|
|
├── extract.duckdb ← _meta table + views
|
|
└── data/ ← parquet files (local sources only)
|
|
```
|
|
|
|
### Required: `_meta` table
|
|
|
|
```sql
|
|
CREATE TABLE _meta (
|
|
table_name VARCHAR,
|
|
description VARCHAR,
|
|
rows INTEGER,
|
|
size_bytes INTEGER,
|
|
extracted_at TIMESTAMP,
|
|
query_mode VARCHAR -- 'local' or 'remote'
|
|
);
|
|
```
|
|
|
|
### Optional: `_remote_attach` table (for remote sources)
|
|
|
|
```sql
|
|
CREATE TABLE _remote_attach (
|
|
alias VARCHAR, -- DuckDB alias used in views
|
|
extension VARCHAR, -- Extension name
|
|
url VARCHAR, -- Connection URL
|
|
token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself)
|
|
);
|
|
```
|
|
|
|
### Identifier validation
|
|
|
|
Import shared validators from `src/identifier_validation.py`:
|
|
|
|
```python
|
|
from src.identifier_validation import validate_identifier, validate_quoted_identifier
|
|
```
|
|
|
|
Use `validate_identifier()` for strict names (alphanumeric + underscore) and `validate_quoted_identifier()` for names that may contain dots/hyphens (e.g., Keboola-style `in.c-crm.orders`).
|
|
|
|
The SyncOrchestrator auto-discovers connectors by scanning `/data/extracts/*/extract.duckdb` — no registration step needed beyond producing the correct output format.
|
|
|
|
See `connectors/keboola/` for a complete batch-pull reference implementation, or `connectors/bigquery/` for a remote-attach example.
|