release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223 )

## Summary

- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.

Closes #155.
Closes #156.

## Test plan

- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.

## Notable design decisions

- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.

## Out of scope

- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).

## Release

Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

2026-05-07 18:33:55 +02:00

5.8 KiB

Raw Blame History

Query Modes — when to register a table as `local`, `remote`, or `materialized`

Source-agnostic guide to the three query_mode values Agnes supports. Pick the right mode at registration time and the analyst-side experience is fast, cost-aware, and predictable. Pick wrong and you'll either burn BQ scan budget on every query or spend hours waiting on syncs that didn't need to happen.

TL;DR — decision tree

Is the table small (< 1 GB) and updated daily-or-slower?
  └─ YES → query_mode: local       (sync to laptop, query offline)

Is the table the result of an aggregate SQL the operator controls?
  └─ YES → query_mode: materialized  (server runs SQL → parquet, distributed)

Otherwise:
  └─ query_mode: remote   (data stays in upstream; analyst queries on demand)

Three modes side-by-side

Aspect	`local`	`materialized`	`remote`
Where the data lives	Analyst laptop (parquet)	Agnes server filesystem (parquet)	Upstream (BigQuery, Keboola, …)
Who runs the query	Analyst's local DuckDB	Analyst's local DuckDB	Upstream engine via DuckDB extension
Cost model	Free after sync	Free after each sync	Per-query scan cost on the analyst's first hit
Freshness	As fresh as last sync	As fresh as last scheduled run	Live
Scan limits	None (laptop disk)	None (server disk)	`bq_max_scan_bytes` cost gate (default 5 GiB)
Best for	Stable reference data, daily-updated facts	Aggregates, daily snapshots	Big tables, live data, residency-restricted

Per-source-type reference

BigQuery — `query_mode: remote`

The most common use case for remote. Data stays in BQ; analysts query on demand via the Agnes server's service account.

IAM: the server's SA must have:

roles/bigquery.dataViewer on the dataset (read access)
roles/bigquery.jobUser on the billing project (run jobs)

If data_source.bigquery.billing_project == data_source.bigquery.project, set the SA's serviceusage.services.use permission too — the BQ extension can otherwise 403 USER_PROJECT_DENIED on the first query. The instance health check (agnes diagnose) surfaces this as an info-tier entry on bq_config.

Register via UI: /admin/tables → "Add table" → Source type bigquery → Mode remote → fill dataset (your BQ dataset name) + source_table (the BQ table id within that dataset).

Register via CLI:

agnes admin register-table sales_2024 \
    --source-type bigquery \
    --bucket dwh_base \
    --source-table sales_2024 \
    --query-mode remote

After registration, smoke-test the SA's access:

agnes query --remote "SELECT COUNT(*) FROM sales_2024"

A 403 here means the SA is missing dataViewer or jobUser; fix in IAM and re-test.

Cost guardrail: bq_max_scan_bytes (default 5 GiB) refuses queries whose pre-execution scan estimate exceeds the cap. Configurable in /admin/server-config. When an analyst hits the cap, the response includes a hint to use agnes snapshot create --where '<predicate>' to materialise a scoped subset locally.

BigQuery — `query_mode: materialized`

The server runs a scheduled SQL aggregate against BigQuery and writes the result to a parquet on the Agnes filesystem. Analysts get the parquet via agnes pull like any other local table.

Register via CLI:

agnes admin register-table monthly_kpis \
    --source-type bigquery \
    --bucket dwh_base \
    --source-table monthly_kpis \
    --query-mode materialized \
    --query @path/to/monthly_kpis.sql \
    --sync-schedule "daily 03:00"

Cost guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB; set 0 to disable) refuses materialise runs whose query plan exceeds the cap. Catches a typo'd WHERE clause that would otherwise scan a year of data.

Keboola — `query_mode: local` (the production path)

The Agnes server's Keboola DuckDB extension downloads the table to a parquet on the server filesystem; agnes pull distributes it to analyst laptops.

Setup: instance.yaml.data_source.type: keboola + Storage API token via KEBOOLA_STORAGE_TOKEN env var (or whatever instance.yaml.token_env points at).

Register via CLI:

agnes admin register-table users \
    --source-type keboola \
    --bucket in.c-crm \
    --source-table users \
    --query-mode local

query_mode: remote for Keboola is architecturally supported via the _remote_attach mechanism (the orchestrator can ATTACH the Keboola DuckDB extension on demand the same way it does for BQ), but not in active deployment use today. If you have an analyst workflow against a Keboola table that's too big to sync, file an issue — the architecture is in place but the registration UX hasn't been polished.

Jira — `query_mode: local` only

Event-driven: webhooks update parquets incrementally. No remote or materialized mode for Jira today.

Worked examples

1. Big BigQuery fact table you query weekly: query_mode: remote. SA needs dataViewer + jobUser. Analyst uses agnes query --remote for one-off aggregates and agnes snapshot create for cross-week joins.

2. Daily Keboola dimension table: query_mode: local. Synced once a day by the scheduler; analyst's agnes pull picks it up.

3. Monthly KPI aggregate from a BQ datawarehouse: query_mode: materialized + --sync-schedule "0 3 1 * *" (3:00 on the 1st of each month). The server runs your aggregate SQL once a month; analysts get a parquet of the result.

5.8 KiB Raw Blame History

Query Modes — when to register a table as local, remote, or materialized

TL;DR — decision tree

Three modes side-by-side

Per-source-type reference

BigQuery — query_mode: remote

BigQuery — query_mode: materialized

Keboola — query_mode: local (the production path)

Jira — query_mode: local only

Worked examples

See also

5.8 KiB

Raw Blame History

Query Modes — when to register a table as `local`, `remote`, or `materialized`

BigQuery — `query_mode: remote`

BigQuery — `query_mode: materialized`

Keboola — `query_mode: local` (the production path)

Jira — `query_mode: local` only