## Summary
- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.
Closes #155.
Closes #156.
## Test plan
- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.
## Notable design decisions
- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.
## Out of scope
- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).
## Release
Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
5.8 KiB
Query Modes — when to register a table as local, remote, or materialized
Source-agnostic guide to the three query_mode values Agnes supports. Pick the right mode at registration time and the analyst-side experience is fast, cost-aware, and predictable. Pick wrong and you'll either burn BQ scan budget on every query or spend hours waiting on syncs that didn't need to happen.
TL;DR — decision tree
Is the table small (< 1 GB) and updated daily-or-slower?
└─ YES → query_mode: local (sync to laptop, query offline)
Is the table the result of an aggregate SQL the operator controls?
└─ YES → query_mode: materialized (server runs SQL → parquet, distributed)
Otherwise:
└─ query_mode: remote (data stays in upstream; analyst queries on demand)
Three modes side-by-side
| Aspect | local |
materialized |
remote |
|---|---|---|---|
| Where the data lives | Analyst laptop (parquet) | Agnes server filesystem (parquet) | Upstream (BigQuery, Keboola, …) |
| Who runs the query | Analyst's local DuckDB | Analyst's local DuckDB | Upstream engine via DuckDB extension |
| Cost model | Free after sync | Free after each sync | Per-query scan cost on the analyst's first hit |
| Freshness | As fresh as last sync | As fresh as last scheduled run | Live |
| Scan limits | None (laptop disk) | None (server disk) | bq_max_scan_bytes cost gate (default 5 GiB) |
| Best for | Stable reference data, daily-updated facts | Aggregates, daily snapshots | Big tables, live data, residency-restricted |
Per-source-type reference
BigQuery — query_mode: remote
The most common use case for remote. Data stays in BQ; analysts query on demand via the Agnes server's service account.
IAM: the server's SA must have:
roles/bigquery.dataVieweron the dataset (read access)roles/bigquery.jobUseron the billing project (run jobs)
If data_source.bigquery.billing_project == data_source.bigquery.project, set the SA's serviceusage.services.use permission too — the BQ extension can otherwise 403 USER_PROJECT_DENIED on the first query. The instance health check (agnes diagnose) surfaces this as an info-tier entry on bq_config.
Register via UI: /admin/tables → "Add table" → Source type bigquery → Mode remote → fill dataset (your BQ dataset name) + source_table (the BQ table id within that dataset).
Register via CLI:
agnes admin register-table sales_2024 \
--source-type bigquery \
--bucket dwh_base \
--source-table sales_2024 \
--query-mode remote
After registration, smoke-test the SA's access:
agnes query --remote "SELECT COUNT(*) FROM sales_2024"
A 403 here means the SA is missing dataViewer or jobUser; fix in IAM and re-test.
Cost guardrail: bq_max_scan_bytes (default 5 GiB) refuses queries whose pre-execution scan estimate exceeds the cap. Configurable in /admin/server-config. When an analyst hits the cap, the response includes a hint to use agnes snapshot create --where '<predicate>' to materialise a scoped subset locally.
BigQuery — query_mode: materialized
The server runs a scheduled SQL aggregate against BigQuery and writes the result to a parquet on the Agnes filesystem. Analysts get the parquet via agnes pull like any other local table.
Register via CLI:
agnes admin register-table monthly_kpis \
--source-type bigquery \
--bucket dwh_base \
--source-table monthly_kpis \
--query-mode materialized \
--query @path/to/monthly_kpis.sql \
--sync-schedule "daily 03:00"
Cost guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB; set 0 to disable) refuses materialise runs whose query plan exceeds the cap. Catches a typo'd WHERE clause that would otherwise scan a year of data.
Keboola — query_mode: local (the production path)
The Agnes server's Keboola DuckDB extension downloads the table to a parquet on the server filesystem; agnes pull distributes it to analyst laptops.
Setup: instance.yaml.data_source.type: keboola + Storage API token via KEBOOLA_STORAGE_TOKEN env var (or whatever instance.yaml.token_env points at).
Register via CLI:
agnes admin register-table users \
--source-type keboola \
--bucket in.c-crm \
--source-table users \
--query-mode local
query_mode: remote for Keboola is architecturally supported via the _remote_attach mechanism (the orchestrator can ATTACH the Keboola DuckDB extension on demand the same way it does for BQ), but not in active deployment use today. If you have an analyst workflow against a Keboola table that's too big to sync, file an issue — the architecture is in place but the registration UX hasn't been polished.
Jira — query_mode: local only
Event-driven: webhooks update parquets incrementally. No remote or materialized mode for Jira today.
Worked examples
1. Big BigQuery fact table you query weekly: query_mode: remote. SA needs dataViewer + jobUser. Analyst uses agnes query --remote for one-off aggregates and agnes snapshot create for cross-week joins.
2. Daily Keboola dimension table: query_mode: local. Synced once a day by the scheduler; analyst's agnes pull picks it up.
3. Monthly KPI aggregate from a BQ datawarehouse: query_mode: materialized + --sync-schedule "0 3 1 * *" (3:00 on the 1st of each month). The server runs your aggregate SQL once a month; analysts get a parquet of the result.
See also
docs/RBAC.md— granting analysts access to a registered table.config/instance.yaml.example— thedata_sourceconfig block.agnes catalog --json— inspect a registered table's mode + size hints.agnes diagnose— surfacebq_configIAM issues and other health entries.