## Summary
- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.
Closes #155.
Closes #156.
## Test plan
- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.
## Notable design decisions
- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.
## Out of scope
- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).
## Release
Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
111 lines
3.7 KiB
Python
111 lines
3.7 KiB
Python
"""Asserts that /api/v2/schema/{id} for a BQ row makes exactly ONE
|
|
bigquery_query() call on cache miss, down from two pre-#155.
|
|
|
|
Counts via a side-effect tracker on the mocked DuckDB session.
|
|
"""
|
|
|
|
from unittest.mock import MagicMock, patch
|
|
import pytest
|
|
|
|
|
|
def _mock_duckdb_session_returning(rows):
|
|
"""Build a context-manager mock that returns `rows` on .fetchall().
|
|
|
|
Exposes `call_count` on the returned mock for assertion.
|
|
"""
|
|
session = MagicMock()
|
|
session.execute.return_value.fetchall.return_value = rows
|
|
cm = MagicMock()
|
|
cm.__enter__.return_value = session
|
|
cm.__exit__.return_value = False
|
|
return cm, session
|
|
|
|
|
|
def test_fetch_bq_columns_full_is_single_query():
|
|
"""The new shared helper makes exactly ONE call to bigquery_query."""
|
|
from connectors.bigquery.access import fetch_bq_columns_full
|
|
|
|
bq = MagicMock()
|
|
bq.projects.data = "data-proj"
|
|
bq.projects.billing = "billing-proj"
|
|
cm, session = _mock_duckdb_session_returning([
|
|
("event_date", "DATE", "NO", "YES", None),
|
|
("country", "STRING", "YES", "NO", 1),
|
|
("user_id", "STRING", "NO", "NO", None),
|
|
])
|
|
bq.duckdb_session.return_value = cm
|
|
|
|
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
|
|
assert len(rows) == 3
|
|
# Exactly one bigquery_query() call — no second round-trip.
|
|
assert session.execute.call_count == 1
|
|
first_call = session.execute.call_args_list[0]
|
|
# Outer wrapper SQL is bigquery_query(?, ?, ?)
|
|
assert "bigquery_query" in first_call.args[0]
|
|
# Inner BQ SQL pulls all five columns we need at once.
|
|
inner_sql = first_call.args[1][1]
|
|
assert "column_name" in inner_sql
|
|
assert "data_type" in inner_sql
|
|
assert "is_nullable" in inner_sql
|
|
assert "is_partitioning_column" in inner_sql
|
|
assert "clustering_ordinal_position" in inner_sql
|
|
|
|
|
|
def test_fetch_bq_columns_full_returns_dicts():
|
|
"""Each row is a dict with the documented keys."""
|
|
from connectors.bigquery.access import fetch_bq_columns_full
|
|
|
|
bq = MagicMock()
|
|
bq.projects.data = "data-proj"
|
|
bq.projects.billing = "billing-proj"
|
|
cm, _ = _mock_duckdb_session_returning([
|
|
("event_date", "DATE", "NO", "YES", None),
|
|
])
|
|
bq.duckdb_session.return_value = cm
|
|
|
|
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
|
|
assert rows == [{
|
|
"name": "event_date",
|
|
"type": "DATE",
|
|
"nullable": False,
|
|
"is_partitioning_column": True,
|
|
"clustering_ordinal_position": None,
|
|
}]
|
|
|
|
|
|
def test_fetch_bq_columns_full_returns_none_when_unconfigured():
|
|
"""Sentinel BqAccess (data project empty) → return None, no query."""
|
|
from connectors.bigquery.access import fetch_bq_columns_full
|
|
|
|
bq = MagicMock()
|
|
bq.projects.data = "" # sentinel
|
|
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
|
|
assert rows is None
|
|
bq.duckdb_session.assert_not_called()
|
|
|
|
|
|
def test_fetch_bq_columns_full_returns_none_on_unsafe_identifier():
|
|
"""Refuses to interpolate identifiers that fail validation."""
|
|
from connectors.bigquery.access import fetch_bq_columns_full
|
|
|
|
bq = MagicMock()
|
|
bq.projects.data = "data-proj"
|
|
rows = fetch_bq_columns_full(bq, "evil`; DROP--", "events")
|
|
assert rows is None
|
|
bq.duckdb_session.assert_not_called()
|
|
|
|
|
|
def test_fetch_bq_columns_full_returns_none_on_query_error():
|
|
"""BQ failure → log + None; never raises."""
|
|
from connectors.bigquery.access import fetch_bq_columns_full
|
|
|
|
bq = MagicMock()
|
|
bq.projects.data = "data-proj"
|
|
bq.projects.billing = "billing-proj"
|
|
cm = MagicMock()
|
|
cm.__enter__.return_value.execute.side_effect = RuntimeError("BQ down")
|
|
cm.__exit__.return_value = False
|
|
bq.duckdb_session.return_value = cm
|
|
|
|
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
|
|
assert rows is None
|