agnes-the-ai-analyst/tests/test_v2_catalog_invalidation.py
ZdenekSrotyr aa5921da67
release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223)
## Summary

- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.

Closes #155.
Closes #156.

## Test plan

- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.

## Notable design decisions

- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.

## Out of scope

- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).

## Release

Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 18:33:55 +02:00

99 lines
3.5 KiB
Python

"""Unified cache flush across all four catalog/schema/sample/metadata
caches on registry write."""
from unittest.mock import patch
def test_invalidate_flushes_all_four_caches():
from app.api import v2_catalog, v2_schema, v2_sample
from app.api._metadata_models import TableMetadata
# Pre-populate.
v2_catalog._table_rows_cache.set("all", ["fake_row"])
v2_catalog._metadata_cache.set("orders", TableMetadata(rows=10))
v2_schema._schema_cache.set("orders", {"columns": []})
v2_sample._sample_cache.set("orders|10", [{"row": 1}])
v2_catalog.invalidate_for_table("orders")
assert v2_catalog._table_rows_cache.get("all") is None
assert v2_catalog._metadata_cache.get("orders") is None
assert v2_schema._schema_cache.get("orders") is None
# Sample cache is cleared whole (we don't have prefix-invalidation).
assert v2_sample._sample_cache.get("orders|10") is None
def test_invalidate_schedules_single_row_rewarm(monkeypatch):
"""After the flush, a background re-warm task is scheduled for the
same table_id. Assert via patching create_task."""
import asyncio
from app.api import v2_catalog
scheduled = []
def fake_create_task(coro):
# Drain the coroutine so the test doesn't leak it.
coro.close()
scheduled.append(coro)
return None
# Simulate a running event loop so the create_task branch is reached.
monkeypatch.setattr(asyncio, "get_running_loop", lambda: object())
monkeypatch.setattr(asyncio, "create_task", fake_create_task)
v2_catalog.invalidate_for_table("orders")
assert len(scheduled) == 1
def test_register_table_invalidates(seeded_app):
"""Registering a table flushes the rows cache so the next catalog
request reflects it without waiting for the 5-min TTL."""
from app.api import v2_catalog
v2_catalog._table_rows_cache.set("all", [])
client = seeded_app["client"]
token = seeded_app["admin_token"]
headers = {"Authorization": f"Bearer {token}"}
client.post("/api/admin/register-table", json={
"name": "new_t",
"source_type": "keboola",
"bucket": "in.c-x",
"source_table": "t",
"query_mode": "local",
}, headers=headers)
assert v2_catalog._table_rows_cache.get("all") is None
def test_update_table_invalidates(seeded_app):
from app.api import v2_catalog
client = seeded_app["client"]
token = seeded_app["admin_token"]
headers = {"Authorization": f"Bearer {token}"}
client.post("/api/admin/register-table", json={
"name": "u_t",
"source_type": "keboola",
"bucket": "in.c-x",
"source_table": "t",
"query_mode": "local",
}, headers=headers)
v2_catalog._table_rows_cache.set("all", ["pre-update"])
client.put("/api/admin/registry/u_t", json={"description": "new"}, headers=headers)
assert v2_catalog._table_rows_cache.get("all") is None
def test_unregister_table_invalidates(seeded_app):
from app.api import v2_catalog
client = seeded_app["client"]
token = seeded_app["admin_token"]
headers = {"Authorization": f"Bearer {token}"}
client.post("/api/admin/register-table", json={
"name": "d_t",
"source_type": "keboola",
"bucket": "in.c-x",
"source_table": "t",
"query_mode": "local",
}, headers=headers)
v2_catalog._table_rows_cache.set("all", ["pre-delete"])
client.delete("/api/admin/registry/d_t", headers=headers)
assert v2_catalog._table_rows_cache.get("all") is None