CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release sections collapsed to one, stale v1->v35 schema history dropped (it lives in CHANGELOG), marketplace endpoint internals and verbose process sections moved out or tightened. New focused docs: - docs/RELEASING.md - release process, deploy workflows, CI quirks (RELEASE_TEMPLATE.md folded in as an appendix) - docs/marketplace.md - marketplace ingestion + re-serving internals - docs/README.md - documentation index by audience, linked from README.md and CLAUDE.md Archived under docs/archive/: docs/superpowers/ (52 historical planning artifacts), HACKATHON.md, pd-ps-comments.md, security-audit-2026-04.md, future/NOTIFICATIONS.md. Removed the docs/auto-install.md stub. Fixed dangling links in connectors/jira/README.md and dev_docs/README.md, repointed code/doc references to archived paths.
121 KiB
Source-Agnostic Table Metadata Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Enrich /api/v2/catalog with cost-relevant per-table metadata (row count, size, partition, cluster) for query_mode='remote' rows via source-agnostic providers; add unified cache invalidation across all four catalog/schema/sample/metadata caches with single-row re-warm; halve BQ jobs on /api/v2/schema cache miss by consolidating two INFORMATION_SCHEMA.COLUMNS queries into one; auto-warm caches at server startup; surface progress via SSE on /admin/tables. Closes #155 + #156.
Architecture: Per-source metadata.py module exposing fetch(MetadataRequest) -> TableMetadata | None; dispatched from app/api/v2_catalog.py:_size_hint_for_row after identifier validation; results cached 15 min keyed by table_id. Cache invalidation owned by v2_catalog.invalidate_for_table flushing four TTLCaches and scheduling a single-row re-warm. Warmup runs as a FastAPI startup background task with bounded concurrency (default 4) and exposes a JSON status endpoint + SSE stream consumed by a new toolbar block on /admin/tables.
Tech Stack: Python 3.11+, FastAPI, DuckDB (with BigQuery extension), pytest, sse-starlette, vanilla JS (no framework) in admin templates.
Source spec: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md (HEAD a6a660bc).
Branch: zs/issue-155-156-source-agnostic-metadata (worktree at /tmp/agnes-metadata).
File map
New:
app/api/_metadata_models.py— dataclassesMetadataRequest,TableMetadataconnectors/bigquery/metadata.py— BQ providerconnectors/keboola/metadata.py— Keboola providerapp/api/cache_warmup.py— warmup state + background task + endpointsdocs/admin/query-modes.md— admin doc pagetests/test_connectors_bigquery_metadata.pytests/test_connectors_keboola_metadata.pytests/test_v2_catalog_remote_metadata.pytests/test_v2_catalog_invalidation.pytests/test_cache_warmup.pytests/test_admin_tables_warmup_ui.pytests/test_v2_schema_columns_consolidation.py
Modified:
app/api/v2_catalog.py— provider dispatch + 15-min metadata cache + new response fields +invalidate_for_tableapp/api/v2_schema.py— split RBAC/cache from BQ work; rewire to sharedfetch_bq_columns_fullconnectors/bigquery/access.py— appendfetch_bq_columns_fullhelperconnectors/keboola/storage_api.py— appendget_table_infothin wrapperapp/api/admin.py— wireinvalidate_for_tableinto register/update/unregisterapp/main.py— registerwarm_catalog_cachesstartup eventcli/commands/admin.py— third post-register hint forquery_mode=remoteapp/web/templates/admin_tables.html— cache toolbar<section>, per-rowcol-statusbadge, EventSource JS,?doc link on query_mode fieldpyproject.toml— version bump to 0.46.0; addsse-starletteto dependencies if not presentCHANGELOG.md—## [0.46.0]section
Task 1: Shared dataclass models
Files:
-
Create:
app/api/_metadata_models.py -
Test:
tests/test_metadata_models.py -
Step 1.1: Write failing test
Create tests/test_metadata_models.py:
"""Sanity tests for the shared metadata dataclasses."""
from app.api._metadata_models import MetadataRequest, TableMetadata
def test_metadata_request_constructs():
req = MetadataRequest(
table_id="orders", bucket="dwh_base", source_table="orders_2024",
)
assert req.table_id == "orders"
assert req.bucket == "dwh_base"
assert req.source_table == "orders_2024"
def test_metadata_request_is_frozen():
"""Frozen so cache keys derived from a request are stable."""
req = MetadataRequest(table_id="x", bucket="b", source_table="t")
import dataclasses
try:
req.bucket = "other"
except dataclasses.FrozenInstanceError:
return
raise AssertionError("MetadataRequest should be frozen")
def test_table_metadata_all_fields_optional():
tm = TableMetadata()
assert tm.rows is None
assert tm.size_bytes is None
assert tm.partition_by is None
assert tm.clustered_by is None
def test_table_metadata_partial_population():
tm = TableMetadata(rows=100, size_bytes=2048)
assert tm.rows == 100
assert tm.size_bytes == 2048
assert tm.partition_by is None
assert tm.clustered_by is None
- Step 1.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_metadata_models.py -v
Expected: FAIL with ModuleNotFoundError: No module named 'app.api._metadata_models'.
- Step 1.3: Implement the module
Create app/api/_metadata_models.py:
"""Shared data shapes for source-agnostic table-metadata providers.
Lives under `app/api/` because the primary consumer is
`app/api/v2_catalog.py`. Connector-side providers in `connectors/<source>/`
import upward into this module — the inverse layering would force
`v2_catalog.py` to depend on `connectors/__init__.py`, which is the
wrong direction.
"""
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class MetadataRequest:
"""Narrow input passed to a metadata provider's `fetch()`.
`bucket` and `source_table` are pre-validated by the dispatcher
(`validate_quoted_identifier`) before construction, so the provider
can interpolate them into SQL/URL paths without re-checking. Frozen
so the (provider, request)-keyed cache lookup is stable.
"""
table_id: str
bucket: str
source_table: str
@dataclass
class TableMetadata:
"""Source-agnostic metadata bundle. Every field optional — providers
fill what they can cheaply get; callers tolerate `None`. Adding a new
field here is a non-breaking change: existing CLI consumers don't
even render `rough_size_hint` (verified `grep -rn rough_size_hint cli/`
is empty), let alone the new fields.
"""
rows: int | None = None
size_bytes: int | None = None
partition_by: str | None = None
clustered_by: list[str] | None = None
- Step 1.4: Run test to verify it passes
cd /tmp/agnes-metadata && python -m pytest tests/test_metadata_models.py -v
Expected: 4 passed.
- Step 1.5: Commit
cd /tmp/agnes-metadata
git add app/api/_metadata_models.py tests/test_metadata_models.py
git commit -m "feat(metadata): MetadataRequest + TableMetadata dataclasses"
Task 2: KeboolaStorageClient.get_table_info wrapper
Files:
-
Modify:
connectors/keboola/storage_api.py(append method) -
Test:
tests/test_keboola_storage_api.py(append test) -
Step 2.1: Write failing test
Append to tests/test_keboola_storage_api.py:
class TestGetTableInfo:
"""`get_table_info` is a thin wrapper around the existing _get path
so the metadata provider doesn't have to bleed `_get` out of the
module (#155)."""
def test_calls_storage_api_with_table_id(self, monkeypatch):
from connectors.keboola.storage_api import KeboolaStorageClient
captured = {}
def fake_get(self, path, **kwargs):
captured["path"] = path
return {"rowsCount": 100, "dataSizeBytes": 4096}
monkeypatch.setattr(KeboolaStorageClient, "_get", fake_get)
client = KeboolaStorageClient(
url="https://connection.keboola.com", token="tok"
)
info = client.get_table_info("in.c-orders.events")
assert captured["path"] == "/tables/in.c-orders.events"
assert info["rowsCount"] == 100
assert info["dataSizeBytes"] == 4096
def test_propagates_storage_api_error(self, monkeypatch):
from connectors.keboola.storage_api import (
KeboolaStorageClient, StorageApiError,
)
def fake_get(self, path, **kwargs):
raise StorageApiError("404 not found", status=404, body={})
monkeypatch.setattr(KeboolaStorageClient, "_get", fake_get)
client = KeboolaStorageClient(url="https://x", token="tok")
import pytest
with pytest.raises(StorageApiError):
client.get_table_info("missing.table")
- Step 2.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_keboola_storage_api.py::TestGetTableInfo -v
Expected: FAIL with AttributeError: 'KeboolaStorageClient' object has no attribute 'get_table_info'.
- Step 2.3: Implement the wrapper
Find the line in connectors/keboola/storage_api.py immediately after the export_table_async method (around line 286) and append the new method to the KeboolaStorageClient class:
def get_table_info(self, table_id: str) -> dict:
"""GET /v2/storage/tables/{table_id} — full table metadata.
Storage API guarantees `rowsCount` + `dataSizeBytes` on success.
Other fields (`columns`, `primaryKey`, ...) are present but not
consumed by the metadata provider today. Raises `StorageApiError`
on 4xx/5xx — caller decides whether to soften to `None`.
"""
return self._get(f"/tables/{table_id}")
- Step 2.4: Run test to verify it passes
cd /tmp/agnes-metadata && python -m pytest tests/test_keboola_storage_api.py::TestGetTableInfo -v
Expected: 2 passed.
- Step 2.5: Commit
cd /tmp/agnes-metadata
git add connectors/keboola/storage_api.py tests/test_keboola_storage_api.py
git commit -m "feat(keboola): KeboolaStorageClient.get_table_info thin wrapper"
Task 3: Combined fetch_bq_columns_full helper + v2_schema rewire
This task halves the BQ job count on /api/v2/schema/{id} cache miss (2 jobs → 1) by collapsing the duplicate INFORMATION_SCHEMA.COLUMNS queries.
Files:
-
Modify:
connectors/bigquery/access.py(append helper) -
Modify:
app/api/v2_schema.py(rewire_fetch_bq_schemaand_fetch_bq_table_options) -
Test:
tests/test_v2_schema_columns_consolidation.py(new) -
Step 3.1: Write failing test for the consolidation
Create tests/test_v2_schema_columns_consolidation.py:
"""Asserts that /api/v2/schema/{id} for a BQ row makes exactly ONE
bigquery_query() call on cache miss, down from two pre-#155.
Counts via a side-effect tracker on the mocked DuckDB session.
"""
from unittest.mock import MagicMock, patch
import pytest
def _mock_duckdb_session_returning(rows):
"""Build a context-manager mock that returns `rows` on .fetchall().
Exposes `call_count` on the returned mock for assertion.
"""
session = MagicMock()
session.execute.return_value.fetchall.return_value = rows
cm = MagicMock()
cm.__enter__.return_value = session
cm.__exit__.return_value = False
return cm, session
def test_fetch_bq_columns_full_is_single_query():
"""The new shared helper makes exactly ONE call to bigquery_query."""
from connectors.bigquery.access import fetch_bq_columns_full
bq = MagicMock()
bq.projects.data = "data-proj"
bq.projects.billing = "billing-proj"
cm, session = _mock_duckdb_session_returning([
("event_date", "DATE", "NO", "YES", None),
("country", "STRING", "YES", "NO", 1),
("user_id", "STRING", "NO", "NO", None),
])
bq.duckdb_session.return_value = cm
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
assert len(rows) == 3
# Exactly one bigquery_query() call — no second round-trip.
assert session.execute.call_count == 1
first_call = session.execute.call_args_list[0]
# Outer wrapper SQL is bigquery_query(?, ?, ?)
assert "bigquery_query" in first_call.args[0]
# Inner BQ SQL pulls all five columns we need at once.
inner_sql = first_call.args[1][1]
assert "column_name" in inner_sql
assert "data_type" in inner_sql
assert "is_nullable" in inner_sql
assert "is_partitioning_column" in inner_sql
assert "clustering_ordinal_position" in inner_sql
def test_fetch_bq_columns_full_returns_dicts():
"""Each row is a dict with the documented keys."""
from connectors.bigquery.access import fetch_bq_columns_full
bq = MagicMock()
bq.projects.data = "data-proj"
bq.projects.billing = "billing-proj"
cm, _ = _mock_duckdb_session_returning([
("event_date", "DATE", "NO", "YES", None),
])
bq.duckdb_session.return_value = cm
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
assert rows == [{
"name": "event_date",
"type": "DATE",
"nullable": False,
"is_partitioning_column": True,
"clustering_ordinal_position": None,
}]
def test_fetch_bq_columns_full_returns_none_when_unconfigured():
"""Sentinel BqAccess (data project empty) → return None, no query."""
from connectors.bigquery.access import fetch_bq_columns_full
bq = MagicMock()
bq.projects.data = "" # sentinel
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
assert rows is None
bq.duckdb_session.assert_not_called()
def test_fetch_bq_columns_full_returns_none_on_unsafe_identifier():
"""Refuses to interpolate identifiers that fail validation."""
from connectors.bigquery.access import fetch_bq_columns_full
bq = MagicMock()
bq.projects.data = "data-proj"
rows = fetch_bq_columns_full(bq, "evil`; DROP--", "events")
assert rows is None
bq.duckdb_session.assert_not_called()
def test_fetch_bq_columns_full_returns_none_on_query_error():
"""BQ failure → log + None; never raises."""
from connectors.bigquery.access import fetch_bq_columns_full
bq = MagicMock()
bq.projects.data = "data-proj"
bq.projects.billing = "billing-proj"
cm = MagicMock()
cm.__enter__.return_value.execute.side_effect = RuntimeError("BQ down")
cm.__exit__.return_value = False
bq.duckdb_session.return_value = cm
rows = fetch_bq_columns_full(bq, "dwh_base", "events")
assert rows is None
- Step 3.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema_columns_consolidation.py -v
Expected: FAIL with ImportError: cannot import name 'fetch_bq_columns_full' from 'connectors.bigquery.access'.
- Step 3.3: Implement
fetch_bq_columns_full
Append to connectors/bigquery/access.py (location: end of file or just before any module-level _BILLING_PROJECT_RE block — pick whichever your editor lands on):
def fetch_bq_columns_full(bq, dataset: str, table: str) -> list[dict] | None:
"""Single round-trip to INFORMATION_SCHEMA.COLUMNS pulling everything
both v2_schema and the metadata provider need.
Returns one dict per column with the keys ``name``, ``type``,
``nullable``, ``is_partitioning_column``, ``clustering_ordinal_position``.
Consumers project the fields they care about.
Best-effort: returns ``None`` on any failure (sentinel-unconfigured,
unsafe identifier, BQ query exception). Does NOT raise. Mirrors the
failure posture of `app/api/v2_schema.py:_fetch_bq_table_options`,
which it replaces.
Replaces two BQ jobs (one for column list + one for partition/cluster)
with one — half the on-demand cost on each `/api/v2/schema/{id}`
cache miss (issue #155 / spec).
"""
from src.identifier_validation import validate_quoted_identifier
if not bq.projects.data:
return None
if not (validate_quoted_identifier(bq.projects.data, "BQ project")
and validate_quoted_identifier(dataset, "BQ dataset")
and validate_quoted_identifier(table, "BQ source_table")):
return None
bq_sql = (
f"SELECT column_name, data_type, is_nullable, "
f" is_partitioning_column, clustering_ordinal_position "
f"FROM `{bq.projects.data}.{dataset}.INFORMATION_SCHEMA.COLUMNS` "
f"WHERE table_name = ? ORDER BY ordinal_position"
)
try:
with bq.duckdb_session() as conn:
rows = conn.execute(
"SELECT * FROM bigquery_query(?, ?, ?)",
[bq.projects.billing, bq_sql, table],
).fetchall()
except Exception as e:
logger.warning(
"BQ COLUMNS fetch failed for %s.%s.%s: %s",
bq.projects.data, dataset, table, e,
)
return None
return [
{
"name": r[0],
"type": r[1],
"nullable": (r[2] or "").upper() == "YES",
"is_partitioning_column": (r[3] or "").upper() == "YES",
"clustering_ordinal_position": r[4],
}
for r in rows
]
If logger isn't already imported at the top of connectors/bigquery/access.py, add import logging; logger = logging.getLogger(__name__) near the top — but check first; the file likely already has it.
- Step 3.4: Run consolidation tests to verify they pass
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema_columns_consolidation.py -v
Expected: 5 passed.
- Step 3.5: Rewire v2_schema._fetch_bq_schema
Edit app/api/v2_schema.py — replace the body of _fetch_bq_schema (function starting around line 33). The function currently builds and runs its own INFORMATION_SCHEMA.COLUMNS query; rewire it to call fetch_bq_columns_full and project the column-list shape.
Locate def _fetch_bq_schema(...) and replace its full body with:
def _fetch_bq_schema(bq, dataset: str, table: str) -> list[dict]:
"""Fetch column list via the shared `fetch_bq_columns_full` helper.
Pre-#155 this had its own INFORMATION_SCHEMA.COLUMNS query; consolidating
with `_fetch_bq_table_options` (now also delegating to the same helper)
halves the BQ job count on cache miss. Returns the schema-endpoint
column shape: name / type / nullable / description.
"""
from connectors.bigquery.access import fetch_bq_columns_full, BqAccessError
# Surface "BQ not configured" as the structured 500 BqAccessError
# (with hint), not a misleading empty-list. Mirrors pre-refactor
# behavior — see Devin BUG_0002 in the original docstring.
if not bq.projects.data:
bq.client() # raises BqAccessError(not_configured); endpoint catches it
rows = fetch_bq_columns_full(bq, dataset, table)
if rows is None:
# Identifier validation refused, or BQ raised. The legacy code
# path raised `ValueError("unsafe BQ identifier in registry")` for
# the validation case; the BQ-error case raised via translate_bq_error.
# The new helper conflates both into None; reproduce the legacy
# error-surfacing here so callers get the same exceptions.
from src.identifier_validation import validate_quoted_identifier
if not (validate_quoted_identifier(bq.projects.data, "BQ project")
and validate_quoted_identifier(dataset, "BQ dataset")
and validate_quoted_identifier(table, "BQ source_table")):
raise ValueError("unsafe BQ identifier in registry — refusing to query")
# Otherwise it's a BQ-side error; raise the same shape the legacy
# code raised (HTTP 502 via the endpoint's translate_bq_error).
from connectors.bigquery.access import translate_bq_error
raise translate_bq_error(
RuntimeError("BQ INFORMATION_SCHEMA.COLUMNS query failed"),
bq.projects, bad_request_status="upstream_error",
)
return [
{
"name": r["name"],
"type": r["type"],
"nullable": r["nullable"],
"description": "",
}
for r in rows
]
- Step 3.6: Rewire v2_schema._fetch_bq_table_options
Same file. Replace the full body of _fetch_bq_table_options (around line 85):
def _fetch_bq_table_options(bq, dataset: str, table: str) -> dict:
"""Best-effort fetch of partition/cluster info via the shared
`fetch_bq_columns_full` helper.
Returns ``{}`` on ANY failure (best-effort). Same load-bearing
contract as before: the /schema endpoint must keep returning 200
with empty partition info when this fails.
"""
from connectors.bigquery.access import fetch_bq_columns_full
rows = fetch_bq_columns_full(bq, dataset, table)
if not rows:
return {}
partition_by = next(
(r["name"] for r in rows if r["is_partitioning_column"]),
None,
)
clustered_rows = [r for r in rows if r["clustering_ordinal_position"] is not None]
clustered_rows.sort(key=lambda r: r["clustering_ordinal_position"])
clustered_by = [r["name"] for r in clustered_rows]
return {"partition_by": partition_by, "clustered_by": clustered_by}
- Step 3.7: Run existing schema tests to verify no regression
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema.py -v
Expected: all existing tests pass (no behavior change for /api/v2/schema/{id} consumers — same response shape).
If a test fails, the most likely causes are: (a) the test mocks _fetch_bq_schema or _fetch_bq_table_options directly with the old signature — patch the test to mock fetch_bq_columns_full instead; (b) a test relies on the old error-surfacing path — check the docstring of step 3.5 and ensure the rewire matches.
- Step 3.8: Run consolidation tests once more for full green
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema_columns_consolidation.py tests/test_v2_schema.py -v
Expected: all green.
- Step 3.9: Commit
cd /tmp/agnes-metadata
git add connectors/bigquery/access.py app/api/v2_schema.py tests/test_v2_schema_columns_consolidation.py
git commit -m "perf(v2_schema): consolidate two INFORMATION_SCHEMA.COLUMNS queries to one
/api/v2/schema/{id} cache-miss path was making two BQ jobs against the
same view with the same predicate — one for column list, one for
partition/cluster. New shared connectors/bigquery/access.py:fetch_bq_columns_full
returns one resultset; both v2_schema._fetch_bq_schema and
_fetch_bq_table_options delegate. Same response shape, half the BQ jobs
on cache miss.
Same helper consumed by the upcoming metadata provider's partition/cluster
path — zero SQL duplication across the two consumers.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 4: Split build_schema for warmup re-use
The warmup background task needs to populate _schema_cache without an authenticated user. Today build_schema mixes RBAC + cache + BQ work; extract the BQ-and-cache half so warmup can call it directly.
Files:
-
Modify:
app/api/v2_schema.py -
Test:
tests/test_v2_schema.py(new test asserting the split) -
Step 4.1: Write failing test
Append to tests/test_v2_schema.py (or wherever existing v2_schema tests live):
class TestBuildSchemaUncached:
"""The uncached entry point exists for warmup, which has no user
context. RBAC + cache check live in `build_schema`; the BQ work +
cache write live in `build_schema_uncached`."""
def test_uncached_function_exists_and_does_not_take_user(self):
"""Signature: build_schema_uncached(conn, table_id, *, bq)"""
from app.api.v2_schema import build_schema_uncached
import inspect
sig = inspect.signature(build_schema_uncached)
params = list(sig.parameters)
assert "user" not in params, (
"build_schema_uncached should NOT require a user — that's "
"the whole point of the split (warmup has no user)."
)
assert "table_id" in params
assert "bq" in params
def test_build_schema_delegates_to_uncached(self, monkeypatch):
"""build_schema should call build_schema_uncached after RBAC+cache check."""
from app.api import v2_schema
called_with = {}
def fake_uncached(conn, table_id, *, bq):
called_with["table_id"] = table_id
return {"table_id": table_id, "columns": []}
monkeypatch.setattr(v2_schema, "build_schema_uncached", fake_uncached)
# Bypass the cache + RBAC for this assertion — both are tested elsewhere.
monkeypatch.setattr(v2_schema._schema_cache, "get", lambda k, default=None: None)
monkeypatch.setattr(v2_schema, "can_access_table", lambda u, tid, c: True)
# Synthetic registry row.
from unittest.mock import MagicMock
repo_mock = MagicMock()
repo_mock.get.return_value = {"id": "x", "source_type": "bigquery"}
monkeypatch.setattr(v2_schema, "TableRegistryRepository", lambda c: repo_mock)
v2_schema.build_schema(
conn=MagicMock(), user={"id": "u"}, table_id="x", bq=MagicMock(),
)
assert called_with["table_id"] == "x"
- Step 4.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema.py::TestBuildSchemaUncached -v
Expected: FAIL with ImportError: cannot import name 'build_schema_uncached'.
- Step 4.3: Perform the split
Open app/api/v2_schema.py. Locate def build_schema(...) (around line 143). Refactor as:
def build_schema(
conn: duckdb.DuckDBPyConnection,
user: dict,
table_id: str,
*,
bq: BqAccess,
) -> dict:
# RBAC + existence check MUST run before cache lookup — otherwise an
# unauthorized user can read cached schema fetched by an authorized one.
repo = TableRegistryRepository(conn)
row = repo.get(table_id)
if not row:
raise NotFound(table_id)
if not can_access_table(user, table_id, conn):
raise PermissionError(table_id)
cache_key = f"{table_id}"
cached = _schema_cache.get(cache_key)
if cached is not None:
return cached
return build_schema_uncached(conn, table_id, bq=bq)
def build_schema_uncached(
conn: duckdb.DuckDBPyConnection,
table_id: str,
*,
bq: BqAccess,
) -> dict:
"""Build the schema response and populate `_schema_cache`. **Skips
RBAC and cache-hit short-circuit** — call only from contexts where
those are either unnecessary (warmup) or already enforced upstream
(`build_schema` above). The BQ work is the same; the entry-point
contract is what differs.
"""
repo = TableRegistryRepository(conn)
row = repo.get(table_id)
if not row:
raise NotFound(table_id)
cache_key = f"{table_id}"
source_type = row.get("source_type") or ""
if source_type == "bigquery":
dataset = row.get("bucket") or ""
source_table = row.get("source_table") or table_id
columns = _fetch_bq_schema(bq, dataset, source_table)
opts = _fetch_bq_table_options(bq, dataset, source_table)
payload = {
"table_id": table_id,
"source_type": source_type,
"sql_flavor": "bigquery",
"columns": columns,
"partition_by": opts.get("partition_by"),
"clustered_by": opts.get("clustered_by"),
}
else:
# Local sources (Keboola/Jira parquet) — schema lives in extract.duckdb.
# Preserve whatever the existing code path did. Copy the original
# else-branch from build_schema verbatim. (Inspect the file before
# this edit to extract the original lines after `if source_type ==
# 'bigquery'`.)
payload = _build_schema_for_local_row(conn, table_id, row)
_schema_cache.set(cache_key, payload)
return payload
Important: the existing build_schema body has logic for non-BQ rows after the if source_type == "bigquery": block. Read that block (lines ~165-210 in current file) and either keep it inline in build_schema_uncached after the if/else, or extract it into a helper _build_schema_for_local_row(conn, table_id, row) as the snippet above suggests. Pick whichever requires the smaller diff against the existing code.
- Step 4.4: Run tests
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema.py -v
Expected: all green (existing tests + the new TestBuildSchemaUncached class).
- Step 4.5: Commit
cd /tmp/agnes-metadata
git add app/api/v2_schema.py tests/test_v2_schema.py
git commit -m "refactor(v2_schema): extract build_schema_uncached for warmup re-use
build_schema currently mixes RBAC + cache check + BQ work in one body.
Warmup will need to populate _schema_cache without a user — extract the
BQ-and-cache half so warmup can call it directly. build_schema keeps
the RBAC + cache-hit short-circuit at the top, then delegates.
Behavior unchanged for live /api/v2/schema/{id} consumers.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 5: Provider scaffold + dispatcher
Files:
-
Create:
connectors/bigquery/metadata.py(stub returning None) -
Create:
connectors/keboola/metadata.py(stub returning None) -
Modify:
app/api/v2_catalog.py(add_metadata_provider_forand_build_metadata_request) -
Test:
tests/test_v2_catalog_dispatcher.py -
Step 5.1: Write failing tests
Create tests/test_v2_catalog_dispatcher.py:
"""Dispatch + identifier-validation gate for the source-agnostic
metadata providers."""
from app.api._metadata_models import MetadataRequest
def test_dispatcher_returns_bq_provider_for_bigquery():
from app.api.v2_catalog import _metadata_provider_for
from connectors.bigquery import metadata as bq_meta
fn = _metadata_provider_for("bigquery")
assert fn is bq_meta.fetch
def test_dispatcher_returns_keboola_provider_for_keboola():
from app.api.v2_catalog import _metadata_provider_for
from connectors.keboola import metadata as kb_meta
fn = _metadata_provider_for("keboola")
assert fn is kb_meta.fetch
def test_dispatcher_returns_none_for_unknown_source():
from app.api.v2_catalog import _metadata_provider_for
assert _metadata_provider_for("jira") is None
assert _metadata_provider_for("") is None
assert _metadata_provider_for("snowflake") is None
def test_build_metadata_request_for_valid_row():
from app.api.v2_catalog import _build_metadata_request
req = _build_metadata_request({
"id": "orders",
"bucket": "dwh_base",
"source_table": "orders_2024",
})
assert isinstance(req, MetadataRequest)
assert req.table_id == "orders"
assert req.bucket == "dwh_base"
assert req.source_table == "orders_2024"
def test_build_metadata_request_rejects_unsafe_bucket():
from app.api.v2_catalog import _build_metadata_request
req = _build_metadata_request({
"id": "x",
"bucket": "evil`; DROP--",
"source_table": "t",
})
assert req is None
def test_build_metadata_request_falls_back_to_id_when_source_table_missing():
"""Some legacy Keboola registry rows have empty source_table; the row id
is the table name in that case (mirrors v2_schema:168 behavior)."""
from app.api.v2_catalog import _build_metadata_request
req = _build_metadata_request({
"id": "orders",
"bucket": "in.c-crm",
"source_table": "",
})
assert req is not None
assert req.source_table == "orders"
def test_stub_providers_return_none():
"""Providers don't have their real bodies yet — stubs return None
so the catalog endpoint stays 200 while we wire the rest."""
from connectors.bigquery import metadata as bq_meta
from connectors.keboola import metadata as kb_meta
req = MetadataRequest(table_id="x", bucket="b", source_table="t")
assert bq_meta.fetch(req) is None
assert kb_meta.fetch(req) is None
- Step 5.2: Run tests to verify they fail
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_dispatcher.py -v
Expected: multiple FAIL — modules don't exist, dispatcher not defined.
- Step 5.3: Create the provider stubs
Create connectors/bigquery/metadata.py:
"""BigQuery metadata provider — populates `TableMetadata` for a remote
BQ-backed registry row.
Stub: returns None pending the full implementation in Task 7.
"""
from __future__ import annotations
from app.api._metadata_models import MetadataRequest, TableMetadata
def fetch(req: MetadataRequest) -> TableMetadata | None:
return None
Create connectors/keboola/metadata.py:
"""Keboola metadata provider — populates `TableMetadata` for a Keboola
registry row via the Storage API.
Stub: returns None pending the full implementation in Task 6.
"""
from __future__ import annotations
from app.api._metadata_models import MetadataRequest, TableMetadata
def fetch(req: MetadataRequest) -> TableMetadata | None:
return None
- Step 5.4: Add dispatcher + request builder to v2_catalog
Edit app/api/v2_catalog.py. Add at the top, after the existing imports:
from app.api._metadata_models import MetadataRequest, TableMetadata
from src.identifier_validation import validate_quoted_identifier
Then, after the _table_rows_cache definition (around line 26), add:
def _metadata_provider_for(source_type: str):
"""Lazy-import dispatch for source-specific metadata providers.
Lazy because connector modules are heavy (BQ extension, google-cloud
client, etc.) and a Keboola-only deployment shouldn't pay the BQ
import cost. Returns ``None`` for unknown source types — the caller
treats that as "no metadata enrichment available" and falls through.
"""
if source_type == "bigquery":
from connectors.bigquery import metadata as m
return m.fetch
if source_type == "keboola":
from connectors.keboola import metadata as m
return m.fetch
return None
def _build_metadata_request(row: dict) -> MetadataRequest | None:
"""Construct a validated MetadataRequest from a registry row.
Pre-validates the identifiers via `validate_quoted_identifier` before
constructing the request — providers can then interpolate
`req.bucket` / `req.source_table` into SQL/URL paths without
re-checking. Returns ``None`` when validation fails; provider is not
dispatched for that row.
"""
bucket = row.get("bucket") or ""
source_table = row.get("source_table") or row.get("id") or ""
if not bucket or not source_table:
return None
if not (validate_quoted_identifier(bucket, "bucket")
and validate_quoted_identifier(source_table, "source_table")):
return None
return MetadataRequest(
table_id=row["id"], bucket=bucket, source_table=source_table,
)
- Step 5.5: Run tests to verify they pass
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_dispatcher.py -v
Expected: 7 passed.
- Step 5.6: Commit
cd /tmp/agnes-metadata
git add connectors/bigquery/metadata.py connectors/keboola/metadata.py \
app/api/v2_catalog.py tests/test_v2_catalog_dispatcher.py
git commit -m "feat(catalog): metadata-provider scaffold + dispatcher with identifier validation
Lazy-imported per-source-type providers; dispatcher in v2_catalog.py.
_build_metadata_request validates bucket + source_table before
constructing a MetadataRequest — providers can interpolate them safely.
Providers ship as stubs returning None; real bodies land in Tasks 6-7.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 6: Keboola provider implementation
Files:
-
Modify:
connectors/keboola/metadata.py(replace stub with real impl) -
Test:
tests/test_connectors_keboola_metadata.py -
Step 6.1: Write failing tests
Create tests/test_connectors_keboola_metadata.py:
"""Keboola metadata provider — happy + unconfigured + api-error paths."""
from unittest.mock import MagicMock, patch
import pytest
from app.api._metadata_models import MetadataRequest, TableMetadata
@pytest.fixture
def req():
return MetadataRequest(
table_id="orders", bucket="in.c-crm", source_table="orders",
)
def test_happy_path_returns_populated_metadata(req, monkeypatch):
from connectors.keboola import metadata
from connectors.keboola.client import KeboolaClient
# KeboolaClient(token=None, url=None) reads env vars; pretend they're set.
monkeypatch.setenv("KEBOOLA_STACK_URL", "https://connection.keboola.com")
monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "tok")
with patch("connectors.keboola.metadata.KeboolaStorageClient") as MockStorage:
instance = MockStorage.return_value
instance.get_table_info.return_value = {
"rowsCount": 1234,
"dataSizeBytes": 500_000,
"primaryKey": ["id"],
}
result = metadata.fetch(req)
assert result == TableMetadata(
rows=1234,
size_bytes=500_000,
partition_by=None,
clustered_by=None,
)
def test_returns_none_when_unconfigured(req, monkeypatch):
"""No KEBOOLA_STACK_URL / KEBOOLA_STORAGE_TOKEN env → return None."""
from connectors.keboola import metadata
monkeypatch.delenv("KEBOOLA_STACK_URL", raising=False)
monkeypatch.delenv("KEBOOLA_STORAGE_TOKEN", raising=False)
assert metadata.fetch(req) is None
def test_returns_none_on_storage_api_error(req, monkeypatch):
"""`StorageApiError` from get_table_info → log + return None."""
from connectors.keboola import metadata
from connectors.keboola.storage_api import StorageApiError
monkeypatch.setenv("KEBOOLA_STACK_URL", "https://x.keboola.com")
monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "tok")
with patch("connectors.keboola.metadata.KeboolaStorageClient") as MockStorage:
instance = MockStorage.return_value
instance.get_table_info.side_effect = StorageApiError(
"404 not found", status=404, body={},
)
assert metadata.fetch(req) is None
def test_table_id_uses_bucket_dot_source_table(req, monkeypatch):
"""Storage API path is `<bucket>.<source_table>`."""
from connectors.keboola import metadata
monkeypatch.setenv("KEBOOLA_STACK_URL", "https://x.keboola.com")
monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "tok")
with patch("connectors.keboola.metadata.KeboolaStorageClient") as MockStorage:
instance = MockStorage.return_value
instance.get_table_info.return_value = {
"rowsCount": 0, "dataSizeBytes": 0,
}
metadata.fetch(req)
instance.get_table_info.assert_called_once_with("in.c-crm.orders")
- Step 6.2: Run tests to verify they fail
cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_keboola_metadata.py -v
Expected: 4 FAIL (stub returns None for happy-path; assertion fails).
- Step 6.3: Replace stub with real implementation
Open connectors/keboola/metadata.py and replace the entire file:
"""Keboola metadata provider — populates `TableMetadata` for a Keboola
registry row via the Storage API.
Reuses `KeboolaClient(token=None, url=None)` to inherit the existing
env-var fallback path (`KEBOOLA_STACK_URL` + `KEBOOLA_STORAGE_TOKEN`),
which is the same hierarchy `connectors/keboola/extractor.py` and
`connectors/keboola/client.py` already use. **Does NOT introduce a third
token-resolution helper.**
"""
from __future__ import annotations
import logging
from app.api._metadata_models import MetadataRequest, TableMetadata
from connectors.keboola.client import KeboolaClient
from connectors.keboola.storage_api import (
KeboolaStorageClient, StorageApiError,
)
logger = logging.getLogger(__name__)
def fetch(req: MetadataRequest) -> TableMetadata | None:
"""Return Keboola Storage API metadata for the given table, or None.
Keboola has no BigQuery-style partition/cluster concept; primaryKey is
conceptually different (uniqueness, not physical layout), so
`partition_by` and `clustered_by` are left None.
"""
# Construct a KeboolaClient with no explicit token/url — the
# constructor reads the standard env vars. Side-effect-free except
# for setting `.token` and `.url` (verified
# connectors/keboola/client.py:90-99). When a future refactor extracts
# `_resolve_keboola_credentials()` as a standalone helper, switch
# here to call that directly.
creds = KeboolaClient(token=None, url=None)
if not creds.url or not creds.token:
return None # not configured — same posture as BQ sentinel
table_id = f"{req.bucket}.{req.source_table}"
try:
storage = KeboolaStorageClient(url=creds.url, token=creds.token)
info = storage.get_table_info(table_id)
except (StorageApiError, ValueError) as e:
logger.warning("Keboola metadata fetch failed for %s: %s", table_id, e)
return None
return TableMetadata(
rows=info.get("rowsCount"),
size_bytes=info.get("dataSizeBytes"),
partition_by=None,
clustered_by=None,
)
- Step 6.4: Run tests to verify they pass
cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_keboola_metadata.py -v
Expected: 4 passed.
- Step 6.5: Commit
cd /tmp/agnes-metadata
git add connectors/keboola/metadata.py tests/test_connectors_keboola_metadata.py
git commit -m "feat(keboola): metadata provider returning rows + size_bytes
Reuses KeboolaClient(token=None, url=None) for env-var credential
fallback — no new token-resolver helper invented. partition_by /
clustered_by stay None (Keboola has no equivalent concept).
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 7: BigQuery provider implementation
Files:
-
Modify:
connectors/bigquery/metadata.py(replace stub with real impl) -
Test:
tests/test_connectors_bigquery_metadata.py -
Step 7.1: Write failing tests
Create tests/test_connectors_bigquery_metadata.py:
"""BigQuery metadata provider — 5 paths from spec test plan:
happy / sentinel / VIEW / region-typo / both-paths-fail."""
from unittest.mock import MagicMock, patch
import pytest
from app.api._metadata_models import MetadataRequest, TableMetadata
@pytest.fixture
def req():
return MetadataRequest(
table_id="orders", bucket="dwh_base", source_table="orders_2024",
)
def _bq_with_session(table_storage_rows=None, columns_rows=None,
table_storage_raises=None, columns_raises=None,
legacy_tables_rows=None, legacy_tables_raises=None,
projects_data="data-proj", projects_billing="billing-proj"):
"""Build a mock `BqAccess` whose `duckdb_session()` returns a context
manager whose `.execute(...)` dispatches based on which inner SQL
string is being run.
The 3 SQL shapes we route on:
- INFORMATION_SCHEMA.TABLE_STORAGE → table_storage_rows / raises
- INFORMATION_SCHEMA.COLUMNS → columns_rows / raises
- __TABLES__ → legacy_tables_rows / raises
"""
bq = MagicMock()
bq.projects.data = projects_data
bq.projects.billing = projects_billing
def execute(outer_sql, params):
inner_sql = params[1] if len(params) > 1 else ""
if "TABLE_STORAGE" in inner_sql:
if table_storage_raises:
raise table_storage_raises
return MagicMock(
fetchone=lambda: table_storage_rows[0] if table_storage_rows else None,
fetchall=lambda: table_storage_rows or [],
)
if "INFORMATION_SCHEMA.COLUMNS" in inner_sql:
if columns_raises:
raise columns_raises
return MagicMock(
fetchall=lambda: columns_rows or [],
)
if "__TABLES__" in inner_sql:
if legacy_tables_raises:
raise legacy_tables_raises
return MagicMock(
fetchone=lambda: legacy_tables_rows[0] if legacy_tables_rows else None,
)
raise AssertionError(f"unexpected SQL: {inner_sql[:80]}")
session = MagicMock()
session.execute.side_effect = execute
cm = MagicMock()
cm.__enter__.return_value = session
cm.__exit__.return_value = False
bq.duckdb_session.return_value = cm
return bq
def test_happy_path_returns_full_metadata(req, monkeypatch):
"""TABLE_STORAGE returns rows+size, COLUMNS returns partition+cluster."""
from connectors.bigquery import metadata
from app.instance_config import get_value
monkeypatch.setattr(
"app.instance_config.get_value",
lambda key, default=None: "us-central1" if "location" in key else default,
raising=False,
)
bq = _bq_with_session(
table_storage_rows=[(1234567, 5_000_000)],
columns_rows=[
("event_date", "DATE", "NO", "YES", None),
("country", "STRING", "YES", "NO", 1),
("user_id", "STRING", "NO", "NO", None),
],
)
with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
result = metadata.fetch(req)
assert result == TableMetadata(
rows=1234567,
size_bytes=5_000_000,
partition_by="event_date",
clustered_by=["country"],
)
def test_sentinel_unconfigured_returns_none_no_query(req, monkeypatch):
"""`bq.projects.data == ''` → return None before any query."""
from connectors.bigquery import metadata
bq = _bq_with_session(projects_data="")
with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
assert metadata.fetch(req) is None
bq.duckdb_session.assert_not_called()
def test_view_path_returns_metadata_with_null_rows_size(req, monkeypatch):
"""VIEW: TABLE_STORAGE empty + __TABLES__ empty → rows/size = None;
partition + cluster from COLUMNS still surface."""
from connectors.bigquery import metadata
monkeypatch.setattr(
"app.instance_config.get_value",
lambda key, default=None: "us-central1" if "location" in key else default,
raising=False,
)
bq = _bq_with_session(
table_storage_rows=[], # view → no row
legacy_tables_rows=[], # view also absent from __TABLES__
columns_rows=[
("event_date", "DATE", "NO", "YES", None),
],
)
with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
result = metadata.fetch(req)
assert result is not None
assert result.rows is None
assert result.size_bytes is None
assert result.partition_by == "event_date"
def test_region_typo_falls_through_to_legacy_tables(req, monkeypatch):
"""TABLE_STORAGE raises (typo'd region) → fall through to __TABLES__."""
from connectors.bigquery import metadata
monkeypatch.setattr(
"app.instance_config.get_value",
lambda key, default=None: "us-central" if "location" in key else default, # typo!
raising=False,
)
bq = _bq_with_session(
table_storage_raises=RuntimeError("Not found: ..."),
legacy_tables_rows=[(100, 2048)],
columns_rows=[("event_date", "DATE", "NO", "YES", None)],
)
with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
result = metadata.fetch(req)
assert result is not None
assert result.rows == 100
assert result.size_bytes == 2048
def test_both_paths_fail_returns_metadata_with_partition_only(req, monkeypatch):
"""Both TABLE_STORAGE and __TABLES__ fail → rows/size None, partition still fills."""
from connectors.bigquery import metadata
monkeypatch.setattr(
"app.instance_config.get_value",
lambda key, default=None: "us-central1" if "location" in key else default,
raising=False,
)
bq = _bq_with_session(
table_storage_raises=RuntimeError("BQ down"),
legacy_tables_raises=RuntimeError("BQ still down"),
columns_rows=[("event_date", "DATE", "NO", "YES", None)],
)
with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
result = metadata.fetch(req)
assert result is not None
assert result.rows is None
assert result.size_bytes is None
assert result.partition_by == "event_date"
def test_bq_access_error_returns_none(req):
"""get_bq_access() raises BqAccessError → return None gracefully."""
from connectors.bigquery import metadata
from connectors.bigquery.access import BqAccessError
with patch(
"connectors.bigquery.metadata.get_bq_access",
side_effect=BqAccessError("not_configured"),
):
assert metadata.fetch(req) is None
- Step 7.2: Run tests to verify they fail
cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_bigquery_metadata.py -v
Expected: 6 FAIL (stub returns None unconditionally).
- Step 7.3: Implement the BQ provider
Replace connectors/bigquery/metadata.py entirely:
"""BigQuery metadata provider — populates `TableMetadata` for a remote
BQ-backed registry row.
Two queries (different INFORMATION_SCHEMA scopes — TABLE_STORAGE is
region-scoped, COLUMNS is dataset-scoped, can't be combined):
1. INFORMATION_SCHEMA.TABLE_STORAGE — total_rows + active+long_term
bytes. Region-portable per Google's docs; only valid via
`<project>.region-<region>.INFORMATION_SCHEMA.TABLE_STORAGE`
(verified live 2026-05-07; dataset-scoped TABLE_STORAGE doesn't
exist).
2. INFORMATION_SCHEMA.COLUMNS — partition_by + clustered_by. Reuses
the consolidated `fetch_bq_columns_full` helper that v2_schema also
calls; one shared shape, one round-trip.
Region resolution chain: `instance.yaml.data_source.bigquery.location` →
`bq.bigquery_client().get_dataset(...)` → fall back to legacy `__TABLES__`
(dataset-scoped, no region required).
VIEW handling: TABLE_STORAGE returns no rows for entries whose
`table_type='VIEW'`; the legacy `__TABLES__` fallback also doesn't list
views. The provider returns `TableMetadata(rows=None, size_bytes=None,
partition_by=<from COLUMNS>, clustered_by=<from COLUMNS>)` — analyst
Claude reads `null` size and applies the existing CLAUDE.md guidance.
`size_bytes` reports `active_logical_bytes + long_term_logical_bytes`
(a full BQ scan reads both — reporting only active undercounts aged
partitioned tables).
"""
from __future__ import annotations
import logging
from app.api._metadata_models import MetadataRequest, TableMetadata
from connectors.bigquery.access import (
BqAccessError, fetch_bq_columns_full, get_bq_access,
)
logger = logging.getLogger(__name__)
def fetch(req: MetadataRequest) -> TableMetadata | None:
try:
bq = get_bq_access()
except BqAccessError:
return None
if not bq.projects.data:
return None
rows_size = _fetch_rows_and_size(bq, req)
columns = fetch_bq_columns_full(bq, req.bucket, req.source_table)
part_clust = _derive_partition_cluster(columns) if columns else None
if rows_size is None and part_clust is None:
return None
return TableMetadata(
rows=(rows_size or {}).get("rows"),
size_bytes=(rows_size or {}).get("size_bytes"),
partition_by=(part_clust or {}).get("partition_by"),
clustered_by=(part_clust or {}).get("clustered_by"),
)
def _derive_partition_cluster(columns: list[dict]) -> dict | None:
"""Mirror v2_schema._fetch_bq_table_options derivations from the
shared columns-full result."""
if not columns:
return None
partition_by = next(
(c["name"] for c in columns if c["is_partitioning_column"]),
None,
)
clustered = sorted(
(c for c in columns if c["clustering_ordinal_position"] is not None),
key=lambda c: c["clustering_ordinal_position"],
)
clustered_by = [c["name"] for c in clustered]
return {"partition_by": partition_by, "clustered_by": clustered_by}
def _fetch_rows_and_size(bq, req: MetadataRequest) -> dict | None:
"""Resolve rows + size_bytes via TABLE_STORAGE → __TABLES__ fallthrough.
See module docstring + spec Open Question §1 for view-path nuance.
"""
location = _resolve_bq_location(bq, req)
if location:
result = _fetch_via_table_storage(bq, req, location)
if result is not None:
return result
# TABLE_STORAGE returned None despite having a location: could
# be a typo in `data_source.bigquery.location`, a multi-region
# dataset operator misclassified, the table is a VIEW, or a
# transient permission gap. Try __TABLES__ before giving up.
return _fetch_via_legacy_tables(bq, req)
def _resolve_bq_location(bq, req: MetadataRequest) -> str | None:
"""instance.yaml.location → REST get_dataset → None."""
from app.instance_config import get_value
cfg_location = (get_value("data_source.bigquery.location") or "").strip()
if cfg_location:
return cfg_location
try:
ds = bq.bigquery_client().get_dataset(
f"{bq.projects.data}.{req.bucket}"
)
return ds.location
except Exception as e:
logger.warning(
"BQ dataset.get failed for %s.%s — falling back to __TABLES__: %s",
bq.projects.data, req.bucket, e,
)
return None
def _fetch_via_table_storage(bq, req: MetadataRequest, location: str) -> dict | None:
"""Region-scoped INFORMATION_SCHEMA.TABLE_STORAGE — preferred path.
`validate_quoted_identifier` accepts `us-central1`, `europe-west1`,
`EU`, `us` etc. (regex `^[a-zA-Z0-9_][a-zA-Z0-9_.\\-]{0,127}$`).
Refuses anything that could break out of the backtick-quoted path.
Returns None on no-row (table is a VIEW, or different region than
configured) — caller decides whether to fall through.
`size_bytes` is `active + long_term` logical bytes (a full BQ scan
reads both; reporting only active undercounts aged partitioned tables).
"""
from src.identifier_validation import validate_quoted_identifier
if not validate_quoted_identifier(location, "BQ region"):
return None
try:
bq_sql = (
f"SELECT total_rows, "
f"IFNULL(active_logical_bytes, 0) + IFNULL(long_term_logical_bytes, 0) "
f"FROM `{bq.projects.data}.region-{location}.INFORMATION_SCHEMA.TABLE_STORAGE` "
f"WHERE table_schema = ? AND table_name = ?"
)
with bq.duckdb_session() as conn:
row = conn.execute(
"SELECT * FROM bigquery_query(?, ?, ?, ?)",
[bq.projects.billing, bq_sql, req.bucket, req.source_table],
).fetchone()
except Exception as e:
logger.warning(
"BQ TABLE_STORAGE fetch failed for %s.%s.%s: %s",
bq.projects.data, req.bucket, req.source_table, e,
)
return None
if row is None:
return None # VIEW or wrong region
rows_, size_bytes = row
return {
"rows": int(rows_) if rows_ is not None else None,
"size_bytes": int(size_bytes) if size_bytes is not None else None,
}
def _fetch_via_legacy_tables(bq, req: MetadataRequest) -> dict | None:
"""Last-resort dataset-scoped __TABLES__ — works without region."""
try:
bq_sql = (
f"SELECT row_count, size_bytes "
f"FROM `{bq.projects.data}.{req.bucket}.__TABLES__` "
f"WHERE table_id = ?"
)
with bq.duckdb_session() as conn:
row = conn.execute(
"SELECT * FROM bigquery_query(?, ?, ?)",
[bq.projects.billing, bq_sql, req.source_table],
).fetchone()
except Exception as e:
logger.warning(
"BQ __TABLES__ fetch failed for %s.%s.%s: %s",
bq.projects.data, req.bucket, req.source_table, e,
)
return None
if row is None:
return None
rows_, size_bytes = row
return {
"rows": int(rows_) if rows_ is not None else None,
"size_bytes": int(size_bytes) if size_bytes is not None else None,
}
- Step 7.4: Run tests to verify they pass
cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_bigquery_metadata.py -v
Expected: 6 passed.
- Step 7.5: Commit
cd /tmp/agnes-metadata
git add connectors/bigquery/metadata.py tests/test_connectors_bigquery_metadata.py
git commit -m "feat(bigquery): metadata provider with TABLE_STORAGE + COLUMNS + __TABLES__ fallback
Resolves rows + size_bytes via region-scoped INFORMATION_SCHEMA.TABLE_STORAGE
(verified the only valid scope on 2026-05-07; dataset-scoped doesn't exist).
size_bytes is active + long_term logical bytes — full scan reads both.
Region resolved from data_source.bigquery.location → REST get_dataset →
fall back to legacy __TABLES__ on TABLE_STORAGE failure (region typo,
VIEW, IAM gap). Partition + cluster derive from the shared
fetch_bq_columns_full helper introduced in Task 3.
VIEW handling: TABLE_STORAGE empty → fall through to __TABLES__ empty →
TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...).
The null size signals analyst Claude to use the existing 'treat as
potentially large' guidance.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 8: v2_catalog wiring — _size_hint_for_row + response shape + cache
Files:
-
Modify:
app/api/v2_catalog.py -
Test:
tests/test_v2_catalog_remote_metadata.py -
Step 8.1: Write failing tests
Create tests/test_v2_catalog_remote_metadata.py:
"""Catalog endpoint integration: per-table metadata enrichment for
remote rows."""
from unittest.mock import patch
from app.api._metadata_models import TableMetadata
def test_remote_row_includes_metadata_fields(seeded_app, monkeypatch):
"""Catalog response for a query_mode='remote' BQ row carries the four
new fields populated by the provider."""
c = seeded_app["client"]
token = seeded_app["admin_token"]
fake_meta = TableMetadata(
rows=10000, size_bytes=2_000_000,
partition_by="event_date", clustered_by=["country", "platform"],
)
# Register a synthetic remote BQ row in the test instance's registry.
seeded_app["register_table"](
id="orders", source_type="bigquery", bucket="dwh_base",
source_table="orders_2024", query_mode="remote",
)
with patch(
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
):
r = c.get(
"/api/v2/catalog",
headers={"Authorization": f"Bearer {token}"},
)
assert r.status_code == 200, r.text
tables = r.json()["tables"]
orders = next(t for t in tables if t["id"] == "orders")
assert orders["rows"] == 10000
assert orders["size_bytes"] == 2_000_000
assert orders["partition_by"] == "event_date"
assert orders["clustered_by"] == ["country", "platform"]
# Existing fields still present.
assert orders["query_mode"] == "remote"
def test_local_row_unaffected_by_provider_dispatch(seeded_app):
"""query_mode='local' rows take the parquet-stat path; provider not called."""
c = seeded_app["client"]
token = seeded_app["admin_token"]
seeded_app["register_table"](
id="users", source_type="keboola", bucket="in.c-crm",
source_table="users", query_mode="local",
)
with patch("connectors.keboola.metadata.fetch") as mock_fetch:
r = c.get(
"/api/v2/catalog",
headers={"Authorization": f"Bearer {token}"},
)
assert r.status_code == 200, r.text
mock_fetch.assert_not_called()
def test_provider_failure_returns_null_metadata(seeded_app):
"""Provider returns None → row appears with null new fields, not
a 500. Catalog endpoint must stay 200."""
c = seeded_app["client"]
token = seeded_app["admin_token"]
seeded_app["register_table"](
id="broken", source_type="bigquery", bucket="dwh_base",
source_table="broken_t", query_mode="remote",
)
with patch(
"connectors.bigquery.metadata.fetch", return_value=None,
):
r = c.get(
"/api/v2/catalog",
headers={"Authorization": f"Bearer {token}"},
)
assert r.status_code == 200, r.text
tables = r.json()["tables"]
broken = next(t for t in tables if t["id"] == "broken")
assert broken["rows"] is None
assert broken["size_bytes"] is None
assert broken["partition_by"] is None
assert broken["clustered_by"] is None
def test_cache_hit_does_not_call_provider_twice(seeded_app):
"""First call invokes provider; second within 15 min hits cache."""
c = seeded_app["client"]
token = seeded_app["admin_token"]
seeded_app["register_table"](
id="orders", source_type="bigquery", bucket="dwh_base",
source_table="orders_2024", query_mode="remote",
)
fake_meta = TableMetadata(rows=1, size_bytes=2)
with patch(
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
) as mock_fetch:
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
assert mock_fetch.call_count == 1
The seeded_app["register_table"] helper may need to be added to conftest.py if it doesn't exist. Quick check + extension:
# In tests/conftest.py — add to seeded_app fixture
def register_table(*, id, source_type, bucket, source_table, query_mode, **extra):
from src.repositories.table_registry import TableRegistryRepository
repo = TableRegistryRepository(seeded_app["conn"])
repo.register(
id=id, name=id, source_type=source_type, bucket=bucket,
source_table=source_table, query_mode=query_mode, **extra,
)
seeded_app["register_table"] = register_table
- Step 8.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_remote_metadata.py -v
Expected: FAIL — response doesn't include the new fields yet.
- Step 8.3: Wire up the dispatch + cache + response shape
Edit app/api/v2_catalog.py:
A. Add a _metadata_cache near the existing _table_rows_cache (around line 26):
# Per-table cached TableMetadata. 15-min TTL — long enough to amortise
# across an analyst session, short enough that a freshly-registered
# remote table shows real numbers within a coffee break (the cache-bust
# path in `invalidate_for_table` accelerates this for the common admin-
# verifies-registration flow).
_metadata_cache = TTLCache(maxsize=512, ttl_seconds=900)
B. Rename the existing function _materialized_size_hint to _size_hint_for_row and extend it. Find the function (around line 68) and replace its full body:
def _size_hint_for_row(row: dict) -> dict:
"""Resolve the per-row metadata bundle the catalog response surfaces.
Renamed from `_materialized_size_hint` (which always also handled
`local` rows; the old name was misleading). Returns a dict with up
to four keys: `rough_size_hint`, `rows`, `size_bytes`, `partition_by`,
`clustered_by`. Missing keys are reported as `null` in the response.
Branches:
- `local` / `materialized` → existing on-disk parquet stat (cheap).
- `remote` → dispatch to the per-source-type provider; cache the
TableMetadata for 15 min.
"""
table_id = row["id"]
source_type = row.get("source_type") or ""
query_mode = row.get("query_mode") or "local"
if query_mode in ("local", "materialized"):
return {"rough_size_hint": _materialized_parquet_size_bucket(
table_id, source_type, query_mode,
)}
if query_mode != "remote":
return {"rough_size_hint": None}
# Cache lookup (per-row TableMetadata).
cached = _metadata_cache.get(table_id)
if cached is None:
cached = _resolve_remote_metadata(row)
if cached is not None:
_metadata_cache.set(table_id, cached)
if cached is None:
return {"rough_size_hint": None}
return {
"rough_size_hint": _bucket_size(cached.size_bytes) if cached.size_bytes else None,
"rows": cached.rows,
"size_bytes": cached.size_bytes,
"partition_by": cached.partition_by,
"clustered_by": cached.clustered_by,
}
def _materialized_parquet_size_bucket(
table_id: str, source_type: str, query_mode: str,
) -> str | None:
"""Size hint for rows whose data is on the server filesystem
(the old `_materialized_size_hint` body)."""
if not source_type:
return None
try:
path = (
Path(_get_data_dir()) / "extracts" / source_type / "data"
/ f"{table_id}.parquet"
)
if not path.exists():
return None
return _bucket_size(path.stat().st_size)
except Exception:
return None
def _resolve_remote_metadata(row: dict) -> "TableMetadata | None":
"""Provider dispatch for a remote row. Returns None on any failure."""
source_type = row.get("source_type") or ""
provider = _metadata_provider_for(source_type)
if provider is None:
return None
req = _build_metadata_request(row)
if req is None:
return None
try:
return provider(req)
except Exception:
# Defense in depth — providers are documented as never-raises,
# but a regression would otherwise 500 the whole catalog.
return None
C. Update build_catalog (around line 94) to merge the new fields into each visible row's payload. Find the loop body that builds visible.append(...) and replace the appended dict construction:
visible = []
for r in rows:
if not can_access_table(user, r["id"], conn):
continue
hint = _size_hint_for_row(r)
visible.append({
"id": r["id"],
"name": r.get("name") or r["id"],
"description": r.get("description") or "",
"source_type": r.get("source_type") or "",
"query_mode": r.get("query_mode") or "local",
"sql_flavor": _flavor_for(r.get("source_type") or ""),
"where_examples": _examples_for(r.get("source_type") or ""),
"fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""),
"rough_size_hint": hint.get("rough_size_hint"),
"rows": hint.get("rows"),
"size_bytes": hint.get("size_bytes"),
"partition_by": hint.get("partition_by"),
"clustered_by": hint.get("clustered_by"),
})
D. Verify the existing from pathlib import Path and _get_data_dir import are still in scope; if not, add them.
- Step 8.4: Run tests to verify they pass
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_remote_metadata.py tests/test_v2_catalog_dispatcher.py -v
Expected: all green.
- Step 8.5: Run existing v2_catalog tests for regression
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog.py -v
Expected: all green (response is additive — no field renamed or removed).
- Step 8.6: Commit
cd /tmp/agnes-metadata
git add app/api/v2_catalog.py tests/test_v2_catalog_remote_metadata.py
git commit -m "feat(catalog): enrich remote rows with rows / size_bytes / partition_by / clustered_by
_size_hint_for_row (renamed from _materialized_size_hint) now dispatches
to the per-source-type metadata provider for query_mode='remote' rows
and caches the TableMetadata for 15 minutes. Catalog response gains
four new optional fields; existing CLI consumers that read only
rough_size_hint stay unchanged.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 9: Unified cache invalidation
Files:
-
Modify:
app/api/v2_catalog.py(addinvalidate_for_table) -
Modify:
app/api/admin.py(call from register/update/unregister) -
Test:
tests/test_v2_catalog_invalidation.py -
Step 9.1: Write failing test
Create tests/test_v2_catalog_invalidation.py:
"""Unified cache flush across all four catalog/schema/sample/metadata
caches on registry write."""
from unittest.mock import patch
def test_invalidate_flushes_all_four_caches():
from app.api import v2_catalog, v2_schema, v2_sample
from app.api._metadata_models import TableMetadata
# Pre-populate.
v2_catalog._table_rows_cache.set("all", ["fake_row"])
v2_catalog._metadata_cache.set("orders", TableMetadata(rows=10))
v2_schema._schema_cache.set("orders", {"columns": []})
v2_sample._sample_cache.set("orders|10", [{"row": 1}])
v2_catalog.invalidate_for_table("orders")
assert v2_catalog._table_rows_cache.get("all") is None
assert v2_catalog._metadata_cache.get("orders") is None
assert v2_schema._schema_cache.get("orders") is None
# Sample cache is cleared whole (we don't have prefix-invalidation).
assert v2_sample._sample_cache.get("orders|10") is None
def test_invalidate_schedules_single_row_rewarm(monkeypatch):
"""After the flush, a background re-warm task is scheduled for the
same table_id. Assert via patching create_task."""
import asyncio
from app.api import v2_catalog
scheduled = []
def fake_create_task(coro):
# Drain the coroutine so the test doesn't leak it.
coro.close()
scheduled.append(coro)
return None
monkeypatch.setattr(asyncio, "create_task", fake_create_task)
v2_catalog.invalidate_for_table("orders")
assert len(scheduled) == 1
def test_register_table_invalidates(seeded_app):
"""Registering a table flushes the rows cache so the next catalog
request reflects it without waiting for the 5-min TTL."""
from app.api import v2_catalog
v2_catalog._table_rows_cache.set("all", [])
seeded_app["register_table"](
id="new_t", source_type="keboola", bucket="in.c-x",
source_table="t", query_mode="local",
)
assert v2_catalog._table_rows_cache.get("all") is None
def test_update_table_invalidates(seeded_app):
from app.api import v2_catalog
seeded_app["register_table"](
id="u_t", source_type="keboola", bucket="in.c-x",
source_table="t", query_mode="local",
)
v2_catalog._table_rows_cache.set("all", ["pre-update"])
seeded_app["http_put"]("/api/admin/registry/u_t", {"description": "new"})
assert v2_catalog._table_rows_cache.get("all") is None
def test_unregister_table_invalidates(seeded_app):
from app.api import v2_catalog
seeded_app["register_table"](
id="d_t", source_type="keboola", bucket="in.c-x",
source_table="t", query_mode="local",
)
v2_catalog._table_rows_cache.set("all", ["pre-delete"])
seeded_app["http_delete"]("/api/admin/registry/d_t")
assert v2_catalog._table_rows_cache.get("all") is None
If seeded_app["http_put"] or ["http_delete"] aren't already defined in tests/conftest.py, add thin wrappers that call c.put(...) / c.delete(...) with the admin token.
- Step 9.2: Run tests to verify they fail
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_invalidation.py -v
Expected: FAIL with AttributeError: module 'app.api.v2_catalog' has no attribute 'invalidate_for_table'.
- Step 9.3: Add
invalidate_for_tableto v2_catalog
Append to app/api/v2_catalog.py (anywhere after _metadata_cache is defined):
def invalidate_for_table(table_id: str) -> None:
"""Drop every per-table cache so the next /api/v2/* request reflects
the just-registered / updated / unregistered row immediately. Owned
by the catalog module so admin.py doesn't need to know which caches
exist.
Imports v2_schema and v2_sample lazily — keeps catalog tests from
pulling in BQ-extension imports they don't need.
"""
import asyncio
from app.api import v2_schema, v2_sample
_table_rows_cache.clear()
_metadata_cache.invalidate(table_id)
v2_schema._schema_cache.invalidate(table_id)
# Sample cache key is `f"{table_id}|{n}"`; clearing the whole sample
# cache is heavier than precise invalidation, but registry-change
# frequency (handful per day on a typical instance) doesn't justify
# adding a prefix-invalidation primitive to TTLCache.
v2_sample._sample_cache.clear()
# Schedule a single-row re-warm so admins editing a registry row
# see fresh data within a couple of seconds rather than waiting for
# the next analyst to trigger a miss. Fire-and-forget; failures
# log + skip inside the coroutine.
try:
asyncio.create_task(_rewarm_one_row(table_id))
except RuntimeError:
# No running event loop (e.g. called from a sync test). Skip
# the re-warm — the next live request will populate via miss.
pass
async def _rewarm_one_row(table_id: str) -> None:
"""Background single-row re-warm. Imports cache_warmup lazily to
avoid a circular import at module load."""
try:
from app.api.cache_warmup import warm_one_table
await warm_one_table(table_id)
except Exception:
import logging
logging.getLogger(__name__).warning(
"single-row re-warm failed for %s — next live request will populate",
table_id,
)
The cache_warmup.warm_one_table function will be defined in Task 10. For now it doesn't exist; the lazy import + outer try means the test for the schedule-task can run without it, falling through to the warning path. The unit test test_invalidate_schedules_single_row_rewarm patches asyncio.create_task so the inner coroutine never runs.
- Step 9.4: Wire the helper into admin.py
Edit app/api/admin.py. Find the success-return paths in:
register_table(around line 1037 / inner success branch)update_table(around line 2486)unregister_table(the DELETE handler around line 2538)
For each, add a single line just before the function returns:
from app.api.v2_catalog import invalidate_for_table
invalidate_for_table(table_id)
(The import inside the function avoids a circular import at module load.)
For register_table, the table_id may be derived from result["id"] or similar — adapt to the actual variable name in scope. For unregister_table, the path parameter is table_id. For update_table, also table_id.
- Step 9.5: Run tests
cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_invalidation.py -v
Expected: all green.
- Step 9.6: Commit
cd /tmp/agnes-metadata
git add app/api/v2_catalog.py app/api/admin.py tests/test_v2_catalog_invalidation.py
git commit -m "feat(catalog): unified invalidate_for_table flushes 4 caches + schedules rewarm
invalidate_for_table flushes _table_rows_cache + _metadata_cache +
_schema_cache + _sample_cache, then schedules a single-row re-warm so
admin-edited rows reflect fresh data within ~1s rather than waiting for
the next analyst miss. Wired into register_table / update_table /
unregister_table.
Pre-fix none of the four caches were invalidated on registry change —
admin registers a table, agnes catalog doesn't show the new row for up
to 5 min; admin updates a row's bucket, agnes schema returns the OLD
column list for up to 1h.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 10: Cache warmup framework — state + bg task + endpoints
Files:
-
Create:
app/api/cache_warmup.py -
Modify:
pyproject.toml(addsse-starlettedep) -
Test:
tests/test_cache_warmup.py -
Step 10.1: Add sse-starlette dependency
Edit pyproject.toml. Find the dependencies = [ block under [project] and append:
"sse-starlette>=2.0",
Run:
cd /tmp/agnes-metadata && uv pip install -e ".[dev]"
- Step 10.2: Write failing tests
Create tests/test_cache_warmup.py:
"""Cache warmup framework — state, bg task, endpoints."""
import asyncio
from unittest.mock import patch
import pytest
def test_warmup_run_state_starts_empty():
from app.api.cache_warmup import WARMUP_STATE
assert WARMUP_STATE is None or WARMUP_STATE.completed_at is not None
@pytest.mark.asyncio
async def test_warmup_skips_when_env_set(monkeypatch):
"""AGNES_SKIP_CACHE_WARMUP=1 → background warmup is a no-op."""
monkeypatch.setenv("AGNES_SKIP_CACHE_WARMUP", "1")
from app.api import cache_warmup
with patch.object(cache_warmup, "_warm_catalog_caches_bg") as mock_bg:
cache_warmup.maybe_schedule_startup_warmup()
# Either not called, or called and short-circuits internally.
# Exact assertion depends on implementation; pick the simpler one.
mock_bg.assert_not_called()
@pytest.mark.asyncio
async def test_warmup_runs_one_per_remote_row():
"""`_warm_catalog_caches_bg` calls `_warm_one` once per remote row."""
from app.api import cache_warmup
# Stub the registry to return 3 remote BQ rows + 1 local row.
fake_rows = [
{"id": "r1", "query_mode": "remote", "source_type": "bigquery"},
{"id": "r2", "query_mode": "remote", "source_type": "bigquery"},
{"id": "r3", "query_mode": "remote", "source_type": "bigquery"},
{"id": "L1", "query_mode": "local", "source_type": "keboola"},
]
warmed = []
async def fake_warm_one(row, state, sem):
warmed.append(row["id"])
with patch.object(cache_warmup, "_list_remote_rows", return_value=fake_rows[:3]):
with patch.object(cache_warmup, "_warm_one", fake_warm_one):
await cache_warmup._warm_catalog_caches_bg(trigger="manual")
assert sorted(warmed) == ["r1", "r2", "r3"]
@pytest.mark.asyncio
async def test_warmup_failure_isolated_to_one_row():
from app.api import cache_warmup
async def fake_warm_one(row, state, sem):
if row["id"] == "fail":
raise RuntimeError("BQ down for this row")
with patch.object(cache_warmup, "_list_remote_rows", return_value=[
{"id": "ok1", "query_mode": "remote", "source_type": "bigquery"},
{"id": "fail", "query_mode": "remote", "source_type": "bigquery"},
{"id": "ok2", "query_mode": "remote", "source_type": "bigquery"},
]):
with patch.object(cache_warmup, "_warm_one", fake_warm_one):
await cache_warmup._warm_catalog_caches_bg(trigger="manual")
state = cache_warmup.WARMUP_STATE
assert state.total == 3
# Other rows still ran; the bad one is recorded but doesn't blow up
# the gather. Note: this asserts on the gather behavior, not on
# `_warm_one`'s internal exception handling — that lives in
# the real implementation, not the stub here. If the implementation
# catches inside `_warm_one`, the gather completes fully.
@pytest.mark.asyncio
async def test_run_endpoint_idempotent_on_concurrent_invocation(seeded_app):
"""Two POST /run calls in flight return the same run_id."""
c = seeded_app["client"]
token = seeded_app["admin_token"]
headers = {"Authorization": f"Bearer {token}"}
# First POST starts a run.
r1 = c.post("/api/admin/cache-warmup/run", headers=headers)
assert r1.status_code == 200
body1 = r1.json()
# Second POST while first is in-flight: returns existing run_id.
r2 = c.post("/api/admin/cache-warmup/run", headers=headers)
body2 = r2.json()
if body2.get("status") == "already_running":
assert body2["run_id"] == body1.get("run_id") or "run_id" in body1
# Otherwise the first run completed before the second arrived; that's
# also OK — idempotency only matters under true concurrency.
def test_status_endpoint_before_first_run(seeded_app):
"""GET /status returns {state: never_run} before any warmup."""
from app.api import cache_warmup
cache_warmup.WARMUP_STATE = None # reset for this test
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get(
"/api/admin/cache-warmup/status",
headers={"Authorization": f"Bearer {token}"},
)
assert r.status_code == 200
assert r.json() == {"state": "never_run"}
- Step 10.3: Run tests to verify they fail
cd /tmp/agnes-metadata && python -m pytest tests/test_cache_warmup.py -v
Expected: ImportError for app.api.cache_warmup.
- Step 10.4: Implement the warmup module
Create app/api/cache_warmup.py:
"""Cache warmup framework — populates catalog/schema/metadata caches at
container startup so the first analyst hits warm caches.
Bounded concurrency (4 by default). Exposes:
- GET /api/admin/cache-warmup/status — JSON snapshot
- POST /api/admin/cache-warmup/run — manual trigger (idempotent)
- GET /api/admin/cache-warmup/stream — Server-Sent Events
"""
from __future__ import annotations
import asyncio
import json
import logging
import os
import time
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from typing import Literal
from uuid import uuid4
from fastapi import APIRouter, Depends
from sse_starlette.sse import EventSourceResponse
from app.auth.dependencies import _get_db
from app.auth.access import require_admin
logger = logging.getLogger(__name__)
router = APIRouter()
@dataclass
class WarmupRowState:
table_id: str
status: Literal["pending", "warming", "fresh", "error"]
started_at: str | None = None
completed_at: str | None = None
duration_ms: int | None = None
error: str | None = None
last_warmed_at: str | None = None
@dataclass
class WarmupRunState:
run_id: str
trigger: Literal["startup", "manual", "registry_change"]
started_at: str
completed_at: str | None = None
total: int = 0
completed: int = 0
failed: int = 0
rows: dict[str, WarmupRowState] = field(default_factory=dict)
_subscribers: list[asyncio.Queue] = field(default_factory=list, repr=False)
WARMUP_STATE: WarmupRunState | None = None
_RUN_LOCK = asyncio.Lock()
def _now_iso() -> str:
return datetime.now(timezone.utc).isoformat()
def maybe_schedule_startup_warmup() -> None:
"""Called from app/main.py FastAPI startup event."""
if os.environ.get("AGNES_SKIP_CACHE_WARMUP") == "1":
logger.info("cache warmup skipped (AGNES_SKIP_CACHE_WARMUP=1)")
return
try:
asyncio.create_task(_warm_catalog_caches_bg(trigger="startup"))
except RuntimeError:
logger.warning("no running event loop — startup warmup skipped")
async def _warm_catalog_caches_bg(trigger: str = "startup") -> None:
"""Walk registry, warm metadata + schema caches for every remote row."""
global WARMUP_STATE
async with _RUN_LOCK:
# Re-check inside the lock — another caller might have completed
# a run while we were waiting.
if WARMUP_STATE and WARMUP_STATE.completed_at is None:
return
run_id = uuid4().hex[:8]
state = WarmupRunState(
run_id=run_id, trigger=trigger, started_at=_now_iso(),
)
WARMUP_STATE = state
rows = _list_remote_rows()
state.total = len(rows)
for r in rows:
state.rows[r["id"]] = WarmupRowState(
table_id=r["id"], status="pending",
)
_broadcast(state, {"event": "start", "data": {
"run_id": run_id, "trigger": trigger, "total": state.total,
}})
sem = asyncio.Semaphore(int(os.environ.get("AGNES_WARMUP_CONCURRENCY", "4")))
await asyncio.gather(
*(_warm_one(r, state, sem) for r in rows), return_exceptions=True,
)
state.completed_at = _now_iso()
_broadcast(state, {"event": "complete", "data": {
"run_id": run_id, "total": state.total,
"completed": state.completed, "failed": state.failed,
}})
logger.info(
"cache warmup complete: run_id=%s total=%d ok=%d fail=%d",
run_id, state.total, state.completed, state.failed,
)
def _list_remote_rows() -> list[dict]:
"""Snapshot of registry rows that need a warmup pass."""
from src.db import get_system_db
from src.repositories.table_registry import TableRegistryRepository
conn = get_system_db()
rows = TableRegistryRepository(conn).list_all()
return [
r for r in rows
if r.get("query_mode") == "remote"
and r.get("source_type") == "bigquery"
]
async def _warm_one(
row: dict, state: WarmupRunState, sem: asyncio.Semaphore,
) -> None:
async with sem:
rs = state.rows[row["id"]]
rs.status = "warming"
rs.started_at = _now_iso()
_broadcast(state, {"event": "row", "data": asdict(rs)})
t0 = time.monotonic()
try:
await asyncio.to_thread(_warm_metadata_sync, row)
await asyncio.to_thread(_warm_schema_sync, row)
rs.status = "fresh"
rs.last_warmed_at = _now_iso()
state.completed += 1
except Exception as e:
rs.status = "error"
rs.error = str(e)
state.failed += 1
logger.warning("cache warmup row=%s failed: %s", row["id"], e)
finally:
rs.completed_at = _now_iso()
rs.duration_ms = int((time.monotonic() - t0) * 1000)
_broadcast(state, {"event": "row", "data": asdict(rs)})
def _warm_metadata_sync(row: dict) -> None:
"""Trigger metadata cache populate via the catalog's normal path."""
from app.api.v2_catalog import _size_hint_for_row
_size_hint_for_row(row)
def _warm_schema_sync(row: dict) -> None:
"""Trigger schema cache populate via build_schema_uncached."""
from app.api.v2_schema import build_schema_uncached
from connectors.bigquery.access import get_bq_access
from src.db import get_system_db
bq = get_bq_access()
build_schema_uncached(get_system_db(), row["id"], bq=bq)
async def warm_one_table(table_id: str) -> None:
"""Single-row re-warm — invoked by `invalidate_for_table` after a
registry change. Does NOT update WARMUP_STATE (small change shouldn't
overwrite the last full run's status); just refreshes the caches."""
from src.db import get_system_db
from src.repositories.table_registry import TableRegistryRepository
conn = get_system_db()
row = TableRegistryRepository(conn).get(table_id)
if not row or row.get("query_mode") != "remote":
return
try:
await asyncio.to_thread(_warm_metadata_sync, row)
await asyncio.to_thread(_warm_schema_sync, row)
except Exception as e:
logger.warning("single-row warmup failed for %s: %s", table_id, e)
def _broadcast(state: WarmupRunState, event: dict) -> None:
"""Send an event to every SSE subscriber. Dead queues are pruned."""
dead = []
for q in state._subscribers:
try:
q.put_nowait(event)
except asyncio.QueueFull:
dead.append(q)
for q in dead:
state._subscribers.remove(q)
def _serialize_state(state: WarmupRunState) -> dict:
return {
"run_id": state.run_id,
"trigger": state.trigger,
"started_at": state.started_at,
"completed_at": state.completed_at,
"total": state.total,
"completed": state.completed,
"failed": state.failed,
"rows": {tid: asdict(rs) for tid, rs in state.rows.items()},
}
# ─── Endpoints ────────────────────────────────────────────────────────
@router.get("/api/admin/cache-warmup/status")
async def warmup_status(user: dict = Depends(require_admin)):
if WARMUP_STATE is None:
return {"state": "never_run"}
return _serialize_state(WARMUP_STATE)
@router.post("/api/admin/cache-warmup/run")
async def warmup_run(user: dict = Depends(require_admin)):
if WARMUP_STATE and WARMUP_STATE.completed_at is None:
return {"run_id": WARMUP_STATE.run_id, "status": "already_running"}
asyncio.create_task(_warm_catalog_caches_bg(trigger="manual"))
# Brief sleep so WARMUP_STATE is populated by the time we return.
await asyncio.sleep(0)
return {
"run_id": WARMUP_STATE.run_id if WARMUP_STATE else None,
"status": "started",
}
@router.get("/api/admin/cache-warmup/stream")
async def warmup_stream(user: dict = Depends(require_admin)):
async def gen():
q: asyncio.Queue = asyncio.Queue(maxsize=256)
if WARMUP_STATE:
WARMUP_STATE._subscribers.append(q)
# Replay current snapshot.
if WARMUP_STATE:
yield {"event": "snapshot", "data": json.dumps(
_serialize_state(WARMUP_STATE)
)}
try:
while True:
ev = await asyncio.wait_for(q.get(), timeout=30.0)
yield {"event": ev["event"], "data": json.dumps(ev["data"])}
if ev["event"] == "complete":
return
except asyncio.TimeoutError:
return
finally:
if WARMUP_STATE and q in WARMUP_STATE._subscribers:
WARMUP_STATE._subscribers.remove(q)
return EventSourceResponse(gen())
- Step 10.5: Register the router in app/main.py
Edit app/main.py. Find the section where routers are included (look for app.include_router(...) calls). Add:
from app.api.cache_warmup import router as cache_warmup_router
app.include_router(cache_warmup_router)
- Step 10.6: Run tests
cd /tmp/agnes-metadata && python -m pytest tests/test_cache_warmup.py -v
Expected: all green.
- Step 10.7: Commit
cd /tmp/agnes-metadata
git add app/api/cache_warmup.py app/main.py pyproject.toml \
tests/test_cache_warmup.py
git commit -m "feat(cache-warmup): startup background warmup + status / run / stream endpoints
- WarmupRunState / WarmupRowState dataclasses + module-level singleton.
- _warm_catalog_caches_bg walks remote BQ rows with bounded concurrency
(Semaphore(4), tunable via AGNES_WARMUP_CONCURRENCY).
- Per-row failures isolated; never blow up the gather.
- maybe_schedule_startup_warmup honors AGNES_SKIP_CACHE_WARMUP=1.
- warm_one_table for the single-row re-warm called from
invalidate_for_table.
- GET /api/admin/cache-warmup/status — JSON snapshot.
- POST /api/admin/cache-warmup/run — manual trigger, idempotent under
concurrent invocation (locks + checks completed_at).
- GET /api/admin/cache-warmup/stream — Server-Sent Events stream of
start / row / complete events.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 11: FastAPI startup event hook
Files:
-
Modify:
app/main.py -
Test:
tests/test_main_startup_warmup.py -
Step 11.1: Write failing test
Create tests/test_main_startup_warmup.py:
"""The FastAPI startup event schedules `maybe_schedule_startup_warmup`
without blocking readiness."""
from unittest.mock import patch
def test_startup_event_calls_warmup_scheduler():
from app.main import app
# FastAPI startup events are functions on `app.router.on_startup`.
handler_names = [h.__name__ for h in app.router.on_startup]
assert any("warm" in n.lower() for n in handler_names), (
"Expected a startup handler with 'warm' in its name; "
f"found: {handler_names}"
)
def test_health_check_returns_immediately_during_warmup(seeded_app):
"""/api/health doesn't await warmup; readiness is fire-and-forget."""
c = seeded_app["client"]
r = c.get("/api/health")
assert r.status_code == 200
- Step 11.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_main_startup_warmup.py -v
Expected: first test FAIL, second probably passes.
- Step 11.3: Add the startup hook
Edit app/main.py. Locate the existing FastAPI app instantiation (app = FastAPI(...)). Add a startup event handler nearby:
@app.on_event("startup")
async def warm_catalog_caches_on_startup():
"""Schedule a background warmup of the v2 catalog/schema/metadata
caches. Fire-and-forget; readiness is not blocked. Failures inside
the background task are logged + swallowed.
"""
from app.api.cache_warmup import maybe_schedule_startup_warmup
maybe_schedule_startup_warmup()
If the file already uses lifespan handlers (newer FastAPI pattern), add the call into the lifespan function instead. Inspect the file first.
- Step 11.4: Run tests to verify they pass
cd /tmp/agnes-metadata && python -m pytest tests/test_main_startup_warmup.py -v
Expected: both pass.
- Step 11.5: Commit
cd /tmp/agnes-metadata
git add app/main.py tests/test_main_startup_warmup.py
git commit -m "feat(main): startup event hook for catalog cache warmup
Fire-and-forget — readiness is not blocked. Per-row warmup runs in
background with bounded concurrency. Honors AGNES_SKIP_CACHE_WARMUP=1
opt-out for dev/test instances.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 12: CLI post-register hint for query_mode=remote
Files:
-
Modify:
cli/commands/admin.py(extendregister_tablepost-success hint) -
Test:
tests/test_cli_admin.py(append test) -
Step 12.1: Write failing test
Append to tests/test_cli_admin.py:
class TestRegisterTableHints:
"""The CLI prints helpful follow-up hints after a successful
register-table call. v0.46 adds a third hint for query_mode=remote
pointing at the IAM verify-your-SA smoke check."""
def test_remote_register_emits_iam_verify_hint(self):
with patch("cli.commands.admin.api_post", return_value=_resp(201, {"id": "t"})):
result = runner.invoke(app, [
"admin", "register-table", "orders",
"--source-type", "bigquery",
"--bucket", "dwh_base",
"--source-table", "orders",
"--query-mode", "remote",
])
assert result.exit_code == 0
assert "agnes query --remote" in result.output
assert "query-modes.md" in result.output
def test_local_register_does_not_emit_remote_hint(self):
with patch("cli.commands.admin.api_post", return_value=_resp(201, {"id": "t"})):
result = runner.invoke(app, [
"admin", "register-table", "users",
"--source-type", "keboola",
"--bucket", "in.c-crm",
"--source-table", "users",
"--query-mode", "local",
])
assert result.exit_code == 0
assert "agnes query --remote" not in result.output
- Step 12.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_cli_admin.py::TestRegisterTableHints -v
Expected: first FAIL.
- Step 12.3: Add the hint
Edit cli/commands/admin.py. Find the success branch in register_table (search for the existing two hints — Next: run agnes setup first-sync and register-table does not auto-grant). Add a third hint conditional on query_mode == "remote":
# Third hint: BQ-remote rows can fail at first analyst query if the
# SA lacks dataViewer/jobUser. Pointing at the smoke command
# surfaces the failure at registration time, not 30 minutes later.
if query_mode == "remote":
typer.echo(
f" Note: this is a remote-query table. Verify the SA can read it:\n"
f" agnes query --remote \"SELECT COUNT(*) FROM {name}\"\n"
f" If it 403s, see docs/admin/query-modes.md → \"BigQuery → IAM\"."
)
- Step 12.4: Run tests
cd /tmp/agnes-metadata && python -m pytest tests/test_cli_admin.py::TestRegisterTableHints -v
Expected: 2 passed.
- Step 12.5: Commit
cd /tmp/agnes-metadata
git add cli/commands/admin.py tests/test_cli_admin.py
git commit -m "feat(cli): third post-register hint for query_mode=remote
Points at agnes query --remote SELECT COUNT(*) as the smoke check for
SA IAM. Surfaces 403 USER_PROJECT_DENIED at register time, not when
the first analyst hits the table 30 minutes later. Cross-references the
new docs/admin/query-modes.md page (Task 13).
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 13: docs/admin/query-modes.md
Files:
-
Create:
docs/admin/query-modes.md -
Test: smoke test linking via
tests/test_docs_links.py(optional) -
Step 13.1: Write the doc
Create docs/admin/query-modes.md with this exact content:
# Query Modes — when to register a table as `local`, `remote`, or `materialized`
Source-agnostic guide to the three `query_mode` values Agnes supports. Pick the right mode at registration time and the analyst-side experience is fast, cost-aware, and predictable. Pick wrong and you'll either burn BQ scan budget on every query or spend hours waiting on syncs that didn't need to happen.
## TL;DR — decision tree
```
Is the table small (< 1 GB) and updated daily-or-slower?
└─ YES → query_mode: local (sync to laptop, query offline)
Is the table the result of an aggregate SQL the operator controls?
└─ YES → query_mode: materialized (server runs SQL → parquet, distributed)
Otherwise:
└─ query_mode: remote (data stays in upstream; analyst queries on demand)
```
## Three modes side-by-side
| Aspect | `local` | `materialized` | `remote` |
|---|---|---|---|
| Where the data lives | Analyst laptop (parquet) | Agnes server filesystem (parquet) | Upstream (BigQuery, Keboola, …) |
| Who runs the query | Analyst's local DuckDB | Analyst's local DuckDB | Upstream engine via DuckDB extension |
| Cost model | Free after sync | Free after each sync | Per-query scan cost on the analyst's first hit |
| Freshness | As fresh as last sync | As fresh as last scheduled run | Live |
| Scan limits | None (laptop disk) | None (server disk) | `bq_max_scan_bytes` cost gate (default 5 GiB) |
| Best for | Stable reference data, daily-updated facts | Aggregates, daily snapshots | Big tables, live data, residency-restricted |
## Per-source-type reference
### BigQuery — `query_mode: remote`
The most common use case for `remote`. Data stays in BQ; analysts query on demand via the Agnes server's service account.
**IAM:** the server's SA must have:
- `roles/bigquery.dataViewer` on the dataset (read access)
- `roles/bigquery.jobUser` on the *billing* project (run jobs)
If `data_source.bigquery.billing_project == data_source.bigquery.project`, set the SA's `serviceusage.services.use` permission too — the BQ extension can otherwise 403 USER_PROJECT_DENIED on the first query. The instance health check (`agnes diagnose`) surfaces this as an `info`-tier entry on `bq_config`.
**Register via UI:** `/admin/tables` → "Add table" → Source type `bigquery` → Mode `remote` → fill `dataset` (your BQ dataset name) + `source_table` (the BQ table id within that dataset).
**Register via CLI:**
```bash
agnes admin register-table sales_2024 \
--source-type bigquery \
--bucket dwh_base \
--source-table sales_2024 \
--query-mode remote
```
After registration, smoke-test the SA's access:
```bash
agnes query --remote "SELECT COUNT(*) FROM sales_2024"
```
A 403 here means the SA is missing `dataViewer` or `jobUser`; fix in IAM and re-test.
**Cost guardrail:** `bq_max_scan_bytes` (default 5 GiB) refuses queries whose pre-execution scan estimate exceeds the cap. Configurable in `/admin/server-config`. When an analyst hits the cap, the response includes a hint to use `agnes snapshot create --where '<predicate>'` to materialise a scoped subset locally.
### BigQuery — `query_mode: materialized`
The server runs a scheduled SQL aggregate against BigQuery and writes the result to a parquet on the Agnes filesystem. Analysts get the parquet via `agnes pull` like any other local table.
**Register via CLI:**
```bash
agnes admin register-table monthly_kpis \
--source-type bigquery \
--bucket dwh_base \
--source-table monthly_kpis \
--query-mode materialized \
--query @path/to/monthly_kpis.sql \
--sync-schedule "daily 03:00"
```
**Cost guardrail:** `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable) refuses materialise runs whose query plan exceeds the cap. Catches a typo'd `WHERE` clause that would otherwise scan a year of data.
### Keboola — `query_mode: local` (the production path)
The Agnes server's Keboola DuckDB extension downloads the table to a parquet on the server filesystem; `agnes pull` distributes it to analyst laptops.
**Setup:** `instance.yaml.data_source.type: keboola` + Storage API token via `KEBOOLA_STORAGE_TOKEN` env var (or whatever `instance.yaml.token_env` points at).
**Register via CLI:**
```bash
agnes admin register-table users \
--source-type keboola \
--bucket in.c-crm \
--source-table users \
--query-mode local
```
**`query_mode: remote` for Keboola** is architecturally supported via the `_remote_attach` mechanism (the orchestrator can ATTACH the Keboola DuckDB extension on demand the same way it does for BQ), but **not in active deployment use today**. If you have an analyst workflow against a Keboola table that's too big to sync, file an issue — the architecture is in place but the registration UX hasn't been polished.
### Jira — `query_mode: local` only
Event-driven: webhooks update parquets incrementally. No `remote` or `materialized` mode for Jira today.
## Worked examples
**1. Big BigQuery fact table you query weekly:** `query_mode: remote`. SA needs `dataViewer` + `jobUser`. Analyst uses `agnes query --remote` for one-off aggregates and `agnes snapshot create` for cross-week joins.
**2. Daily Keboola dimension table:** `query_mode: local`. Synced once a day by the scheduler; analyst's `agnes pull` picks it up.
**3. Monthly KPI aggregate from a BQ datawarehouse:** `query_mode: materialized` + `--sync-schedule "0 3 1 * *"` (3:00 on the 1st of each month). The server runs your aggregate SQL once a month; analysts get a parquet of the result.
## See also
- `docs/RBAC.md` — granting analysts access to a registered table.
- `config/instance.yaml.example` — the `data_source` config block.
- `agnes catalog --json` — inspect a registered table's mode + size hints.
- `agnes diagnose` — surface `bq_config` IAM issues and other health entries.
- Step 13.2: Quick smoke check that the doc builds without dead links
cd /tmp/agnes-metadata && python -c "
from pathlib import Path
content = Path('docs/admin/query-modes.md').read_text()
# Check the cross-referenced files exist.
for ref in ['docs/RBAC.md', 'config/instance.yaml.example']:
assert Path(ref).exists(), f'broken cross-ref: {ref}'
print('docs OK')
"
Expected: docs OK.
- Step 13.3: Commit
cd /tmp/agnes-metadata
git add docs/admin/query-modes.md
git commit -m "docs(admin): query-modes.md — when to register a table as local/remote/materialized
Source-agnostic decision tree + per-source-type reference (BigQuery,
Keboola, Jira). Three worked examples. Cross-references to RBAC.md +
instance.yaml.example.
Linked from the admin UI tooltip (Task 14) and the CLI post-register
hint (Task 12).
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 14: Admin UI integration on /admin/tables
Files:
-
Modify:
app/web/templates/admin_tables.html -
Test:
tests/test_admin_tables_warmup_ui.py -
Step 14.1: Write failing test
Create tests/test_admin_tables_warmup_ui.py:
"""Smoke test that /admin/tables HTML contains the cache toolbar markup,
the EventSource wiring, and the per-row col-status slot."""
def test_cache_toolbar_present(seeded_app):
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get(
"/admin/tables", headers={"Authorization": f"Bearer {token}"},
)
assert r.status_code == 200, r.text
body = r.text
assert 'id="cacheWarmupCard"' in body
assert "Re-warm all" in body
assert "/api/admin/cache-warmup/stream" in body
assert "EventSource" in body
def test_query_mode_doc_link_present(seeded_app):
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get(
"/admin/tables", headers={"Authorization": f"Bearer {token}"},
)
assert r.status_code == 200
assert "query-modes.md" in r.text or "/docs/admin/query-modes" in r.text
def test_col_status_th_present_in_renderer(seeded_app):
"""The renderRegistryListing JS still emits <th class='col-status'>
so the per-row badge slot exists."""
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get("/admin/tables", headers={"Authorization": f"Bearer {token}"})
assert 'class="col-status"' in r.text
- Step 14.2: Run test to verify it fails
cd /tmp/agnes-metadata && python -m pytest tests/test_admin_tables_warmup_ui.py -v
Expected: 2 of 3 fail (toolbar absent, EventSource not yet wired). Third probably passes.
- Step 14.3: Add the cache toolbar
<section>toadmin_tables.html
Edit app/web/templates/admin_tables.html. Locate the section between the page header and the per-source-type table listings — search for bqTableListing to find the area. Insert the new card just before the listings:
<section id="cacheWarmupCard" class="card" style="margin-bottom: 20px;">
<header class="card-header" style="display: flex; justify-content: space-between; align-items: center;">
<h2>Cache freshness</h2>
<button class="btn btn-secondary" id="cacheWarmupRunBtn" onclick="cacheWarmupRun()">
Re-warm all
</button>
</header>
<div class="card-body">
<div id="cacheWarmupProgress" style="margin-bottom: 8px;">
<span id="cacheWarmupSummary">Loading…</span>
</div>
<progress id="cacheWarmupBar" max="100" value="0" style="width: 100%; display: none;"></progress>
<details style="margin-top: 8px;">
<summary style="cursor: pointer; user-select: none;">Show log</summary>
<pre id="cacheWarmupLog" style="background: #0a0a0a; color: #dcdcdc; font-family: ui-monospace, Menlo, monospace; font-size: 12px; padding: 8px; max-height: 240px; overflow-y: auto; margin-top: 8px; border-radius: 4px;"></pre>
</details>
</div>
</section>
- Step 14.4: Add the JS for live updates
Inside the existing <script> block at the bottom of admin_tables.html, append:
// ── Cache warmup toolbar (issue #155 / #156) ────────────────
let cacheWarmupSource = null;
function cacheWarmupInit() {
cacheWarmupRefreshSnapshot();
cacheWarmupOpenStream();
}
function cacheWarmupRefreshSnapshot() {
fetch('/api/admin/cache-warmup/status')
.then(function(r) { return r.json(); })
.then(function(state) { cacheWarmupRender(state); })
.catch(function() { /* silent */ });
}
function cacheWarmupOpenStream() {
try {
cacheWarmupSource = new EventSource('/api/admin/cache-warmup/stream');
cacheWarmupSource.addEventListener('start', cacheWarmupOnStart);
cacheWarmupSource.addEventListener('row', cacheWarmupOnRow);
cacheWarmupSource.addEventListener('complete', cacheWarmupOnComplete);
cacheWarmupSource.addEventListener('snapshot', function(e) {
cacheWarmupRender(JSON.parse(e.data));
});
cacheWarmupSource.onerror = function() {
// Polling fallback — close + retry every 3 s.
if (cacheWarmupSource) {
cacheWarmupSource.close();
cacheWarmupSource = null;
}
setTimeout(cacheWarmupRefreshSnapshot, 3000);
};
} catch (e) {
// EventSource unsupported; poll instead.
setInterval(cacheWarmupRefreshSnapshot, 3000);
}
}
function cacheWarmupRender(state) {
var summary = document.getElementById('cacheWarmupSummary');
var bar = document.getElementById('cacheWarmupBar');
var btn = document.getElementById('cacheWarmupRunBtn');
if (!summary) return;
if (!state || state.state === 'never_run') {
summary.textContent = 'No cache warmup yet — click Re-warm all to start.';
bar.style.display = 'none';
btn.disabled = false;
return;
}
var inProgress = state.completed_at === null || state.completed_at === undefined;
var pct = state.total > 0 ? Math.round((state.completed * 100) / state.total) : 0;
summary.textContent = inProgress
? state.completed + ' / ' + state.total + ' fresh — running…'
: 'Last run: ' + state.completed + ' ok, ' + state.failed + ' errors';
bar.style.display = 'block';
bar.value = pct;
btn.disabled = inProgress;
// Per-row badges.
if (state.rows) {
for (var tid in state.rows) {
cacheWarmupSetRowBadge(tid, state.rows[tid]);
}
}
}
function cacheWarmupOnStart(e) {
var data = JSON.parse(e.data);
var log = document.getElementById('cacheWarmupLog');
log.textContent = '';
cacheWarmupAppendLog(data.run_id ? '[' + nowHHMMSS() + '] start trigger=' + data.trigger + ' total=' + data.total : '');
cacheWarmupRefreshSnapshot();
}
function cacheWarmupOnRow(e) {
var rs = JSON.parse(e.data);
cacheWarmupAppendLog(
'[' + nowHHMMSS() + '] ' + rs.status.padEnd(7) + rs.table_id +
(rs.duration_ms ? ' (' + (rs.duration_ms / 1000).toFixed(1) + ' s)' : '') +
(rs.error ? ' ' + rs.error : '')
);
cacheWarmupSetRowBadge(rs.table_id, rs);
cacheWarmupRefreshSnapshot();
}
function cacheWarmupOnComplete(e) {
var data = JSON.parse(e.data);
cacheWarmupAppendLog(
'[' + nowHHMMSS() + '] complete total=' + data.total +
' ok=' + data.completed + ' fail=' + data.failed
);
cacheWarmupRefreshSnapshot();
}
function cacheWarmupAppendLog(line) {
var log = document.getElementById('cacheWarmupLog');
if (!log) return;
log.textContent += line + '\n';
log.scrollTop = log.scrollHeight;
}
function cacheWarmupSetRowBadge(tableId, rs) {
// Per-row badge in the col-status slot. Selector: any <tr>
// whose first <td class="col-id"> text matches tableId.
document.querySelectorAll('tr').forEach(function(tr) {
var idCell = tr.querySelector('td.col-id');
if (!idCell || idCell.textContent.trim() !== tableId) return;
var statusCell = tr.querySelector('td.col-status');
if (!statusCell) return;
var color = {fresh: '#10B77F', warming: '#0073D1', pending: '#9CA3AF', error: '#EA580C'}[rs.status] || '#9CA3AF';
var label = rs.status === 'fresh' ? 'fresh' : rs.status;
statusCell.innerHTML = '<span style="display:inline-block;padding:2px 6px;border-radius:3px;font-size:11px;background:' + color + ';color:white;" title="' + (rs.error || '') + '">' + label + '</span>';
});
}
function nowHHMMSS() {
var d = new Date();
return d.toTimeString().slice(0, 8);
}
function cacheWarmupRun() {
var btn = document.getElementById('cacheWarmupRunBtn');
btn.disabled = true;
fetch('/api/admin/cache-warmup/run', {method: 'POST'})
.then(function(r) { return r.json(); })
.then(function() { /* SSE stream picks up the new run */ })
.catch(function() { btn.disabled = false; });
}
// Init on page load.
document.addEventListener('DOMContentLoaded', cacheWarmupInit);
- Step 14.5: Add
?icon next toquery_modefield in the edit modal
Find the edit-modal block in admin_tables.html (search for editBqQueryMode or similar). Next to the query_mode field's <label>, append:
<a href="/docs/admin/query-modes.md" target="_blank" title="When to use which mode" style="margin-left: 6px; text-decoration: none;">?</a>
If the docs aren't served at /docs/admin/..., link to the GitHub URL instead (look at recent edits — cli/lib/setup_instructions.py may have a server-side doc base URL).
- Step 14.6: Run smoke tests
cd /tmp/agnes-metadata && python -m pytest tests/test_admin_tables_warmup_ui.py -v
Expected: 3 passed.
- Step 14.7: Manual visual check (optional but recommended)
If a dev instance is available, browse to /admin/tables, register a remote BQ table, click "Re-warm all", and verify the log scrolls + per-row badges update.
- Step 14.8: Commit
cd /tmp/agnes-metadata
git add app/web/templates/admin_tables.html tests/test_admin_tables_warmup_ui.py
git commit -m "feat(admin/tables): cache freshness toolbar + per-row badges + SSE log
Adds a single cache panel above the per-source-type listings:
- Re-warm all button → POST /api/admin/cache-warmup/run.
- Progress bar + summary line.
- Collapsible terminal-style log fed by EventSource on
/api/admin/cache-warmup/stream (polling fallback at 3 s on SSE error).
- Per-row badge in the col-status slot updates live as warmup events
arrive.
- ? icon next to the query_mode field in the edit modal links to
docs/admin/query-modes.md.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 15: CHANGELOG + version bump
Files:
-
Modify:
CHANGELOG.md -
Modify:
pyproject.toml -
Step 15.1: Bump version
Edit pyproject.toml. Find:
version = "0.45.x"
Change to:
version = "0.46.0"
- Step 15.2: Add CHANGELOG entry
Edit CHANGELOG.md. Find ## [Unreleased] near the top. Replace with:
## [Unreleased]
## [0.46.0] — 2026-05-07
Catalog metadata enrichment + cache discipline + automatic warmup.
Closes #155 + #156.
### Added
- **`/api/v2/catalog` returns four new optional fields per row** — `rows`,
`size_bytes`, `partition_by`, `clustered_by` — populated by per-source-type
metadata providers (`connectors/bigquery/metadata.py`,
`connectors/keboola/metadata.py`). For `query_mode='remote'` BigQuery rows,
`size_bytes` is `active_logical_bytes + long_term_logical_bytes` (a full
scan reads both); region resolved from `data_source.bigquery.location` →
`bq_client.get_dataset(...)` → fall back to legacy `__TABLES__`.
Existing CLI consumers reading only `rough_size_hint` are unaffected.
- **Automatic cache warmup at startup.** FastAPI startup event schedules
a background task that walks BQ remote rows and pre-populates
`_metadata_cache` + `_schema_cache` with bounded concurrency (default 4,
tunable via `AGNES_WARMUP_CONCURRENCY`). Doesn't block readiness;
per-row failures logged + skipped. Opt-out via `AGNES_SKIP_CACHE_WARMUP=1`.
- **Three new admin endpoints under `/api/admin/cache-warmup/*`:**
- `GET /status` — JSON snapshot of the latest run.
- `POST /run` — manual trigger, idempotent under concurrent invocation.
- `GET /stream` — Server-Sent Events with `start` / `row` / `complete`
events for live UI updates.
- **`/admin/tables` cache freshness panel.** Toolbar above the per-source-type
listings with progress bar + "Re-warm all" button + collapsible
terminal-style log fed by SSE (polling fallback at 3 s). Per-row badge
in the existing `col-status` column updates live (fresh / warming /
pending / error).
- **`docs/admin/query-modes.md`** — source-agnostic admin reference for
registering tables as `local` / `remote` / `materialized`. Decision
tree, per-source-type IAM + setup, three worked examples. Linked from
the `?` icon next to the `query_mode` field in the admin UI edit modal
and from the third post-register hint in `agnes admin register-table`.
- **`agnes admin register-table` post-register hint** for `query_mode=remote`:
points at `agnes query --remote "SELECT COUNT(*)..."` as the IAM smoke
check so a missing `dataViewer` / `jobUser` surfaces at registration
time, not 30 minutes later.
### Changed
- **`/api/v2/schema/{id}` cache miss now does 1 BQ job instead of 2.**
`connectors/bigquery/access.py:fetch_bq_columns_full` collapses what
used to be `_fetch_bq_schema` + `_fetch_bq_table_options` into a single
`INFORMATION_SCHEMA.COLUMNS` query (same view, same predicate, just a
combined SELECT list). The metadata provider's partition/cluster path
shares the same helper — zero SQL duplication across the two consumers.
- **All four catalog/schema/sample/metadata caches are flushed on registry
change.** `app/api/v2_catalog.py:invalidate_for_table` is wired into
`POST /api/admin/register-table`, `PUT /api/admin/registry/{id}`, and
`DELETE /api/admin/registry/{id}`. After a registry write, a single-row
re-warm task is scheduled in the background so the admin's verification
request hits warm caches within ~1 s instead of waiting for the next
analyst miss. Pre-fix none of the caches were invalidated — admin
registers a table, `agnes catalog` doesn't show the new row for up to
5 min; admin updates a row's bucket, `agnes schema` returns the OLD
column list for up to 1 hour.
- **`v2_schema.build_schema` split into RBAC-aware outer + RBAC-naive
inner (`build_schema_uncached`).** Live endpoint behavior unchanged;
warmup uses the inner entry point to populate `_schema_cache` without
a user context.
### Internal
- New shared dataclass module `app/api/_metadata_models.py` with
`MetadataRequest` (frozen) + `TableMetadata` for source-agnostic
provider input/output.
- New `connectors/keboola/storage_api.py:KeboolaStorageClient.get_table_info`
thin wrapper — keeps `_get` private to the module.
- New env vars (operator-facing tuning, no required setup change):
- `AGNES_SKIP_CACHE_WARMUP` — opt-out of startup warmup.
- `AGNES_WARMUP_CONCURRENCY` — default 4, max parallel BQ
INFORMATION_SCHEMA jobs during a warmup pass.
- New runtime dependency: `sse-starlette>=2.0` (Server-Sent Events
responses for the cache-warmup stream).
- Tests added: `test_metadata_models`, `test_v2_schema_columns_consolidation`,
`test_v2_catalog_dispatcher`, `test_connectors_bigquery_metadata`,
`test_connectors_keboola_metadata`, `test_v2_catalog_remote_metadata`,
`test_v2_catalog_invalidation`, `test_cache_warmup`,
`test_main_startup_warmup`, `test_admin_tables_warmup_ui`.
- Step 15.3: Smoke-check version + changelog consistency
cd /tmp/agnes-metadata && grep '^version' pyproject.toml
cd /tmp/agnes-metadata && head -20 CHANGELOG.md
Expected: pyproject shows 0.46.0; CHANGELOG has ## [0.46.0] — 2026-05-07 directly after ## [Unreleased].
- Step 15.4: Run the full targeted test set
cd /tmp/agnes-metadata && python -m pytest \
tests/test_metadata_models.py \
tests/test_keboola_storage_api.py::TestGetTableInfo \
tests/test_v2_schema_columns_consolidation.py \
tests/test_v2_schema.py \
tests/test_v2_catalog_dispatcher.py \
tests/test_connectors_bigquery_metadata.py \
tests/test_connectors_keboola_metadata.py \
tests/test_v2_catalog_remote_metadata.py \
tests/test_v2_catalog_invalidation.py \
tests/test_cache_warmup.py \
tests/test_main_startup_warmup.py \
tests/test_admin_tables_warmup_ui.py \
tests/test_cli_admin.py::TestRegisterTableHints \
-v
Expected: all green.
- Step 15.5: Commit
cd /tmp/agnes-metadata
git add CHANGELOG.md pyproject.toml
git commit -m "release: 0.46.0 — source-agnostic catalog metadata + cache discipline
PR closes #155 + #156.
Highlights:
- Catalog response gains rows / size_bytes / partition_by / clustered_by.
- /api/v2/schema cache miss: 2 BQ jobs → 1 (-50%).
- All 4 catalog/schema/sample/metadata caches flush on registry change.
- Automatic startup warmup with bounded concurrency.
- /admin/tables shows cache freshness toolbar + per-row badges + SSE log.
- docs/admin/query-modes.md as the single source of truth for mode choice.
No DB migration; no wire break; MIN_COMPAT_CLI_VERSION unchanged.
Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"
Task 16: Final verification + PR
- Step 16.1: Push branch
cd /tmp/agnes-metadata && git push -u origin zs/issue-155-156-source-agnostic-metadata
- Step 16.2: Open PR with auto-close keywords PER ISSUE
gh pr create --title "release: 0.46.0 — source-agnostic catalog metadata + cache discipline" \
--body "$(cat <<'EOF'
## Summary
- Catalog enrichment for `query_mode='remote'` rows: rows + size_bytes + partition_by + clustered_by.
- `/api/v2/schema` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup; SSE-driven freshness toolbar on `/admin/tables`.
Closes #155.
Closes #156.
## Test plan
- [x] Targeted: 50+ unit + integration tests across 12 new test modules.
- [x] No DB migration; no wire-break; MIN_COMPAT_CLI_VERSION unchanged.
- [ ] Reviewer: register a remote BQ table, observe `/admin/tables` toolbar populates within 2 s.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and per-row badges update.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
Per the user's feedback_github_close_keyword_per_issue memory: each issue gets its own Closes #N keyword on its own line. A list after one keyword only auto-closes the first.
Self-Review
After writing this plan, I checked it against the spec end-to-end:
Spec coverage:
- ✅ Provider pattern + dispatcher → Tasks 5, 6, 7
- ✅ Identifier validation in dispatcher → Task 5
- ✅ COLUMNS query consolidation → Task 3
- ✅
build_schemaRBAC/cache split → Task 4 - ✅ v2_catalog wiring + 15-min metadata cache → Task 8
- ✅ Unified
invalidate_for_table→ Task 9 - ✅ Cache warmup framework + endpoints → Task 10
- ✅ FastAPI startup hook → Task 11
- ✅ CLI post-register hint → Task 12
- ✅ docs/admin/query-modes.md → Task 13
- ✅ Admin UI integration → Task 14
- ✅ CHANGELOG + version bump → Task 15
Placeholder scan: none of the "TBD", "TODO", "implement later" patterns. Every step has actual code or an exact command.
Type consistency: MetadataRequest / TableMetadata / WarmupRunState / WarmupRowState field names referenced consistently across tasks. fetch_bq_columns_full signature consistent between Task 3 and Task 7.
One known nit: Task 9's _rewarm_one_row references cache_warmup.warm_one_table which is defined in Task 10. The lazy import inside the function + outer try/except handles the during-development gap, but if a subagent runs Tasks 9 and 10 strictly in order without the lazy import, the test in Task 9 still passes because it patches asyncio.create_task. Documented in Task 9 step 9.3.
Ready for execution.