docs: consolidate and de-clutter the documentation tree (#306 )

CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.

2026-05-14 18:54:22 +00:00

121 KiB

Raw Blame History

Source-Agnostic Table Metadata Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Enrich /api/v2/catalog with cost-relevant per-table metadata (row count, size, partition, cluster) for query_mode='remote' rows via source-agnostic providers; add unified cache invalidation across all four catalog/schema/sample/metadata caches with single-row re-warm; halve BQ jobs on /api/v2/schema cache miss by consolidating two INFORMATION_SCHEMA.COLUMNS queries into one; auto-warm caches at server startup; surface progress via SSE on /admin/tables. Closes #155 + #156.

Architecture: Per-source metadata.py module exposing fetch(MetadataRequest) -> TableMetadata | None; dispatched from app/api/v2_catalog.py:_size_hint_for_row after identifier validation; results cached 15 min keyed by table_id. Cache invalidation owned by v2_catalog.invalidate_for_table flushing four TTLCaches and scheduling a single-row re-warm. Warmup runs as a FastAPI startup background task with bounded concurrency (default 4) and exposes a JSON status endpoint + SSE stream consumed by a new toolbar block on /admin/tables.

Tech Stack: Python 3.11+, FastAPI, DuckDB (with BigQuery extension), pytest, sse-starlette, vanilla JS (no framework) in admin templates.

Source spec: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md (HEAD a6a660bc).

Branch: zs/issue-155-156-source-agnostic-metadata (worktree at /tmp/agnes-metadata).

File map

New:

app/api/_metadata_models.py — dataclasses MetadataRequest, TableMetadata
connectors/bigquery/metadata.py — BQ provider
connectors/keboola/metadata.py — Keboola provider
app/api/cache_warmup.py — warmup state + background task + endpoints
docs/admin/query-modes.md — admin doc page
tests/test_connectors_bigquery_metadata.py
tests/test_connectors_keboola_metadata.py
tests/test_v2_catalog_remote_metadata.py
tests/test_v2_catalog_invalidation.py
tests/test_cache_warmup.py
tests/test_admin_tables_warmup_ui.py
tests/test_v2_schema_columns_consolidation.py

Modified:

app/api/v2_catalog.py — provider dispatch + 15-min metadata cache + new response fields + invalidate_for_table
app/api/v2_schema.py — split RBAC/cache from BQ work; rewire to shared fetch_bq_columns_full
connectors/bigquery/access.py — append fetch_bq_columns_full helper
connectors/keboola/storage_api.py — append get_table_info thin wrapper
app/api/admin.py — wire invalidate_for_table into register/update/unregister
app/main.py — register warm_catalog_caches startup event
cli/commands/admin.py — third post-register hint for query_mode=remote
app/web/templates/admin_tables.html — cache toolbar <section>, per-row col-status badge, EventSource JS, ? doc link on query_mode field
pyproject.toml — version bump to 0.46.0; add sse-starlette to dependencies if not present
CHANGELOG.md — ## [0.46.0] section

Task 1: Shared dataclass models

Files:

Create: app/api/_metadata_models.py
Test: tests/test_metadata_models.py
Step 1.1: Write failing test

Create tests/test_metadata_models.py:

"""Sanity tests for the shared metadata dataclasses."""

from app.api._metadata_models import MetadataRequest, TableMetadata


def test_metadata_request_constructs():
    req = MetadataRequest(
        table_id="orders", bucket="dwh_base", source_table="orders_2024",
    )
    assert req.table_id == "orders"
    assert req.bucket == "dwh_base"
    assert req.source_table == "orders_2024"


def test_metadata_request_is_frozen():
    """Frozen so cache keys derived from a request are stable."""
    req = MetadataRequest(table_id="x", bucket="b", source_table="t")
    import dataclasses
    try:
        req.bucket = "other"
    except dataclasses.FrozenInstanceError:
        return
    raise AssertionError("MetadataRequest should be frozen")


def test_table_metadata_all_fields_optional():
    tm = TableMetadata()
    assert tm.rows is None
    assert tm.size_bytes is None
    assert tm.partition_by is None
    assert tm.clustered_by is None


def test_table_metadata_partial_population():
    tm = TableMetadata(rows=100, size_bytes=2048)
    assert tm.rows == 100
    assert tm.size_bytes == 2048
    assert tm.partition_by is None
    assert tm.clustered_by is None

Step 1.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_metadata_models.py -v

Expected: FAIL with ModuleNotFoundError: No module named 'app.api._metadata_models'.

Step 1.3: Implement the module

Create app/api/_metadata_models.py:

"""Shared data shapes for source-agnostic table-metadata providers.

Lives under `app/api/` because the primary consumer is
`app/api/v2_catalog.py`. Connector-side providers in `connectors/<source>/`
import upward into this module — the inverse layering would force
`v2_catalog.py` to depend on `connectors/__init__.py`, which is the
wrong direction.
"""

from __future__ import annotations

from dataclasses import dataclass


@dataclass(frozen=True)
class MetadataRequest:
    """Narrow input passed to a metadata provider's `fetch()`.

    `bucket` and `source_table` are pre-validated by the dispatcher
    (`validate_quoted_identifier`) before construction, so the provider
    can interpolate them into SQL/URL paths without re-checking. Frozen
    so the (provider, request)-keyed cache lookup is stable.
    """
    table_id: str
    bucket: str
    source_table: str


@dataclass
class TableMetadata:
    """Source-agnostic metadata bundle. Every field optional — providers
    fill what they can cheaply get; callers tolerate `None`. Adding a new
    field here is a non-breaking change: existing CLI consumers don't
    even render `rough_size_hint` (verified `grep -rn rough_size_hint cli/`
    is empty), let alone the new fields.
    """
    rows: int | None = None
    size_bytes: int | None = None
    partition_by: str | None = None
    clustered_by: list[str] | None = None

Step 1.4: Run test to verify it passes

cd /tmp/agnes-metadata && python -m pytest tests/test_metadata_models.py -v

Expected: 4 passed.

Step 1.5: Commit

cd /tmp/agnes-metadata
git add app/api/_metadata_models.py tests/test_metadata_models.py
git commit -m "feat(metadata): MetadataRequest + TableMetadata dataclasses"

Task 2: KeboolaStorageClient.get_table_info wrapper

Files:

Modify: connectors/keboola/storage_api.py (append method)
Test: tests/test_keboola_storage_api.py (append test)
Step 2.1: Write failing test

Append to tests/test_keboola_storage_api.py:

class TestGetTableInfo:
    """`get_table_info` is a thin wrapper around the existing _get path
    so the metadata provider doesn't have to bleed `_get` out of the
    module (#155)."""

    def test_calls_storage_api_with_table_id(self, monkeypatch):
        from connectors.keboola.storage_api import KeboolaStorageClient

        captured = {}

        def fake_get(self, path, **kwargs):
            captured["path"] = path
            return {"rowsCount": 100, "dataSizeBytes": 4096}

        monkeypatch.setattr(KeboolaStorageClient, "_get", fake_get)

        client = KeboolaStorageClient(
            url="https://connection.keboola.com", token="tok"
        )
        info = client.get_table_info("in.c-orders.events")
        assert captured["path"] == "/tables/in.c-orders.events"
        assert info["rowsCount"] == 100
        assert info["dataSizeBytes"] == 4096

    def test_propagates_storage_api_error(self, monkeypatch):
        from connectors.keboola.storage_api import (
            KeboolaStorageClient, StorageApiError,
        )

        def fake_get(self, path, **kwargs):
            raise StorageApiError("404 not found", status=404, body={})

        monkeypatch.setattr(KeboolaStorageClient, "_get", fake_get)

        client = KeboolaStorageClient(url="https://x", token="tok")
        import pytest
        with pytest.raises(StorageApiError):
            client.get_table_info("missing.table")

Step 2.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_keboola_storage_api.py::TestGetTableInfo -v

Expected: FAIL with AttributeError: 'KeboolaStorageClient' object has no attribute 'get_table_info'.

Step 2.3: Implement the wrapper

Find the line in connectors/keboola/storage_api.py immediately after the export_table_async method (around line 286) and append the new method to the KeboolaStorageClient class:

    def get_table_info(self, table_id: str) -> dict:
        """GET /v2/storage/tables/{table_id} — full table metadata.

        Storage API guarantees `rowsCount` + `dataSizeBytes` on success.
        Other fields (`columns`, `primaryKey`, ...) are present but not
        consumed by the metadata provider today. Raises `StorageApiError`
        on 4xx/5xx — caller decides whether to soften to `None`.
        """
        return self._get(f"/tables/{table_id}")

Step 2.4: Run test to verify it passes

cd /tmp/agnes-metadata && python -m pytest tests/test_keboola_storage_api.py::TestGetTableInfo -v

Expected: 2 passed.

Step 2.5: Commit

cd /tmp/agnes-metadata
git add connectors/keboola/storage_api.py tests/test_keboola_storage_api.py
git commit -m "feat(keboola): KeboolaStorageClient.get_table_info thin wrapper"

Task 3: Combined `fetch_bq_columns_full` helper + v2_schema rewire

This task halves the BQ job count on /api/v2/schema/{id} cache miss (2 jobs → 1) by collapsing the duplicate INFORMATION_SCHEMA.COLUMNS queries.

Files:

Modify: connectors/bigquery/access.py (append helper)
Modify: app/api/v2_schema.py (rewire _fetch_bq_schema and _fetch_bq_table_options)
Test: tests/test_v2_schema_columns_consolidation.py (new)
Step 3.1: Write failing test for the consolidation

Create tests/test_v2_schema_columns_consolidation.py:

"""Asserts that /api/v2/schema/{id} for a BQ row makes exactly ONE
bigquery_query() call on cache miss, down from two pre-#155.

Counts via a side-effect tracker on the mocked DuckDB session.
"""

from unittest.mock import MagicMock, patch
import pytest


def _mock_duckdb_session_returning(rows):
    """Build a context-manager mock that returns `rows` on .fetchall().

    Exposes `call_count` on the returned mock for assertion.
    """
    session = MagicMock()
    session.execute.return_value.fetchall.return_value = rows
    cm = MagicMock()
    cm.__enter__.return_value = session
    cm.__exit__.return_value = False
    return cm, session


def test_fetch_bq_columns_full_is_single_query():
    """The new shared helper makes exactly ONE call to bigquery_query."""
    from connectors.bigquery.access import fetch_bq_columns_full

    bq = MagicMock()
    bq.projects.data = "data-proj"
    bq.projects.billing = "billing-proj"
    cm, session = _mock_duckdb_session_returning([
        ("event_date", "DATE", "NO", "YES", None),
        ("country", "STRING", "YES", "NO", 1),
        ("user_id", "STRING", "NO", "NO", None),
    ])
    bq.duckdb_session.return_value = cm

    rows = fetch_bq_columns_full(bq, "dwh_base", "events")
    assert len(rows) == 3
    # Exactly one bigquery_query() call — no second round-trip.
    assert session.execute.call_count == 1
    first_call = session.execute.call_args_list[0]
    # Outer wrapper SQL is bigquery_query(?, ?, ?)
    assert "bigquery_query" in first_call.args[0]
    # Inner BQ SQL pulls all five columns we need at once.
    inner_sql = first_call.args[1][1]
    assert "column_name" in inner_sql
    assert "data_type" in inner_sql
    assert "is_nullable" in inner_sql
    assert "is_partitioning_column" in inner_sql
    assert "clustering_ordinal_position" in inner_sql


def test_fetch_bq_columns_full_returns_dicts():
    """Each row is a dict with the documented keys."""
    from connectors.bigquery.access import fetch_bq_columns_full

    bq = MagicMock()
    bq.projects.data = "data-proj"
    bq.projects.billing = "billing-proj"
    cm, _ = _mock_duckdb_session_returning([
        ("event_date", "DATE", "NO", "YES", None),
    ])
    bq.duckdb_session.return_value = cm

    rows = fetch_bq_columns_full(bq, "dwh_base", "events")
    assert rows == [{
        "name": "event_date",
        "type": "DATE",
        "nullable": False,
        "is_partitioning_column": True,
        "clustering_ordinal_position": None,
    }]


def test_fetch_bq_columns_full_returns_none_when_unconfigured():
    """Sentinel BqAccess (data project empty) → return None, no query."""
    from connectors.bigquery.access import fetch_bq_columns_full

    bq = MagicMock()
    bq.projects.data = ""  # sentinel
    rows = fetch_bq_columns_full(bq, "dwh_base", "events")
    assert rows is None
    bq.duckdb_session.assert_not_called()


def test_fetch_bq_columns_full_returns_none_on_unsafe_identifier():
    """Refuses to interpolate identifiers that fail validation."""
    from connectors.bigquery.access import fetch_bq_columns_full

    bq = MagicMock()
    bq.projects.data = "data-proj"
    rows = fetch_bq_columns_full(bq, "evil`; DROP--", "events")
    assert rows is None
    bq.duckdb_session.assert_not_called()


def test_fetch_bq_columns_full_returns_none_on_query_error():
    """BQ failure → log + None; never raises."""
    from connectors.bigquery.access import fetch_bq_columns_full

    bq = MagicMock()
    bq.projects.data = "data-proj"
    bq.projects.billing = "billing-proj"
    cm = MagicMock()
    cm.__enter__.return_value.execute.side_effect = RuntimeError("BQ down")
    cm.__exit__.return_value = False
    bq.duckdb_session.return_value = cm

    rows = fetch_bq_columns_full(bq, "dwh_base", "events")
    assert rows is None

Step 3.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema_columns_consolidation.py -v

Expected: FAIL with ImportError: cannot import name 'fetch_bq_columns_full' from 'connectors.bigquery.access'.

Step 3.3: Implement fetch_bq_columns_full

Append to connectors/bigquery/access.py (location: end of file or just before any module-level _BILLING_PROJECT_RE block — pick whichever your editor lands on):

def fetch_bq_columns_full(bq, dataset: str, table: str) -> list[dict] | None:
    """Single round-trip to INFORMATION_SCHEMA.COLUMNS pulling everything
    both v2_schema and the metadata provider need.

    Returns one dict per column with the keys ``name``, ``type``,
    ``nullable``, ``is_partitioning_column``, ``clustering_ordinal_position``.
    Consumers project the fields they care about.

    Best-effort: returns ``None`` on any failure (sentinel-unconfigured,
    unsafe identifier, BQ query exception). Does NOT raise. Mirrors the
    failure posture of `app/api/v2_schema.py:_fetch_bq_table_options`,
    which it replaces.

    Replaces two BQ jobs (one for column list + one for partition/cluster)
    with one — half the on-demand cost on each `/api/v2/schema/{id}`
    cache miss (issue #155 / spec).
    """
    from src.identifier_validation import validate_quoted_identifier

    if not bq.projects.data:
        return None

    if not (validate_quoted_identifier(bq.projects.data, "BQ project")
            and validate_quoted_identifier(dataset, "BQ dataset")
            and validate_quoted_identifier(table, "BQ source_table")):
        return None

    bq_sql = (
        f"SELECT column_name, data_type, is_nullable, "
        f"       is_partitioning_column, clustering_ordinal_position "
        f"FROM `{bq.projects.data}.{dataset}.INFORMATION_SCHEMA.COLUMNS` "
        f"WHERE table_name = ? ORDER BY ordinal_position"
    )
    try:
        with bq.duckdb_session() as conn:
            rows = conn.execute(
                "SELECT * FROM bigquery_query(?, ?, ?)",
                [bq.projects.billing, bq_sql, table],
            ).fetchall()
    except Exception as e:
        logger.warning(
            "BQ COLUMNS fetch failed for %s.%s.%s: %s",
            bq.projects.data, dataset, table, e,
        )
        return None

    return [
        {
            "name": r[0],
            "type": r[1],
            "nullable": (r[2] or "").upper() == "YES",
            "is_partitioning_column": (r[3] or "").upper() == "YES",
            "clustering_ordinal_position": r[4],
        }
        for r in rows
    ]

If logger isn't already imported at the top of connectors/bigquery/access.py, add import logging; logger = logging.getLogger(__name__) near the top — but check first; the file likely already has it.

Step 3.4: Run consolidation tests to verify they pass

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema_columns_consolidation.py -v

Expected: 5 passed.

Step 3.5: Rewire v2_schema._fetch_bq_schema

Edit app/api/v2_schema.py — replace the body of _fetch_bq_schema (function starting around line 33). The function currently builds and runs its own INFORMATION_SCHEMA.COLUMNS query; rewire it to call fetch_bq_columns_full and project the column-list shape.

Locate def _fetch_bq_schema(...) and replace its full body with:

def _fetch_bq_schema(bq, dataset: str, table: str) -> list[dict]:
    """Fetch column list via the shared `fetch_bq_columns_full` helper.

    Pre-#155 this had its own INFORMATION_SCHEMA.COLUMNS query; consolidating
    with `_fetch_bq_table_options` (now also delegating to the same helper)
    halves the BQ job count on cache miss. Returns the schema-endpoint
    column shape: name / type / nullable / description.
    """
    from connectors.bigquery.access import fetch_bq_columns_full, BqAccessError

    # Surface "BQ not configured" as the structured 500 BqAccessError
    # (with hint), not a misleading empty-list. Mirrors pre-refactor
    # behavior — see Devin BUG_0002 in the original docstring.
    if not bq.projects.data:
        bq.client()  # raises BqAccessError(not_configured); endpoint catches it

    rows = fetch_bq_columns_full(bq, dataset, table)
    if rows is None:
        # Identifier validation refused, or BQ raised. The legacy code
        # path raised `ValueError("unsafe BQ identifier in registry")` for
        # the validation case; the BQ-error case raised via translate_bq_error.
        # The new helper conflates both into None; reproduce the legacy
        # error-surfacing here so callers get the same exceptions.
        from src.identifier_validation import validate_quoted_identifier
        if not (validate_quoted_identifier(bq.projects.data, "BQ project")
                and validate_quoted_identifier(dataset, "BQ dataset")
                and validate_quoted_identifier(table, "BQ source_table")):
            raise ValueError("unsafe BQ identifier in registry — refusing to query")
        # Otherwise it's a BQ-side error; raise the same shape the legacy
        # code raised (HTTP 502 via the endpoint's translate_bq_error).
        from connectors.bigquery.access import translate_bq_error
        raise translate_bq_error(
            RuntimeError("BQ INFORMATION_SCHEMA.COLUMNS query failed"),
            bq.projects, bad_request_status="upstream_error",
        )

    return [
        {
            "name": r["name"],
            "type": r["type"],
            "nullable": r["nullable"],
            "description": "",
        }
        for r in rows
    ]

Step 3.6: Rewire v2_schema._fetch_bq_table_options

Same file. Replace the full body of _fetch_bq_table_options (around line 85):

def _fetch_bq_table_options(bq, dataset: str, table: str) -> dict:
    """Best-effort fetch of partition/cluster info via the shared
    `fetch_bq_columns_full` helper.

    Returns ``{}`` on ANY failure (best-effort). Same load-bearing
    contract as before: the /schema endpoint must keep returning 200
    with empty partition info when this fails.
    """
    from connectors.bigquery.access import fetch_bq_columns_full

    rows = fetch_bq_columns_full(bq, dataset, table)
    if not rows:
        return {}

    partition_by = next(
        (r["name"] for r in rows if r["is_partitioning_column"]),
        None,
    )
    clustered_rows = [r for r in rows if r["clustering_ordinal_position"] is not None]
    clustered_rows.sort(key=lambda r: r["clustering_ordinal_position"])
    clustered_by = [r["name"] for r in clustered_rows]
    return {"partition_by": partition_by, "clustered_by": clustered_by}

Step 3.7: Run existing schema tests to verify no regression

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema.py -v

Expected: all existing tests pass (no behavior change for /api/v2/schema/{id} consumers — same response shape).

If a test fails, the most likely causes are: (a) the test mocks _fetch_bq_schema or _fetch_bq_table_options directly with the old signature — patch the test to mock fetch_bq_columns_full instead; (b) a test relies on the old error-surfacing path — check the docstring of step 3.5 and ensure the rewire matches.

Step 3.8: Run consolidation tests once more for full green

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema_columns_consolidation.py tests/test_v2_schema.py -v

Expected: all green.

Step 3.9: Commit

cd /tmp/agnes-metadata
git add connectors/bigquery/access.py app/api/v2_schema.py tests/test_v2_schema_columns_consolidation.py
git commit -m "perf(v2_schema): consolidate two INFORMATION_SCHEMA.COLUMNS queries to one

/api/v2/schema/{id} cache-miss path was making two BQ jobs against the
same view with the same predicate — one for column list, one for
partition/cluster. New shared connectors/bigquery/access.py:fetch_bq_columns_full
returns one resultset; both v2_schema._fetch_bq_schema and
_fetch_bq_table_options delegate. Same response shape, half the BQ jobs
on cache miss.

Same helper consumed by the upcoming metadata provider's partition/cluster
path — zero SQL duplication across the two consumers.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 4: Split `build_schema` for warmup re-use

The warmup background task needs to populate _schema_cache without an authenticated user. Today build_schema mixes RBAC + cache + BQ work; extract the BQ-and-cache half so warmup can call it directly.

Files:

Modify: app/api/v2_schema.py
Test: tests/test_v2_schema.py (new test asserting the split)
Step 4.1: Write failing test

Append to tests/test_v2_schema.py (or wherever existing v2_schema tests live):

class TestBuildSchemaUncached:
    """The uncached entry point exists for warmup, which has no user
    context. RBAC + cache check live in `build_schema`; the BQ work +
    cache write live in `build_schema_uncached`."""

    def test_uncached_function_exists_and_does_not_take_user(self):
        """Signature: build_schema_uncached(conn, table_id, *, bq)"""
        from app.api.v2_schema import build_schema_uncached
        import inspect
        sig = inspect.signature(build_schema_uncached)
        params = list(sig.parameters)
        assert "user" not in params, (
            "build_schema_uncached should NOT require a user — that's "
            "the whole point of the split (warmup has no user)."
        )
        assert "table_id" in params
        assert "bq" in params

    def test_build_schema_delegates_to_uncached(self, monkeypatch):
        """build_schema should call build_schema_uncached after RBAC+cache check."""
        from app.api import v2_schema

        called_with = {}
        def fake_uncached(conn, table_id, *, bq):
            called_with["table_id"] = table_id
            return {"table_id": table_id, "columns": []}

        monkeypatch.setattr(v2_schema, "build_schema_uncached", fake_uncached)
        # Bypass the cache + RBAC for this assertion — both are tested elsewhere.
        monkeypatch.setattr(v2_schema._schema_cache, "get", lambda k, default=None: None)
        monkeypatch.setattr(v2_schema, "can_access_table", lambda u, tid, c: True)

        # Synthetic registry row.
        from unittest.mock import MagicMock
        repo_mock = MagicMock()
        repo_mock.get.return_value = {"id": "x", "source_type": "bigquery"}
        monkeypatch.setattr(v2_schema, "TableRegistryRepository", lambda c: repo_mock)

        v2_schema.build_schema(
            conn=MagicMock(), user={"id": "u"}, table_id="x", bq=MagicMock(),
        )
        assert called_with["table_id"] == "x"

Step 4.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema.py::TestBuildSchemaUncached -v

Expected: FAIL with ImportError: cannot import name 'build_schema_uncached'.

Step 4.3: Perform the split

Open app/api/v2_schema.py. Locate def build_schema(...) (around line 143). Refactor as:

def build_schema(
    conn: duckdb.DuckDBPyConnection,
    user: dict,
    table_id: str,
    *,
    bq: BqAccess,
) -> dict:
    # RBAC + existence check MUST run before cache lookup — otherwise an
    # unauthorized user can read cached schema fetched by an authorized one.
    repo = TableRegistryRepository(conn)
    row = repo.get(table_id)
    if not row:
        raise NotFound(table_id)

    if not can_access_table(user, table_id, conn):
        raise PermissionError(table_id)

    cache_key = f"{table_id}"
    cached = _schema_cache.get(cache_key)
    if cached is not None:
        return cached

    return build_schema_uncached(conn, table_id, bq=bq)


def build_schema_uncached(
    conn: duckdb.DuckDBPyConnection,
    table_id: str,
    *,
    bq: BqAccess,
) -> dict:
    """Build the schema response and populate `_schema_cache`. **Skips
    RBAC and cache-hit short-circuit** — call only from contexts where
    those are either unnecessary (warmup) or already enforced upstream
    (`build_schema` above). The BQ work is the same; the entry-point
    contract is what differs.
    """
    repo = TableRegistryRepository(conn)
    row = repo.get(table_id)
    if not row:
        raise NotFound(table_id)

    cache_key = f"{table_id}"
    source_type = row.get("source_type") or ""
    if source_type == "bigquery":
        dataset = row.get("bucket") or ""
        source_table = row.get("source_table") or table_id
        columns = _fetch_bq_schema(bq, dataset, source_table)
        opts = _fetch_bq_table_options(bq, dataset, source_table)
        payload = {
            "table_id": table_id,
            "source_type": source_type,
            "sql_flavor": "bigquery",
            "columns": columns,
            "partition_by": opts.get("partition_by"),
            "clustered_by": opts.get("clustered_by"),
        }
    else:
        # Local sources (Keboola/Jira parquet) — schema lives in extract.duckdb.
        # Preserve whatever the existing code path did. Copy the original
        # else-branch from build_schema verbatim. (Inspect the file before
        # this edit to extract the original lines after `if source_type ==
        # 'bigquery'`.)
        payload = _build_schema_for_local_row(conn, table_id, row)

    _schema_cache.set(cache_key, payload)
    return payload

Important: the existing build_schema body has logic for non-BQ rows after the if source_type == "bigquery": block. Read that block (lines ~165-210 in current file) and either keep it inline in build_schema_uncached after the if/else, or extract it into a helper _build_schema_for_local_row(conn, table_id, row) as the snippet above suggests. Pick whichever requires the smaller diff against the existing code.

Step 4.4: Run tests

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_schema.py -v

Expected: all green (existing tests + the new TestBuildSchemaUncached class).

Step 4.5: Commit

cd /tmp/agnes-metadata
git add app/api/v2_schema.py tests/test_v2_schema.py
git commit -m "refactor(v2_schema): extract build_schema_uncached for warmup re-use

build_schema currently mixes RBAC + cache check + BQ work in one body.
Warmup will need to populate _schema_cache without a user — extract the
BQ-and-cache half so warmup can call it directly. build_schema keeps
the RBAC + cache-hit short-circuit at the top, then delegates.

Behavior unchanged for live /api/v2/schema/{id} consumers.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 5: Provider scaffold + dispatcher

Files:

Create: connectors/bigquery/metadata.py (stub returning None)
Create: connectors/keboola/metadata.py (stub returning None)
Modify: app/api/v2_catalog.py (add _metadata_provider_for and _build_metadata_request)
Test: tests/test_v2_catalog_dispatcher.py
Step 5.1: Write failing tests

Create tests/test_v2_catalog_dispatcher.py:

"""Dispatch + identifier-validation gate for the source-agnostic
metadata providers."""

from app.api._metadata_models import MetadataRequest


def test_dispatcher_returns_bq_provider_for_bigquery():
    from app.api.v2_catalog import _metadata_provider_for
    from connectors.bigquery import metadata as bq_meta
    fn = _metadata_provider_for("bigquery")
    assert fn is bq_meta.fetch


def test_dispatcher_returns_keboola_provider_for_keboola():
    from app.api.v2_catalog import _metadata_provider_for
    from connectors.keboola import metadata as kb_meta
    fn = _metadata_provider_for("keboola")
    assert fn is kb_meta.fetch


def test_dispatcher_returns_none_for_unknown_source():
    from app.api.v2_catalog import _metadata_provider_for
    assert _metadata_provider_for("jira") is None
    assert _metadata_provider_for("") is None
    assert _metadata_provider_for("snowflake") is None


def test_build_metadata_request_for_valid_row():
    from app.api.v2_catalog import _build_metadata_request
    req = _build_metadata_request({
        "id": "orders",
        "bucket": "dwh_base",
        "source_table": "orders_2024",
    })
    assert isinstance(req, MetadataRequest)
    assert req.table_id == "orders"
    assert req.bucket == "dwh_base"
    assert req.source_table == "orders_2024"


def test_build_metadata_request_rejects_unsafe_bucket():
    from app.api.v2_catalog import _build_metadata_request
    req = _build_metadata_request({
        "id": "x",
        "bucket": "evil`; DROP--",
        "source_table": "t",
    })
    assert req is None


def test_build_metadata_request_falls_back_to_id_when_source_table_missing():
    """Some legacy Keboola registry rows have empty source_table; the row id
    is the table name in that case (mirrors v2_schema:168 behavior)."""
    from app.api.v2_catalog import _build_metadata_request
    req = _build_metadata_request({
        "id": "orders",
        "bucket": "in.c-crm",
        "source_table": "",
    })
    assert req is not None
    assert req.source_table == "orders"


def test_stub_providers_return_none():
    """Providers don't have their real bodies yet — stubs return None
    so the catalog endpoint stays 200 while we wire the rest."""
    from connectors.bigquery import metadata as bq_meta
    from connectors.keboola import metadata as kb_meta
    req = MetadataRequest(table_id="x", bucket="b", source_table="t")
    assert bq_meta.fetch(req) is None
    assert kb_meta.fetch(req) is None

Step 5.2: Run tests to verify they fail

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_dispatcher.py -v

Expected: multiple FAIL — modules don't exist, dispatcher not defined.

Step 5.3: Create the provider stubs

Create connectors/bigquery/metadata.py:

"""BigQuery metadata provider — populates `TableMetadata` for a remote
BQ-backed registry row.

Stub: returns None pending the full implementation in Task 7.
"""

from __future__ import annotations

from app.api._metadata_models import MetadataRequest, TableMetadata


def fetch(req: MetadataRequest) -> TableMetadata | None:
    return None

Create connectors/keboola/metadata.py:

"""Keboola metadata provider — populates `TableMetadata` for a Keboola
registry row via the Storage API.

Stub: returns None pending the full implementation in Task 6.
"""

from __future__ import annotations

from app.api._metadata_models import MetadataRequest, TableMetadata


def fetch(req: MetadataRequest) -> TableMetadata | None:
    return None

Step 5.4: Add dispatcher + request builder to v2_catalog

Edit app/api/v2_catalog.py. Add at the top, after the existing imports:

from app.api._metadata_models import MetadataRequest, TableMetadata
from src.identifier_validation import validate_quoted_identifier

Then, after the _table_rows_cache definition (around line 26), add:

def _metadata_provider_for(source_type: str):
    """Lazy-import dispatch for source-specific metadata providers.

    Lazy because connector modules are heavy (BQ extension, google-cloud
    client, etc.) and a Keboola-only deployment shouldn't pay the BQ
    import cost. Returns ``None`` for unknown source types — the caller
    treats that as "no metadata enrichment available" and falls through.
    """
    if source_type == "bigquery":
        from connectors.bigquery import metadata as m
        return m.fetch
    if source_type == "keboola":
        from connectors.keboola import metadata as m
        return m.fetch
    return None


def _build_metadata_request(row: dict) -> MetadataRequest | None:
    """Construct a validated MetadataRequest from a registry row.

    Pre-validates the identifiers via `validate_quoted_identifier` before
    constructing the request — providers can then interpolate
    `req.bucket` / `req.source_table` into SQL/URL paths without
    re-checking. Returns ``None`` when validation fails; provider is not
    dispatched for that row.
    """
    bucket = row.get("bucket") or ""
    source_table = row.get("source_table") or row.get("id") or ""
    if not bucket or not source_table:
        return None
    if not (validate_quoted_identifier(bucket, "bucket")
            and validate_quoted_identifier(source_table, "source_table")):
        return None
    return MetadataRequest(
        table_id=row["id"], bucket=bucket, source_table=source_table,
    )

Step 5.5: Run tests to verify they pass

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_dispatcher.py -v

Expected: 7 passed.

Step 5.6: Commit

cd /tmp/agnes-metadata
git add connectors/bigquery/metadata.py connectors/keboola/metadata.py \
        app/api/v2_catalog.py tests/test_v2_catalog_dispatcher.py
git commit -m "feat(catalog): metadata-provider scaffold + dispatcher with identifier validation

Lazy-imported per-source-type providers; dispatcher in v2_catalog.py.
_build_metadata_request validates bucket + source_table before
constructing a MetadataRequest — providers can interpolate them safely.
Providers ship as stubs returning None; real bodies land in Tasks 6-7.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 6: Keboola provider implementation

Files:

Modify: connectors/keboola/metadata.py (replace stub with real impl)
Test: tests/test_connectors_keboola_metadata.py
Step 6.1: Write failing tests

Create tests/test_connectors_keboola_metadata.py:

"""Keboola metadata provider — happy + unconfigured + api-error paths."""

from unittest.mock import MagicMock, patch

import pytest

from app.api._metadata_models import MetadataRequest, TableMetadata


@pytest.fixture
def req():
    return MetadataRequest(
        table_id="orders", bucket="in.c-crm", source_table="orders",
    )


def test_happy_path_returns_populated_metadata(req, monkeypatch):
    from connectors.keboola import metadata
    from connectors.keboola.client import KeboolaClient
    # KeboolaClient(token=None, url=None) reads env vars; pretend they're set.
    monkeypatch.setenv("KEBOOLA_STACK_URL", "https://connection.keboola.com")
    monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "tok")

    with patch("connectors.keboola.metadata.KeboolaStorageClient") as MockStorage:
        instance = MockStorage.return_value
        instance.get_table_info.return_value = {
            "rowsCount": 1234,
            "dataSizeBytes": 500_000,
            "primaryKey": ["id"],
        }
        result = metadata.fetch(req)

    assert result == TableMetadata(
        rows=1234,
        size_bytes=500_000,
        partition_by=None,
        clustered_by=None,
    )


def test_returns_none_when_unconfigured(req, monkeypatch):
    """No KEBOOLA_STACK_URL / KEBOOLA_STORAGE_TOKEN env → return None."""
    from connectors.keboola import metadata
    monkeypatch.delenv("KEBOOLA_STACK_URL", raising=False)
    monkeypatch.delenv("KEBOOLA_STORAGE_TOKEN", raising=False)
    assert metadata.fetch(req) is None


def test_returns_none_on_storage_api_error(req, monkeypatch):
    """`StorageApiError` from get_table_info → log + return None."""
    from connectors.keboola import metadata
    from connectors.keboola.storage_api import StorageApiError
    monkeypatch.setenv("KEBOOLA_STACK_URL", "https://x.keboola.com")
    monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "tok")

    with patch("connectors.keboola.metadata.KeboolaStorageClient") as MockStorage:
        instance = MockStorage.return_value
        instance.get_table_info.side_effect = StorageApiError(
            "404 not found", status=404, body={},
        )
        assert metadata.fetch(req) is None


def test_table_id_uses_bucket_dot_source_table(req, monkeypatch):
    """Storage API path is `<bucket>.<source_table>`."""
    from connectors.keboola import metadata
    monkeypatch.setenv("KEBOOLA_STACK_URL", "https://x.keboola.com")
    monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "tok")

    with patch("connectors.keboola.metadata.KeboolaStorageClient") as MockStorage:
        instance = MockStorage.return_value
        instance.get_table_info.return_value = {
            "rowsCount": 0, "dataSizeBytes": 0,
        }
        metadata.fetch(req)
        instance.get_table_info.assert_called_once_with("in.c-crm.orders")

Step 6.2: Run tests to verify they fail

cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_keboola_metadata.py -v

Expected: 4 FAIL (stub returns None for happy-path; assertion fails).

Step 6.3: Replace stub with real implementation

Open connectors/keboola/metadata.py and replace the entire file:

"""Keboola metadata provider — populates `TableMetadata` for a Keboola
registry row via the Storage API.

Reuses `KeboolaClient(token=None, url=None)` to inherit the existing
env-var fallback path (`KEBOOLA_STACK_URL` + `KEBOOLA_STORAGE_TOKEN`),
which is the same hierarchy `connectors/keboola/extractor.py` and
`connectors/keboola/client.py` already use. **Does NOT introduce a third
token-resolution helper.**
"""

from __future__ import annotations

import logging

from app.api._metadata_models import MetadataRequest, TableMetadata
from connectors.keboola.client import KeboolaClient
from connectors.keboola.storage_api import (
    KeboolaStorageClient, StorageApiError,
)

logger = logging.getLogger(__name__)


def fetch(req: MetadataRequest) -> TableMetadata | None:
    """Return Keboola Storage API metadata for the given table, or None.

    Keboola has no BigQuery-style partition/cluster concept; primaryKey is
    conceptually different (uniqueness, not physical layout), so
    `partition_by` and `clustered_by` are left None.
    """
    # Construct a KeboolaClient with no explicit token/url — the
    # constructor reads the standard env vars. Side-effect-free except
    # for setting `.token` and `.url` (verified
    # connectors/keboola/client.py:90-99). When a future refactor extracts
    # `_resolve_keboola_credentials()` as a standalone helper, switch
    # here to call that directly.
    creds = KeboolaClient(token=None, url=None)
    if not creds.url or not creds.token:
        return None  # not configured — same posture as BQ sentinel

    table_id = f"{req.bucket}.{req.source_table}"
    try:
        storage = KeboolaStorageClient(url=creds.url, token=creds.token)
        info = storage.get_table_info(table_id)
    except (StorageApiError, ValueError) as e:
        logger.warning("Keboola metadata fetch failed for %s: %s", table_id, e)
        return None

    return TableMetadata(
        rows=info.get("rowsCount"),
        size_bytes=info.get("dataSizeBytes"),
        partition_by=None,
        clustered_by=None,
    )

Step 6.4: Run tests to verify they pass

cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_keboola_metadata.py -v

Expected: 4 passed.

Step 6.5: Commit

cd /tmp/agnes-metadata
git add connectors/keboola/metadata.py tests/test_connectors_keboola_metadata.py
git commit -m "feat(keboola): metadata provider returning rows + size_bytes

Reuses KeboolaClient(token=None, url=None) for env-var credential
fallback — no new token-resolver helper invented. partition_by /
clustered_by stay None (Keboola has no equivalent concept).

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 7: BigQuery provider implementation

Files:

Modify: connectors/bigquery/metadata.py (replace stub with real impl)
Test: tests/test_connectors_bigquery_metadata.py
Step 7.1: Write failing tests

Create tests/test_connectors_bigquery_metadata.py:

"""BigQuery metadata provider — 5 paths from spec test plan:
happy / sentinel / VIEW / region-typo / both-paths-fail."""

from unittest.mock import MagicMock, patch

import pytest

from app.api._metadata_models import MetadataRequest, TableMetadata


@pytest.fixture
def req():
    return MetadataRequest(
        table_id="orders", bucket="dwh_base", source_table="orders_2024",
    )


def _bq_with_session(table_storage_rows=None, columns_rows=None,
                     table_storage_raises=None, columns_raises=None,
                     legacy_tables_rows=None, legacy_tables_raises=None,
                     projects_data="data-proj", projects_billing="billing-proj"):
    """Build a mock `BqAccess` whose `duckdb_session()` returns a context
    manager whose `.execute(...)` dispatches based on which inner SQL
    string is being run.

    The 3 SQL shapes we route on:
      - INFORMATION_SCHEMA.TABLE_STORAGE  → table_storage_rows / raises
      - INFORMATION_SCHEMA.COLUMNS         → columns_rows / raises
      - __TABLES__                         → legacy_tables_rows / raises
    """
    bq = MagicMock()
    bq.projects.data = projects_data
    bq.projects.billing = projects_billing

    def execute(outer_sql, params):
        inner_sql = params[1] if len(params) > 1 else ""
        if "TABLE_STORAGE" in inner_sql:
            if table_storage_raises:
                raise table_storage_raises
            return MagicMock(
                fetchone=lambda: table_storage_rows[0] if table_storage_rows else None,
                fetchall=lambda: table_storage_rows or [],
            )
        if "INFORMATION_SCHEMA.COLUMNS" in inner_sql:
            if columns_raises:
                raise columns_raises
            return MagicMock(
                fetchall=lambda: columns_rows or [],
            )
        if "__TABLES__" in inner_sql:
            if legacy_tables_raises:
                raise legacy_tables_raises
            return MagicMock(
                fetchone=lambda: legacy_tables_rows[0] if legacy_tables_rows else None,
            )
        raise AssertionError(f"unexpected SQL: {inner_sql[:80]}")

    session = MagicMock()
    session.execute.side_effect = execute
    cm = MagicMock()
    cm.__enter__.return_value = session
    cm.__exit__.return_value = False
    bq.duckdb_session.return_value = cm
    return bq


def test_happy_path_returns_full_metadata(req, monkeypatch):
    """TABLE_STORAGE returns rows+size, COLUMNS returns partition+cluster."""
    from connectors.bigquery import metadata
    from app.instance_config import get_value

    monkeypatch.setattr(
        "app.instance_config.get_value",
        lambda key, default=None: "us-central1" if "location" in key else default,
        raising=False,
    )

    bq = _bq_with_session(
        table_storage_rows=[(1234567, 5_000_000)],
        columns_rows=[
            ("event_date", "DATE", "NO", "YES", None),
            ("country", "STRING", "YES", "NO", 1),
            ("user_id", "STRING", "NO", "NO", None),
        ],
    )
    with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
        result = metadata.fetch(req)
    assert result == TableMetadata(
        rows=1234567,
        size_bytes=5_000_000,
        partition_by="event_date",
        clustered_by=["country"],
    )


def test_sentinel_unconfigured_returns_none_no_query(req, monkeypatch):
    """`bq.projects.data == ''` → return None before any query."""
    from connectors.bigquery import metadata
    bq = _bq_with_session(projects_data="")
    with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
        assert metadata.fetch(req) is None
    bq.duckdb_session.assert_not_called()


def test_view_path_returns_metadata_with_null_rows_size(req, monkeypatch):
    """VIEW: TABLE_STORAGE empty + __TABLES__ empty → rows/size = None;
    partition + cluster from COLUMNS still surface."""
    from connectors.bigquery import metadata
    monkeypatch.setattr(
        "app.instance_config.get_value",
        lambda key, default=None: "us-central1" if "location" in key else default,
        raising=False,
    )
    bq = _bq_with_session(
        table_storage_rows=[],   # view → no row
        legacy_tables_rows=[],   # view also absent from __TABLES__
        columns_rows=[
            ("event_date", "DATE", "NO", "YES", None),
        ],
    )
    with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
        result = metadata.fetch(req)
    assert result is not None
    assert result.rows is None
    assert result.size_bytes is None
    assert result.partition_by == "event_date"


def test_region_typo_falls_through_to_legacy_tables(req, monkeypatch):
    """TABLE_STORAGE raises (typo'd region) → fall through to __TABLES__."""
    from connectors.bigquery import metadata
    monkeypatch.setattr(
        "app.instance_config.get_value",
        lambda key, default=None: "us-central" if "location" in key else default,  # typo!
        raising=False,
    )
    bq = _bq_with_session(
        table_storage_raises=RuntimeError("Not found: ..."),
        legacy_tables_rows=[(100, 2048)],
        columns_rows=[("event_date", "DATE", "NO", "YES", None)],
    )
    with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
        result = metadata.fetch(req)
    assert result is not None
    assert result.rows == 100
    assert result.size_bytes == 2048


def test_both_paths_fail_returns_metadata_with_partition_only(req, monkeypatch):
    """Both TABLE_STORAGE and __TABLES__ fail → rows/size None, partition still fills."""
    from connectors.bigquery import metadata
    monkeypatch.setattr(
        "app.instance_config.get_value",
        lambda key, default=None: "us-central1" if "location" in key else default,
        raising=False,
    )
    bq = _bq_with_session(
        table_storage_raises=RuntimeError("BQ down"),
        legacy_tables_raises=RuntimeError("BQ still down"),
        columns_rows=[("event_date", "DATE", "NO", "YES", None)],
    )
    with patch("connectors.bigquery.metadata.get_bq_access", return_value=bq):
        result = metadata.fetch(req)
    assert result is not None
    assert result.rows is None
    assert result.size_bytes is None
    assert result.partition_by == "event_date"


def test_bq_access_error_returns_none(req):
    """get_bq_access() raises BqAccessError → return None gracefully."""
    from connectors.bigquery import metadata
    from connectors.bigquery.access import BqAccessError
    with patch(
        "connectors.bigquery.metadata.get_bq_access",
        side_effect=BqAccessError("not_configured"),
    ):
        assert metadata.fetch(req) is None

Step 7.2: Run tests to verify they fail

cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_bigquery_metadata.py -v

Expected: 6 FAIL (stub returns None unconditionally).

Step 7.3: Implement the BQ provider

Replace connectors/bigquery/metadata.py entirely:

"""BigQuery metadata provider — populates `TableMetadata` for a remote
BQ-backed registry row.

Two queries (different INFORMATION_SCHEMA scopes — TABLE_STORAGE is
region-scoped, COLUMNS is dataset-scoped, can't be combined):

  1. INFORMATION_SCHEMA.TABLE_STORAGE — total_rows + active+long_term
     bytes. Region-portable per Google's docs; only valid via
     `<project>.region-<region>.INFORMATION_SCHEMA.TABLE_STORAGE`
     (verified live 2026-05-07; dataset-scoped TABLE_STORAGE doesn't
     exist).

  2. INFORMATION_SCHEMA.COLUMNS — partition_by + clustered_by. Reuses
     the consolidated `fetch_bq_columns_full` helper that v2_schema also
     calls; one shared shape, one round-trip.

Region resolution chain: `instance.yaml.data_source.bigquery.location` →
`bq.bigquery_client().get_dataset(...)` → fall back to legacy `__TABLES__`
(dataset-scoped, no region required).

VIEW handling: TABLE_STORAGE returns no rows for entries whose
`table_type='VIEW'`; the legacy `__TABLES__` fallback also doesn't list
views. The provider returns `TableMetadata(rows=None, size_bytes=None,
partition_by=<from COLUMNS>, clustered_by=<from COLUMNS>)` — analyst
Claude reads `null` size and applies the existing CLAUDE.md guidance.

`size_bytes` reports `active_logical_bytes + long_term_logical_bytes`
(a full BQ scan reads both — reporting only active undercounts aged
partitioned tables).
"""

from __future__ import annotations

import logging

from app.api._metadata_models import MetadataRequest, TableMetadata
from connectors.bigquery.access import (
    BqAccessError, fetch_bq_columns_full, get_bq_access,
)

logger = logging.getLogger(__name__)


def fetch(req: MetadataRequest) -> TableMetadata | None:
    try:
        bq = get_bq_access()
    except BqAccessError:
        return None

    if not bq.projects.data:
        return None

    rows_size = _fetch_rows_and_size(bq, req)
    columns = fetch_bq_columns_full(bq, req.bucket, req.source_table)
    part_clust = _derive_partition_cluster(columns) if columns else None

    if rows_size is None and part_clust is None:
        return None

    return TableMetadata(
        rows=(rows_size or {}).get("rows"),
        size_bytes=(rows_size or {}).get("size_bytes"),
        partition_by=(part_clust or {}).get("partition_by"),
        clustered_by=(part_clust or {}).get("clustered_by"),
    )


def _derive_partition_cluster(columns: list[dict]) -> dict | None:
    """Mirror v2_schema._fetch_bq_table_options derivations from the
    shared columns-full result."""
    if not columns:
        return None
    partition_by = next(
        (c["name"] for c in columns if c["is_partitioning_column"]),
        None,
    )
    clustered = sorted(
        (c for c in columns if c["clustering_ordinal_position"] is not None),
        key=lambda c: c["clustering_ordinal_position"],
    )
    clustered_by = [c["name"] for c in clustered]
    return {"partition_by": partition_by, "clustered_by": clustered_by}


def _fetch_rows_and_size(bq, req: MetadataRequest) -> dict | None:
    """Resolve rows + size_bytes via TABLE_STORAGE → __TABLES__ fallthrough.

    See module docstring + spec Open Question §1 for view-path nuance.
    """
    location = _resolve_bq_location(bq, req)
    if location:
        result = _fetch_via_table_storage(bq, req, location)
        if result is not None:
            return result
        # TABLE_STORAGE returned None despite having a location: could
        # be a typo in `data_source.bigquery.location`, a multi-region
        # dataset operator misclassified, the table is a VIEW, or a
        # transient permission gap. Try __TABLES__ before giving up.
    return _fetch_via_legacy_tables(bq, req)


def _resolve_bq_location(bq, req: MetadataRequest) -> str | None:
    """instance.yaml.location → REST get_dataset → None."""
    from app.instance_config import get_value
    cfg_location = (get_value("data_source.bigquery.location") or "").strip()
    if cfg_location:
        return cfg_location
    try:
        ds = bq.bigquery_client().get_dataset(
            f"{bq.projects.data}.{req.bucket}"
        )
        return ds.location
    except Exception as e:
        logger.warning(
            "BQ dataset.get failed for %s.%s — falling back to __TABLES__: %s",
            bq.projects.data, req.bucket, e,
        )
        return None


def _fetch_via_table_storage(bq, req: MetadataRequest, location: str) -> dict | None:
    """Region-scoped INFORMATION_SCHEMA.TABLE_STORAGE — preferred path.

    `validate_quoted_identifier` accepts `us-central1`, `europe-west1`,
    `EU`, `us` etc. (regex `^[a-zA-Z0-9_][a-zA-Z0-9_.\\-]{0,127}$`).
    Refuses anything that could break out of the backtick-quoted path.

    Returns None on no-row (table is a VIEW, or different region than
    configured) — caller decides whether to fall through.

    `size_bytes` is `active + long_term` logical bytes (a full BQ scan
    reads both; reporting only active undercounts aged partitioned tables).
    """
    from src.identifier_validation import validate_quoted_identifier
    if not validate_quoted_identifier(location, "BQ region"):
        return None
    try:
        bq_sql = (
            f"SELECT total_rows, "
            f"IFNULL(active_logical_bytes, 0) + IFNULL(long_term_logical_bytes, 0) "
            f"FROM `{bq.projects.data}.region-{location}.INFORMATION_SCHEMA.TABLE_STORAGE` "
            f"WHERE table_schema = ? AND table_name = ?"
        )
        with bq.duckdb_session() as conn:
            row = conn.execute(
                "SELECT * FROM bigquery_query(?, ?, ?, ?)",
                [bq.projects.billing, bq_sql, req.bucket, req.source_table],
            ).fetchone()
    except Exception as e:
        logger.warning(
            "BQ TABLE_STORAGE fetch failed for %s.%s.%s: %s",
            bq.projects.data, req.bucket, req.source_table, e,
        )
        return None
    if row is None:
        return None  # VIEW or wrong region
    rows_, size_bytes = row
    return {
        "rows": int(rows_) if rows_ is not None else None,
        "size_bytes": int(size_bytes) if size_bytes is not None else None,
    }


def _fetch_via_legacy_tables(bq, req: MetadataRequest) -> dict | None:
    """Last-resort dataset-scoped __TABLES__ — works without region."""
    try:
        bq_sql = (
            f"SELECT row_count, size_bytes "
            f"FROM `{bq.projects.data}.{req.bucket}.__TABLES__` "
            f"WHERE table_id = ?"
        )
        with bq.duckdb_session() as conn:
            row = conn.execute(
                "SELECT * FROM bigquery_query(?, ?, ?)",
                [bq.projects.billing, bq_sql, req.source_table],
            ).fetchone()
    except Exception as e:
        logger.warning(
            "BQ __TABLES__ fetch failed for %s.%s.%s: %s",
            bq.projects.data, req.bucket, req.source_table, e,
        )
        return None
    if row is None:
        return None
    rows_, size_bytes = row
    return {
        "rows": int(rows_) if rows_ is not None else None,
        "size_bytes": int(size_bytes) if size_bytes is not None else None,
    }

Step 7.4: Run tests to verify they pass

cd /tmp/agnes-metadata && python -m pytest tests/test_connectors_bigquery_metadata.py -v

Expected: 6 passed.

Step 7.5: Commit

cd /tmp/agnes-metadata
git add connectors/bigquery/metadata.py tests/test_connectors_bigquery_metadata.py
git commit -m "feat(bigquery): metadata provider with TABLE_STORAGE + COLUMNS + __TABLES__ fallback

Resolves rows + size_bytes via region-scoped INFORMATION_SCHEMA.TABLE_STORAGE
(verified the only valid scope on 2026-05-07; dataset-scoped doesn't exist).
size_bytes is active + long_term logical bytes — full scan reads both.

Region resolved from data_source.bigquery.location → REST get_dataset →
fall back to legacy __TABLES__ on TABLE_STORAGE failure (region typo,
VIEW, IAM gap). Partition + cluster derive from the shared
fetch_bq_columns_full helper introduced in Task 3.

VIEW handling: TABLE_STORAGE empty → fall through to __TABLES__ empty →
TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...).
The null size signals analyst Claude to use the existing 'treat as
potentially large' guidance.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 8: v2_catalog wiring — _size_hint_for_row + response shape + cache

Files:

Modify: app/api/v2_catalog.py
Test: tests/test_v2_catalog_remote_metadata.py
Step 8.1: Write failing tests

Create tests/test_v2_catalog_remote_metadata.py:

"""Catalog endpoint integration: per-table metadata enrichment for
remote rows."""

from unittest.mock import patch

from app.api._metadata_models import TableMetadata


def test_remote_row_includes_metadata_fields(seeded_app, monkeypatch):
    """Catalog response for a query_mode='remote' BQ row carries the four
    new fields populated by the provider."""
    c = seeded_app["client"]
    token = seeded_app["admin_token"]

    fake_meta = TableMetadata(
        rows=10000, size_bytes=2_000_000,
        partition_by="event_date", clustered_by=["country", "platform"],
    )

    # Register a synthetic remote BQ row in the test instance's registry.
    seeded_app["register_table"](
        id="orders", source_type="bigquery", bucket="dwh_base",
        source_table="orders_2024", query_mode="remote",
    )

    with patch(
        "connectors.bigquery.metadata.fetch", return_value=fake_meta,
    ):
        r = c.get(
            "/api/v2/catalog",
            headers={"Authorization": f"Bearer {token}"},
        )
    assert r.status_code == 200, r.text
    tables = r.json()["tables"]
    orders = next(t for t in tables if t["id"] == "orders")
    assert orders["rows"] == 10000
    assert orders["size_bytes"] == 2_000_000
    assert orders["partition_by"] == "event_date"
    assert orders["clustered_by"] == ["country", "platform"]
    # Existing fields still present.
    assert orders["query_mode"] == "remote"


def test_local_row_unaffected_by_provider_dispatch(seeded_app):
    """query_mode='local' rows take the parquet-stat path; provider not called."""
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    seeded_app["register_table"](
        id="users", source_type="keboola", bucket="in.c-crm",
        source_table="users", query_mode="local",
    )

    with patch("connectors.keboola.metadata.fetch") as mock_fetch:
        r = c.get(
            "/api/v2/catalog",
            headers={"Authorization": f"Bearer {token}"},
        )
    assert r.status_code == 200, r.text
    mock_fetch.assert_not_called()


def test_provider_failure_returns_null_metadata(seeded_app):
    """Provider returns None → row appears with null new fields, not
    a 500. Catalog endpoint must stay 200."""
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    seeded_app["register_table"](
        id="broken", source_type="bigquery", bucket="dwh_base",
        source_table="broken_t", query_mode="remote",
    )

    with patch(
        "connectors.bigquery.metadata.fetch", return_value=None,
    ):
        r = c.get(
            "/api/v2/catalog",
            headers={"Authorization": f"Bearer {token}"},
        )
    assert r.status_code == 200, r.text
    tables = r.json()["tables"]
    broken = next(t for t in tables if t["id"] == "broken")
    assert broken["rows"] is None
    assert broken["size_bytes"] is None
    assert broken["partition_by"] is None
    assert broken["clustered_by"] is None


def test_cache_hit_does_not_call_provider_twice(seeded_app):
    """First call invokes provider; second within 15 min hits cache."""
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    seeded_app["register_table"](
        id="orders", source_type="bigquery", bucket="dwh_base",
        source_table="orders_2024", query_mode="remote",
    )

    fake_meta = TableMetadata(rows=1, size_bytes=2)
    with patch(
        "connectors.bigquery.metadata.fetch", return_value=fake_meta,
    ) as mock_fetch:
        c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
        c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
    assert mock_fetch.call_count == 1

The seeded_app["register_table"] helper may need to be added to conftest.py if it doesn't exist. Quick check + extension:

# In tests/conftest.py — add to seeded_app fixture
def register_table(*, id, source_type, bucket, source_table, query_mode, **extra):
    from src.repositories.table_registry import TableRegistryRepository
    repo = TableRegistryRepository(seeded_app["conn"])
    repo.register(
        id=id, name=id, source_type=source_type, bucket=bucket,
        source_table=source_table, query_mode=query_mode, **extra,
    )
seeded_app["register_table"] = register_table

Step 8.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_remote_metadata.py -v

Expected: FAIL — response doesn't include the new fields yet.

Step 8.3: Wire up the dispatch + cache + response shape

Edit app/api/v2_catalog.py:

A. Add a _metadata_cache near the existing _table_rows_cache (around line 26):

# Per-table cached TableMetadata. 15-min TTL — long enough to amortise
# across an analyst session, short enough that a freshly-registered
# remote table shows real numbers within a coffee break (the cache-bust
# path in `invalidate_for_table` accelerates this for the common admin-
# verifies-registration flow).
_metadata_cache = TTLCache(maxsize=512, ttl_seconds=900)

B. Rename the existing function _materialized_size_hint to _size_hint_for_row and extend it. Find the function (around line 68) and replace its full body:

def _size_hint_for_row(row: dict) -> dict:
    """Resolve the per-row metadata bundle the catalog response surfaces.

    Renamed from `_materialized_size_hint` (which always also handled
    `local` rows; the old name was misleading). Returns a dict with up
    to four keys: `rough_size_hint`, `rows`, `size_bytes`, `partition_by`,
    `clustered_by`. Missing keys are reported as `null` in the response.

    Branches:
      - `local` / `materialized` → existing on-disk parquet stat (cheap).
      - `remote` → dispatch to the per-source-type provider; cache the
        TableMetadata for 15 min.
    """
    table_id = row["id"]
    source_type = row.get("source_type") or ""
    query_mode = row.get("query_mode") or "local"

    if query_mode in ("local", "materialized"):
        return {"rough_size_hint": _materialized_parquet_size_bucket(
            table_id, source_type, query_mode,
        )}

    if query_mode != "remote":
        return {"rough_size_hint": None}

    # Cache lookup (per-row TableMetadata).
    cached = _metadata_cache.get(table_id)
    if cached is None:
        cached = _resolve_remote_metadata(row)
        if cached is not None:
            _metadata_cache.set(table_id, cached)

    if cached is None:
        return {"rough_size_hint": None}

    return {
        "rough_size_hint": _bucket_size(cached.size_bytes) if cached.size_bytes else None,
        "rows": cached.rows,
        "size_bytes": cached.size_bytes,
        "partition_by": cached.partition_by,
        "clustered_by": cached.clustered_by,
    }


def _materialized_parquet_size_bucket(
    table_id: str, source_type: str, query_mode: str,
) -> str | None:
    """Size hint for rows whose data is on the server filesystem
    (the old `_materialized_size_hint` body)."""
    if not source_type:
        return None
    try:
        path = (
            Path(_get_data_dir()) / "extracts" / source_type / "data"
            / f"{table_id}.parquet"
        )
        if not path.exists():
            return None
        return _bucket_size(path.stat().st_size)
    except Exception:
        return None


def _resolve_remote_metadata(row: dict) -> "TableMetadata | None":
    """Provider dispatch for a remote row. Returns None on any failure."""
    source_type = row.get("source_type") or ""
    provider = _metadata_provider_for(source_type)
    if provider is None:
        return None
    req = _build_metadata_request(row)
    if req is None:
        return None
    try:
        return provider(req)
    except Exception:
        # Defense in depth — providers are documented as never-raises,
        # but a regression would otherwise 500 the whole catalog.
        return None

C. Update build_catalog (around line 94) to merge the new fields into each visible row's payload. Find the loop body that builds visible.append(...) and replace the appended dict construction:

    visible = []
    for r in rows:
        if not can_access_table(user, r["id"], conn):
            continue
        hint = _size_hint_for_row(r)
        visible.append({
            "id": r["id"],
            "name": r.get("name") or r["id"],
            "description": r.get("description") or "",
            "source_type": r.get("source_type") or "",
            "query_mode": r.get("query_mode") or "local",
            "sql_flavor": _flavor_for(r.get("source_type") or ""),
            "where_examples": _examples_for(r.get("source_type") or ""),
            "fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""),
            "rough_size_hint": hint.get("rough_size_hint"),
            "rows": hint.get("rows"),
            "size_bytes": hint.get("size_bytes"),
            "partition_by": hint.get("partition_by"),
            "clustered_by": hint.get("clustered_by"),
        })

D. Verify the existing from pathlib import Path and _get_data_dir import are still in scope; if not, add them.

Step 8.4: Run tests to verify they pass

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_remote_metadata.py tests/test_v2_catalog_dispatcher.py -v

Expected: all green.

Step 8.5: Run existing v2_catalog tests for regression

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog.py -v

Expected: all green (response is additive — no field renamed or removed).

Step 8.6: Commit

cd /tmp/agnes-metadata
git add app/api/v2_catalog.py tests/test_v2_catalog_remote_metadata.py
git commit -m "feat(catalog): enrich remote rows with rows / size_bytes / partition_by / clustered_by

_size_hint_for_row (renamed from _materialized_size_hint) now dispatches
to the per-source-type metadata provider for query_mode='remote' rows
and caches the TableMetadata for 15 minutes. Catalog response gains
four new optional fields; existing CLI consumers that read only
rough_size_hint stay unchanged.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 9: Unified cache invalidation

Files:

Modify: app/api/v2_catalog.py (add invalidate_for_table)
Modify: app/api/admin.py (call from register/update/unregister)
Test: tests/test_v2_catalog_invalidation.py
Step 9.1: Write failing test

Create tests/test_v2_catalog_invalidation.py:

"""Unified cache flush across all four catalog/schema/sample/metadata
caches on registry write."""

from unittest.mock import patch


def test_invalidate_flushes_all_four_caches():
    from app.api import v2_catalog, v2_schema, v2_sample
    from app.api._metadata_models import TableMetadata

    # Pre-populate.
    v2_catalog._table_rows_cache.set("all", ["fake_row"])
    v2_catalog._metadata_cache.set("orders", TableMetadata(rows=10))
    v2_schema._schema_cache.set("orders", {"columns": []})
    v2_sample._sample_cache.set("orders|10", [{"row": 1}])

    v2_catalog.invalidate_for_table("orders")

    assert v2_catalog._table_rows_cache.get("all") is None
    assert v2_catalog._metadata_cache.get("orders") is None
    assert v2_schema._schema_cache.get("orders") is None
    # Sample cache is cleared whole (we don't have prefix-invalidation).
    assert v2_sample._sample_cache.get("orders|10") is None


def test_invalidate_schedules_single_row_rewarm(monkeypatch):
    """After the flush, a background re-warm task is scheduled for the
    same table_id. Assert via patching create_task."""
    import asyncio
    from app.api import v2_catalog

    scheduled = []

    def fake_create_task(coro):
        # Drain the coroutine so the test doesn't leak it.
        coro.close()
        scheduled.append(coro)
        return None

    monkeypatch.setattr(asyncio, "create_task", fake_create_task)
    v2_catalog.invalidate_for_table("orders")
    assert len(scheduled) == 1


def test_register_table_invalidates(seeded_app):
    """Registering a table flushes the rows cache so the next catalog
    request reflects it without waiting for the 5-min TTL."""
    from app.api import v2_catalog
    v2_catalog._table_rows_cache.set("all", [])

    seeded_app["register_table"](
        id="new_t", source_type="keboola", bucket="in.c-x",
        source_table="t", query_mode="local",
    )
    assert v2_catalog._table_rows_cache.get("all") is None


def test_update_table_invalidates(seeded_app):
    from app.api import v2_catalog
    seeded_app["register_table"](
        id="u_t", source_type="keboola", bucket="in.c-x",
        source_table="t", query_mode="local",
    )
    v2_catalog._table_rows_cache.set("all", ["pre-update"])
    seeded_app["http_put"]("/api/admin/registry/u_t", {"description": "new"})
    assert v2_catalog._table_rows_cache.get("all") is None


def test_unregister_table_invalidates(seeded_app):
    from app.api import v2_catalog
    seeded_app["register_table"](
        id="d_t", source_type="keboola", bucket="in.c-x",
        source_table="t", query_mode="local",
    )
    v2_catalog._table_rows_cache.set("all", ["pre-delete"])
    seeded_app["http_delete"]("/api/admin/registry/d_t")
    assert v2_catalog._table_rows_cache.get("all") is None

If seeded_app["http_put"] or ["http_delete"] aren't already defined in tests/conftest.py, add thin wrappers that call c.put(...) / c.delete(...) with the admin token.

Step 9.2: Run tests to verify they fail

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_invalidation.py -v

Expected: FAIL with AttributeError: module 'app.api.v2_catalog' has no attribute 'invalidate_for_table'.

Step 9.3: Add invalidate_for_table to v2_catalog

Append to app/api/v2_catalog.py (anywhere after _metadata_cache is defined):

def invalidate_for_table(table_id: str) -> None:
    """Drop every per-table cache so the next /api/v2/* request reflects
    the just-registered / updated / unregistered row immediately. Owned
    by the catalog module so admin.py doesn't need to know which caches
    exist.

    Imports v2_schema and v2_sample lazily — keeps catalog tests from
    pulling in BQ-extension imports they don't need.
    """
    import asyncio
    from app.api import v2_schema, v2_sample

    _table_rows_cache.clear()
    _metadata_cache.invalidate(table_id)
    v2_schema._schema_cache.invalidate(table_id)
    # Sample cache key is `f"{table_id}|{n}"`; clearing the whole sample
    # cache is heavier than precise invalidation, but registry-change
    # frequency (handful per day on a typical instance) doesn't justify
    # adding a prefix-invalidation primitive to TTLCache.
    v2_sample._sample_cache.clear()

    # Schedule a single-row re-warm so admins editing a registry row
    # see fresh data within a couple of seconds rather than waiting for
    # the next analyst to trigger a miss. Fire-and-forget; failures
    # log + skip inside the coroutine.
    try:
        asyncio.create_task(_rewarm_one_row(table_id))
    except RuntimeError:
        # No running event loop (e.g. called from a sync test). Skip
        # the re-warm — the next live request will populate via miss.
        pass


async def _rewarm_one_row(table_id: str) -> None:
    """Background single-row re-warm. Imports cache_warmup lazily to
    avoid a circular import at module load."""
    try:
        from app.api.cache_warmup import warm_one_table
        await warm_one_table(table_id)
    except Exception:
        import logging
        logging.getLogger(__name__).warning(
            "single-row re-warm failed for %s — next live request will populate",
            table_id,
        )

The cache_warmup.warm_one_table function will be defined in Task 10. For now it doesn't exist; the lazy import + outer try means the test for the schedule-task can run without it, falling through to the warning path. The unit test test_invalidate_schedules_single_row_rewarm patches asyncio.create_task so the inner coroutine never runs.

Step 9.4: Wire the helper into admin.py

Edit app/api/admin.py. Find the success-return paths in:

register_table (around line 1037 / inner success branch)
update_table (around line 2486)
unregister_table (the DELETE handler around line 2538)

For each, add a single line just before the function returns:

    from app.api.v2_catalog import invalidate_for_table
    invalidate_for_table(table_id)

(The import inside the function avoids a circular import at module load.)

For register_table, the table_id may be derived from result["id"] or similar — adapt to the actual variable name in scope. For unregister_table, the path parameter is table_id. For update_table, also table_id.

Step 9.5: Run tests

cd /tmp/agnes-metadata && python -m pytest tests/test_v2_catalog_invalidation.py -v

Expected: all green.

Step 9.6: Commit

cd /tmp/agnes-metadata
git add app/api/v2_catalog.py app/api/admin.py tests/test_v2_catalog_invalidation.py
git commit -m "feat(catalog): unified invalidate_for_table flushes 4 caches + schedules rewarm

invalidate_for_table flushes _table_rows_cache + _metadata_cache +
_schema_cache + _sample_cache, then schedules a single-row re-warm so
admin-edited rows reflect fresh data within ~1s rather than waiting for
the next analyst miss. Wired into register_table / update_table /
unregister_table.

Pre-fix none of the four caches were invalidated on registry change —
admin registers a table, agnes catalog doesn't show the new row for up
to 5 min; admin updates a row's bucket, agnes schema returns the OLD
column list for up to 1h.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 10: Cache warmup framework — state + bg task + endpoints

Files:

Create: app/api/cache_warmup.py
Modify: pyproject.toml (add sse-starlette dep)
Test: tests/test_cache_warmup.py
Step 10.1: Add sse-starlette dependency

Edit pyproject.toml. Find the dependencies = [ block under [project] and append:

    "sse-starlette>=2.0",

Run:

cd /tmp/agnes-metadata && uv pip install -e ".[dev]"

Step 10.2: Write failing tests

Create tests/test_cache_warmup.py:

"""Cache warmup framework — state, bg task, endpoints."""

import asyncio
from unittest.mock import patch

import pytest


def test_warmup_run_state_starts_empty():
    from app.api.cache_warmup import WARMUP_STATE
    assert WARMUP_STATE is None or WARMUP_STATE.completed_at is not None


@pytest.mark.asyncio
async def test_warmup_skips_when_env_set(monkeypatch):
    """AGNES_SKIP_CACHE_WARMUP=1 → background warmup is a no-op."""
    monkeypatch.setenv("AGNES_SKIP_CACHE_WARMUP", "1")
    from app.api import cache_warmup

    with patch.object(cache_warmup, "_warm_catalog_caches_bg") as mock_bg:
        cache_warmup.maybe_schedule_startup_warmup()
        # Either not called, or called and short-circuits internally.
    # Exact assertion depends on implementation; pick the simpler one.
    mock_bg.assert_not_called()


@pytest.mark.asyncio
async def test_warmup_runs_one_per_remote_row():
    """`_warm_catalog_caches_bg` calls `_warm_one` once per remote row."""
    from app.api import cache_warmup

    # Stub the registry to return 3 remote BQ rows + 1 local row.
    fake_rows = [
        {"id": "r1", "query_mode": "remote", "source_type": "bigquery"},
        {"id": "r2", "query_mode": "remote", "source_type": "bigquery"},
        {"id": "r3", "query_mode": "remote", "source_type": "bigquery"},
        {"id": "L1", "query_mode": "local", "source_type": "keboola"},
    ]
    warmed = []

    async def fake_warm_one(row, state, sem):
        warmed.append(row["id"])

    with patch.object(cache_warmup, "_list_remote_rows", return_value=fake_rows[:3]):
        with patch.object(cache_warmup, "_warm_one", fake_warm_one):
            await cache_warmup._warm_catalog_caches_bg(trigger="manual")

    assert sorted(warmed) == ["r1", "r2", "r3"]


@pytest.mark.asyncio
async def test_warmup_failure_isolated_to_one_row():
    from app.api import cache_warmup

    async def fake_warm_one(row, state, sem):
        if row["id"] == "fail":
            raise RuntimeError("BQ down for this row")

    with patch.object(cache_warmup, "_list_remote_rows", return_value=[
        {"id": "ok1", "query_mode": "remote", "source_type": "bigquery"},
        {"id": "fail", "query_mode": "remote", "source_type": "bigquery"},
        {"id": "ok2", "query_mode": "remote", "source_type": "bigquery"},
    ]):
        with patch.object(cache_warmup, "_warm_one", fake_warm_one):
            await cache_warmup._warm_catalog_caches_bg(trigger="manual")

    state = cache_warmup.WARMUP_STATE
    assert state.total == 3
    # Other rows still ran; the bad one is recorded but doesn't blow up
    # the gather. Note: this asserts on the gather behavior, not on
    # `_warm_one`'s internal exception handling — that lives in
    # the real implementation, not the stub here. If the implementation
    # catches inside `_warm_one`, the gather completes fully.


@pytest.mark.asyncio
async def test_run_endpoint_idempotent_on_concurrent_invocation(seeded_app):
    """Two POST /run calls in flight return the same run_id."""
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    headers = {"Authorization": f"Bearer {token}"}

    # First POST starts a run.
    r1 = c.post("/api/admin/cache-warmup/run", headers=headers)
    assert r1.status_code == 200
    body1 = r1.json()
    # Second POST while first is in-flight: returns existing run_id.
    r2 = c.post("/api/admin/cache-warmup/run", headers=headers)
    body2 = r2.json()
    if body2.get("status") == "already_running":
        assert body2["run_id"] == body1.get("run_id") or "run_id" in body1
    # Otherwise the first run completed before the second arrived; that's
    # also OK — idempotency only matters under true concurrency.


def test_status_endpoint_before_first_run(seeded_app):
    """GET /status returns {state: never_run} before any warmup."""
    from app.api import cache_warmup
    cache_warmup.WARMUP_STATE = None  # reset for this test

    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    r = c.get(
        "/api/admin/cache-warmup/status",
        headers={"Authorization": f"Bearer {token}"},
    )
    assert r.status_code == 200
    assert r.json() == {"state": "never_run"}

Step 10.3: Run tests to verify they fail

cd /tmp/agnes-metadata && python -m pytest tests/test_cache_warmup.py -v

Expected: ImportError for app.api.cache_warmup.

Step 10.4: Implement the warmup module

Create app/api/cache_warmup.py:

"""Cache warmup framework — populates catalog/schema/metadata caches at
container startup so the first analyst hits warm caches.

Bounded concurrency (4 by default). Exposes:
  - GET /api/admin/cache-warmup/status — JSON snapshot
  - POST /api/admin/cache-warmup/run — manual trigger (idempotent)
  - GET /api/admin/cache-warmup/stream — Server-Sent Events
"""

from __future__ import annotations

import asyncio
import json
import logging
import os
import time
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from typing import Literal
from uuid import uuid4

from fastapi import APIRouter, Depends
from sse_starlette.sse import EventSourceResponse

from app.auth.dependencies import _get_db
from app.auth.access import require_admin

logger = logging.getLogger(__name__)
router = APIRouter()


@dataclass
class WarmupRowState:
    table_id: str
    status: Literal["pending", "warming", "fresh", "error"]
    started_at: str | None = None
    completed_at: str | None = None
    duration_ms: int | None = None
    error: str | None = None
    last_warmed_at: str | None = None


@dataclass
class WarmupRunState:
    run_id: str
    trigger: Literal["startup", "manual", "registry_change"]
    started_at: str
    completed_at: str | None = None
    total: int = 0
    completed: int = 0
    failed: int = 0
    rows: dict[str, WarmupRowState] = field(default_factory=dict)
    _subscribers: list[asyncio.Queue] = field(default_factory=list, repr=False)


WARMUP_STATE: WarmupRunState | None = None
_RUN_LOCK = asyncio.Lock()


def _now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()


def maybe_schedule_startup_warmup() -> None:
    """Called from app/main.py FastAPI startup event."""
    if os.environ.get("AGNES_SKIP_CACHE_WARMUP") == "1":
        logger.info("cache warmup skipped (AGNES_SKIP_CACHE_WARMUP=1)")
        return
    try:
        asyncio.create_task(_warm_catalog_caches_bg(trigger="startup"))
    except RuntimeError:
        logger.warning("no running event loop — startup warmup skipped")


async def _warm_catalog_caches_bg(trigger: str = "startup") -> None:
    """Walk registry, warm metadata + schema caches for every remote row."""
    global WARMUP_STATE
    async with _RUN_LOCK:
        # Re-check inside the lock — another caller might have completed
        # a run while we were waiting.
        if WARMUP_STATE and WARMUP_STATE.completed_at is None:
            return

        run_id = uuid4().hex[:8]
        state = WarmupRunState(
            run_id=run_id, trigger=trigger, started_at=_now_iso(),
        )
        WARMUP_STATE = state

    rows = _list_remote_rows()
    state.total = len(rows)
    for r in rows:
        state.rows[r["id"]] = WarmupRowState(
            table_id=r["id"], status="pending",
        )
    _broadcast(state, {"event": "start", "data": {
        "run_id": run_id, "trigger": trigger, "total": state.total,
    }})

    sem = asyncio.Semaphore(int(os.environ.get("AGNES_WARMUP_CONCURRENCY", "4")))
    await asyncio.gather(
        *(_warm_one(r, state, sem) for r in rows), return_exceptions=True,
    )

    state.completed_at = _now_iso()
    _broadcast(state, {"event": "complete", "data": {
        "run_id": run_id, "total": state.total,
        "completed": state.completed, "failed": state.failed,
    }})
    logger.info(
        "cache warmup complete: run_id=%s total=%d ok=%d fail=%d",
        run_id, state.total, state.completed, state.failed,
    )


def _list_remote_rows() -> list[dict]:
    """Snapshot of registry rows that need a warmup pass."""
    from src.db import get_system_db
    from src.repositories.table_registry import TableRegistryRepository
    conn = get_system_db()
    rows = TableRegistryRepository(conn).list_all()
    return [
        r for r in rows
        if r.get("query_mode") == "remote"
        and r.get("source_type") == "bigquery"
    ]


async def _warm_one(
    row: dict, state: WarmupRunState, sem: asyncio.Semaphore,
) -> None:
    async with sem:
        rs = state.rows[row["id"]]
        rs.status = "warming"
        rs.started_at = _now_iso()
        _broadcast(state, {"event": "row", "data": asdict(rs)})
        t0 = time.monotonic()
        try:
            await asyncio.to_thread(_warm_metadata_sync, row)
            await asyncio.to_thread(_warm_schema_sync, row)
            rs.status = "fresh"
            rs.last_warmed_at = _now_iso()
            state.completed += 1
        except Exception as e:
            rs.status = "error"
            rs.error = str(e)
            state.failed += 1
            logger.warning("cache warmup row=%s failed: %s", row["id"], e)
        finally:
            rs.completed_at = _now_iso()
            rs.duration_ms = int((time.monotonic() - t0) * 1000)
            _broadcast(state, {"event": "row", "data": asdict(rs)})


def _warm_metadata_sync(row: dict) -> None:
    """Trigger metadata cache populate via the catalog's normal path."""
    from app.api.v2_catalog import _size_hint_for_row
    _size_hint_for_row(row)


def _warm_schema_sync(row: dict) -> None:
    """Trigger schema cache populate via build_schema_uncached."""
    from app.api.v2_schema import build_schema_uncached
    from connectors.bigquery.access import get_bq_access
    from src.db import get_system_db
    bq = get_bq_access()
    build_schema_uncached(get_system_db(), row["id"], bq=bq)


async def warm_one_table(table_id: str) -> None:
    """Single-row re-warm — invoked by `invalidate_for_table` after a
    registry change. Does NOT update WARMUP_STATE (small change shouldn't
    overwrite the last full run's status); just refreshes the caches."""
    from src.db import get_system_db
    from src.repositories.table_registry import TableRegistryRepository
    conn = get_system_db()
    row = TableRegistryRepository(conn).get(table_id)
    if not row or row.get("query_mode") != "remote":
        return
    try:
        await asyncio.to_thread(_warm_metadata_sync, row)
        await asyncio.to_thread(_warm_schema_sync, row)
    except Exception as e:
        logger.warning("single-row warmup failed for %s: %s", table_id, e)


def _broadcast(state: WarmupRunState, event: dict) -> None:
    """Send an event to every SSE subscriber. Dead queues are pruned."""
    dead = []
    for q in state._subscribers:
        try:
            q.put_nowait(event)
        except asyncio.QueueFull:
            dead.append(q)
    for q in dead:
        state._subscribers.remove(q)


def _serialize_state(state: WarmupRunState) -> dict:
    return {
        "run_id": state.run_id,
        "trigger": state.trigger,
        "started_at": state.started_at,
        "completed_at": state.completed_at,
        "total": state.total,
        "completed": state.completed,
        "failed": state.failed,
        "rows": {tid: asdict(rs) for tid, rs in state.rows.items()},
    }


# ─── Endpoints ────────────────────────────────────────────────────────


@router.get("/api/admin/cache-warmup/status")
async def warmup_status(user: dict = Depends(require_admin)):
    if WARMUP_STATE is None:
        return {"state": "never_run"}
    return _serialize_state(WARMUP_STATE)


@router.post("/api/admin/cache-warmup/run")
async def warmup_run(user: dict = Depends(require_admin)):
    if WARMUP_STATE and WARMUP_STATE.completed_at is None:
        return {"run_id": WARMUP_STATE.run_id, "status": "already_running"}
    asyncio.create_task(_warm_catalog_caches_bg(trigger="manual"))
    # Brief sleep so WARMUP_STATE is populated by the time we return.
    await asyncio.sleep(0)
    return {
        "run_id": WARMUP_STATE.run_id if WARMUP_STATE else None,
        "status": "started",
    }


@router.get("/api/admin/cache-warmup/stream")
async def warmup_stream(user: dict = Depends(require_admin)):
    async def gen():
        q: asyncio.Queue = asyncio.Queue(maxsize=256)
        if WARMUP_STATE:
            WARMUP_STATE._subscribers.append(q)
        # Replay current snapshot.
        if WARMUP_STATE:
            yield {"event": "snapshot", "data": json.dumps(
                _serialize_state(WARMUP_STATE)
            )}
        try:
            while True:
                ev = await asyncio.wait_for(q.get(), timeout=30.0)
                yield {"event": ev["event"], "data": json.dumps(ev["data"])}
                if ev["event"] == "complete":
                    return
        except asyncio.TimeoutError:
            return
        finally:
            if WARMUP_STATE and q in WARMUP_STATE._subscribers:
                WARMUP_STATE._subscribers.remove(q)

    return EventSourceResponse(gen())

Step 10.5: Register the router in app/main.py

Edit app/main.py. Find the section where routers are included (look for app.include_router(...) calls). Add:

from app.api.cache_warmup import router as cache_warmup_router
app.include_router(cache_warmup_router)

Step 10.6: Run tests

cd /tmp/agnes-metadata && python -m pytest tests/test_cache_warmup.py -v

Expected: all green.

Step 10.7: Commit

cd /tmp/agnes-metadata
git add app/api/cache_warmup.py app/main.py pyproject.toml \
        tests/test_cache_warmup.py
git commit -m "feat(cache-warmup): startup background warmup + status / run / stream endpoints

- WarmupRunState / WarmupRowState dataclasses + module-level singleton.
- _warm_catalog_caches_bg walks remote BQ rows with bounded concurrency
  (Semaphore(4), tunable via AGNES_WARMUP_CONCURRENCY).
- Per-row failures isolated; never blow up the gather.
- maybe_schedule_startup_warmup honors AGNES_SKIP_CACHE_WARMUP=1.
- warm_one_table for the single-row re-warm called from
  invalidate_for_table.
- GET /api/admin/cache-warmup/status — JSON snapshot.
- POST /api/admin/cache-warmup/run — manual trigger, idempotent under
  concurrent invocation (locks + checks completed_at).
- GET /api/admin/cache-warmup/stream — Server-Sent Events stream of
  start / row / complete events.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 11: FastAPI startup event hook

Files:

Modify: app/main.py
Test: tests/test_main_startup_warmup.py
Step 11.1: Write failing test

Create tests/test_main_startup_warmup.py:

"""The FastAPI startup event schedules `maybe_schedule_startup_warmup`
without blocking readiness."""

from unittest.mock import patch


def test_startup_event_calls_warmup_scheduler():
    from app.main import app
    # FastAPI startup events are functions on `app.router.on_startup`.
    handler_names = [h.__name__ for h in app.router.on_startup]
    assert any("warm" in n.lower() for n in handler_names), (
        "Expected a startup handler with 'warm' in its name; "
        f"found: {handler_names}"
    )


def test_health_check_returns_immediately_during_warmup(seeded_app):
    """/api/health doesn't await warmup; readiness is fire-and-forget."""
    c = seeded_app["client"]
    r = c.get("/api/health")
    assert r.status_code == 200

Step 11.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_main_startup_warmup.py -v

Expected: first test FAIL, second probably passes.

Step 11.3: Add the startup hook

Edit app/main.py. Locate the existing FastAPI app instantiation (app = FastAPI(...)). Add a startup event handler nearby:

@app.on_event("startup")
async def warm_catalog_caches_on_startup():
    """Schedule a background warmup of the v2 catalog/schema/metadata
    caches. Fire-and-forget; readiness is not blocked. Failures inside
    the background task are logged + swallowed.
    """
    from app.api.cache_warmup import maybe_schedule_startup_warmup
    maybe_schedule_startup_warmup()

If the file already uses lifespan handlers (newer FastAPI pattern), add the call into the lifespan function instead. Inspect the file first.

Step 11.4: Run tests to verify they pass

cd /tmp/agnes-metadata && python -m pytest tests/test_main_startup_warmup.py -v

Expected: both pass.

Step 11.5: Commit

cd /tmp/agnes-metadata
git add app/main.py tests/test_main_startup_warmup.py
git commit -m "feat(main): startup event hook for catalog cache warmup

Fire-and-forget — readiness is not blocked. Per-row warmup runs in
background with bounded concurrency. Honors AGNES_SKIP_CACHE_WARMUP=1
opt-out for dev/test instances.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 12: CLI post-register hint for query_mode=remote

Files:

Modify: cli/commands/admin.py (extend register_table post-success hint)
Test: tests/test_cli_admin.py (append test)
Step 12.1: Write failing test

Append to tests/test_cli_admin.py:

class TestRegisterTableHints:
    """The CLI prints helpful follow-up hints after a successful
    register-table call. v0.46 adds a third hint for query_mode=remote
    pointing at the IAM verify-your-SA smoke check."""

    def test_remote_register_emits_iam_verify_hint(self):
        with patch("cli.commands.admin.api_post", return_value=_resp(201, {"id": "t"})):
            result = runner.invoke(app, [
                "admin", "register-table", "orders",
                "--source-type", "bigquery",
                "--bucket", "dwh_base",
                "--source-table", "orders",
                "--query-mode", "remote",
            ])
        assert result.exit_code == 0
        assert "agnes query --remote" in result.output
        assert "query-modes.md" in result.output

    def test_local_register_does_not_emit_remote_hint(self):
        with patch("cli.commands.admin.api_post", return_value=_resp(201, {"id": "t"})):
            result = runner.invoke(app, [
                "admin", "register-table", "users",
                "--source-type", "keboola",
                "--bucket", "in.c-crm",
                "--source-table", "users",
                "--query-mode", "local",
            ])
        assert result.exit_code == 0
        assert "agnes query --remote" not in result.output

Step 12.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_cli_admin.py::TestRegisterTableHints -v

Expected: first FAIL.

Step 12.3: Add the hint

Edit cli/commands/admin.py. Find the success branch in register_table (search for the existing two hints — Next: run agnes setup first-sync and register-table does not auto-grant). Add a third hint conditional on query_mode == "remote":

        # Third hint: BQ-remote rows can fail at first analyst query if the
        # SA lacks dataViewer/jobUser. Pointing at the smoke command
        # surfaces the failure at registration time, not 30 minutes later.
        if query_mode == "remote":
            typer.echo(
                f"  Note: this is a remote-query table. Verify the SA can read it:\n"
                f"    agnes query --remote \"SELECT COUNT(*) FROM {name}\"\n"
                f"  If it 403s, see docs/admin/query-modes.md → \"BigQuery → IAM\"."
            )

Step 12.4: Run tests

cd /tmp/agnes-metadata && python -m pytest tests/test_cli_admin.py::TestRegisterTableHints -v

Expected: 2 passed.

Step 12.5: Commit

cd /tmp/agnes-metadata
git add cli/commands/admin.py tests/test_cli_admin.py
git commit -m "feat(cli): third post-register hint for query_mode=remote

Points at agnes query --remote SELECT COUNT(*) as the smoke check for
SA IAM. Surfaces 403 USER_PROJECT_DENIED at register time, not when
the first analyst hits the table 30 minutes later. Cross-references the
new docs/admin/query-modes.md page (Task 13).

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 13: docs/admin/query-modes.md

Files:

Create: docs/admin/query-modes.md
Test: smoke test linking via tests/test_docs_links.py (optional)
Step 13.1: Write the doc

Create docs/admin/query-modes.md with this exact content:

# Query Modes — when to register a table as `local`, `remote`, or `materialized`

Source-agnostic guide to the three `query_mode` values Agnes supports. Pick the right mode at registration time and the analyst-side experience is fast, cost-aware, and predictable. Pick wrong and you'll either burn BQ scan budget on every query or spend hours waiting on syncs that didn't need to happen.

## TL;DR — decision tree

```
Is the table small (< 1 GB) and updated daily-or-slower?
  └─ YES → query_mode: local       (sync to laptop, query offline)

Is the table the result of an aggregate SQL the operator controls?
  └─ YES → query_mode: materialized  (server runs SQL → parquet, distributed)

Otherwise:
  └─ query_mode: remote   (data stays in upstream; analyst queries on demand)
```

## Three modes side-by-side

| Aspect | `local` | `materialized` | `remote` |
|---|---|---|---|
| Where the data lives | Analyst laptop (parquet) | Agnes server filesystem (parquet) | Upstream (BigQuery, Keboola, …) |
| Who runs the query | Analyst's local DuckDB | Analyst's local DuckDB | Upstream engine via DuckDB extension |
| Cost model | Free after sync | Free after each sync | Per-query scan cost on the analyst's first hit |
| Freshness | As fresh as last sync | As fresh as last scheduled run | Live |
| Scan limits | None (laptop disk) | None (server disk) | `bq_max_scan_bytes` cost gate (default 5 GiB) |
| Best for | Stable reference data, daily-updated facts | Aggregates, daily snapshots | Big tables, live data, residency-restricted |

## Per-source-type reference

### BigQuery — `query_mode: remote`

The most common use case for `remote`. Data stays in BQ; analysts query on demand via the Agnes server's service account.

**IAM:** the server's SA must have:
- `roles/bigquery.dataViewer` on the dataset (read access)
- `roles/bigquery.jobUser` on the *billing* project (run jobs)

If `data_source.bigquery.billing_project == data_source.bigquery.project`, set the SA's `serviceusage.services.use` permission too — the BQ extension can otherwise 403 USER_PROJECT_DENIED on the first query. The instance health check (`agnes diagnose`) surfaces this as an `info`-tier entry on `bq_config`.

**Register via UI:** `/admin/tables` → "Add table" → Source type `bigquery` → Mode `remote` → fill `dataset` (your BQ dataset name) + `source_table` (the BQ table id within that dataset).

**Register via CLI:**

```bash
agnes admin register-table sales_2024 \
    --source-type bigquery \
    --bucket dwh_base \
    --source-table sales_2024 \
    --query-mode remote
```

After registration, smoke-test the SA's access:

```bash
agnes query --remote "SELECT COUNT(*) FROM sales_2024"
```

A 403 here means the SA is missing `dataViewer` or `jobUser`; fix in IAM and re-test.

**Cost guardrail:** `bq_max_scan_bytes` (default 5 GiB) refuses queries whose pre-execution scan estimate exceeds the cap. Configurable in `/admin/server-config`. When an analyst hits the cap, the response includes a hint to use `agnes snapshot create --where '<predicate>'` to materialise a scoped subset locally.

### BigQuery — `query_mode: materialized`

The server runs a scheduled SQL aggregate against BigQuery and writes the result to a parquet on the Agnes filesystem. Analysts get the parquet via `agnes pull` like any other local table.

**Register via CLI:**

```bash
agnes admin register-table monthly_kpis \
    --source-type bigquery \
    --bucket dwh_base \
    --source-table monthly_kpis \
    --query-mode materialized \
    --query @path/to/monthly_kpis.sql \
    --sync-schedule "daily 03:00"
```

**Cost guardrail:** `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable) refuses materialise runs whose query plan exceeds the cap. Catches a typo'd `WHERE` clause that would otherwise scan a year of data.

### Keboola — `query_mode: local` (the production path)

The Agnes server's Keboola DuckDB extension downloads the table to a parquet on the server filesystem; `agnes pull` distributes it to analyst laptops.

**Setup:** `instance.yaml.data_source.type: keboola` + Storage API token via `KEBOOLA_STORAGE_TOKEN` env var (or whatever `instance.yaml.token_env` points at).

**Register via CLI:**

```bash
agnes admin register-table users \
    --source-type keboola \
    --bucket in.c-crm \
    --source-table users \
    --query-mode local
```

**`query_mode: remote` for Keboola** is architecturally supported via the `_remote_attach` mechanism (the orchestrator can ATTACH the Keboola DuckDB extension on demand the same way it does for BQ), but **not in active deployment use today**. If you have an analyst workflow against a Keboola table that's too big to sync, file an issue — the architecture is in place but the registration UX hasn't been polished.

### Jira — `query_mode: local` only

Event-driven: webhooks update parquets incrementally. No `remote` or `materialized` mode for Jira today.

## Worked examples

**1. Big BigQuery fact table you query weekly:** `query_mode: remote`. SA needs `dataViewer` + `jobUser`. Analyst uses `agnes query --remote` for one-off aggregates and `agnes snapshot create` for cross-week joins.

**2. Daily Keboola dimension table:** `query_mode: local`. Synced once a day by the scheduler; analyst's `agnes pull` picks it up.

**3. Monthly KPI aggregate from a BQ datawarehouse:** `query_mode: materialized` + `--sync-schedule "0 3 1 * *"` (3:00 on the 1st of each month). The server runs your aggregate SQL once a month; analysts get a parquet of the result.

## See also

- `docs/RBAC.md` — granting analysts access to a registered table.
- `config/instance.yaml.example` — the `data_source` config block.
- `agnes catalog --json` — inspect a registered table's mode + size hints.
- `agnes diagnose` — surface `bq_config` IAM issues and other health entries.

Step 13.2: Quick smoke check that the doc builds without dead links

cd /tmp/agnes-metadata && python -c "
from pathlib import Path
content = Path('docs/admin/query-modes.md').read_text()
# Check the cross-referenced files exist.
for ref in ['docs/RBAC.md', 'config/instance.yaml.example']:
    assert Path(ref).exists(), f'broken cross-ref: {ref}'
print('docs OK')
"

Expected: docs OK.

Step 13.3: Commit

cd /tmp/agnes-metadata
git add docs/admin/query-modes.md
git commit -m "docs(admin): query-modes.md — when to register a table as local/remote/materialized

Source-agnostic decision tree + per-source-type reference (BigQuery,
Keboola, Jira). Three worked examples. Cross-references to RBAC.md +
instance.yaml.example.

Linked from the admin UI tooltip (Task 14) and the CLI post-register
hint (Task 12).

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 14: Admin UI integration on /admin/tables

Files:

Modify: app/web/templates/admin_tables.html
Test: tests/test_admin_tables_warmup_ui.py
Step 14.1: Write failing test

Create tests/test_admin_tables_warmup_ui.py:

"""Smoke test that /admin/tables HTML contains the cache toolbar markup,
the EventSource wiring, and the per-row col-status slot."""


def test_cache_toolbar_present(seeded_app):
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    r = c.get(
        "/admin/tables", headers={"Authorization": f"Bearer {token}"},
    )
    assert r.status_code == 200, r.text
    body = r.text
    assert 'id="cacheWarmupCard"' in body
    assert "Re-warm all" in body
    assert "/api/admin/cache-warmup/stream" in body
    assert "EventSource" in body


def test_query_mode_doc_link_present(seeded_app):
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    r = c.get(
        "/admin/tables", headers={"Authorization": f"Bearer {token}"},
    )
    assert r.status_code == 200
    assert "query-modes.md" in r.text or "/docs/admin/query-modes" in r.text


def test_col_status_th_present_in_renderer(seeded_app):
    """The renderRegistryListing JS still emits <th class='col-status'>
    so the per-row badge slot exists."""
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    r = c.get("/admin/tables", headers={"Authorization": f"Bearer {token}"})
    assert 'class="col-status"' in r.text

Step 14.2: Run test to verify it fails

cd /tmp/agnes-metadata && python -m pytest tests/test_admin_tables_warmup_ui.py -v

Expected: 2 of 3 fail (toolbar absent, EventSource not yet wired). Third probably passes.

Step 14.3: Add the cache toolbar <section> to admin_tables.html

Edit app/web/templates/admin_tables.html. Locate the section between the page header and the per-source-type table listings — search for bqTableListing to find the area. Insert the new card just before the listings:

<section id="cacheWarmupCard" class="card" style="margin-bottom: 20px;">
    <header class="card-header" style="display: flex; justify-content: space-between; align-items: center;">
        <h2>Cache freshness</h2>
        <button class="btn btn-secondary" id="cacheWarmupRunBtn" onclick="cacheWarmupRun()">
            Re-warm all
        </button>
    </header>
    <div class="card-body">
        <div id="cacheWarmupProgress" style="margin-bottom: 8px;">
            <span id="cacheWarmupSummary">Loading…</span>
        </div>
        <progress id="cacheWarmupBar" max="100" value="0" style="width: 100%; display: none;"></progress>
        <details style="margin-top: 8px;">
            <summary style="cursor: pointer; user-select: none;">Show log</summary>
            <pre id="cacheWarmupLog" style="background: #0a0a0a; color: #dcdcdc; font-family: ui-monospace, Menlo, monospace; font-size: 12px; padding: 8px; max-height: 240px; overflow-y: auto; margin-top: 8px; border-radius: 4px;"></pre>
        </details>
    </div>
</section>

Step 14.4: Add the JS for live updates

Inside the existing <script> block at the bottom of admin_tables.html, append:

    // ── Cache warmup toolbar (issue #155 / #156) ────────────────
    let cacheWarmupSource = null;

    function cacheWarmupInit() {
        cacheWarmupRefreshSnapshot();
        cacheWarmupOpenStream();
    }

    function cacheWarmupRefreshSnapshot() {
        fetch('/api/admin/cache-warmup/status')
            .then(function(r) { return r.json(); })
            .then(function(state) { cacheWarmupRender(state); })
            .catch(function() { /* silent */ });
    }

    function cacheWarmupOpenStream() {
        try {
            cacheWarmupSource = new EventSource('/api/admin/cache-warmup/stream');
            cacheWarmupSource.addEventListener('start', cacheWarmupOnStart);
            cacheWarmupSource.addEventListener('row', cacheWarmupOnRow);
            cacheWarmupSource.addEventListener('complete', cacheWarmupOnComplete);
            cacheWarmupSource.addEventListener('snapshot', function(e) {
                cacheWarmupRender(JSON.parse(e.data));
            });
            cacheWarmupSource.onerror = function() {
                // Polling fallback — close + retry every 3 s.
                if (cacheWarmupSource) {
                    cacheWarmupSource.close();
                    cacheWarmupSource = null;
                }
                setTimeout(cacheWarmupRefreshSnapshot, 3000);
            };
        } catch (e) {
            // EventSource unsupported; poll instead.
            setInterval(cacheWarmupRefreshSnapshot, 3000);
        }
    }

    function cacheWarmupRender(state) {
        var summary = document.getElementById('cacheWarmupSummary');
        var bar = document.getElementById('cacheWarmupBar');
        var btn = document.getElementById('cacheWarmupRunBtn');
        if (!summary) return;

        if (!state || state.state === 'never_run') {
            summary.textContent = 'No cache warmup yet — click Re-warm all to start.';
            bar.style.display = 'none';
            btn.disabled = false;
            return;
        }
        var inProgress = state.completed_at === null || state.completed_at === undefined;
        var pct = state.total > 0 ? Math.round((state.completed * 100) / state.total) : 0;
        summary.textContent = inProgress
            ? state.completed + ' / ' + state.total + ' fresh — running…'
            : 'Last run: ' + state.completed + ' ok, ' + state.failed + ' errors';
        bar.style.display = 'block';
        bar.value = pct;
        btn.disabled = inProgress;

        // Per-row badges.
        if (state.rows) {
            for (var tid in state.rows) {
                cacheWarmupSetRowBadge(tid, state.rows[tid]);
            }
        }
    }

    function cacheWarmupOnStart(e) {
        var data = JSON.parse(e.data);
        var log = document.getElementById('cacheWarmupLog');
        log.textContent = '';
        cacheWarmupAppendLog(data.run_id ? '[' + nowHHMMSS() + '] start  trigger=' + data.trigger + ' total=' + data.total : '');
        cacheWarmupRefreshSnapshot();
    }

    function cacheWarmupOnRow(e) {
        var rs = JSON.parse(e.data);
        cacheWarmupAppendLog(
            '[' + nowHHMMSS() + '] ' + rs.status.padEnd(7) + rs.table_id +
            (rs.duration_ms ? '  (' + (rs.duration_ms / 1000).toFixed(1) + ' s)' : '') +
            (rs.error ? '  ' + rs.error : '')
        );
        cacheWarmupSetRowBadge(rs.table_id, rs);
        cacheWarmupRefreshSnapshot();
    }

    function cacheWarmupOnComplete(e) {
        var data = JSON.parse(e.data);
        cacheWarmupAppendLog(
            '[' + nowHHMMSS() + '] complete total=' + data.total +
            ' ok=' + data.completed + ' fail=' + data.failed
        );
        cacheWarmupRefreshSnapshot();
    }

    function cacheWarmupAppendLog(line) {
        var log = document.getElementById('cacheWarmupLog');
        if (!log) return;
        log.textContent += line + '\n';
        log.scrollTop = log.scrollHeight;
    }

    function cacheWarmupSetRowBadge(tableId, rs) {
        // Per-row badge in the col-status slot. Selector: any <tr>
        // whose first <td class="col-id"> text matches tableId.
        document.querySelectorAll('tr').forEach(function(tr) {
            var idCell = tr.querySelector('td.col-id');
            if (!idCell || idCell.textContent.trim() !== tableId) return;
            var statusCell = tr.querySelector('td.col-status');
            if (!statusCell) return;
            var color = {fresh: '#10B77F', warming: '#0073D1', pending: '#9CA3AF', error: '#EA580C'}[rs.status] || '#9CA3AF';
            var label = rs.status === 'fresh' ? 'fresh' : rs.status;
            statusCell.innerHTML = '<span style="display:inline-block;padding:2px 6px;border-radius:3px;font-size:11px;background:' + color + ';color:white;" title="' + (rs.error || '') + '">' + label + '</span>';
        });
    }

    function nowHHMMSS() {
        var d = new Date();
        return d.toTimeString().slice(0, 8);
    }

    function cacheWarmupRun() {
        var btn = document.getElementById('cacheWarmupRunBtn');
        btn.disabled = true;
        fetch('/api/admin/cache-warmup/run', {method: 'POST'})
            .then(function(r) { return r.json(); })
            .then(function() { /* SSE stream picks up the new run */ })
            .catch(function() { btn.disabled = false; });
    }

    // Init on page load.
    document.addEventListener('DOMContentLoaded', cacheWarmupInit);

Step 14.5: Add ? icon next to query_mode field in the edit modal

Find the edit-modal block in admin_tables.html (search for editBqQueryMode or similar). Next to the query_mode field's <label>, append:

<a href="/docs/admin/query-modes.md" target="_blank" title="When to use which mode" style="margin-left: 6px; text-decoration: none;">?</a>

If the docs aren't served at /docs/admin/..., link to the GitHub URL instead (look at recent edits — cli/lib/setup_instructions.py may have a server-side doc base URL).

Step 14.6: Run smoke tests

cd /tmp/agnes-metadata && python -m pytest tests/test_admin_tables_warmup_ui.py -v

Expected: 3 passed.

Step 14.7: Manual visual check (optional but recommended)

If a dev instance is available, browse to /admin/tables, register a remote BQ table, click "Re-warm all", and verify the log scrolls + per-row badges update.

Step 14.8: Commit

cd /tmp/agnes-metadata
git add app/web/templates/admin_tables.html tests/test_admin_tables_warmup_ui.py
git commit -m "feat(admin/tables): cache freshness toolbar + per-row badges + SSE log

Adds a single cache panel above the per-source-type listings:
- Re-warm all button → POST /api/admin/cache-warmup/run.
- Progress bar + summary line.
- Collapsible terminal-style log fed by EventSource on
  /api/admin/cache-warmup/stream (polling fallback at 3 s on SSE error).
- Per-row badge in the col-status slot updates live as warmup events
  arrive.
- ? icon next to the query_mode field in the edit modal links to
  docs/admin/query-modes.md.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 15: CHANGELOG + version bump

Files:

Modify: CHANGELOG.md
Modify: pyproject.toml
Step 15.1: Bump version

Edit pyproject.toml. Find:

version = "0.45.x"

Change to:

version = "0.46.0"

Step 15.2: Add CHANGELOG entry

Edit CHANGELOG.md. Find ## [Unreleased] near the top. Replace with:

## [Unreleased]

## [0.46.0] — 2026-05-07

Catalog metadata enrichment + cache discipline + automatic warmup.
Closes #155 + #156.

### Added

- **`/api/v2/catalog` returns four new optional fields per row** — `rows`,
  `size_bytes`, `partition_by`, `clustered_by` — populated by per-source-type
  metadata providers (`connectors/bigquery/metadata.py`,
  `connectors/keboola/metadata.py`). For `query_mode='remote'` BigQuery rows,
  `size_bytes` is `active_logical_bytes + long_term_logical_bytes` (a full
  scan reads both); region resolved from `data_source.bigquery.location` →
  `bq_client.get_dataset(...)` → fall back to legacy `__TABLES__`.
  Existing CLI consumers reading only `rough_size_hint` are unaffected.
- **Automatic cache warmup at startup.** FastAPI startup event schedules
  a background task that walks BQ remote rows and pre-populates
  `_metadata_cache` + `_schema_cache` with bounded concurrency (default 4,
  tunable via `AGNES_WARMUP_CONCURRENCY`). Doesn't block readiness;
  per-row failures logged + skipped. Opt-out via `AGNES_SKIP_CACHE_WARMUP=1`.
- **Three new admin endpoints under `/api/admin/cache-warmup/*`:**
  - `GET /status` — JSON snapshot of the latest run.
  - `POST /run` — manual trigger, idempotent under concurrent invocation.
  - `GET /stream` — Server-Sent Events with `start` / `row` / `complete`
    events for live UI updates.
- **`/admin/tables` cache freshness panel.** Toolbar above the per-source-type
  listings with progress bar + "Re-warm all" button + collapsible
  terminal-style log fed by SSE (polling fallback at 3 s). Per-row badge
  in the existing `col-status` column updates live (fresh / warming /
  pending / error).
- **`docs/admin/query-modes.md`** — source-agnostic admin reference for
  registering tables as `local` / `remote` / `materialized`. Decision
  tree, per-source-type IAM + setup, three worked examples. Linked from
  the `?` icon next to the `query_mode` field in the admin UI edit modal
  and from the third post-register hint in `agnes admin register-table`.
- **`agnes admin register-table` post-register hint** for `query_mode=remote`:
  points at `agnes query --remote "SELECT COUNT(*)..."` as the IAM smoke
  check so a missing `dataViewer` / `jobUser` surfaces at registration
  time, not 30 minutes later.

### Changed

- **`/api/v2/schema/{id}` cache miss now does 1 BQ job instead of 2.**
  `connectors/bigquery/access.py:fetch_bq_columns_full` collapses what
  used to be `_fetch_bq_schema` + `_fetch_bq_table_options` into a single
  `INFORMATION_SCHEMA.COLUMNS` query (same view, same predicate, just a
  combined SELECT list). The metadata provider's partition/cluster path
  shares the same helper — zero SQL duplication across the two consumers.
- **All four catalog/schema/sample/metadata caches are flushed on registry
  change.** `app/api/v2_catalog.py:invalidate_for_table` is wired into
  `POST /api/admin/register-table`, `PUT /api/admin/registry/{id}`, and
  `DELETE /api/admin/registry/{id}`. After a registry write, a single-row
  re-warm task is scheduled in the background so the admin's verification
  request hits warm caches within ~1 s instead of waiting for the next
  analyst miss. Pre-fix none of the caches were invalidated — admin
  registers a table, `agnes catalog` doesn't show the new row for up to
  5 min; admin updates a row's bucket, `agnes schema` returns the OLD
  column list for up to 1 hour.
- **`v2_schema.build_schema` split into RBAC-aware outer + RBAC-naive
  inner (`build_schema_uncached`).** Live endpoint behavior unchanged;
  warmup uses the inner entry point to populate `_schema_cache` without
  a user context.

### Internal

- New shared dataclass module `app/api/_metadata_models.py` with
  `MetadataRequest` (frozen) + `TableMetadata` for source-agnostic
  provider input/output.
- New `connectors/keboola/storage_api.py:KeboolaStorageClient.get_table_info`
  thin wrapper — keeps `_get` private to the module.
- New env vars (operator-facing tuning, no required setup change):
  - `AGNES_SKIP_CACHE_WARMUP` — opt-out of startup warmup.
  - `AGNES_WARMUP_CONCURRENCY` — default 4, max parallel BQ
    INFORMATION_SCHEMA jobs during a warmup pass.
- New runtime dependency: `sse-starlette>=2.0` (Server-Sent Events
  responses for the cache-warmup stream).
- Tests added: `test_metadata_models`, `test_v2_schema_columns_consolidation`,
  `test_v2_catalog_dispatcher`, `test_connectors_bigquery_metadata`,
  `test_connectors_keboola_metadata`, `test_v2_catalog_remote_metadata`,
  `test_v2_catalog_invalidation`, `test_cache_warmup`,
  `test_main_startup_warmup`, `test_admin_tables_warmup_ui`.

Step 15.3: Smoke-check version + changelog consistency

cd /tmp/agnes-metadata && grep '^version' pyproject.toml
cd /tmp/agnes-metadata && head -20 CHANGELOG.md

Expected: pyproject shows 0.46.0; CHANGELOG has ## [0.46.0] — 2026-05-07 directly after ## [Unreleased].

Step 15.4: Run the full targeted test set

cd /tmp/agnes-metadata && python -m pytest \
    tests/test_metadata_models.py \
    tests/test_keboola_storage_api.py::TestGetTableInfo \
    tests/test_v2_schema_columns_consolidation.py \
    tests/test_v2_schema.py \
    tests/test_v2_catalog_dispatcher.py \
    tests/test_connectors_bigquery_metadata.py \
    tests/test_connectors_keboola_metadata.py \
    tests/test_v2_catalog_remote_metadata.py \
    tests/test_v2_catalog_invalidation.py \
    tests/test_cache_warmup.py \
    tests/test_main_startup_warmup.py \
    tests/test_admin_tables_warmup_ui.py \
    tests/test_cli_admin.py::TestRegisterTableHints \
    -v

Expected: all green.

Step 15.5: Commit

cd /tmp/agnes-metadata
git add CHANGELOG.md pyproject.toml
git commit -m "release: 0.46.0 — source-agnostic catalog metadata + cache discipline

PR closes #155 + #156.

Highlights:
- Catalog response gains rows / size_bytes / partition_by / clustered_by.
- /api/v2/schema cache miss: 2 BQ jobs → 1 (-50%).
- All 4 catalog/schema/sample/metadata caches flush on registry change.
- Automatic startup warmup with bounded concurrency.
- /admin/tables shows cache freshness toolbar + per-row badges + SSE log.
- docs/admin/query-modes.md as the single source of truth for mode choice.

No DB migration; no wire break; MIN_COMPAT_CLI_VERSION unchanged.

Refs: docs/superpowers/specs/2026-05-07-source-agnostic-table-metadata-spec.md"

Task 16: Final verification + PR

Step 16.1: Push branch

cd /tmp/agnes-metadata && git push -u origin zs/issue-155-156-source-agnostic-metadata

Step 16.2: Open PR with auto-close keywords PER ISSUE

gh pr create --title "release: 0.46.0 — source-agnostic catalog metadata + cache discipline" \
  --body "$(cat <<'EOF'
## Summary

- Catalog enrichment for `query_mode='remote'` rows: rows + size_bytes + partition_by + clustered_by.
- `/api/v2/schema` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup; SSE-driven freshness toolbar on `/admin/tables`.

Closes #155.
Closes #156.

## Test plan

- [x] Targeted: 50+ unit + integration tests across 12 new test modules.
- [x] No DB migration; no wire-break; MIN_COMPAT_CLI_VERSION unchanged.
- [ ] Reviewer: register a remote BQ table, observe `/admin/tables` toolbar populates within 2 s.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and per-row badges update.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"

Per the user's feedback_github_close_keyword_per_issue memory: each issue gets its own Closes #N keyword on its own line. A list after one keyword only auto-closes the first.

Self-Review

After writing this plan, I checked it against the spec end-to-end:

Spec coverage:

✅ Provider pattern + dispatcher → Tasks 5, 6, 7
✅ Identifier validation in dispatcher → Task 5
✅ COLUMNS query consolidation → Task 3
✅ build_schema RBAC/cache split → Task 4
✅ v2_catalog wiring + 15-min metadata cache → Task 8
✅ Unified invalidate_for_table → Task 9
✅ Cache warmup framework + endpoints → Task 10
✅ FastAPI startup hook → Task 11
✅ CLI post-register hint → Task 12
✅ docs/admin/query-modes.md → Task 13
✅ Admin UI integration → Task 14
✅ CHANGELOG + version bump → Task 15

Placeholder scan: none of the "TBD", "TODO", "implement later" patterns. Every step has actual code or an exact command.

Type consistency: MetadataRequest / TableMetadata / WarmupRunState / WarmupRowState field names referenced consistently across tasks. fetch_bq_columns_full signature consistent between Task 3 and Task 7.

One known nit: Task 9's _rewarm_one_row references cache_warmup.warm_one_table which is defined in Task 10. The lazy import inside the function + outer try/except handles the during-development gap, but if a subagent runs Tasks 9 and 10 strictly in order without the lazy import, the test in Task 9 still passes because it patches asyncio.create_task. Documented in Task 9 step 9.3.

Ready for execution.

121 KiB Raw Blame History

Source-Agnostic Table Metadata Implementation Plan

File map

Task 1: Shared dataclass models

Task 2: KeboolaStorageClient.get_table_info wrapper

Task 3: Combined fetch_bq_columns_full helper + v2_schema rewire

Task 4: Split build_schema for warmup re-use

Task 5: Provider scaffold + dispatcher

Task 6: Keboola provider implementation

Task 7: BigQuery provider implementation

Task 8: v2_catalog wiring — _size_hint_for_row + response shape + cache

Task 9: Unified cache invalidation

Task 10: Cache warmup framework — state + bg task + endpoints

Task 11: FastAPI startup event hook

Task 12: CLI post-register hint for query_mode=remote

Task 13: docs/admin/query-modes.md

Task 14: Admin UI integration on /admin/tables

Task 15: CHANGELOG + version bump

Task 16: Final verification + PR

Self-Review

121 KiB

Raw Blame History

Task 3: Combined `fetch_bq_columns_full` helper + v2_schema rewire

Task 4: Split `build_schema` for warmup re-use