release: 0.50.0 — persistent BQ metadata cache + scheduled refresh; catalog never blocks on BigQuery
Since 0.47.0 GET /api/v2/catalog enriched each remote BigQuery row by fetching INFORMATION_SCHEMA.TABLE_STORAGE + COLUMNS through the DuckDB BigQuery extension *inside the request*. On cold caches that fanned out to O(N) sequential BQ jobs-API roundtrips — easily 90 s+ on partitioned / view-backed tables — and reliably blew the CLI's 30 s httpx ReadTimeout. Reproduced with py-spy: three AnyIO worker threads stuck inside connectors/bigquery/metadata._fetch_via_legacy_tables. Refactor: enrichment is read exclusively from a new persistent bq_metadata_cache DuckDB table (schema v40), populated by a scheduler- driven refresh job at SCHEDULER_BQ_METADATA_REFRESH_INTERVAL (default 4 h). Cold catalog response on a fresh container is now tens of milliseconds with metadata_freshness=never_fetched for unwarmed rows. New surface: - POST /api/admin/run-bq-metadata-refresh (scheduler-driven, full) - POST /api/v2/metadata-cache/refresh?table=<id> (admin, single) - GET /api/v2/metadata-cache/status (auth, non-admin) - metadata_freshness field per catalog row Removed (internal API): v2_catalog._size_hint_for_row, _resolve_remote_metadata, _metadata_provider_for, _build_metadata_request, _materialized_size_hint, in-memory _metadata_cache. Response shape unchanged for external consumers. 991 tests passing; 2 pre-existing failures (test_db v3→v4 ladder, test_cli_binary_rename) unrelated to this change.
This commit is contained in:
parent
183ee44bad
commit
b3841f5b6c
16 changed files with 1158 additions and 356 deletions
24
CHANGELOG.md
24
CHANGELOG.md
|
|
@ -10,6 +10,30 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
## [0.50.0] — 2026-05-11
|
||||
|
||||
### Fixed
|
||||
|
||||
- **`GET /api/v2/catalog` no longer hangs on cold cache.** Since 0.47.0 the catalog endpoint enriched each remote BigQuery row by fetching `INFORMATION_SCHEMA.TABLE_STORAGE` + `COLUMNS` through the DuckDB BigQuery extension inside the request. On cold caches that fanned out to O(N) sequential BQ jobs-API roundtrips — easily 90 s+ on partitioned / view-backed tables — and reliably exceeded the CLI's 30 s `httpx.ReadTimeout`. Enrichment now reads exclusively from a persistent `bq_metadata_cache` DuckDB table, populated by a scheduler-driven refresh job. First call after a fresh container start returns in tens of milliseconds with `metadata_freshness: never_fetched` for rows the scheduler hasn't reached yet; subsequent ticks fill the cache. Closes the cold-start outage class entirely.
|
||||
|
||||
### Added
|
||||
|
||||
- **Persistent BigQuery metadata cache (`bq_metadata_cache`, schema v40).** Holds `rows`, `size_bytes`, `partition_by`, `clustered_by`, `refreshed_at`, plus a `error_at` / `error_msg` pair that preserves the last successful row across transient provider failures so analyst tooling keeps seeing last-known-good numbers.
|
||||
- **`POST /api/admin/run-bq-metadata-refresh`** — scheduler-driven full refresh of every remote BigQuery row in the registry. Bounded concurrency via `AGNES_BQ_METADATA_REFRESH_CONCURRENCY` (default 4).
|
||||
- **`POST /api/v2/metadata-cache/refresh?table=<id>`** — operator on-demand single-row refresh (admin-gated), for use right after a registry edit when waiting for the next scheduled tick is too long.
|
||||
- **`GET /api/v2/metadata-cache/status`** — non-admin endpoint surfacing per-row `refreshed_at`, `error_at`, `error_msg`, and `freshness` (`fresh` / `stale` / `never_fetched` / `error`) so CLI / Claude Code can decide whether to trust the catalog's `rows` and `size_bytes`.
|
||||
- **`metadata_freshness` field** in every `/api/v2/catalog` row. `not_applicable` for `local` / `materialized` rows where the BQ cache concept doesn't apply.
|
||||
- **Scheduler job `bq-metadata-refresh`** running at `SCHEDULER_BQ_METADATA_REFRESH_INTERVAL` (default `4 * 60 * 60` seconds = 4 h). Tunable per deployment; the catalog request path is independent of the value.
|
||||
|
||||
### Changed
|
||||
|
||||
- **BREAKING (internal API):** removed `app.api.v2_catalog._size_hint_for_row`, `_resolve_remote_metadata`, `_metadata_provider_for`, `_build_metadata_request`, `_materialized_size_hint`, and the in-memory `_metadata_cache` (`TTLCache`). Catalog responses still expose the same enrichment fields (`rows`, `size_bytes`, `partition_by`, `clustered_by`); the new `metadata_freshness` field is additive. External consumers that read the response shape are unaffected.
|
||||
- `app.api.cache_warmup._warm_metadata_sync` now refreshes the persistent cache via `bq_metadata_refresh.refresh_one` instead of priming an in-memory TTL cache. The existing `/api/admin/cache-warmup/*` endpoints and admin-tables SSE wiring continue to work.
|
||||
|
||||
### Internal
|
||||
|
||||
- Schema v40 migration `_V39_TO_V40_MIGRATIONS` adds the new table; existing instances pick it up on next start. Empty cache is treated as `never_fetched` by the catalog, never as an error.
|
||||
|
||||
## [0.49.1] — 2026-05-11
|
||||
|
||||
### Added
|
||||
|
|
|
|||
310
app/api/bq_metadata_refresh.py
Normal file
310
app/api/bq_metadata_refresh.py
Normal file
|
|
@ -0,0 +1,310 @@
|
|||
"""BigQuery metadata cache refresh — owner of the ``bq_metadata_cache``
|
||||
write path.
|
||||
|
||||
Three endpoints share this module:
|
||||
|
||||
- ``POST /api/admin/run-bq-metadata-refresh`` — called by the scheduler
|
||||
container (auth: shared scheduler token resolves to a synthetic admin
|
||||
user). Walks remote rows in ``table_registry``, fetches each via the
|
||||
BigQuery metadata provider, UPSERTs into ``bq_metadata_cache``.
|
||||
|
||||
- ``POST /api/v2/metadata-cache/refresh?table=<id>`` — admin-gated, for
|
||||
operator on-demand refresh of a single row (e.g. after editing the
|
||||
registry entry's ``bucket`` / ``source_table``).
|
||||
|
||||
- ``GET /api/v2/metadata-cache/status`` — auth required, NOT admin-only.
|
||||
Returns per-row freshness so analyst tooling (CLI / Claude Code) can
|
||||
decide whether to trust the cached numbers or wait for a refresh.
|
||||
|
||||
Why this lives outside the catalog endpoint
|
||||
-------------------------------------------
|
||||
Earlier releases inlined a per-row BigQuery fetch into ``GET /api/v2/catalog``.
|
||||
On cold caches that became O(N) sequential BQ jobs API roundtrips inside
|
||||
one HTTP request — easily 90 s+ on partitioned tables — and reliably blew
|
||||
the CLI's 30 s ``httpx.ReadTimeout``. Moving the fetch off the hot path
|
||||
into a scheduled refresh job (default every 4 h, configurable via
|
||||
``SCHEDULER_BQ_METADATA_REFRESH_INTERVAL``) keeps the catalog response
|
||||
under tens of milliseconds even at first boot, at the cost of metadata
|
||||
being up to one refresh-interval stale. The freshness field surfaces
|
||||
that explicitly.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from typing import Any, Optional
|
||||
|
||||
import duckdb
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
|
||||
from app.auth.access import require_admin
|
||||
from app.auth.dependencies import _get_db, get_current_user
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
from src.repositories.table_registry import TableRegistryRepository
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
# ─── Freshness thresholds ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _scheduler_interval_seconds() -> int:
|
||||
"""Return the scheduler's configured refresh interval, mirroring
|
||||
``services/scheduler/__main__.py``. We re-read the env var instead
|
||||
of importing the scheduler module because the scheduler runs in a
|
||||
sibling container and is not on the app's import path.
|
||||
"""
|
||||
raw = os.environ.get("SCHEDULER_BQ_METADATA_REFRESH_INTERVAL")
|
||||
if raw is None or raw == "":
|
||||
return 4 * 60 * 60 # 4 h default
|
||||
try:
|
||||
value = int(raw)
|
||||
except (TypeError, ValueError):
|
||||
return 4 * 60 * 60
|
||||
return value if value > 0 else 4 * 60 * 60
|
||||
|
||||
|
||||
def _fresh_threshold_seconds() -> int:
|
||||
"""A row is ``fresh`` when refreshed within this window.
|
||||
|
||||
Two refresh intervals: one refresh might fail (network blip, BQ
|
||||
throttle); the analyst should keep seeing the last-known-good row
|
||||
as ``fresh`` until two consecutive refreshes have passed without
|
||||
success. Beyond that, the response surfaces ``stale`` so the
|
||||
consumer knows the numbers might be outdated.
|
||||
"""
|
||||
return 2 * _scheduler_interval_seconds()
|
||||
|
||||
|
||||
def compute_freshness(
|
||||
cache_row: Optional[dict[str, Any]],
|
||||
*,
|
||||
now: Optional[datetime] = None,
|
||||
fresh_threshold: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Classify a cache row's freshness.
|
||||
|
||||
- ``never_fetched``: no row, or no successful refresh yet.
|
||||
- ``fresh``: refreshed within the threshold.
|
||||
- ``stale``: refreshed earlier than the threshold.
|
||||
- ``error``: most recent attempt failed and there is no prior success
|
||||
(success row is preserved across errors — analyst keeps using
|
||||
last-known-good numbers).
|
||||
"""
|
||||
if cache_row is None:
|
||||
return "never_fetched"
|
||||
refreshed_at = cache_row.get("refreshed_at")
|
||||
error_at = cache_row.get("error_at")
|
||||
if refreshed_at is None:
|
||||
return "error" if error_at is not None else "never_fetched"
|
||||
threshold = fresh_threshold if fresh_threshold is not None else _fresh_threshold_seconds()
|
||||
cutoff = (now or datetime.now(timezone.utc)) - timedelta(seconds=threshold)
|
||||
# DuckDB returns naive datetimes for TIMESTAMP columns; treat as UTC.
|
||||
if refreshed_at.tzinfo is None:
|
||||
refreshed_at = refreshed_at.replace(tzinfo=timezone.utc)
|
||||
return "fresh" if refreshed_at >= cutoff else "stale"
|
||||
|
||||
|
||||
# ─── Single-row refresh primitive ──────────────────────────────────────────
|
||||
|
||||
|
||||
def refresh_one(conn: duckdb.DuckDBPyConnection, row: dict[str, Any]) -> dict[str, Any]:
|
||||
"""Fetch BQ metadata for one row and UPSERT the result.
|
||||
|
||||
Synchronous; safe to call from an anyio thread. Returns a small
|
||||
outcome dict for the caller (counts, audit).
|
||||
|
||||
Failures are absorbed: the cache row's prior success is preserved
|
||||
(``error_at`` + ``error_msg`` set, ``refreshed_at`` left alone).
|
||||
"""
|
||||
from app.api._metadata_models import MetadataRequest
|
||||
from connectors.bigquery import metadata as bq_metadata
|
||||
from src.identifier_validation import validate_quoted_identifier
|
||||
|
||||
table_id = row["id"]
|
||||
bucket = row.get("bucket") or ""
|
||||
source_table = row.get("source_table") or table_id
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
|
||||
if not (
|
||||
validate_quoted_identifier(bucket, "bucket")
|
||||
and validate_quoted_identifier(source_table, "source_table")
|
||||
):
|
||||
repo.mark_error(table_id, "invalid bucket/source_table identifier")
|
||||
return {"table_id": table_id, "status": "error", "error": "invalid identifier"}
|
||||
|
||||
req = MetadataRequest(
|
||||
table_id=table_id, bucket=bucket, source_table=source_table,
|
||||
)
|
||||
try:
|
||||
result = bq_metadata.fetch(req)
|
||||
except Exception as e:
|
||||
# bq_metadata.fetch is documented as never-raises, but defense in
|
||||
# depth: catch any regression so one bad row doesn't kill the
|
||||
# whole scheduler tick.
|
||||
msg = f"{type(e).__name__}: {e}"
|
||||
logger.warning("bq metadata refresh failed for %s: %s", table_id, msg)
|
||||
repo.mark_error(table_id, msg)
|
||||
return {"table_id": table_id, "status": "error", "error": msg}
|
||||
|
||||
if result is None:
|
||||
repo.mark_error(table_id, "provider returned no data")
|
||||
return {"table_id": table_id, "status": "no_data"}
|
||||
|
||||
repo.upsert_success(
|
||||
table_id,
|
||||
rows=result.rows,
|
||||
size_bytes=result.size_bytes,
|
||||
partition_by=result.partition_by,
|
||||
clustered_by=result.clustered_by,
|
||||
)
|
||||
return {
|
||||
"table_id": table_id,
|
||||
"status": "ok",
|
||||
"rows": result.rows,
|
||||
"size_bytes": result.size_bytes,
|
||||
}
|
||||
|
||||
|
||||
def _list_remote_bq_rows(conn: duckdb.DuckDBPyConnection) -> list[dict[str, Any]]:
|
||||
rows = TableRegistryRepository(conn).list_all()
|
||||
return [
|
||||
r for r in rows
|
||||
if r.get("query_mode") == "remote" and r.get("source_type") == "bigquery"
|
||||
]
|
||||
|
||||
|
||||
def _refresh_concurrency() -> int:
|
||||
raw = os.environ.get("AGNES_BQ_METADATA_REFRESH_CONCURRENCY", "4")
|
||||
try:
|
||||
value = int(raw)
|
||||
except (TypeError, ValueError):
|
||||
return 4
|
||||
return value if value > 0 else 4
|
||||
|
||||
|
||||
# ─── Endpoints ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@router.post("/api/admin/run-bq-metadata-refresh")
|
||||
async def run_bq_metadata_refresh(
|
||||
user: dict = Depends(require_admin),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
):
|
||||
"""Refresh metadata for every remote BQ row in the registry.
|
||||
|
||||
Called by the scheduler at ``SCHEDULER_BQ_METADATA_REFRESH_INTERVAL``
|
||||
(default 4 h). Idempotent — running twice in quick succession is
|
||||
safe but wasteful; the scheduler enforces the interval.
|
||||
|
||||
Bounded concurrency (default 4, override via
|
||||
``AGNES_BQ_METADATA_REFRESH_CONCURRENCY``) so a deployment with
|
||||
many remote tables doesn't fan out to dozens of parallel BQ jobs.
|
||||
"""
|
||||
from src.db import get_system_db
|
||||
|
||||
rows = _list_remote_bq_rows(conn)
|
||||
sem = asyncio.Semaphore(_refresh_concurrency())
|
||||
|
||||
async def _one(row: dict[str, Any]) -> dict[str, Any]:
|
||||
async with sem:
|
||||
# Each refresh_one call wants its own cursor; the singleton
|
||||
# connection accessor returns a fresh cursor each call.
|
||||
return await asyncio.to_thread(refresh_one, get_system_db(), row)
|
||||
|
||||
t0 = time.monotonic()
|
||||
results = await asyncio.gather(
|
||||
*(_one(r) for r in rows), return_exceptions=True,
|
||||
)
|
||||
duration_ms = int((time.monotonic() - t0) * 1000)
|
||||
|
||||
succeeded = sum(
|
||||
1 for r in results if isinstance(r, dict) and r.get("status") == "ok"
|
||||
)
|
||||
no_data = sum(
|
||||
1 for r in results if isinstance(r, dict) and r.get("status") == "no_data"
|
||||
)
|
||||
failed = sum(
|
||||
1 for r in results
|
||||
if isinstance(r, Exception)
|
||||
or (isinstance(r, dict) and r.get("status") == "error")
|
||||
)
|
||||
|
||||
logger.info(
|
||||
"bq metadata refresh: total=%d ok=%d no_data=%d failed=%d duration_ms=%d",
|
||||
len(rows), succeeded, no_data, failed, duration_ms,
|
||||
)
|
||||
return {
|
||||
"total": len(rows),
|
||||
"succeeded": succeeded,
|
||||
"no_data": no_data,
|
||||
"failed": failed,
|
||||
"duration_ms": duration_ms,
|
||||
}
|
||||
|
||||
|
||||
@router.post("/api/v2/metadata-cache/refresh")
|
||||
async def refresh_one_table(
|
||||
table: str = Query(..., description="Registry table_id to refresh"),
|
||||
user: dict = Depends(require_admin),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
):
|
||||
"""Operator on-demand refresh of one row.
|
||||
|
||||
Useful right after editing the registry row (so the catalog reflects
|
||||
new ``bucket`` / ``source_table`` immediately) or after an upstream
|
||||
BQ schema change that the operator wants reflected before the next
|
||||
scheduled tick.
|
||||
"""
|
||||
from src.db import get_system_db
|
||||
|
||||
row = TableRegistryRepository(conn).get(table)
|
||||
if not row:
|
||||
raise HTTPException(status_code=404, detail=f"Unknown table_id: {table}")
|
||||
if row.get("query_mode") != "remote" or row.get("source_type") != "bigquery":
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail="Manual metadata refresh is only meaningful for remote BigQuery tables",
|
||||
)
|
||||
return await asyncio.to_thread(refresh_one, get_system_db(), row)
|
||||
|
||||
|
||||
@router.get("/api/v2/metadata-cache/status")
|
||||
def metadata_cache_status(
|
||||
user: dict = Depends(get_current_user),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
):
|
||||
"""Per-table cache status. Non-admin — analyst tools rely on this to
|
||||
decide whether to trust the catalog's ``rows`` / ``size_bytes`` or
|
||||
treat the table as opaque until the next refresh.
|
||||
"""
|
||||
cache_rows = BqMetadataCacheRepository(conn).list_all()
|
||||
threshold = _fresh_threshold_seconds()
|
||||
now = datetime.now(timezone.utc)
|
||||
interval = _scheduler_interval_seconds()
|
||||
tables = []
|
||||
for r in cache_rows:
|
||||
refreshed_at = r.get("refreshed_at")
|
||||
error_at = r.get("error_at")
|
||||
tables.append({
|
||||
"table_id": r["table_id"],
|
||||
"refreshed_at": refreshed_at.isoformat() if refreshed_at else None,
|
||||
"rows": r.get("rows"),
|
||||
"size_bytes": r.get("size_bytes"),
|
||||
"partition_by": r.get("partition_by"),
|
||||
"clustered_by": r.get("clustered_by") or [],
|
||||
"error_at": error_at.isoformat() if error_at else None,
|
||||
"error_msg": r.get("error_msg"),
|
||||
"freshness": compute_freshness(r, now=now, fresh_threshold=threshold),
|
||||
})
|
||||
return {
|
||||
"scheduler_interval_seconds": interval,
|
||||
"fresh_threshold_seconds": threshold,
|
||||
"server_time": now.isoformat(),
|
||||
"tables": tables,
|
||||
}
|
||||
|
|
@ -159,9 +159,16 @@ async def _warm_one(
|
|||
|
||||
|
||||
def _warm_metadata_sync(row: dict) -> None:
|
||||
"""Trigger metadata cache populate via the catalog's normal path."""
|
||||
from app.api.v2_catalog import _size_hint_for_row
|
||||
_size_hint_for_row(row)
|
||||
"""Refresh the persistent ``bq_metadata_cache`` row.
|
||||
|
||||
Pre-0.50 this called ``v2_catalog._size_hint_for_row`` to populate
|
||||
an in-memory TTL cache. The in-memory cache is gone — metadata now
|
||||
lives in DuckDB, owned by ``app/api/bq_metadata_refresh.refresh_one``
|
||||
(the same primitive the scheduler-driven refresh uses).
|
||||
"""
|
||||
from app.api.bq_metadata_refresh import refresh_one
|
||||
from src.db import get_system_db
|
||||
refresh_one(get_system_db(), row)
|
||||
|
||||
|
||||
def _warm_schema_sync(row: dict) -> None:
|
||||
|
|
|
|||
|
|
@ -1,18 +1,32 @@
|
|||
"""GET /api/v2/catalog — list tables visible to caller (spec §3.1)."""
|
||||
"""GET /api/v2/catalog — list tables visible to caller (spec §3.1).
|
||||
|
||||
History note
|
||||
------------
|
||||
0.47.0 enriched remote rows with BigQuery metadata (rows / size_bytes /
|
||||
partition_by / clustered_by) by fetching from BQ *inside the request*
|
||||
through a per-table TTL cache. On a cold cache that fanned out to O(N)
|
||||
sequential BQ jobs API roundtrips and reliably exceeded the CLI's 30 s
|
||||
``httpx.ReadTimeout`` against partitioned tables. This module now reads
|
||||
those fields exclusively from the persistent ``bq_metadata_cache`` table
|
||||
(populated by ``app/api/bq_metadata_refresh.py`` on a scheduler tick).
|
||||
The request path never calls BQ.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from fastapi import APIRouter, Depends
|
||||
import duckdb
|
||||
from typing import Any
|
||||
|
||||
from app.auth.dependencies import get_current_user, _get_db
|
||||
import duckdb
|
||||
from fastapi import APIRouter, Depends
|
||||
|
||||
from app.api.v2_cache import TTLCache
|
||||
from app.auth.dependencies import _get_db, get_current_user
|
||||
from app.utils import get_data_dir as _get_data_dir
|
||||
from src.rbac import can_access_table
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
from src.repositories.table_registry import TableRegistryRepository
|
||||
from app.api.v2_cache import TTLCache
|
||||
from app.api._metadata_models import MetadataRequest, TableMetadata
|
||||
from src.identifier_validation import validate_quoted_identifier
|
||||
|
||||
router = APIRouter(prefix="/api/v2", tags=["v2"])
|
||||
|
||||
|
|
@ -27,51 +41,6 @@ router = APIRouter(prefix="/api/v2", tags=["v2"])
|
|||
_table_rows_cache = TTLCache(maxsize=1, ttl_seconds=300)
|
||||
_TABLE_ROWS_KEY = "all"
|
||||
|
||||
# Per-table cached TableMetadata. 15-min TTL — long enough to amortise
|
||||
# across an analyst session, short enough that a freshly-registered
|
||||
# remote table shows real numbers within a coffee break (the cache-bust
|
||||
# path in `invalidate_for_table` accelerates this for the common admin-
|
||||
# verifies-registration flow).
|
||||
_metadata_cache = TTLCache(maxsize=512, ttl_seconds=900)
|
||||
|
||||
|
||||
def _metadata_provider_for(source_type: str):
|
||||
"""Lazy-import dispatch for source-specific metadata providers.
|
||||
|
||||
Lazy because connector modules are heavy (BQ extension, google-cloud
|
||||
client, etc.) and a Keboola-only deployment shouldn't pay the BQ
|
||||
import cost. Returns ``None`` for unknown source types — the caller
|
||||
treats that as "no metadata enrichment available" and falls through.
|
||||
"""
|
||||
if source_type == "bigquery":
|
||||
from connectors.bigquery import metadata as m
|
||||
return m.fetch
|
||||
if source_type == "keboola":
|
||||
from connectors.keboola import metadata as m
|
||||
return m.fetch
|
||||
return None
|
||||
|
||||
|
||||
def _build_metadata_request(row: dict) -> MetadataRequest | None:
|
||||
"""Construct a validated MetadataRequest from a registry row.
|
||||
|
||||
Pre-validates the identifiers via `validate_quoted_identifier` before
|
||||
constructing the request — providers can then interpolate
|
||||
`req.bucket` / `req.source_table` into SQL/URL paths without
|
||||
re-checking. Returns ``None`` when validation fails; provider is not
|
||||
dispatched for that row.
|
||||
"""
|
||||
bucket = row.get("bucket") or ""
|
||||
source_table = row.get("source_table") or row.get("id") or ""
|
||||
if not bucket or not source_table:
|
||||
return None
|
||||
if not (validate_quoted_identifier(bucket, "bucket")
|
||||
and validate_quoted_identifier(source_table, "source_table")):
|
||||
return None
|
||||
return MetadataRequest(
|
||||
table_id=row["id"], bucket=bucket, source_table=source_table,
|
||||
)
|
||||
|
||||
|
||||
def _flavor_for(source_type: str) -> str:
|
||||
return "bigquery" if source_type == "bigquery" else "duckdb"
|
||||
|
|
@ -112,56 +81,11 @@ def _bucket_size(byte_count: int) -> str:
|
|||
return "very_large"
|
||||
|
||||
|
||||
def _size_hint_for_row(row: dict) -> dict:
|
||||
"""Resolve the per-row metadata bundle the catalog response surfaces.
|
||||
|
||||
Renamed from `_materialized_size_hint` (which always also handled
|
||||
`local` rows; the old name was misleading). Returns a dict with up
|
||||
to four keys: `rough_size_hint`, `rows`, `size_bytes`, `partition_by`,
|
||||
`clustered_by`. Missing keys are reported as `null` in the response.
|
||||
|
||||
Branches:
|
||||
- `local` / `materialized` → existing on-disk parquet stat (cheap).
|
||||
- `remote` → dispatch to the per-source-type provider; cache the
|
||||
TableMetadata for 15 min.
|
||||
"""
|
||||
table_id = row["id"]
|
||||
source_type = row.get("source_type") or ""
|
||||
query_mode = row.get("query_mode") or "local"
|
||||
|
||||
if query_mode in ("local", "materialized"):
|
||||
return {"rough_size_hint": _materialized_parquet_size_bucket(
|
||||
table_id, source_type, query_mode,
|
||||
)}
|
||||
|
||||
if query_mode != "remote":
|
||||
return {"rough_size_hint": None}
|
||||
|
||||
# Cache lookup (per-row TableMetadata).
|
||||
cached = _metadata_cache.get(table_id)
|
||||
if cached is None:
|
||||
cached = _resolve_remote_metadata(row)
|
||||
if cached is not None:
|
||||
_metadata_cache.set(table_id, cached)
|
||||
|
||||
if cached is None:
|
||||
return {"rough_size_hint": None}
|
||||
|
||||
return {
|
||||
"rough_size_hint": _bucket_size(cached.size_bytes) if cached.size_bytes is not None else None,
|
||||
"rows": cached.rows,
|
||||
"size_bytes": cached.size_bytes,
|
||||
"partition_by": cached.partition_by,
|
||||
"clustered_by": cached.clustered_by,
|
||||
}
|
||||
|
||||
|
||||
def _materialized_parquet_size_bucket(
|
||||
table_id: str, source_type: str, query_mode: str,
|
||||
) -> str | None:
|
||||
"""Size hint for rows whose data is on the server filesystem
|
||||
(the old `_materialized_size_hint` body). Renamed for clarity now
|
||||
that the new dispatcher is the entry point.
|
||||
(``local`` or ``materialized``). Cheap ``Path.stat()``; never blocks.
|
||||
|
||||
Layout matches the v2 extract.duckdb contract:
|
||||
${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet
|
||||
|
|
@ -182,21 +106,64 @@ def _materialized_parquet_size_bucket(
|
|||
return None
|
||||
|
||||
|
||||
def _resolve_remote_metadata(row: dict) -> "TableMetadata | None":
|
||||
"""Provider dispatch for a remote row. Returns None on any failure."""
|
||||
def _hint_for_row(
|
||||
row: dict[str, Any],
|
||||
bq_cache_index: dict[str, dict[str, Any]],
|
||||
) -> dict[str, Any]:
|
||||
"""Resolve the per-row metadata bundle the catalog response surfaces.
|
||||
|
||||
Branches:
|
||||
- ``local`` / ``materialized`` → on-disk parquet ``stat()`` (cheap).
|
||||
- ``remote`` (BigQuery) → pre-computed row from ``bq_metadata_cache``,
|
||||
populated by the scheduler-driven refresh. Never touches BQ here.
|
||||
|
||||
Always returns ``metadata_freshness`` (``fresh`` / ``stale`` /
|
||||
``never_fetched`` / ``error`` / ``not_applicable``) so AI consumers can
|
||||
decide whether to trust ``rows`` / ``size_bytes`` or treat them as
|
||||
advisory.
|
||||
"""
|
||||
table_id = row["id"]
|
||||
source_type = row.get("source_type") or ""
|
||||
provider = _metadata_provider_for(source_type)
|
||||
if provider is None:
|
||||
return None
|
||||
req = _build_metadata_request(row)
|
||||
if req is None:
|
||||
return None
|
||||
try:
|
||||
return provider(req)
|
||||
except Exception:
|
||||
# Defense in depth — providers are documented as never-raises,
|
||||
# but a regression would otherwise 500 the whole catalog.
|
||||
return None
|
||||
query_mode = row.get("query_mode") or "local"
|
||||
|
||||
if query_mode in ("local", "materialized"):
|
||||
return {
|
||||
"rough_size_hint": _materialized_parquet_size_bucket(
|
||||
table_id, source_type, query_mode,
|
||||
),
|
||||
"metadata_freshness": "not_applicable",
|
||||
}
|
||||
|
||||
if query_mode != "remote":
|
||||
return {
|
||||
"rough_size_hint": None,
|
||||
"metadata_freshness": "not_applicable",
|
||||
}
|
||||
|
||||
# Remote: read from the persistent cache; never call BQ here.
|
||||
from app.api.bq_metadata_refresh import compute_freshness
|
||||
cache_row = bq_cache_index.get(table_id)
|
||||
freshness = compute_freshness(cache_row)
|
||||
|
||||
if cache_row is None:
|
||||
return {
|
||||
"rough_size_hint": None,
|
||||
"rows": None,
|
||||
"size_bytes": None,
|
||||
"partition_by": None,
|
||||
"clustered_by": [],
|
||||
"metadata_freshness": freshness,
|
||||
}
|
||||
|
||||
size_bytes = cache_row.get("size_bytes")
|
||||
return {
|
||||
"rough_size_hint": _bucket_size(size_bytes) if size_bytes is not None else None,
|
||||
"rows": cache_row.get("rows"),
|
||||
"size_bytes": size_bytes,
|
||||
"partition_by": cache_row.get("partition_by"),
|
||||
"clustered_by": cache_row.get("clustered_by") or [],
|
||||
"metadata_freshness": freshness,
|
||||
}
|
||||
|
||||
|
||||
def invalidate_for_table(table_id: str) -> None:
|
||||
|
|
@ -205,14 +172,14 @@ def invalidate_for_table(table_id: str) -> None:
|
|||
by the catalog module so admin.py doesn't need to know which caches
|
||||
exist.
|
||||
|
||||
Imports v2_schema and v2_sample lazily — keeps catalog tests from
|
||||
pulling in BQ-extension imports they don't need.
|
||||
The persistent ``bq_metadata_cache`` row is NOT invalidated here —
|
||||
the scheduler-driven refresh owns that lifecycle. Admins who need
|
||||
an immediate refresh after a registry edit should hit
|
||||
``POST /api/v2/metadata-cache/refresh?table=<id>``.
|
||||
"""
|
||||
import asyncio
|
||||
from app.api import v2_schema, v2_sample
|
||||
from app.api import v2_sample, v2_schema
|
||||
|
||||
_table_rows_cache.clear()
|
||||
_metadata_cache.invalidate(table_id)
|
||||
v2_schema._schema_cache.invalidate(table_id)
|
||||
# Sample cache key is `f"{table_id}|{n}"`; clearing the whole sample
|
||||
# cache is heavier than precise invalidation, but registry-change
|
||||
|
|
@ -220,36 +187,6 @@ def invalidate_for_table(table_id: str) -> None:
|
|||
# adding a prefix-invalidation primitive to TTLCache.
|
||||
v2_sample._sample_cache.clear()
|
||||
|
||||
# Schedule a single-row re-warm so admins editing a registry row
|
||||
# see fresh data within a couple of seconds rather than waiting for
|
||||
# the next analyst to trigger a miss. Fire-and-forget; failures
|
||||
# log + skip inside the coroutine.
|
||||
try:
|
||||
loop = asyncio.get_running_loop()
|
||||
except RuntimeError:
|
||||
loop = None
|
||||
if loop is not None:
|
||||
# Running inside an async context (production FastAPI path).
|
||||
asyncio.create_task(_rewarm_one_row(table_id))
|
||||
# No running event loop (e.g. called from a sync test or a sync
|
||||
# handler thread). Skip re-warm — the next live request will
|
||||
# populate via miss.
|
||||
|
||||
|
||||
async def _rewarm_one_row(table_id: str) -> None:
|
||||
"""Background single-row re-warm. Imports cache_warmup lazily to
|
||||
avoid a circular import at module load (cache_warmup.py is created
|
||||
in Task 10; until then, this function logs a warning and returns)."""
|
||||
try:
|
||||
from app.api.cache_warmup import warm_one_table
|
||||
await warm_one_table(table_id)
|
||||
except Exception:
|
||||
import logging
|
||||
logging.getLogger(__name__).warning(
|
||||
"single-row re-warm failed for %s — next live request will populate",
|
||||
table_id,
|
||||
)
|
||||
|
||||
|
||||
def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
||||
rows = _table_rows_cache.get(_TABLE_ROWS_KEY)
|
||||
|
|
@ -258,6 +195,12 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
|||
rows = repo.list_all()
|
||||
_table_rows_cache.set(_TABLE_ROWS_KEY, rows)
|
||||
|
||||
# One DB read for all remote-row metadata. Indexed by table_id so the
|
||||
# per-row loop below stays O(N).
|
||||
bq_cache_index: dict[str, dict[str, Any]] = {
|
||||
r["table_id"]: r for r in BqMetadataCacheRepository(conn).list_all()
|
||||
}
|
||||
|
||||
# RBAC is enforced fresh per request. Revoking a user's access to a
|
||||
# table takes effect on their next call to this endpoint, not after the
|
||||
# cache TTL expires.
|
||||
|
|
@ -265,7 +208,7 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
|||
for r in rows:
|
||||
if not can_access_table(user, r["id"], conn):
|
||||
continue
|
||||
hint = _size_hint_for_row(r)
|
||||
hint = _hint_for_row(r, bq_cache_index)
|
||||
visible.append({
|
||||
"id": r["id"],
|
||||
"name": r.get("name") or r["id"],
|
||||
|
|
@ -279,7 +222,8 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
|||
"rows": hint.get("rows"),
|
||||
"size_bytes": hint.get("size_bytes"),
|
||||
"partition_by": hint.get("partition_by"),
|
||||
"clustered_by": hint.get("clustered_by"),
|
||||
"clustered_by": hint.get("clustered_by") or [],
|
||||
"metadata_freshness": hint.get("metadata_freshness"),
|
||||
})
|
||||
|
||||
return {
|
||||
|
|
@ -294,12 +238,6 @@ def catalog(
|
|||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
):
|
||||
# Plain ``def`` so FastAPI auto-offloads to the anyio thread pool —
|
||||
# build_catalog now calls `_size_hint_for_row` for every visible row,
|
||||
# which does sync `Path.stat()` / `Path.exists()` on the data volume
|
||||
# (local/materialized) or provider dispatch (remote). On local FS
|
||||
# that's microseconds, but on a network-mounted DATA_DIR (NFS / CIFS /
|
||||
# GCS-FUSE) those calls can block. Plain ``def`` means each request
|
||||
# runs on its own thread; the event loop stays free for non-catalog
|
||||
# traffic. Mirrors the Tier 1 conversion of /api/query, /api/v2/scan,
|
||||
# /api/v2/sample, /api/v2/schema — Devin Review on PR #188.
|
||||
# the request path is pure local I/O (DuckDB reads + filesystem
|
||||
# stat()) and uses a sync DuckDB cursor.
|
||||
return build_catalog(conn, user)
|
||||
|
|
|
|||
|
|
@ -117,6 +117,7 @@ from app.api.welcome import router as welcome_router
|
|||
from app.api.claude_md import router as claude_md_router
|
||||
from app.api.news import router as news_router
|
||||
from app.api.cache_warmup import router as cache_warmup_router
|
||||
from app.api.bq_metadata_refresh import router as bq_metadata_refresh_router
|
||||
from app.marketplace_server.router import router as marketplace_server_router
|
||||
from app.marketplace_server.git_router import make_git_wsgi_app
|
||||
from app.web.router import router as web_router
|
||||
|
|
@ -598,6 +599,7 @@ def create_app() -> FastAPI:
|
|||
app.include_router(claude_md_router)
|
||||
app.include_router(news_router)
|
||||
app.include_router(cache_warmup_router)
|
||||
app.include_router(bq_metadata_refresh_router)
|
||||
app.include_router(marketplace_server_router)
|
||||
|
||||
# Git smart-HTTP endpoint for Claude Code: /marketplace.git/*
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[project]
|
||||
name = "agnes-the-ai-analyst"
|
||||
version = "0.49.1"
|
||||
version = "0.50.0"
|
||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||
requires-python = ">=3.11,<3.14"
|
||||
license = "MIT"
|
||||
|
|
|
|||
|
|
@ -90,6 +90,14 @@ _DEFAULTS = {
|
|||
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL": 15 * 60,
|
||||
"SCHEDULER_USAGE_PROCESSOR_INTERVAL": 10 * 60,
|
||||
"SCHEDULER_CORPORATE_MEMORY_INTERVAL": 17 * 60,
|
||||
# BigQuery metadata refresh: walks remote registry rows and updates the
|
||||
# persistent ``bq_metadata_cache``. Default 4 h — long enough that the
|
||||
# cumulative BQ jobs API cost stays negligible on a typical 10–50-table
|
||||
# registry, short enough that operator-edited tables show real numbers
|
||||
# within an analyst's working day. Hot reads of ``/api/v2/catalog`` go
|
||||
# to DuckDB, never to BQ, so this can be tuned freely without touching
|
||||
# request-path latency.
|
||||
"SCHEDULER_BQ_METADATA_REFRESH_INTERVAL": 4 * 60 * 60,
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -149,8 +157,9 @@ def build_jobs() -> list[tuple[str, str, str, str, int]]:
|
|||
verify = _read_positive_int("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL")
|
||||
usage = _read_positive_int("SCHEDULER_USAGE_PROCESSOR_INTERVAL")
|
||||
corpmem = _read_positive_int("SCHEDULER_CORPORATE_MEMORY_INTERVAL")
|
||||
bqmeta = _read_positive_int("SCHEDULER_BQ_METADATA_REFRESH_INTERVAL")
|
||||
tick = _read_positive_int("SCHEDULER_TICK_SECONDS")
|
||||
smallest = min(refresh, health, scripts, sess, verify, usage, corpmem)
|
||||
smallest = min(refresh, health, scripts, sess, verify, usage, corpmem, bqmeta)
|
||||
if tick > smallest:
|
||||
raise ValueError(
|
||||
f"SCHEDULER_TICK_SECONDS={tick} must be <= the smallest job "
|
||||
|
|
@ -193,6 +202,14 @@ def build_jobs() -> list[tuple[str, str, str, str, int]]:
|
|||
# to review_error so admin can retry. Cheap (one indexed
|
||||
# SELECT + N small UPDATEs); short timeout sufficient.
|
||||
("store-reap-stuck-reviews", "every 15m", "/api/admin/run-reap-stuck-reviews", "POST", 60),
|
||||
# BigQuery metadata refresh — keeps ``bq_metadata_cache`` warm so
|
||||
# ``GET /api/v2/catalog`` never has to call BQ at request time.
|
||||
# 30-min timeout is generous; on a 10-table dev registry the
|
||||
# observed full refresh ran in ~7 min when two view-backed rows
|
||||
# took 7 min each. Bounded concurrency
|
||||
# (``AGNES_BQ_METADATA_REFRESH_CONCURRENCY``, default 4) caps the
|
||||
# tail.
|
||||
("bq-metadata-refresh", _seconds_to_schedule(bqmeta), "/api/admin/run-bq-metadata-refresh", "POST", 1800),
|
||||
]
|
||||
|
||||
_running = True
|
||||
|
|
|
|||
58
src/db.py
58
src/db.py
|
|
@ -40,7 +40,7 @@ def _maybe_instrument(con, db_tag: str):
|
|||
|
||||
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
|
||||
|
||||
SCHEMA_VERSION = 39
|
||||
SCHEMA_VERSION = 40
|
||||
|
||||
_SYSTEM_SCHEMA = """
|
||||
CREATE TABLE IF NOT EXISTS schema_version (
|
||||
|
|
@ -640,6 +640,38 @@ CREATE INDEX IF NOT EXISTS idx_store_submissions_entity ON store_submissions(ent
|
|||
-- (reproduced with N=2 against 3 rows during /admin/store/submissions
|
||||
-- paging). Submissions table is admin-only and bounded by upload
|
||||
-- volume, so the index buys little; dropping it sidesteps the bug.
|
||||
|
||||
-- v40: persistent metadata cache for remote sources (BigQuery initially).
|
||||
-- Replaces the per-request, in-memory `_metadata_cache` in v2_catalog.py
|
||||
-- that turned every cold-cache /api/v2/catalog into a sequence of N×3 BQ
|
||||
-- jobs API calls (one TABLE_STORAGE + COLUMNS pair per remote row) — long
|
||||
-- enough on view-backed or partitioned tables (>>30 s) to blow the CLI's
|
||||
-- httpx 30 s read timeout. Now refresh is driven exclusively by the
|
||||
-- scheduler (default every 4 h, `SCHEDULER_BQ_METADATA_REFRESH_INTERVAL`),
|
||||
-- and the catalog endpoint just reads this table — no BQ at request time.
|
||||
--
|
||||
-- Columns:
|
||||
-- table_id — registry.id; PK and join key with table_registry.
|
||||
-- rows / size_bytes / partition_by / clustered_by — last successful
|
||||
-- provider result. NULL when the table has never
|
||||
-- been fetched, or fetch failed before any success.
|
||||
-- clustered_by stored as JSON array of column names.
|
||||
-- refreshed_at — wall-clock of the last successful fetch. Used by
|
||||
-- the catalog response to compute metadata_freshness
|
||||
-- (`fresh` if < 2× scheduler interval old, `stale`
|
||||
-- otherwise, `never_fetched` if NULL).
|
||||
-- error_at / error_msg — last failure timestamp + redacted message.
|
||||
-- NULL after the next successful refresh.
|
||||
CREATE TABLE IF NOT EXISTS bq_metadata_cache (
|
||||
table_id VARCHAR PRIMARY KEY,
|
||||
rows BIGINT,
|
||||
size_bytes BIGINT,
|
||||
partition_by VARCHAR,
|
||||
clustered_by JSON,
|
||||
refreshed_at TIMESTAMP,
|
||||
error_at TIMESTAMP,
|
||||
error_msg VARCHAR
|
||||
);
|
||||
"""
|
||||
|
||||
|
||||
|
|
@ -2489,6 +2521,27 @@ _V38_TO_V39_MIGRATIONS = [
|
|||
]
|
||||
|
||||
|
||||
# v40: bq_metadata_cache table. Existing DBs get an empty table; the next
|
||||
# scheduler tick (or app startup warmup) populates it. The catalog endpoint
|
||||
# treats absence-of-row as `metadata_freshness: never_fetched` and returns
|
||||
# NULL for the optional fields rather than failing — analyst tooling already
|
||||
# tolerates NULL rows / size_bytes from the pre-0.47 contract.
|
||||
_V39_TO_V40_MIGRATIONS = [
|
||||
"""
|
||||
CREATE TABLE IF NOT EXISTS bq_metadata_cache (
|
||||
table_id VARCHAR PRIMARY KEY,
|
||||
rows BIGINT,
|
||||
size_bytes BIGINT,
|
||||
partition_by VARCHAR,
|
||||
clustered_by JSON,
|
||||
refreshed_at TIMESTAMP,
|
||||
error_at TIMESTAMP,
|
||||
error_msg VARCHAR
|
||||
)
|
||||
""",
|
||||
]
|
||||
|
||||
|
||||
_V33_TO_V34_MIGRATIONS = [
|
||||
# DuckDB blocks DROP COLUMN while indexes reference the table
|
||||
# ("Dependency Error: Cannot alter entry … because there are entries
|
||||
|
|
@ -2882,6 +2935,9 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
|||
if current < 39:
|
||||
for sql in _V38_TO_V39_MIGRATIONS:
|
||||
conn.execute(sql)
|
||||
if current < 40:
|
||||
for sql in _V39_TO_V40_MIGRATIONS:
|
||||
conn.execute(sql)
|
||||
conn.execute(
|
||||
"UPDATE schema_version SET version = ?, applied_at = current_timestamp",
|
||||
[SCHEMA_VERSION],
|
||||
|
|
|
|||
120
src/repositories/bq_metadata_cache.py
Normal file
120
src/repositories/bq_metadata_cache.py
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
"""Repository for the persistent BigQuery metadata cache.
|
||||
|
||||
Backs the v40 ``bq_metadata_cache`` table. Reads are called from the
|
||||
hot path (``/api/v2/catalog``); writes only from the scheduler-driven
|
||||
refresh job in ``app/api/bq_metadata_refresh.py`` and from operator-
|
||||
triggered single-row refreshes via ``/api/v2/metadata-cache/refresh``.
|
||||
|
||||
clustered_by is stored as a JSON array of column-name strings and
|
||||
returned to callers as a list (decoded here, never raw JSON).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any, Optional
|
||||
|
||||
import duckdb
|
||||
|
||||
|
||||
def _decode_clustered_by(stored: Any) -> Optional[list[str]]:
|
||||
if stored is None:
|
||||
return None
|
||||
if isinstance(stored, list):
|
||||
return [str(x) for x in stored]
|
||||
if isinstance(stored, str):
|
||||
try:
|
||||
parsed = json.loads(stored)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return [str(x) for x in parsed] if isinstance(parsed, list) else None
|
||||
return None
|
||||
|
||||
|
||||
def _row_to_dict(conn: duckdb.DuckDBPyConnection, row: tuple) -> dict[str, Any]:
|
||||
columns = [desc[0] for desc in conn.description]
|
||||
out: dict[str, Any] = dict(zip(columns, row))
|
||||
out["clustered_by"] = _decode_clustered_by(out.get("clustered_by"))
|
||||
return out
|
||||
|
||||
|
||||
class BqMetadataCacheRepository:
|
||||
def __init__(self, conn: duckdb.DuckDBPyConnection):
|
||||
self.conn = conn
|
||||
|
||||
def get(self, table_id: str) -> Optional[dict[str, Any]]:
|
||||
result = self.conn.execute(
|
||||
"SELECT * FROM bq_metadata_cache WHERE table_id = ?",
|
||||
[table_id],
|
||||
).fetchone()
|
||||
if not result:
|
||||
return None
|
||||
return _row_to_dict(self.conn, result)
|
||||
|
||||
def list_all(self) -> list[dict[str, Any]]:
|
||||
results = self.conn.execute(
|
||||
"SELECT * FROM bq_metadata_cache ORDER BY table_id"
|
||||
).fetchall()
|
||||
if not results:
|
||||
return []
|
||||
columns = [desc[0] for desc in self.conn.description]
|
||||
out: list[dict[str, Any]] = []
|
||||
for r in results:
|
||||
row = dict(zip(columns, r))
|
||||
row["clustered_by"] = _decode_clustered_by(row.get("clustered_by"))
|
||||
out.append(row)
|
||||
return out
|
||||
|
||||
def upsert_success(
|
||||
self,
|
||||
table_id: str,
|
||||
*,
|
||||
rows: Optional[int],
|
||||
size_bytes: Optional[int],
|
||||
partition_by: Optional[str],
|
||||
clustered_by: Optional[list[str]],
|
||||
) -> None:
|
||||
"""Record a successful refresh. Clears any prior error_at/error_msg."""
|
||||
now = datetime.now(timezone.utc)
|
||||
clustered_json = (
|
||||
json.dumps(list(clustered_by)) if clustered_by is not None else None
|
||||
)
|
||||
self.conn.execute(
|
||||
"""INSERT INTO bq_metadata_cache
|
||||
(table_id, rows, size_bytes, partition_by, clustered_by,
|
||||
refreshed_at, error_at, error_msg)
|
||||
VALUES (?, ?, ?, ?, ?, ?, NULL, NULL)
|
||||
ON CONFLICT (table_id) DO UPDATE SET
|
||||
rows = excluded.rows,
|
||||
size_bytes = excluded.size_bytes,
|
||||
partition_by = excluded.partition_by,
|
||||
clustered_by = excluded.clustered_by,
|
||||
refreshed_at = excluded.refreshed_at,
|
||||
error_at = NULL,
|
||||
error_msg = NULL""",
|
||||
[table_id, rows, size_bytes, partition_by, clustered_json, now],
|
||||
)
|
||||
|
||||
def mark_error(self, table_id: str, error_msg: str) -> None:
|
||||
"""Record a failed refresh. Preserves the prior success row (if any)
|
||||
so analyst Claude keeps using last-known-good rows + size_bytes while
|
||||
the next scheduled retry attempts to recover."""
|
||||
now = datetime.now(timezone.utc)
|
||||
truncated = (error_msg or "")[:512] # bound storage
|
||||
self.conn.execute(
|
||||
"""INSERT INTO bq_metadata_cache
|
||||
(table_id, rows, size_bytes, partition_by, clustered_by,
|
||||
refreshed_at, error_at, error_msg)
|
||||
VALUES (?, NULL, NULL, NULL, NULL, NULL, ?, ?)
|
||||
ON CONFLICT (table_id) DO UPDATE SET
|
||||
error_at = excluded.error_at,
|
||||
error_msg = excluded.error_msg""",
|
||||
[table_id, now, truncated],
|
||||
)
|
||||
|
||||
def delete(self, table_id: str) -> None:
|
||||
"""Drop a row — used by admin endpoints when a table is unregistered."""
|
||||
self.conn.execute(
|
||||
"DELETE FROM bq_metadata_cache WHERE table_id = ?", [table_id]
|
||||
)
|
||||
|
|
@ -84,7 +84,6 @@ def _reset_module_caches():
|
|||
try:
|
||||
from app.api import v2_catalog as _vc
|
||||
_vc._table_rows_cache.clear()
|
||||
_vc._metadata_cache.clear()
|
||||
except (ImportError, AttributeError):
|
||||
pass
|
||||
try:
|
||||
|
|
@ -96,7 +95,6 @@ def _reset_module_caches():
|
|||
try:
|
||||
from app.api import v2_catalog as _vc
|
||||
_vc._table_rows_cache.clear()
|
||||
_vc._metadata_cache.clear()
|
||||
except (ImportError, AttributeError):
|
||||
pass
|
||||
try:
|
||||
|
|
|
|||
160
tests/test_bq_metadata_cache_repo.py
Normal file
160
tests/test_bq_metadata_cache_repo.py
Normal file
|
|
@ -0,0 +1,160 @@
|
|||
"""Repository + freshness tests for the persistent BQ metadata cache."""
|
||||
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
import pytest
|
||||
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
|
||||
|
||||
def test_upsert_success_inserts_then_updates(seeded_app):
|
||||
from src.db import get_system_db
|
||||
conn = get_system_db()
|
||||
try:
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
repo.upsert_success(
|
||||
"orders", rows=10, size_bytes=2048,
|
||||
partition_by="event_date", clustered_by=["country"],
|
||||
)
|
||||
row = repo.get("orders")
|
||||
assert row is not None
|
||||
assert row["rows"] == 10
|
||||
assert row["size_bytes"] == 2048
|
||||
assert row["partition_by"] == "event_date"
|
||||
assert row["clustered_by"] == ["country"]
|
||||
assert row["refreshed_at"] is not None
|
||||
assert row["error_at"] is None
|
||||
|
||||
# Update with new numbers; refreshed_at advances.
|
||||
first_refresh = row["refreshed_at"]
|
||||
repo.upsert_success(
|
||||
"orders", rows=20, size_bytes=4096,
|
||||
partition_by=None, clustered_by=[],
|
||||
)
|
||||
row2 = repo.get("orders")
|
||||
assert row2["rows"] == 20
|
||||
assert row2["partition_by"] is None
|
||||
assert row2["clustered_by"] == []
|
||||
assert row2["refreshed_at"] >= first_refresh
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_mark_error_preserves_prior_success(seeded_app):
|
||||
"""After a successful refresh, a subsequent failure must keep the
|
||||
rows/size_bytes columns untouched — analyst Claude keeps using the
|
||||
last-known-good numbers while the next scheduled retry attempts to
|
||||
recover."""
|
||||
from src.db import get_system_db
|
||||
conn = get_system_db()
|
||||
try:
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
repo.upsert_success(
|
||||
"orders", rows=100, size_bytes=1000,
|
||||
partition_by=None, clustered_by=None,
|
||||
)
|
||||
repo.mark_error("orders", "BQ timeout")
|
||||
row = repo.get("orders")
|
||||
assert row["rows"] == 100, "prior success must be preserved across error"
|
||||
assert row["size_bytes"] == 1000
|
||||
assert row["error_at"] is not None
|
||||
assert row["error_msg"] == "BQ timeout"
|
||||
# Subsequent success clears the error.
|
||||
repo.upsert_success(
|
||||
"orders", rows=200, size_bytes=2000,
|
||||
partition_by=None, clustered_by=None,
|
||||
)
|
||||
row2 = repo.get("orders")
|
||||
assert row2["rows"] == 200
|
||||
assert row2["error_at"] is None
|
||||
assert row2["error_msg"] is None
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_mark_error_truncates_long_messages(seeded_app):
|
||||
from src.db import get_system_db
|
||||
conn = get_system_db()
|
||||
try:
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
repo.mark_error("orders", "x" * 2000)
|
||||
row = repo.get("orders")
|
||||
assert len(row["error_msg"]) == 512
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_list_all_orders_by_table_id(seeded_app):
|
||||
from src.db import get_system_db
|
||||
conn = get_system_db()
|
||||
try:
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
repo.upsert_success(
|
||||
"zeta", rows=1, size_bytes=1, partition_by=None, clustered_by=None,
|
||||
)
|
||||
repo.upsert_success(
|
||||
"alpha", rows=2, size_bytes=2, partition_by=None, clustered_by=None,
|
||||
)
|
||||
rows = repo.list_all()
|
||||
ids = [r["table_id"] for r in rows]
|
||||
assert ids == sorted(ids)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_delete_removes_row(seeded_app):
|
||||
from src.db import get_system_db
|
||||
conn = get_system_db()
|
||||
try:
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
repo.upsert_success(
|
||||
"orders", rows=1, size_bytes=1, partition_by=None, clustered_by=None,
|
||||
)
|
||||
repo.delete("orders")
|
||||
assert repo.get("orders") is None
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
# ─── compute_freshness ────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_freshness_never_fetched_for_missing_row():
|
||||
from app.api.bq_metadata_refresh import compute_freshness
|
||||
assert compute_freshness(None) == "never_fetched"
|
||||
|
||||
|
||||
def test_freshness_never_fetched_for_no_refresh_no_error():
|
||||
from app.api.bq_metadata_refresh import compute_freshness
|
||||
row = {"refreshed_at": None, "error_at": None}
|
||||
assert compute_freshness(row) == "never_fetched"
|
||||
|
||||
|
||||
def test_freshness_error_when_only_error_present():
|
||||
from app.api.bq_metadata_refresh import compute_freshness
|
||||
row = {
|
||||
"refreshed_at": None,
|
||||
"error_at": datetime.now(timezone.utc),
|
||||
}
|
||||
assert compute_freshness(row) == "error"
|
||||
|
||||
|
||||
def test_freshness_fresh_within_threshold():
|
||||
from app.api.bq_metadata_refresh import compute_freshness
|
||||
now = datetime.now(timezone.utc)
|
||||
row = {
|
||||
"refreshed_at": now - timedelta(seconds=60),
|
||||
"error_at": None,
|
||||
}
|
||||
# 1-minute-old row with a 1-hour threshold ⇒ fresh.
|
||||
assert compute_freshness(row, now=now, fresh_threshold=3600) == "fresh"
|
||||
|
||||
|
||||
def test_freshness_stale_beyond_threshold():
|
||||
from app.api.bq_metadata_refresh import compute_freshness
|
||||
now = datetime.now(timezone.utc)
|
||||
row = {
|
||||
"refreshed_at": now - timedelta(hours=10),
|
||||
"error_at": None,
|
||||
}
|
||||
assert compute_freshness(row, now=now, fresh_threshold=3600) == "stale"
|
||||
211
tests/test_bq_metadata_refresh_endpoint.py
Normal file
211
tests/test_bq_metadata_refresh_endpoint.py
Normal file
|
|
@ -0,0 +1,211 @@
|
|||
"""End-to-end tests for the three bq_metadata_refresh endpoints."""
|
||||
|
||||
from unittest.mock import patch
|
||||
|
||||
from app.api._metadata_models import TableMetadata
|
||||
|
||||
|
||||
def _register_remote(seeded_app, table_id: str):
|
||||
from src.db import get_system_db
|
||||
from src.repositories.table_registry import TableRegistryRepository
|
||||
conn = get_system_db()
|
||||
try:
|
||||
TableRegistryRepository(conn).register(
|
||||
name=table_id,
|
||||
id=table_id,
|
||||
source_type="bigquery",
|
||||
bucket="dwh_base",
|
||||
source_table=table_id,
|
||||
query_mode="remote",
|
||||
)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
# ─── POST /api/admin/run-bq-metadata-refresh ──────────────────────────────
|
||||
|
||||
|
||||
def test_run_refresh_walks_remote_rows_and_upserts(seeded_app):
|
||||
from src.db import get_system_db
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
|
||||
_register_remote(seeded_app, "a_remote")
|
||||
_register_remote(seeded_app, "b_remote")
|
||||
|
||||
fake = TableMetadata(
|
||||
rows=5, size_bytes=512, partition_by="d", clustered_by=["c"],
|
||||
)
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
with patch("connectors.bigquery.metadata.fetch", return_value=fake):
|
||||
r = c.post(
|
||||
"/api/admin/run-bq-metadata-refresh",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200, r.text
|
||||
body = r.json()
|
||||
assert body["total"] >= 2
|
||||
assert body["succeeded"] >= 2
|
||||
assert body["failed"] == 0
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
repo = BqMetadataCacheRepository(conn)
|
||||
for tid in ("a_remote", "b_remote"):
|
||||
row = repo.get(tid)
|
||||
assert row is not None
|
||||
assert row["rows"] == 5
|
||||
assert row["size_bytes"] == 512
|
||||
assert row["partition_by"] == "d"
|
||||
assert row["clustered_by"] == ["c"]
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_run_refresh_marks_error_on_provider_failure(seeded_app):
|
||||
from src.db import get_system_db
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
|
||||
_register_remote(seeded_app, "boom")
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
with patch(
|
||||
"connectors.bigquery.metadata.fetch",
|
||||
side_effect=RuntimeError("BQ throttle"),
|
||||
):
|
||||
r = c.post(
|
||||
"/api/admin/run-bq-metadata-refresh",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200
|
||||
body = r.json()
|
||||
assert body["failed"] >= 1
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
row = BqMetadataCacheRepository(conn).get("boom")
|
||||
assert row is not None
|
||||
assert row["error_at"] is not None
|
||||
assert "BQ throttle" in (row["error_msg"] or "")
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_run_refresh_requires_admin(seeded_app):
|
||||
c = seeded_app["client"]
|
||||
# No Authorization header → 401.
|
||||
r = c.post("/api/admin/run-bq-metadata-refresh")
|
||||
assert r.status_code == 401
|
||||
|
||||
|
||||
# ─── POST /api/v2/metadata-cache/refresh?table= ───────────────────────────
|
||||
|
||||
|
||||
def test_refresh_one_table_endpoint(seeded_app):
|
||||
from src.db import get_system_db
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
|
||||
_register_remote(seeded_app, "single")
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
fake = TableMetadata(rows=99, size_bytes=999)
|
||||
with patch("connectors.bigquery.metadata.fetch", return_value=fake):
|
||||
r = c.post(
|
||||
"/api/v2/metadata-cache/refresh?table=single",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200, r.text
|
||||
assert r.json()["status"] == "ok"
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
row = BqMetadataCacheRepository(conn).get("single")
|
||||
assert row["rows"] == 99
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_refresh_one_table_unknown_id_returns_404(seeded_app):
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
r = c.post(
|
||||
"/api/v2/metadata-cache/refresh?table=does_not_exist",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 404
|
||||
|
||||
|
||||
def test_refresh_one_table_rejects_non_remote(seeded_app):
|
||||
from src.db import get_system_db
|
||||
from src.repositories.table_registry import TableRegistryRepository
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
TableRegistryRepository(conn).register(
|
||||
name="local_t",
|
||||
id="local_t",
|
||||
source_type="keboola",
|
||||
bucket="in.c-x",
|
||||
source_table="t",
|
||||
query_mode="local",
|
||||
)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
r = c.post(
|
||||
"/api/v2/metadata-cache/refresh?table=local_t",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 400
|
||||
|
||||
|
||||
# ─── GET /api/v2/metadata-cache/status ────────────────────────────────────
|
||||
|
||||
|
||||
def test_status_endpoint_returns_per_row_freshness(seeded_app):
|
||||
from src.db import get_system_db
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
BqMetadataCacheRepository(conn).upsert_success(
|
||||
"orders", rows=1, size_bytes=1, partition_by=None, clustered_by=None,
|
||||
)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
r = c.get(
|
||||
"/api/v2/metadata-cache/status",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200, r.text
|
||||
body = r.json()
|
||||
assert "scheduler_interval_seconds" in body
|
||||
assert "fresh_threshold_seconds" in body
|
||||
assert body["fresh_threshold_seconds"] == 2 * body["scheduler_interval_seconds"]
|
||||
orders = next(t for t in body["tables"] if t["table_id"] == "orders")
|
||||
assert orders["freshness"] == "fresh"
|
||||
|
||||
|
||||
def test_status_endpoint_does_not_require_admin(seeded_app):
|
||||
"""Non-admin analyst tools (CLI, Claude Code) need this surface."""
|
||||
c = seeded_app["client"]
|
||||
# No token at all → 401 (auth still required, just not admin).
|
||||
r = c.get("/api/v2/metadata-cache/status")
|
||||
assert r.status_code == 401
|
||||
# Any authenticated user works — seeded_app's admin_token is the
|
||||
# easiest valid bearer; downgrade once the test harness exposes a
|
||||
# plain-user token.
|
||||
token = seeded_app["admin_token"]
|
||||
r = c.get(
|
||||
"/api/v2/metadata-cache/status",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200
|
||||
|
|
@ -18,9 +18,17 @@ def test_build_jobs_uses_documented_defaults(monkeypatch):
|
|||
assert jobs["health-check"] == "every 5m"
|
||||
assert jobs["script-runner"] == "every 1m"
|
||||
assert jobs["marketplaces"] == "daily 03:00"
|
||||
assert jobs["bq-metadata-refresh"] == "every 4h"
|
||||
assert resolved_tick_seconds() == 30
|
||||
|
||||
|
||||
def test_build_jobs_honors_bq_metadata_env_override(monkeypatch):
|
||||
monkeypatch.setenv("SCHEDULER_BQ_METADATA_REFRESH_INTERVAL", "7200") # 2h
|
||||
from services.scheduler.__main__ import build_jobs
|
||||
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
||||
assert jobs["bq-metadata-refresh"] == "every 2h"
|
||||
|
||||
|
||||
def test_build_jobs_honors_env_overrides(monkeypatch):
|
||||
monkeypatch.setenv("SCHEDULER_DATA_REFRESH_INTERVAL", "1800") # 30m
|
||||
monkeypatch.setenv("SCHEDULER_HEALTH_CHECK_INTERVAL", "60") # 1m
|
||||
|
|
|
|||
|
|
@ -1,71 +0,0 @@
|
|||
"""Dispatch + identifier-validation gate for the source-agnostic
|
||||
metadata providers."""
|
||||
|
||||
from app.api._metadata_models import MetadataRequest
|
||||
|
||||
|
||||
def test_dispatcher_returns_bq_provider_for_bigquery():
|
||||
from app.api.v2_catalog import _metadata_provider_for
|
||||
from connectors.bigquery import metadata as bq_meta
|
||||
fn = _metadata_provider_for("bigquery")
|
||||
assert fn is bq_meta.fetch
|
||||
|
||||
|
||||
def test_dispatcher_returns_keboola_provider_for_keboola():
|
||||
from app.api.v2_catalog import _metadata_provider_for
|
||||
from connectors.keboola import metadata as kb_meta
|
||||
fn = _metadata_provider_for("keboola")
|
||||
assert fn is kb_meta.fetch
|
||||
|
||||
|
||||
def test_dispatcher_returns_none_for_unknown_source():
|
||||
from app.api.v2_catalog import _metadata_provider_for
|
||||
assert _metadata_provider_for("jira") is None
|
||||
assert _metadata_provider_for("") is None
|
||||
assert _metadata_provider_for("snowflake") is None
|
||||
|
||||
|
||||
def test_build_metadata_request_for_valid_row():
|
||||
from app.api.v2_catalog import _build_metadata_request
|
||||
req = _build_metadata_request({
|
||||
"id": "orders",
|
||||
"bucket": "dwh_base",
|
||||
"source_table": "orders_2024",
|
||||
})
|
||||
assert isinstance(req, MetadataRequest)
|
||||
assert req.table_id == "orders"
|
||||
assert req.bucket == "dwh_base"
|
||||
assert req.source_table == "orders_2024"
|
||||
|
||||
|
||||
def test_build_metadata_request_rejects_unsafe_bucket():
|
||||
from app.api.v2_catalog import _build_metadata_request
|
||||
req = _build_metadata_request({
|
||||
"id": "x",
|
||||
"bucket": "evil`; DROP--",
|
||||
"source_table": "t",
|
||||
})
|
||||
assert req is None
|
||||
|
||||
|
||||
def test_build_metadata_request_falls_back_to_id_when_source_table_missing():
|
||||
"""Some legacy Keboola registry rows have empty source_table; the row id
|
||||
is the table name in that case (mirrors v2_schema:168 behavior)."""
|
||||
from app.api.v2_catalog import _build_metadata_request
|
||||
req = _build_metadata_request({
|
||||
"id": "orders",
|
||||
"bucket": "in.c-crm",
|
||||
"source_table": "",
|
||||
})
|
||||
assert req is not None
|
||||
assert req.source_table == "orders"
|
||||
|
||||
|
||||
def test_stub_providers_return_none():
|
||||
"""Providers don't have their real bodies yet — stubs return None
|
||||
so the catalog endpoint stays 200 while we wire the rest."""
|
||||
from connectors.bigquery import metadata as bq_meta
|
||||
from connectors.keboola import metadata as kb_meta
|
||||
req = MetadataRequest(table_id="x", bucket="b", source_table="t")
|
||||
assert bq_meta.fetch(req) is None
|
||||
assert kb_meta.fetch(req) is None
|
||||
|
|
@ -1,47 +1,58 @@
|
|||
"""Unified cache flush across all four catalog/schema/sample/metadata
|
||||
caches on registry write."""
|
||||
"""Unified cache flush across the three in-memory catalog/schema/sample
|
||||
caches on registry write.
|
||||
|
||||
from unittest.mock import patch
|
||||
Post-0.50: the persistent ``bq_metadata_cache`` is intentionally NOT
|
||||
invalidated here. That table's lifecycle is owned by the scheduler-
|
||||
driven refresh — admins who need an immediate refresh after editing a
|
||||
remote row hit ``POST /api/v2/metadata-cache/refresh?table=<id>``
|
||||
explicitly. Auto-invalidation on every registry edit would re-introduce
|
||||
the request-path BQ fan-out the refactor exists to avoid.
|
||||
"""
|
||||
|
||||
from src.db import get_system_db
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
|
||||
|
||||
def test_invalidate_flushes_all_four_caches():
|
||||
def test_invalidate_flushes_three_in_memory_caches():
|
||||
from app.api import v2_catalog, v2_schema, v2_sample
|
||||
from app.api._metadata_models import TableMetadata
|
||||
|
||||
# Pre-populate.
|
||||
v2_catalog._table_rows_cache.set("all", ["fake_row"])
|
||||
v2_catalog._metadata_cache.set("orders", TableMetadata(rows=10))
|
||||
v2_schema._schema_cache.set("orders", {"columns": []})
|
||||
v2_sample._sample_cache.set("orders|10", [{"row": 1}])
|
||||
|
||||
v2_catalog.invalidate_for_table("orders")
|
||||
|
||||
assert v2_catalog._table_rows_cache.get("all") is None
|
||||
assert v2_catalog._metadata_cache.get("orders") is None
|
||||
assert v2_schema._schema_cache.get("orders") is None
|
||||
# Sample cache is cleared whole (we don't have prefix-invalidation).
|
||||
assert v2_sample._sample_cache.get("orders|10") is None
|
||||
|
||||
|
||||
def test_invalidate_schedules_single_row_rewarm(monkeypatch):
|
||||
"""After the flush, a background re-warm task is scheduled for the
|
||||
same table_id. Assert via patching create_task."""
|
||||
import asyncio
|
||||
def test_invalidate_does_not_touch_persistent_bq_cache():
|
||||
"""The persistent cache survives registry-row invalidations; only an
|
||||
explicit ``POST /api/v2/metadata-cache/refresh`` (or the scheduled
|
||||
refresh) should change it."""
|
||||
from app.api import v2_catalog
|
||||
|
||||
scheduled = []
|
||||
conn = get_system_db()
|
||||
try:
|
||||
BqMetadataCacheRepository(conn).upsert_success(
|
||||
"survives_invalidate",
|
||||
rows=42, size_bytes=4096, partition_by=None, clustered_by=None,
|
||||
)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
def fake_create_task(coro):
|
||||
# Drain the coroutine so the test doesn't leak it.
|
||||
coro.close()
|
||||
scheduled.append(coro)
|
||||
return None
|
||||
v2_catalog.invalidate_for_table("survives_invalidate")
|
||||
|
||||
# Simulate a running event loop so the create_task branch is reached.
|
||||
monkeypatch.setattr(asyncio, "get_running_loop", lambda: object())
|
||||
monkeypatch.setattr(asyncio, "create_task", fake_create_task)
|
||||
v2_catalog.invalidate_for_table("orders")
|
||||
assert len(scheduled) == 1
|
||||
conn = get_system_db()
|
||||
try:
|
||||
row = BqMetadataCacheRepository(conn).get("survives_invalidate")
|
||||
finally:
|
||||
conn.close()
|
||||
assert row is not None
|
||||
assert row["rows"] == 42
|
||||
|
||||
|
||||
def test_register_table_invalidates(seeded_app):
|
||||
|
|
|
|||
|
|
@ -1,10 +1,16 @@
|
|||
"""Catalog endpoint integration: per-table metadata enrichment for
|
||||
remote rows."""
|
||||
remote rows.
|
||||
|
||||
Post-0.50 the catalog endpoint reads enrichment fields exclusively from
|
||||
the persistent ``bq_metadata_cache`` table (populated by the scheduler-
|
||||
driven refresh in ``app/api/bq_metadata_refresh.py``). These tests
|
||||
pre-seed cache rows and verify the catalog response shape; they do NOT
|
||||
mock ``connectors.bigquery.metadata.fetch`` because that path is no
|
||||
longer reachable from the catalog request.
|
||||
"""
|
||||
|
||||
from unittest.mock import patch
|
||||
|
||||
from app.api._metadata_models import TableMetadata
|
||||
|
||||
|
||||
def _register_table(seeded_app, **kwargs):
|
||||
"""Register a table into the test DB using TableRegistryRepository."""
|
||||
|
|
@ -13,38 +19,60 @@ def _register_table(seeded_app, **kwargs):
|
|||
conn = get_system_db()
|
||||
try:
|
||||
repo = TableRegistryRepository(conn)
|
||||
# `name` defaults to `id` if not supplied
|
||||
name = kwargs.pop("name", kwargs.get("id"))
|
||||
repo.register(name=name, **kwargs)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_remote_row_includes_metadata_fields(seeded_app, monkeypatch):
|
||||
"""Catalog response for a query_mode='remote' BQ row carries the four
|
||||
new fields populated by the provider."""
|
||||
# Reset catalog row cache so this test's registered table is visible.
|
||||
def _seed_cache_row(
|
||||
table_id: str,
|
||||
*,
|
||||
rows=None,
|
||||
size_bytes=None,
|
||||
partition_by=None,
|
||||
clustered_by=None,
|
||||
):
|
||||
"""Insert a successful refresh row into bq_metadata_cache."""
|
||||
from src.db import get_system_db
|
||||
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||
conn = get_system_db()
|
||||
try:
|
||||
BqMetadataCacheRepository(conn).upsert_success(
|
||||
table_id,
|
||||
rows=rows,
|
||||
size_bytes=size_bytes,
|
||||
partition_by=partition_by,
|
||||
clustered_by=clustered_by,
|
||||
)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def _reset_catalog_caches():
|
||||
from app.api import v2_catalog
|
||||
v2_catalog._table_rows_cache.clear()
|
||||
v2_catalog._metadata_cache.clear()
|
||||
|
||||
|
||||
def test_remote_row_includes_metadata_fields(seeded_app):
|
||||
"""Catalog response for a query_mode='remote' BQ row carries the four
|
||||
enrichment fields read from the persistent cache."""
|
||||
_reset_catalog_caches()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
|
||||
fake_meta = TableMetadata(
|
||||
rows=10000, size_bytes=2_000_000,
|
||||
partition_by="event_date", clustered_by=["country", "platform"],
|
||||
)
|
||||
|
||||
_register_table(
|
||||
seeded_app,
|
||||
id="orders", source_type="bigquery", bucket="dwh_base",
|
||||
source_table="orders_2024", query_mode="remote",
|
||||
)
|
||||
_seed_cache_row(
|
||||
"orders",
|
||||
rows=10000, size_bytes=2_000_000,
|
||||
partition_by="event_date", clustered_by=["country", "platform"],
|
||||
)
|
||||
|
||||
with patch(
|
||||
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
|
||||
):
|
||||
r = c.get(
|
||||
"/api/v2/catalog",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
|
|
@ -56,15 +84,46 @@ def test_remote_row_includes_metadata_fields(seeded_app, monkeypatch):
|
|||
assert orders["size_bytes"] == 2_000_000
|
||||
assert orders["partition_by"] == "event_date"
|
||||
assert orders["clustered_by"] == ["country", "platform"]
|
||||
# Existing fields still present.
|
||||
assert orders["query_mode"] == "remote"
|
||||
assert orders["metadata_freshness"] == "fresh"
|
||||
|
||||
|
||||
def test_local_row_unaffected_by_provider_dispatch(seeded_app):
|
||||
"""query_mode='local' rows take the parquet-stat path; provider not called."""
|
||||
from app.api import v2_catalog
|
||||
v2_catalog._table_rows_cache.clear()
|
||||
v2_catalog._metadata_cache.clear()
|
||||
def test_remote_row_with_no_cache_returns_null_fields(seeded_app):
|
||||
"""Catalog response for a remote row with no cache entry — first boot
|
||||
before scheduler tick — returns null enrichment fields and
|
||||
metadata_freshness='never_fetched'. MUST stay 200; MUST NOT call BQ."""
|
||||
_reset_catalog_caches()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
_register_table(
|
||||
seeded_app,
|
||||
id="cold_t", source_type="bigquery", bucket="dwh_base",
|
||||
source_table="cold_t", query_mode="remote",
|
||||
)
|
||||
|
||||
# Patch the BQ provider so we can prove the request path never reaches it.
|
||||
with patch("connectors.bigquery.metadata.fetch") as mock_fetch:
|
||||
r = c.get(
|
||||
"/api/v2/catalog",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200, r.text
|
||||
mock_fetch.assert_not_called()
|
||||
|
||||
tables = r.json()["tables"]
|
||||
cold = next(t for t in tables if t["id"] == "cold_t")
|
||||
assert cold["rows"] is None
|
||||
assert cold["size_bytes"] is None
|
||||
assert cold["partition_by"] is None
|
||||
assert cold["clustered_by"] == []
|
||||
assert cold["metadata_freshness"] == "never_fetched"
|
||||
|
||||
|
||||
def test_local_row_metadata_freshness_is_not_applicable(seeded_app):
|
||||
"""query_mode='local' rows take the parquet-stat path; the freshness
|
||||
field signals that the BQ cache concept doesn't apply."""
|
||||
_reset_catalog_caches()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
|
|
@ -74,74 +133,31 @@ def test_local_row_unaffected_by_provider_dispatch(seeded_app):
|
|||
source_table="users", query_mode="local",
|
||||
)
|
||||
|
||||
with patch("connectors.keboola.metadata.fetch") as mock_fetch:
|
||||
r = c.get(
|
||||
"/api/v2/catalog",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200, r.text
|
||||
mock_fetch.assert_not_called()
|
||||
|
||||
|
||||
def test_provider_failure_returns_null_metadata(seeded_app):
|
||||
"""Provider returns None → row appears with null new fields, not
|
||||
a 500. Catalog endpoint must stay 200."""
|
||||
from app.api import v2_catalog
|
||||
v2_catalog._table_rows_cache.clear()
|
||||
v2_catalog._metadata_cache.clear()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
_register_table(
|
||||
seeded_app,
|
||||
id="broken", source_type="bigquery", bucket="dwh_base",
|
||||
source_table="broken_t", query_mode="remote",
|
||||
)
|
||||
|
||||
with patch(
|
||||
"connectors.bigquery.metadata.fetch", return_value=None,
|
||||
):
|
||||
r = c.get(
|
||||
"/api/v2/catalog",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
)
|
||||
assert r.status_code == 200, r.text
|
||||
tables = r.json()["tables"]
|
||||
broken = next(t for t in tables if t["id"] == "broken")
|
||||
assert broken["rows"] is None
|
||||
assert broken["size_bytes"] is None
|
||||
assert broken["partition_by"] is None
|
||||
assert broken["clustered_by"] is None
|
||||
users = next(t for t in tables if t["id"] == "users")
|
||||
assert users["metadata_freshness"] == "not_applicable"
|
||||
|
||||
|
||||
def test_zero_size_bytes_reports_small_not_unknown(seeded_app):
|
||||
"""Devin Review #1 regression: `if cached.size_bytes:` is falsy when
|
||||
`size_bytes == 0` (genuinely empty table) — that wrongly emitted
|
||||
`rough_size_hint=None` ("unknown") instead of `"small"` (the bucket
|
||||
`_bucket_size(0)` returns).
|
||||
|
||||
Fix in `_size_hint_for_row`: distinguish "size known to be zero" from
|
||||
"size is unknown" with `is not None`."""
|
||||
from app.api import v2_catalog
|
||||
v2_catalog._table_rows_cache.clear()
|
||||
v2_catalog._metadata_cache.clear()
|
||||
"""Devin Review #1 regression preserved across the refactor: a cache
|
||||
row with size_bytes=0 must surface rough_size_hint='small', not None.
|
||||
"""
|
||||
_reset_catalog_caches()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
|
||||
fake_meta = TableMetadata(
|
||||
rows=0, size_bytes=0, partition_by=None, clustered_by=[],
|
||||
)
|
||||
|
||||
_register_table(
|
||||
seeded_app,
|
||||
id="empty_t", source_type="bigquery", bucket="dwh_base",
|
||||
source_table="empty_t", query_mode="remote",
|
||||
)
|
||||
_seed_cache_row("empty_t", rows=0, size_bytes=0, clustered_by=[])
|
||||
|
||||
with patch(
|
||||
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
|
||||
):
|
||||
r = c.get(
|
||||
"/api/v2/catalog",
|
||||
headers={"Authorization": f"Bearer {token}"},
|
||||
|
|
@ -149,18 +165,15 @@ def test_zero_size_bytes_reports_small_not_unknown(seeded_app):
|
|||
assert r.status_code == 200, r.text
|
||||
tables = r.json()["tables"]
|
||||
empty = next(t for t in tables if t["id"] == "empty_t")
|
||||
# The whole point of this test: 0 bytes is NOT "unknown".
|
||||
assert empty["size_bytes"] == 0
|
||||
assert empty["rough_size_hint"] == "small", (
|
||||
f"size_bytes=0 should bucket to 'small', got {empty['rough_size_hint']}"
|
||||
)
|
||||
assert empty["rough_size_hint"] == "small"
|
||||
|
||||
|
||||
def test_cache_hit_does_not_call_provider_twice(seeded_app):
|
||||
"""First call invokes provider; second within 15 min hits cache."""
|
||||
from app.api import v2_catalog
|
||||
v2_catalog._table_rows_cache.clear()
|
||||
v2_catalog._metadata_cache.clear()
|
||||
def test_catalog_request_never_calls_bq(seeded_app):
|
||||
"""The whole point of the refactor: even with a cold cache and a
|
||||
remote BQ row in the registry, GET /api/v2/catalog MUST NOT touch
|
||||
the BQ provider. Regressing this re-introduces the >90 s hang."""
|
||||
_reset_catalog_caches()
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
|
|
@ -170,10 +183,8 @@ def test_cache_hit_does_not_call_provider_twice(seeded_app):
|
|||
source_table="orders_2024", query_mode="remote",
|
||||
)
|
||||
|
||||
fake_meta = TableMetadata(rows=1, size_bytes=2)
|
||||
with patch(
|
||||
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
|
||||
) as mock_fetch:
|
||||
with patch("connectors.bigquery.metadata.fetch") as mock_fetch:
|
||||
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
|
||||
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
|
||||
assert mock_fetch.call_count == 1
|
||||
|
||||
mock_fetch.assert_not_called()
|
||||
|
|
|
|||
Loading…
Reference in a new issue