release: 0.50.0 — persistent BQ metadata cache + scheduled refresh; catalog never blocks on BigQuery
Since 0.47.0 GET /api/v2/catalog enriched each remote BigQuery row by fetching INFORMATION_SCHEMA.TABLE_STORAGE + COLUMNS through the DuckDB BigQuery extension *inside the request*. On cold caches that fanned out to O(N) sequential BQ jobs-API roundtrips — easily 90 s+ on partitioned / view-backed tables — and reliably blew the CLI's 30 s httpx ReadTimeout. Reproduced with py-spy: three AnyIO worker threads stuck inside connectors/bigquery/metadata._fetch_via_legacy_tables. Refactor: enrichment is read exclusively from a new persistent bq_metadata_cache DuckDB table (schema v40), populated by a scheduler- driven refresh job at SCHEDULER_BQ_METADATA_REFRESH_INTERVAL (default 4 h). Cold catalog response on a fresh container is now tens of milliseconds with metadata_freshness=never_fetched for unwarmed rows. New surface: - POST /api/admin/run-bq-metadata-refresh (scheduler-driven, full) - POST /api/v2/metadata-cache/refresh?table=<id> (admin, single) - GET /api/v2/metadata-cache/status (auth, non-admin) - metadata_freshness field per catalog row Removed (internal API): v2_catalog._size_hint_for_row, _resolve_remote_metadata, _metadata_provider_for, _build_metadata_request, _materialized_size_hint, in-memory _metadata_cache. Response shape unchanged for external consumers. 991 tests passing; 2 pre-existing failures (test_db v3→v4 ladder, test_cli_binary_rename) unrelated to this change.
This commit is contained in:
parent
183ee44bad
commit
b3841f5b6c
16 changed files with 1158 additions and 356 deletions
24
CHANGELOG.md
24
CHANGELOG.md
|
|
@ -10,6 +10,30 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
## [0.50.0] — 2026-05-11
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- **`GET /api/v2/catalog` no longer hangs on cold cache.** Since 0.47.0 the catalog endpoint enriched each remote BigQuery row by fetching `INFORMATION_SCHEMA.TABLE_STORAGE` + `COLUMNS` through the DuckDB BigQuery extension inside the request. On cold caches that fanned out to O(N) sequential BQ jobs-API roundtrips — easily 90 s+ on partitioned / view-backed tables — and reliably exceeded the CLI's 30 s `httpx.ReadTimeout`. Enrichment now reads exclusively from a persistent `bq_metadata_cache` DuckDB table, populated by a scheduler-driven refresh job. First call after a fresh container start returns in tens of milliseconds with `metadata_freshness: never_fetched` for rows the scheduler hasn't reached yet; subsequent ticks fill the cache. Closes the cold-start outage class entirely.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- **Persistent BigQuery metadata cache (`bq_metadata_cache`, schema v40).** Holds `rows`, `size_bytes`, `partition_by`, `clustered_by`, `refreshed_at`, plus a `error_at` / `error_msg` pair that preserves the last successful row across transient provider failures so analyst tooling keeps seeing last-known-good numbers.
|
||||||
|
- **`POST /api/admin/run-bq-metadata-refresh`** — scheduler-driven full refresh of every remote BigQuery row in the registry. Bounded concurrency via `AGNES_BQ_METADATA_REFRESH_CONCURRENCY` (default 4).
|
||||||
|
- **`POST /api/v2/metadata-cache/refresh?table=<id>`** — operator on-demand single-row refresh (admin-gated), for use right after a registry edit when waiting for the next scheduled tick is too long.
|
||||||
|
- **`GET /api/v2/metadata-cache/status`** — non-admin endpoint surfacing per-row `refreshed_at`, `error_at`, `error_msg`, and `freshness` (`fresh` / `stale` / `never_fetched` / `error`) so CLI / Claude Code can decide whether to trust the catalog's `rows` and `size_bytes`.
|
||||||
|
- **`metadata_freshness` field** in every `/api/v2/catalog` row. `not_applicable` for `local` / `materialized` rows where the BQ cache concept doesn't apply.
|
||||||
|
- **Scheduler job `bq-metadata-refresh`** running at `SCHEDULER_BQ_METADATA_REFRESH_INTERVAL` (default `4 * 60 * 60` seconds = 4 h). Tunable per deployment; the catalog request path is independent of the value.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
|
||||||
|
- **BREAKING (internal API):** removed `app.api.v2_catalog._size_hint_for_row`, `_resolve_remote_metadata`, `_metadata_provider_for`, `_build_metadata_request`, `_materialized_size_hint`, and the in-memory `_metadata_cache` (`TTLCache`). Catalog responses still expose the same enrichment fields (`rows`, `size_bytes`, `partition_by`, `clustered_by`); the new `metadata_freshness` field is additive. External consumers that read the response shape are unaffected.
|
||||||
|
- `app.api.cache_warmup._warm_metadata_sync` now refreshes the persistent cache via `bq_metadata_refresh.refresh_one` instead of priming an in-memory TTL cache. The existing `/api/admin/cache-warmup/*` endpoints and admin-tables SSE wiring continue to work.
|
||||||
|
|
||||||
|
### Internal
|
||||||
|
|
||||||
|
- Schema v40 migration `_V39_TO_V40_MIGRATIONS` adds the new table; existing instances pick it up on next start. Empty cache is treated as `never_fetched` by the catalog, never as an error.
|
||||||
|
|
||||||
## [0.49.1] — 2026-05-11
|
## [0.49.1] — 2026-05-11
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
|
||||||
310
app/api/bq_metadata_refresh.py
Normal file
310
app/api/bq_metadata_refresh.py
Normal file
|
|
@ -0,0 +1,310 @@
|
||||||
|
"""BigQuery metadata cache refresh — owner of the ``bq_metadata_cache``
|
||||||
|
write path.
|
||||||
|
|
||||||
|
Three endpoints share this module:
|
||||||
|
|
||||||
|
- ``POST /api/admin/run-bq-metadata-refresh`` — called by the scheduler
|
||||||
|
container (auth: shared scheduler token resolves to a synthetic admin
|
||||||
|
user). Walks remote rows in ``table_registry``, fetches each via the
|
||||||
|
BigQuery metadata provider, UPSERTs into ``bq_metadata_cache``.
|
||||||
|
|
||||||
|
- ``POST /api/v2/metadata-cache/refresh?table=<id>`` — admin-gated, for
|
||||||
|
operator on-demand refresh of a single row (e.g. after editing the
|
||||||
|
registry entry's ``bucket`` / ``source_table``).
|
||||||
|
|
||||||
|
- ``GET /api/v2/metadata-cache/status`` — auth required, NOT admin-only.
|
||||||
|
Returns per-row freshness so analyst tooling (CLI / Claude Code) can
|
||||||
|
decide whether to trust the cached numbers or wait for a refresh.
|
||||||
|
|
||||||
|
Why this lives outside the catalog endpoint
|
||||||
|
-------------------------------------------
|
||||||
|
Earlier releases inlined a per-row BigQuery fetch into ``GET /api/v2/catalog``.
|
||||||
|
On cold caches that became O(N) sequential BQ jobs API roundtrips inside
|
||||||
|
one HTTP request — easily 90 s+ on partitioned tables — and reliably blew
|
||||||
|
the CLI's 30 s ``httpx.ReadTimeout``. Moving the fetch off the hot path
|
||||||
|
into a scheduled refresh job (default every 4 h, configurable via
|
||||||
|
``SCHEDULER_BQ_METADATA_REFRESH_INTERVAL``) keeps the catalog response
|
||||||
|
under tens of milliseconds even at first boot, at the cost of metadata
|
||||||
|
being up to one refresh-interval stale. The freshness field surfaces
|
||||||
|
that explicitly.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from typing import Any, Optional
|
||||||
|
|
||||||
|
import duckdb
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
|
||||||
|
from app.auth.access import require_admin
|
||||||
|
from app.auth.dependencies import _get_db, get_current_user
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
from src.repositories.table_registry import TableRegistryRepository
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Freshness thresholds ──────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _scheduler_interval_seconds() -> int:
|
||||||
|
"""Return the scheduler's configured refresh interval, mirroring
|
||||||
|
``services/scheduler/__main__.py``. We re-read the env var instead
|
||||||
|
of importing the scheduler module because the scheduler runs in a
|
||||||
|
sibling container and is not on the app's import path.
|
||||||
|
"""
|
||||||
|
raw = os.environ.get("SCHEDULER_BQ_METADATA_REFRESH_INTERVAL")
|
||||||
|
if raw is None or raw == "":
|
||||||
|
return 4 * 60 * 60 # 4 h default
|
||||||
|
try:
|
||||||
|
value = int(raw)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return 4 * 60 * 60
|
||||||
|
return value if value > 0 else 4 * 60 * 60
|
||||||
|
|
||||||
|
|
||||||
|
def _fresh_threshold_seconds() -> int:
|
||||||
|
"""A row is ``fresh`` when refreshed within this window.
|
||||||
|
|
||||||
|
Two refresh intervals: one refresh might fail (network blip, BQ
|
||||||
|
throttle); the analyst should keep seeing the last-known-good row
|
||||||
|
as ``fresh`` until two consecutive refreshes have passed without
|
||||||
|
success. Beyond that, the response surfaces ``stale`` so the
|
||||||
|
consumer knows the numbers might be outdated.
|
||||||
|
"""
|
||||||
|
return 2 * _scheduler_interval_seconds()
|
||||||
|
|
||||||
|
|
||||||
|
def compute_freshness(
|
||||||
|
cache_row: Optional[dict[str, Any]],
|
||||||
|
*,
|
||||||
|
now: Optional[datetime] = None,
|
||||||
|
fresh_threshold: Optional[int] = None,
|
||||||
|
) -> str:
|
||||||
|
"""Classify a cache row's freshness.
|
||||||
|
|
||||||
|
- ``never_fetched``: no row, or no successful refresh yet.
|
||||||
|
- ``fresh``: refreshed within the threshold.
|
||||||
|
- ``stale``: refreshed earlier than the threshold.
|
||||||
|
- ``error``: most recent attempt failed and there is no prior success
|
||||||
|
(success row is preserved across errors — analyst keeps using
|
||||||
|
last-known-good numbers).
|
||||||
|
"""
|
||||||
|
if cache_row is None:
|
||||||
|
return "never_fetched"
|
||||||
|
refreshed_at = cache_row.get("refreshed_at")
|
||||||
|
error_at = cache_row.get("error_at")
|
||||||
|
if refreshed_at is None:
|
||||||
|
return "error" if error_at is not None else "never_fetched"
|
||||||
|
threshold = fresh_threshold if fresh_threshold is not None else _fresh_threshold_seconds()
|
||||||
|
cutoff = (now or datetime.now(timezone.utc)) - timedelta(seconds=threshold)
|
||||||
|
# DuckDB returns naive datetimes for TIMESTAMP columns; treat as UTC.
|
||||||
|
if refreshed_at.tzinfo is None:
|
||||||
|
refreshed_at = refreshed_at.replace(tzinfo=timezone.utc)
|
||||||
|
return "fresh" if refreshed_at >= cutoff else "stale"
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Single-row refresh primitive ──────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def refresh_one(conn: duckdb.DuckDBPyConnection, row: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
"""Fetch BQ metadata for one row and UPSERT the result.
|
||||||
|
|
||||||
|
Synchronous; safe to call from an anyio thread. Returns a small
|
||||||
|
outcome dict for the caller (counts, audit).
|
||||||
|
|
||||||
|
Failures are absorbed: the cache row's prior success is preserved
|
||||||
|
(``error_at`` + ``error_msg`` set, ``refreshed_at`` left alone).
|
||||||
|
"""
|
||||||
|
from app.api._metadata_models import MetadataRequest
|
||||||
|
from connectors.bigquery import metadata as bq_metadata
|
||||||
|
from src.identifier_validation import validate_quoted_identifier
|
||||||
|
|
||||||
|
table_id = row["id"]
|
||||||
|
bucket = row.get("bucket") or ""
|
||||||
|
source_table = row.get("source_table") or table_id
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
|
||||||
|
if not (
|
||||||
|
validate_quoted_identifier(bucket, "bucket")
|
||||||
|
and validate_quoted_identifier(source_table, "source_table")
|
||||||
|
):
|
||||||
|
repo.mark_error(table_id, "invalid bucket/source_table identifier")
|
||||||
|
return {"table_id": table_id, "status": "error", "error": "invalid identifier"}
|
||||||
|
|
||||||
|
req = MetadataRequest(
|
||||||
|
table_id=table_id, bucket=bucket, source_table=source_table,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
result = bq_metadata.fetch(req)
|
||||||
|
except Exception as e:
|
||||||
|
# bq_metadata.fetch is documented as never-raises, but defense in
|
||||||
|
# depth: catch any regression so one bad row doesn't kill the
|
||||||
|
# whole scheduler tick.
|
||||||
|
msg = f"{type(e).__name__}: {e}"
|
||||||
|
logger.warning("bq metadata refresh failed for %s: %s", table_id, msg)
|
||||||
|
repo.mark_error(table_id, msg)
|
||||||
|
return {"table_id": table_id, "status": "error", "error": msg}
|
||||||
|
|
||||||
|
if result is None:
|
||||||
|
repo.mark_error(table_id, "provider returned no data")
|
||||||
|
return {"table_id": table_id, "status": "no_data"}
|
||||||
|
|
||||||
|
repo.upsert_success(
|
||||||
|
table_id,
|
||||||
|
rows=result.rows,
|
||||||
|
size_bytes=result.size_bytes,
|
||||||
|
partition_by=result.partition_by,
|
||||||
|
clustered_by=result.clustered_by,
|
||||||
|
)
|
||||||
|
return {
|
||||||
|
"table_id": table_id,
|
||||||
|
"status": "ok",
|
||||||
|
"rows": result.rows,
|
||||||
|
"size_bytes": result.size_bytes,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _list_remote_bq_rows(conn: duckdb.DuckDBPyConnection) -> list[dict[str, Any]]:
|
||||||
|
rows = TableRegistryRepository(conn).list_all()
|
||||||
|
return [
|
||||||
|
r for r in rows
|
||||||
|
if r.get("query_mode") == "remote" and r.get("source_type") == "bigquery"
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _refresh_concurrency() -> int:
|
||||||
|
raw = os.environ.get("AGNES_BQ_METADATA_REFRESH_CONCURRENCY", "4")
|
||||||
|
try:
|
||||||
|
value = int(raw)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return 4
|
||||||
|
return value if value > 0 else 4
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Endpoints ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/api/admin/run-bq-metadata-refresh")
|
||||||
|
async def run_bq_metadata_refresh(
|
||||||
|
user: dict = Depends(require_admin),
|
||||||
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||||
|
):
|
||||||
|
"""Refresh metadata for every remote BQ row in the registry.
|
||||||
|
|
||||||
|
Called by the scheduler at ``SCHEDULER_BQ_METADATA_REFRESH_INTERVAL``
|
||||||
|
(default 4 h). Idempotent — running twice in quick succession is
|
||||||
|
safe but wasteful; the scheduler enforces the interval.
|
||||||
|
|
||||||
|
Bounded concurrency (default 4, override via
|
||||||
|
``AGNES_BQ_METADATA_REFRESH_CONCURRENCY``) so a deployment with
|
||||||
|
many remote tables doesn't fan out to dozens of parallel BQ jobs.
|
||||||
|
"""
|
||||||
|
from src.db import get_system_db
|
||||||
|
|
||||||
|
rows = _list_remote_bq_rows(conn)
|
||||||
|
sem = asyncio.Semaphore(_refresh_concurrency())
|
||||||
|
|
||||||
|
async def _one(row: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
async with sem:
|
||||||
|
# Each refresh_one call wants its own cursor; the singleton
|
||||||
|
# connection accessor returns a fresh cursor each call.
|
||||||
|
return await asyncio.to_thread(refresh_one, get_system_db(), row)
|
||||||
|
|
||||||
|
t0 = time.monotonic()
|
||||||
|
results = await asyncio.gather(
|
||||||
|
*(_one(r) for r in rows), return_exceptions=True,
|
||||||
|
)
|
||||||
|
duration_ms = int((time.monotonic() - t0) * 1000)
|
||||||
|
|
||||||
|
succeeded = sum(
|
||||||
|
1 for r in results if isinstance(r, dict) and r.get("status") == "ok"
|
||||||
|
)
|
||||||
|
no_data = sum(
|
||||||
|
1 for r in results if isinstance(r, dict) and r.get("status") == "no_data"
|
||||||
|
)
|
||||||
|
failed = sum(
|
||||||
|
1 for r in results
|
||||||
|
if isinstance(r, Exception)
|
||||||
|
or (isinstance(r, dict) and r.get("status") == "error")
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"bq metadata refresh: total=%d ok=%d no_data=%d failed=%d duration_ms=%d",
|
||||||
|
len(rows), succeeded, no_data, failed, duration_ms,
|
||||||
|
)
|
||||||
|
return {
|
||||||
|
"total": len(rows),
|
||||||
|
"succeeded": succeeded,
|
||||||
|
"no_data": no_data,
|
||||||
|
"failed": failed,
|
||||||
|
"duration_ms": duration_ms,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/api/v2/metadata-cache/refresh")
|
||||||
|
async def refresh_one_table(
|
||||||
|
table: str = Query(..., description="Registry table_id to refresh"),
|
||||||
|
user: dict = Depends(require_admin),
|
||||||
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||||
|
):
|
||||||
|
"""Operator on-demand refresh of one row.
|
||||||
|
|
||||||
|
Useful right after editing the registry row (so the catalog reflects
|
||||||
|
new ``bucket`` / ``source_table`` immediately) or after an upstream
|
||||||
|
BQ schema change that the operator wants reflected before the next
|
||||||
|
scheduled tick.
|
||||||
|
"""
|
||||||
|
from src.db import get_system_db
|
||||||
|
|
||||||
|
row = TableRegistryRepository(conn).get(table)
|
||||||
|
if not row:
|
||||||
|
raise HTTPException(status_code=404, detail=f"Unknown table_id: {table}")
|
||||||
|
if row.get("query_mode") != "remote" or row.get("source_type") != "bigquery":
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail="Manual metadata refresh is only meaningful for remote BigQuery tables",
|
||||||
|
)
|
||||||
|
return await asyncio.to_thread(refresh_one, get_system_db(), row)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/api/v2/metadata-cache/status")
|
||||||
|
def metadata_cache_status(
|
||||||
|
user: dict = Depends(get_current_user),
|
||||||
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||||
|
):
|
||||||
|
"""Per-table cache status. Non-admin — analyst tools rely on this to
|
||||||
|
decide whether to trust the catalog's ``rows`` / ``size_bytes`` or
|
||||||
|
treat the table as opaque until the next refresh.
|
||||||
|
"""
|
||||||
|
cache_rows = BqMetadataCacheRepository(conn).list_all()
|
||||||
|
threshold = _fresh_threshold_seconds()
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
interval = _scheduler_interval_seconds()
|
||||||
|
tables = []
|
||||||
|
for r in cache_rows:
|
||||||
|
refreshed_at = r.get("refreshed_at")
|
||||||
|
error_at = r.get("error_at")
|
||||||
|
tables.append({
|
||||||
|
"table_id": r["table_id"],
|
||||||
|
"refreshed_at": refreshed_at.isoformat() if refreshed_at else None,
|
||||||
|
"rows": r.get("rows"),
|
||||||
|
"size_bytes": r.get("size_bytes"),
|
||||||
|
"partition_by": r.get("partition_by"),
|
||||||
|
"clustered_by": r.get("clustered_by") or [],
|
||||||
|
"error_at": error_at.isoformat() if error_at else None,
|
||||||
|
"error_msg": r.get("error_msg"),
|
||||||
|
"freshness": compute_freshness(r, now=now, fresh_threshold=threshold),
|
||||||
|
})
|
||||||
|
return {
|
||||||
|
"scheduler_interval_seconds": interval,
|
||||||
|
"fresh_threshold_seconds": threshold,
|
||||||
|
"server_time": now.isoformat(),
|
||||||
|
"tables": tables,
|
||||||
|
}
|
||||||
|
|
@ -159,9 +159,16 @@ async def _warm_one(
|
||||||
|
|
||||||
|
|
||||||
def _warm_metadata_sync(row: dict) -> None:
|
def _warm_metadata_sync(row: dict) -> None:
|
||||||
"""Trigger metadata cache populate via the catalog's normal path."""
|
"""Refresh the persistent ``bq_metadata_cache`` row.
|
||||||
from app.api.v2_catalog import _size_hint_for_row
|
|
||||||
_size_hint_for_row(row)
|
Pre-0.50 this called ``v2_catalog._size_hint_for_row`` to populate
|
||||||
|
an in-memory TTL cache. The in-memory cache is gone — metadata now
|
||||||
|
lives in DuckDB, owned by ``app/api/bq_metadata_refresh.refresh_one``
|
||||||
|
(the same primitive the scheduler-driven refresh uses).
|
||||||
|
"""
|
||||||
|
from app.api.bq_metadata_refresh import refresh_one
|
||||||
|
from src.db import get_system_db
|
||||||
|
refresh_one(get_system_db(), row)
|
||||||
|
|
||||||
|
|
||||||
def _warm_schema_sync(row: dict) -> None:
|
def _warm_schema_sync(row: dict) -> None:
|
||||||
|
|
|
||||||
|
|
@ -1,18 +1,32 @@
|
||||||
"""GET /api/v2/catalog — list tables visible to caller (spec §3.1)."""
|
"""GET /api/v2/catalog — list tables visible to caller (spec §3.1).
|
||||||
|
|
||||||
|
History note
|
||||||
|
------------
|
||||||
|
0.47.0 enriched remote rows with BigQuery metadata (rows / size_bytes /
|
||||||
|
partition_by / clustered_by) by fetching from BQ *inside the request*
|
||||||
|
through a per-table TTL cache. On a cold cache that fanned out to O(N)
|
||||||
|
sequential BQ jobs API roundtrips and reliably exceeded the CLI's 30 s
|
||||||
|
``httpx.ReadTimeout`` against partitioned tables. This module now reads
|
||||||
|
those fields exclusively from the persistent ``bq_metadata_cache`` table
|
||||||
|
(populated by ``app/api/bq_metadata_refresh.py`` on a scheduler tick).
|
||||||
|
The request path never calls BQ.
|
||||||
|
"""
|
||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from datetime import datetime, timezone
|
from datetime import datetime, timezone
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from fastapi import APIRouter, Depends
|
from typing import Any
|
||||||
import duckdb
|
|
||||||
|
|
||||||
from app.auth.dependencies import get_current_user, _get_db
|
import duckdb
|
||||||
|
from fastapi import APIRouter, Depends
|
||||||
|
|
||||||
|
from app.api.v2_cache import TTLCache
|
||||||
|
from app.auth.dependencies import _get_db, get_current_user
|
||||||
from app.utils import get_data_dir as _get_data_dir
|
from app.utils import get_data_dir as _get_data_dir
|
||||||
from src.rbac import can_access_table
|
from src.rbac import can_access_table
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
from src.repositories.table_registry import TableRegistryRepository
|
from src.repositories.table_registry import TableRegistryRepository
|
||||||
from app.api.v2_cache import TTLCache
|
|
||||||
from app.api._metadata_models import MetadataRequest, TableMetadata
|
|
||||||
from src.identifier_validation import validate_quoted_identifier
|
|
||||||
|
|
||||||
router = APIRouter(prefix="/api/v2", tags=["v2"])
|
router = APIRouter(prefix="/api/v2", tags=["v2"])
|
||||||
|
|
||||||
|
|
@ -27,51 +41,6 @@ router = APIRouter(prefix="/api/v2", tags=["v2"])
|
||||||
_table_rows_cache = TTLCache(maxsize=1, ttl_seconds=300)
|
_table_rows_cache = TTLCache(maxsize=1, ttl_seconds=300)
|
||||||
_TABLE_ROWS_KEY = "all"
|
_TABLE_ROWS_KEY = "all"
|
||||||
|
|
||||||
# Per-table cached TableMetadata. 15-min TTL — long enough to amortise
|
|
||||||
# across an analyst session, short enough that a freshly-registered
|
|
||||||
# remote table shows real numbers within a coffee break (the cache-bust
|
|
||||||
# path in `invalidate_for_table` accelerates this for the common admin-
|
|
||||||
# verifies-registration flow).
|
|
||||||
_metadata_cache = TTLCache(maxsize=512, ttl_seconds=900)
|
|
||||||
|
|
||||||
|
|
||||||
def _metadata_provider_for(source_type: str):
|
|
||||||
"""Lazy-import dispatch for source-specific metadata providers.
|
|
||||||
|
|
||||||
Lazy because connector modules are heavy (BQ extension, google-cloud
|
|
||||||
client, etc.) and a Keboola-only deployment shouldn't pay the BQ
|
|
||||||
import cost. Returns ``None`` for unknown source types — the caller
|
|
||||||
treats that as "no metadata enrichment available" and falls through.
|
|
||||||
"""
|
|
||||||
if source_type == "bigquery":
|
|
||||||
from connectors.bigquery import metadata as m
|
|
||||||
return m.fetch
|
|
||||||
if source_type == "keboola":
|
|
||||||
from connectors.keboola import metadata as m
|
|
||||||
return m.fetch
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def _build_metadata_request(row: dict) -> MetadataRequest | None:
|
|
||||||
"""Construct a validated MetadataRequest from a registry row.
|
|
||||||
|
|
||||||
Pre-validates the identifiers via `validate_quoted_identifier` before
|
|
||||||
constructing the request — providers can then interpolate
|
|
||||||
`req.bucket` / `req.source_table` into SQL/URL paths without
|
|
||||||
re-checking. Returns ``None`` when validation fails; provider is not
|
|
||||||
dispatched for that row.
|
|
||||||
"""
|
|
||||||
bucket = row.get("bucket") or ""
|
|
||||||
source_table = row.get("source_table") or row.get("id") or ""
|
|
||||||
if not bucket or not source_table:
|
|
||||||
return None
|
|
||||||
if not (validate_quoted_identifier(bucket, "bucket")
|
|
||||||
and validate_quoted_identifier(source_table, "source_table")):
|
|
||||||
return None
|
|
||||||
return MetadataRequest(
|
|
||||||
table_id=row["id"], bucket=bucket, source_table=source_table,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def _flavor_for(source_type: str) -> str:
|
def _flavor_for(source_type: str) -> str:
|
||||||
return "bigquery" if source_type == "bigquery" else "duckdb"
|
return "bigquery" if source_type == "bigquery" else "duckdb"
|
||||||
|
|
@ -112,56 +81,11 @@ def _bucket_size(byte_count: int) -> str:
|
||||||
return "very_large"
|
return "very_large"
|
||||||
|
|
||||||
|
|
||||||
def _size_hint_for_row(row: dict) -> dict:
|
|
||||||
"""Resolve the per-row metadata bundle the catalog response surfaces.
|
|
||||||
|
|
||||||
Renamed from `_materialized_size_hint` (which always also handled
|
|
||||||
`local` rows; the old name was misleading). Returns a dict with up
|
|
||||||
to four keys: `rough_size_hint`, `rows`, `size_bytes`, `partition_by`,
|
|
||||||
`clustered_by`. Missing keys are reported as `null` in the response.
|
|
||||||
|
|
||||||
Branches:
|
|
||||||
- `local` / `materialized` → existing on-disk parquet stat (cheap).
|
|
||||||
- `remote` → dispatch to the per-source-type provider; cache the
|
|
||||||
TableMetadata for 15 min.
|
|
||||||
"""
|
|
||||||
table_id = row["id"]
|
|
||||||
source_type = row.get("source_type") or ""
|
|
||||||
query_mode = row.get("query_mode") or "local"
|
|
||||||
|
|
||||||
if query_mode in ("local", "materialized"):
|
|
||||||
return {"rough_size_hint": _materialized_parquet_size_bucket(
|
|
||||||
table_id, source_type, query_mode,
|
|
||||||
)}
|
|
||||||
|
|
||||||
if query_mode != "remote":
|
|
||||||
return {"rough_size_hint": None}
|
|
||||||
|
|
||||||
# Cache lookup (per-row TableMetadata).
|
|
||||||
cached = _metadata_cache.get(table_id)
|
|
||||||
if cached is None:
|
|
||||||
cached = _resolve_remote_metadata(row)
|
|
||||||
if cached is not None:
|
|
||||||
_metadata_cache.set(table_id, cached)
|
|
||||||
|
|
||||||
if cached is None:
|
|
||||||
return {"rough_size_hint": None}
|
|
||||||
|
|
||||||
return {
|
|
||||||
"rough_size_hint": _bucket_size(cached.size_bytes) if cached.size_bytes is not None else None,
|
|
||||||
"rows": cached.rows,
|
|
||||||
"size_bytes": cached.size_bytes,
|
|
||||||
"partition_by": cached.partition_by,
|
|
||||||
"clustered_by": cached.clustered_by,
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _materialized_parquet_size_bucket(
|
def _materialized_parquet_size_bucket(
|
||||||
table_id: str, source_type: str, query_mode: str,
|
table_id: str, source_type: str, query_mode: str,
|
||||||
) -> str | None:
|
) -> str | None:
|
||||||
"""Size hint for rows whose data is on the server filesystem
|
"""Size hint for rows whose data is on the server filesystem
|
||||||
(the old `_materialized_size_hint` body). Renamed for clarity now
|
(``local`` or ``materialized``). Cheap ``Path.stat()``; never blocks.
|
||||||
that the new dispatcher is the entry point.
|
|
||||||
|
|
||||||
Layout matches the v2 extract.duckdb contract:
|
Layout matches the v2 extract.duckdb contract:
|
||||||
${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet
|
${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet
|
||||||
|
|
@ -182,21 +106,64 @@ def _materialized_parquet_size_bucket(
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def _resolve_remote_metadata(row: dict) -> "TableMetadata | None":
|
def _hint_for_row(
|
||||||
"""Provider dispatch for a remote row. Returns None on any failure."""
|
row: dict[str, Any],
|
||||||
|
bq_cache_index: dict[str, dict[str, Any]],
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Resolve the per-row metadata bundle the catalog response surfaces.
|
||||||
|
|
||||||
|
Branches:
|
||||||
|
- ``local`` / ``materialized`` → on-disk parquet ``stat()`` (cheap).
|
||||||
|
- ``remote`` (BigQuery) → pre-computed row from ``bq_metadata_cache``,
|
||||||
|
populated by the scheduler-driven refresh. Never touches BQ here.
|
||||||
|
|
||||||
|
Always returns ``metadata_freshness`` (``fresh`` / ``stale`` /
|
||||||
|
``never_fetched`` / ``error`` / ``not_applicable``) so AI consumers can
|
||||||
|
decide whether to trust ``rows`` / ``size_bytes`` or treat them as
|
||||||
|
advisory.
|
||||||
|
"""
|
||||||
|
table_id = row["id"]
|
||||||
source_type = row.get("source_type") or ""
|
source_type = row.get("source_type") or ""
|
||||||
provider = _metadata_provider_for(source_type)
|
query_mode = row.get("query_mode") or "local"
|
||||||
if provider is None:
|
|
||||||
return None
|
if query_mode in ("local", "materialized"):
|
||||||
req = _build_metadata_request(row)
|
return {
|
||||||
if req is None:
|
"rough_size_hint": _materialized_parquet_size_bucket(
|
||||||
return None
|
table_id, source_type, query_mode,
|
||||||
try:
|
),
|
||||||
return provider(req)
|
"metadata_freshness": "not_applicable",
|
||||||
except Exception:
|
}
|
||||||
# Defense in depth — providers are documented as never-raises,
|
|
||||||
# but a regression would otherwise 500 the whole catalog.
|
if query_mode != "remote":
|
||||||
return None
|
return {
|
||||||
|
"rough_size_hint": None,
|
||||||
|
"metadata_freshness": "not_applicable",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Remote: read from the persistent cache; never call BQ here.
|
||||||
|
from app.api.bq_metadata_refresh import compute_freshness
|
||||||
|
cache_row = bq_cache_index.get(table_id)
|
||||||
|
freshness = compute_freshness(cache_row)
|
||||||
|
|
||||||
|
if cache_row is None:
|
||||||
|
return {
|
||||||
|
"rough_size_hint": None,
|
||||||
|
"rows": None,
|
||||||
|
"size_bytes": None,
|
||||||
|
"partition_by": None,
|
||||||
|
"clustered_by": [],
|
||||||
|
"metadata_freshness": freshness,
|
||||||
|
}
|
||||||
|
|
||||||
|
size_bytes = cache_row.get("size_bytes")
|
||||||
|
return {
|
||||||
|
"rough_size_hint": _bucket_size(size_bytes) if size_bytes is not None else None,
|
||||||
|
"rows": cache_row.get("rows"),
|
||||||
|
"size_bytes": size_bytes,
|
||||||
|
"partition_by": cache_row.get("partition_by"),
|
||||||
|
"clustered_by": cache_row.get("clustered_by") or [],
|
||||||
|
"metadata_freshness": freshness,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def invalidate_for_table(table_id: str) -> None:
|
def invalidate_for_table(table_id: str) -> None:
|
||||||
|
|
@ -205,14 +172,14 @@ def invalidate_for_table(table_id: str) -> None:
|
||||||
by the catalog module so admin.py doesn't need to know which caches
|
by the catalog module so admin.py doesn't need to know which caches
|
||||||
exist.
|
exist.
|
||||||
|
|
||||||
Imports v2_schema and v2_sample lazily — keeps catalog tests from
|
The persistent ``bq_metadata_cache`` row is NOT invalidated here —
|
||||||
pulling in BQ-extension imports they don't need.
|
the scheduler-driven refresh owns that lifecycle. Admins who need
|
||||||
|
an immediate refresh after a registry edit should hit
|
||||||
|
``POST /api/v2/metadata-cache/refresh?table=<id>``.
|
||||||
"""
|
"""
|
||||||
import asyncio
|
from app.api import v2_sample, v2_schema
|
||||||
from app.api import v2_schema, v2_sample
|
|
||||||
|
|
||||||
_table_rows_cache.clear()
|
_table_rows_cache.clear()
|
||||||
_metadata_cache.invalidate(table_id)
|
|
||||||
v2_schema._schema_cache.invalidate(table_id)
|
v2_schema._schema_cache.invalidate(table_id)
|
||||||
# Sample cache key is `f"{table_id}|{n}"`; clearing the whole sample
|
# Sample cache key is `f"{table_id}|{n}"`; clearing the whole sample
|
||||||
# cache is heavier than precise invalidation, but registry-change
|
# cache is heavier than precise invalidation, but registry-change
|
||||||
|
|
@ -220,36 +187,6 @@ def invalidate_for_table(table_id: str) -> None:
|
||||||
# adding a prefix-invalidation primitive to TTLCache.
|
# adding a prefix-invalidation primitive to TTLCache.
|
||||||
v2_sample._sample_cache.clear()
|
v2_sample._sample_cache.clear()
|
||||||
|
|
||||||
# Schedule a single-row re-warm so admins editing a registry row
|
|
||||||
# see fresh data within a couple of seconds rather than waiting for
|
|
||||||
# the next analyst to trigger a miss. Fire-and-forget; failures
|
|
||||||
# log + skip inside the coroutine.
|
|
||||||
try:
|
|
||||||
loop = asyncio.get_running_loop()
|
|
||||||
except RuntimeError:
|
|
||||||
loop = None
|
|
||||||
if loop is not None:
|
|
||||||
# Running inside an async context (production FastAPI path).
|
|
||||||
asyncio.create_task(_rewarm_one_row(table_id))
|
|
||||||
# No running event loop (e.g. called from a sync test or a sync
|
|
||||||
# handler thread). Skip re-warm — the next live request will
|
|
||||||
# populate via miss.
|
|
||||||
|
|
||||||
|
|
||||||
async def _rewarm_one_row(table_id: str) -> None:
|
|
||||||
"""Background single-row re-warm. Imports cache_warmup lazily to
|
|
||||||
avoid a circular import at module load (cache_warmup.py is created
|
|
||||||
in Task 10; until then, this function logs a warning and returns)."""
|
|
||||||
try:
|
|
||||||
from app.api.cache_warmup import warm_one_table
|
|
||||||
await warm_one_table(table_id)
|
|
||||||
except Exception:
|
|
||||||
import logging
|
|
||||||
logging.getLogger(__name__).warning(
|
|
||||||
"single-row re-warm failed for %s — next live request will populate",
|
|
||||||
table_id,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
||||||
rows = _table_rows_cache.get(_TABLE_ROWS_KEY)
|
rows = _table_rows_cache.get(_TABLE_ROWS_KEY)
|
||||||
|
|
@ -258,6 +195,12 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
||||||
rows = repo.list_all()
|
rows = repo.list_all()
|
||||||
_table_rows_cache.set(_TABLE_ROWS_KEY, rows)
|
_table_rows_cache.set(_TABLE_ROWS_KEY, rows)
|
||||||
|
|
||||||
|
# One DB read for all remote-row metadata. Indexed by table_id so the
|
||||||
|
# per-row loop below stays O(N).
|
||||||
|
bq_cache_index: dict[str, dict[str, Any]] = {
|
||||||
|
r["table_id"]: r for r in BqMetadataCacheRepository(conn).list_all()
|
||||||
|
}
|
||||||
|
|
||||||
# RBAC is enforced fresh per request. Revoking a user's access to a
|
# RBAC is enforced fresh per request. Revoking a user's access to a
|
||||||
# table takes effect on their next call to this endpoint, not after the
|
# table takes effect on their next call to this endpoint, not after the
|
||||||
# cache TTL expires.
|
# cache TTL expires.
|
||||||
|
|
@ -265,7 +208,7 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
||||||
for r in rows:
|
for r in rows:
|
||||||
if not can_access_table(user, r["id"], conn):
|
if not can_access_table(user, r["id"], conn):
|
||||||
continue
|
continue
|
||||||
hint = _size_hint_for_row(r)
|
hint = _hint_for_row(r, bq_cache_index)
|
||||||
visible.append({
|
visible.append({
|
||||||
"id": r["id"],
|
"id": r["id"],
|
||||||
"name": r.get("name") or r["id"],
|
"name": r.get("name") or r["id"],
|
||||||
|
|
@ -279,7 +222,8 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
|
||||||
"rows": hint.get("rows"),
|
"rows": hint.get("rows"),
|
||||||
"size_bytes": hint.get("size_bytes"),
|
"size_bytes": hint.get("size_bytes"),
|
||||||
"partition_by": hint.get("partition_by"),
|
"partition_by": hint.get("partition_by"),
|
||||||
"clustered_by": hint.get("clustered_by"),
|
"clustered_by": hint.get("clustered_by") or [],
|
||||||
|
"metadata_freshness": hint.get("metadata_freshness"),
|
||||||
})
|
})
|
||||||
|
|
||||||
return {
|
return {
|
||||||
|
|
@ -294,12 +238,6 @@ def catalog(
|
||||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||||
):
|
):
|
||||||
# Plain ``def`` so FastAPI auto-offloads to the anyio thread pool —
|
# Plain ``def`` so FastAPI auto-offloads to the anyio thread pool —
|
||||||
# build_catalog now calls `_size_hint_for_row` for every visible row,
|
# the request path is pure local I/O (DuckDB reads + filesystem
|
||||||
# which does sync `Path.stat()` / `Path.exists()` on the data volume
|
# stat()) and uses a sync DuckDB cursor.
|
||||||
# (local/materialized) or provider dispatch (remote). On local FS
|
|
||||||
# that's microseconds, but on a network-mounted DATA_DIR (NFS / CIFS /
|
|
||||||
# GCS-FUSE) those calls can block. Plain ``def`` means each request
|
|
||||||
# runs on its own thread; the event loop stays free for non-catalog
|
|
||||||
# traffic. Mirrors the Tier 1 conversion of /api/query, /api/v2/scan,
|
|
||||||
# /api/v2/sample, /api/v2/schema — Devin Review on PR #188.
|
|
||||||
return build_catalog(conn, user)
|
return build_catalog(conn, user)
|
||||||
|
|
|
||||||
|
|
@ -117,6 +117,7 @@ from app.api.welcome import router as welcome_router
|
||||||
from app.api.claude_md import router as claude_md_router
|
from app.api.claude_md import router as claude_md_router
|
||||||
from app.api.news import router as news_router
|
from app.api.news import router as news_router
|
||||||
from app.api.cache_warmup import router as cache_warmup_router
|
from app.api.cache_warmup import router as cache_warmup_router
|
||||||
|
from app.api.bq_metadata_refresh import router as bq_metadata_refresh_router
|
||||||
from app.marketplace_server.router import router as marketplace_server_router
|
from app.marketplace_server.router import router as marketplace_server_router
|
||||||
from app.marketplace_server.git_router import make_git_wsgi_app
|
from app.marketplace_server.git_router import make_git_wsgi_app
|
||||||
from app.web.router import router as web_router
|
from app.web.router import router as web_router
|
||||||
|
|
@ -598,6 +599,7 @@ def create_app() -> FastAPI:
|
||||||
app.include_router(claude_md_router)
|
app.include_router(claude_md_router)
|
||||||
app.include_router(news_router)
|
app.include_router(news_router)
|
||||||
app.include_router(cache_warmup_router)
|
app.include_router(cache_warmup_router)
|
||||||
|
app.include_router(bq_metadata_refresh_router)
|
||||||
app.include_router(marketplace_server_router)
|
app.include_router(marketplace_server_router)
|
||||||
|
|
||||||
# Git smart-HTTP endpoint for Claude Code: /marketplace.git/*
|
# Git smart-HTTP endpoint for Claude Code: /marketplace.git/*
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,6 @@
|
||||||
[project]
|
[project]
|
||||||
name = "agnes-the-ai-analyst"
|
name = "agnes-the-ai-analyst"
|
||||||
version = "0.49.1"
|
version = "0.50.0"
|
||||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||||
requires-python = ">=3.11,<3.14"
|
requires-python = ">=3.11,<3.14"
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
|
|
|
||||||
|
|
@ -90,6 +90,14 @@ _DEFAULTS = {
|
||||||
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL": 15 * 60,
|
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL": 15 * 60,
|
||||||
"SCHEDULER_USAGE_PROCESSOR_INTERVAL": 10 * 60,
|
"SCHEDULER_USAGE_PROCESSOR_INTERVAL": 10 * 60,
|
||||||
"SCHEDULER_CORPORATE_MEMORY_INTERVAL": 17 * 60,
|
"SCHEDULER_CORPORATE_MEMORY_INTERVAL": 17 * 60,
|
||||||
|
# BigQuery metadata refresh: walks remote registry rows and updates the
|
||||||
|
# persistent ``bq_metadata_cache``. Default 4 h — long enough that the
|
||||||
|
# cumulative BQ jobs API cost stays negligible on a typical 10–50-table
|
||||||
|
# registry, short enough that operator-edited tables show real numbers
|
||||||
|
# within an analyst's working day. Hot reads of ``/api/v2/catalog`` go
|
||||||
|
# to DuckDB, never to BQ, so this can be tuned freely without touching
|
||||||
|
# request-path latency.
|
||||||
|
"SCHEDULER_BQ_METADATA_REFRESH_INTERVAL": 4 * 60 * 60,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -149,8 +157,9 @@ def build_jobs() -> list[tuple[str, str, str, str, int]]:
|
||||||
verify = _read_positive_int("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL")
|
verify = _read_positive_int("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL")
|
||||||
usage = _read_positive_int("SCHEDULER_USAGE_PROCESSOR_INTERVAL")
|
usage = _read_positive_int("SCHEDULER_USAGE_PROCESSOR_INTERVAL")
|
||||||
corpmem = _read_positive_int("SCHEDULER_CORPORATE_MEMORY_INTERVAL")
|
corpmem = _read_positive_int("SCHEDULER_CORPORATE_MEMORY_INTERVAL")
|
||||||
|
bqmeta = _read_positive_int("SCHEDULER_BQ_METADATA_REFRESH_INTERVAL")
|
||||||
tick = _read_positive_int("SCHEDULER_TICK_SECONDS")
|
tick = _read_positive_int("SCHEDULER_TICK_SECONDS")
|
||||||
smallest = min(refresh, health, scripts, sess, verify, usage, corpmem)
|
smallest = min(refresh, health, scripts, sess, verify, usage, corpmem, bqmeta)
|
||||||
if tick > smallest:
|
if tick > smallest:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"SCHEDULER_TICK_SECONDS={tick} must be <= the smallest job "
|
f"SCHEDULER_TICK_SECONDS={tick} must be <= the smallest job "
|
||||||
|
|
@ -193,6 +202,14 @@ def build_jobs() -> list[tuple[str, str, str, str, int]]:
|
||||||
# to review_error so admin can retry. Cheap (one indexed
|
# to review_error so admin can retry. Cheap (one indexed
|
||||||
# SELECT + N small UPDATEs); short timeout sufficient.
|
# SELECT + N small UPDATEs); short timeout sufficient.
|
||||||
("store-reap-stuck-reviews", "every 15m", "/api/admin/run-reap-stuck-reviews", "POST", 60),
|
("store-reap-stuck-reviews", "every 15m", "/api/admin/run-reap-stuck-reviews", "POST", 60),
|
||||||
|
# BigQuery metadata refresh — keeps ``bq_metadata_cache`` warm so
|
||||||
|
# ``GET /api/v2/catalog`` never has to call BQ at request time.
|
||||||
|
# 30-min timeout is generous; on a 10-table dev registry the
|
||||||
|
# observed full refresh ran in ~7 min when two view-backed rows
|
||||||
|
# took 7 min each. Bounded concurrency
|
||||||
|
# (``AGNES_BQ_METADATA_REFRESH_CONCURRENCY``, default 4) caps the
|
||||||
|
# tail.
|
||||||
|
("bq-metadata-refresh", _seconds_to_schedule(bqmeta), "/api/admin/run-bq-metadata-refresh", "POST", 1800),
|
||||||
]
|
]
|
||||||
|
|
||||||
_running = True
|
_running = True
|
||||||
|
|
|
||||||
58
src/db.py
58
src/db.py
|
|
@ -40,7 +40,7 @@ def _maybe_instrument(con, db_tag: str):
|
||||||
|
|
||||||
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
|
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
|
||||||
|
|
||||||
SCHEMA_VERSION = 39
|
SCHEMA_VERSION = 40
|
||||||
|
|
||||||
_SYSTEM_SCHEMA = """
|
_SYSTEM_SCHEMA = """
|
||||||
CREATE TABLE IF NOT EXISTS schema_version (
|
CREATE TABLE IF NOT EXISTS schema_version (
|
||||||
|
|
@ -640,6 +640,38 @@ CREATE INDEX IF NOT EXISTS idx_store_submissions_entity ON store_submissions(ent
|
||||||
-- (reproduced with N=2 against 3 rows during /admin/store/submissions
|
-- (reproduced with N=2 against 3 rows during /admin/store/submissions
|
||||||
-- paging). Submissions table is admin-only and bounded by upload
|
-- paging). Submissions table is admin-only and bounded by upload
|
||||||
-- volume, so the index buys little; dropping it sidesteps the bug.
|
-- volume, so the index buys little; dropping it sidesteps the bug.
|
||||||
|
|
||||||
|
-- v40: persistent metadata cache for remote sources (BigQuery initially).
|
||||||
|
-- Replaces the per-request, in-memory `_metadata_cache` in v2_catalog.py
|
||||||
|
-- that turned every cold-cache /api/v2/catalog into a sequence of N×3 BQ
|
||||||
|
-- jobs API calls (one TABLE_STORAGE + COLUMNS pair per remote row) — long
|
||||||
|
-- enough on view-backed or partitioned tables (>>30 s) to blow the CLI's
|
||||||
|
-- httpx 30 s read timeout. Now refresh is driven exclusively by the
|
||||||
|
-- scheduler (default every 4 h, `SCHEDULER_BQ_METADATA_REFRESH_INTERVAL`),
|
||||||
|
-- and the catalog endpoint just reads this table — no BQ at request time.
|
||||||
|
--
|
||||||
|
-- Columns:
|
||||||
|
-- table_id — registry.id; PK and join key with table_registry.
|
||||||
|
-- rows / size_bytes / partition_by / clustered_by — last successful
|
||||||
|
-- provider result. NULL when the table has never
|
||||||
|
-- been fetched, or fetch failed before any success.
|
||||||
|
-- clustered_by stored as JSON array of column names.
|
||||||
|
-- refreshed_at — wall-clock of the last successful fetch. Used by
|
||||||
|
-- the catalog response to compute metadata_freshness
|
||||||
|
-- (`fresh` if < 2× scheduler interval old, `stale`
|
||||||
|
-- otherwise, `never_fetched` if NULL).
|
||||||
|
-- error_at / error_msg — last failure timestamp + redacted message.
|
||||||
|
-- NULL after the next successful refresh.
|
||||||
|
CREATE TABLE IF NOT EXISTS bq_metadata_cache (
|
||||||
|
table_id VARCHAR PRIMARY KEY,
|
||||||
|
rows BIGINT,
|
||||||
|
size_bytes BIGINT,
|
||||||
|
partition_by VARCHAR,
|
||||||
|
clustered_by JSON,
|
||||||
|
refreshed_at TIMESTAMP,
|
||||||
|
error_at TIMESTAMP,
|
||||||
|
error_msg VARCHAR
|
||||||
|
);
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -2489,6 +2521,27 @@ _V38_TO_V39_MIGRATIONS = [
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# v40: bq_metadata_cache table. Existing DBs get an empty table; the next
|
||||||
|
# scheduler tick (or app startup warmup) populates it. The catalog endpoint
|
||||||
|
# treats absence-of-row as `metadata_freshness: never_fetched` and returns
|
||||||
|
# NULL for the optional fields rather than failing — analyst tooling already
|
||||||
|
# tolerates NULL rows / size_bytes from the pre-0.47 contract.
|
||||||
|
_V39_TO_V40_MIGRATIONS = [
|
||||||
|
"""
|
||||||
|
CREATE TABLE IF NOT EXISTS bq_metadata_cache (
|
||||||
|
table_id VARCHAR PRIMARY KEY,
|
||||||
|
rows BIGINT,
|
||||||
|
size_bytes BIGINT,
|
||||||
|
partition_by VARCHAR,
|
||||||
|
clustered_by JSON,
|
||||||
|
refreshed_at TIMESTAMP,
|
||||||
|
error_at TIMESTAMP,
|
||||||
|
error_msg VARCHAR
|
||||||
|
)
|
||||||
|
""",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
_V33_TO_V34_MIGRATIONS = [
|
_V33_TO_V34_MIGRATIONS = [
|
||||||
# DuckDB blocks DROP COLUMN while indexes reference the table
|
# DuckDB blocks DROP COLUMN while indexes reference the table
|
||||||
# ("Dependency Error: Cannot alter entry … because there are entries
|
# ("Dependency Error: Cannot alter entry … because there are entries
|
||||||
|
|
@ -2882,6 +2935,9 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
||||||
if current < 39:
|
if current < 39:
|
||||||
for sql in _V38_TO_V39_MIGRATIONS:
|
for sql in _V38_TO_V39_MIGRATIONS:
|
||||||
conn.execute(sql)
|
conn.execute(sql)
|
||||||
|
if current < 40:
|
||||||
|
for sql in _V39_TO_V40_MIGRATIONS:
|
||||||
|
conn.execute(sql)
|
||||||
conn.execute(
|
conn.execute(
|
||||||
"UPDATE schema_version SET version = ?, applied_at = current_timestamp",
|
"UPDATE schema_version SET version = ?, applied_at = current_timestamp",
|
||||||
[SCHEMA_VERSION],
|
[SCHEMA_VERSION],
|
||||||
|
|
|
||||||
120
src/repositories/bq_metadata_cache.py
Normal file
120
src/repositories/bq_metadata_cache.py
Normal file
|
|
@ -0,0 +1,120 @@
|
||||||
|
"""Repository for the persistent BigQuery metadata cache.
|
||||||
|
|
||||||
|
Backs the v40 ``bq_metadata_cache`` table. Reads are called from the
|
||||||
|
hot path (``/api/v2/catalog``); writes only from the scheduler-driven
|
||||||
|
refresh job in ``app/api/bq_metadata_refresh.py`` and from operator-
|
||||||
|
triggered single-row refreshes via ``/api/v2/metadata-cache/refresh``.
|
||||||
|
|
||||||
|
clustered_by is stored as a JSON array of column-name strings and
|
||||||
|
returned to callers as a list (decoded here, never raw JSON).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Any, Optional
|
||||||
|
|
||||||
|
import duckdb
|
||||||
|
|
||||||
|
|
||||||
|
def _decode_clustered_by(stored: Any) -> Optional[list[str]]:
|
||||||
|
if stored is None:
|
||||||
|
return None
|
||||||
|
if isinstance(stored, list):
|
||||||
|
return [str(x) for x in stored]
|
||||||
|
if isinstance(stored, str):
|
||||||
|
try:
|
||||||
|
parsed = json.loads(stored)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return None
|
||||||
|
return [str(x) for x in parsed] if isinstance(parsed, list) else None
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _row_to_dict(conn: duckdb.DuckDBPyConnection, row: tuple) -> dict[str, Any]:
|
||||||
|
columns = [desc[0] for desc in conn.description]
|
||||||
|
out: dict[str, Any] = dict(zip(columns, row))
|
||||||
|
out["clustered_by"] = _decode_clustered_by(out.get("clustered_by"))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class BqMetadataCacheRepository:
|
||||||
|
def __init__(self, conn: duckdb.DuckDBPyConnection):
|
||||||
|
self.conn = conn
|
||||||
|
|
||||||
|
def get(self, table_id: str) -> Optional[dict[str, Any]]:
|
||||||
|
result = self.conn.execute(
|
||||||
|
"SELECT * FROM bq_metadata_cache WHERE table_id = ?",
|
||||||
|
[table_id],
|
||||||
|
).fetchone()
|
||||||
|
if not result:
|
||||||
|
return None
|
||||||
|
return _row_to_dict(self.conn, result)
|
||||||
|
|
||||||
|
def list_all(self) -> list[dict[str, Any]]:
|
||||||
|
results = self.conn.execute(
|
||||||
|
"SELECT * FROM bq_metadata_cache ORDER BY table_id"
|
||||||
|
).fetchall()
|
||||||
|
if not results:
|
||||||
|
return []
|
||||||
|
columns = [desc[0] for desc in self.conn.description]
|
||||||
|
out: list[dict[str, Any]] = []
|
||||||
|
for r in results:
|
||||||
|
row = dict(zip(columns, r))
|
||||||
|
row["clustered_by"] = _decode_clustered_by(row.get("clustered_by"))
|
||||||
|
out.append(row)
|
||||||
|
return out
|
||||||
|
|
||||||
|
def upsert_success(
|
||||||
|
self,
|
||||||
|
table_id: str,
|
||||||
|
*,
|
||||||
|
rows: Optional[int],
|
||||||
|
size_bytes: Optional[int],
|
||||||
|
partition_by: Optional[str],
|
||||||
|
clustered_by: Optional[list[str]],
|
||||||
|
) -> None:
|
||||||
|
"""Record a successful refresh. Clears any prior error_at/error_msg."""
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
clustered_json = (
|
||||||
|
json.dumps(list(clustered_by)) if clustered_by is not None else None
|
||||||
|
)
|
||||||
|
self.conn.execute(
|
||||||
|
"""INSERT INTO bq_metadata_cache
|
||||||
|
(table_id, rows, size_bytes, partition_by, clustered_by,
|
||||||
|
refreshed_at, error_at, error_msg)
|
||||||
|
VALUES (?, ?, ?, ?, ?, ?, NULL, NULL)
|
||||||
|
ON CONFLICT (table_id) DO UPDATE SET
|
||||||
|
rows = excluded.rows,
|
||||||
|
size_bytes = excluded.size_bytes,
|
||||||
|
partition_by = excluded.partition_by,
|
||||||
|
clustered_by = excluded.clustered_by,
|
||||||
|
refreshed_at = excluded.refreshed_at,
|
||||||
|
error_at = NULL,
|
||||||
|
error_msg = NULL""",
|
||||||
|
[table_id, rows, size_bytes, partition_by, clustered_json, now],
|
||||||
|
)
|
||||||
|
|
||||||
|
def mark_error(self, table_id: str, error_msg: str) -> None:
|
||||||
|
"""Record a failed refresh. Preserves the prior success row (if any)
|
||||||
|
so analyst Claude keeps using last-known-good rows + size_bytes while
|
||||||
|
the next scheduled retry attempts to recover."""
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
truncated = (error_msg or "")[:512] # bound storage
|
||||||
|
self.conn.execute(
|
||||||
|
"""INSERT INTO bq_metadata_cache
|
||||||
|
(table_id, rows, size_bytes, partition_by, clustered_by,
|
||||||
|
refreshed_at, error_at, error_msg)
|
||||||
|
VALUES (?, NULL, NULL, NULL, NULL, NULL, ?, ?)
|
||||||
|
ON CONFLICT (table_id) DO UPDATE SET
|
||||||
|
error_at = excluded.error_at,
|
||||||
|
error_msg = excluded.error_msg""",
|
||||||
|
[table_id, now, truncated],
|
||||||
|
)
|
||||||
|
|
||||||
|
def delete(self, table_id: str) -> None:
|
||||||
|
"""Drop a row — used by admin endpoints when a table is unregistered."""
|
||||||
|
self.conn.execute(
|
||||||
|
"DELETE FROM bq_metadata_cache WHERE table_id = ?", [table_id]
|
||||||
|
)
|
||||||
|
|
@ -84,7 +84,6 @@ def _reset_module_caches():
|
||||||
try:
|
try:
|
||||||
from app.api import v2_catalog as _vc
|
from app.api import v2_catalog as _vc
|
||||||
_vc._table_rows_cache.clear()
|
_vc._table_rows_cache.clear()
|
||||||
_vc._metadata_cache.clear()
|
|
||||||
except (ImportError, AttributeError):
|
except (ImportError, AttributeError):
|
||||||
pass
|
pass
|
||||||
try:
|
try:
|
||||||
|
|
@ -96,7 +95,6 @@ def _reset_module_caches():
|
||||||
try:
|
try:
|
||||||
from app.api import v2_catalog as _vc
|
from app.api import v2_catalog as _vc
|
||||||
_vc._table_rows_cache.clear()
|
_vc._table_rows_cache.clear()
|
||||||
_vc._metadata_cache.clear()
|
|
||||||
except (ImportError, AttributeError):
|
except (ImportError, AttributeError):
|
||||||
pass
|
pass
|
||||||
try:
|
try:
|
||||||
|
|
|
||||||
160
tests/test_bq_metadata_cache_repo.py
Normal file
160
tests/test_bq_metadata_cache_repo.py
Normal file
|
|
@ -0,0 +1,160 @@
|
||||||
|
"""Repository + freshness tests for the persistent BQ metadata cache."""
|
||||||
|
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
|
||||||
|
|
||||||
|
def test_upsert_success_inserts_then_updates(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
repo.upsert_success(
|
||||||
|
"orders", rows=10, size_bytes=2048,
|
||||||
|
partition_by="event_date", clustered_by=["country"],
|
||||||
|
)
|
||||||
|
row = repo.get("orders")
|
||||||
|
assert row is not None
|
||||||
|
assert row["rows"] == 10
|
||||||
|
assert row["size_bytes"] == 2048
|
||||||
|
assert row["partition_by"] == "event_date"
|
||||||
|
assert row["clustered_by"] == ["country"]
|
||||||
|
assert row["refreshed_at"] is not None
|
||||||
|
assert row["error_at"] is None
|
||||||
|
|
||||||
|
# Update with new numbers; refreshed_at advances.
|
||||||
|
first_refresh = row["refreshed_at"]
|
||||||
|
repo.upsert_success(
|
||||||
|
"orders", rows=20, size_bytes=4096,
|
||||||
|
partition_by=None, clustered_by=[],
|
||||||
|
)
|
||||||
|
row2 = repo.get("orders")
|
||||||
|
assert row2["rows"] == 20
|
||||||
|
assert row2["partition_by"] is None
|
||||||
|
assert row2["clustered_by"] == []
|
||||||
|
assert row2["refreshed_at"] >= first_refresh
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_mark_error_preserves_prior_success(seeded_app):
|
||||||
|
"""After a successful refresh, a subsequent failure must keep the
|
||||||
|
rows/size_bytes columns untouched — analyst Claude keeps using the
|
||||||
|
last-known-good numbers while the next scheduled retry attempts to
|
||||||
|
recover."""
|
||||||
|
from src.db import get_system_db
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
repo.upsert_success(
|
||||||
|
"orders", rows=100, size_bytes=1000,
|
||||||
|
partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
repo.mark_error("orders", "BQ timeout")
|
||||||
|
row = repo.get("orders")
|
||||||
|
assert row["rows"] == 100, "prior success must be preserved across error"
|
||||||
|
assert row["size_bytes"] == 1000
|
||||||
|
assert row["error_at"] is not None
|
||||||
|
assert row["error_msg"] == "BQ timeout"
|
||||||
|
# Subsequent success clears the error.
|
||||||
|
repo.upsert_success(
|
||||||
|
"orders", rows=200, size_bytes=2000,
|
||||||
|
partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
row2 = repo.get("orders")
|
||||||
|
assert row2["rows"] == 200
|
||||||
|
assert row2["error_at"] is None
|
||||||
|
assert row2["error_msg"] is None
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_mark_error_truncates_long_messages(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
repo.mark_error("orders", "x" * 2000)
|
||||||
|
row = repo.get("orders")
|
||||||
|
assert len(row["error_msg"]) == 512
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_list_all_orders_by_table_id(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
repo.upsert_success(
|
||||||
|
"zeta", rows=1, size_bytes=1, partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
repo.upsert_success(
|
||||||
|
"alpha", rows=2, size_bytes=2, partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
rows = repo.list_all()
|
||||||
|
ids = [r["table_id"] for r in rows]
|
||||||
|
assert ids == sorted(ids)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_delete_removes_row(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
repo.upsert_success(
|
||||||
|
"orders", rows=1, size_bytes=1, partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
repo.delete("orders")
|
||||||
|
assert repo.get("orders") is None
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
# ─── compute_freshness ────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_freshness_never_fetched_for_missing_row():
|
||||||
|
from app.api.bq_metadata_refresh import compute_freshness
|
||||||
|
assert compute_freshness(None) == "never_fetched"
|
||||||
|
|
||||||
|
|
||||||
|
def test_freshness_never_fetched_for_no_refresh_no_error():
|
||||||
|
from app.api.bq_metadata_refresh import compute_freshness
|
||||||
|
row = {"refreshed_at": None, "error_at": None}
|
||||||
|
assert compute_freshness(row) == "never_fetched"
|
||||||
|
|
||||||
|
|
||||||
|
def test_freshness_error_when_only_error_present():
|
||||||
|
from app.api.bq_metadata_refresh import compute_freshness
|
||||||
|
row = {
|
||||||
|
"refreshed_at": None,
|
||||||
|
"error_at": datetime.now(timezone.utc),
|
||||||
|
}
|
||||||
|
assert compute_freshness(row) == "error"
|
||||||
|
|
||||||
|
|
||||||
|
def test_freshness_fresh_within_threshold():
|
||||||
|
from app.api.bq_metadata_refresh import compute_freshness
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
row = {
|
||||||
|
"refreshed_at": now - timedelta(seconds=60),
|
||||||
|
"error_at": None,
|
||||||
|
}
|
||||||
|
# 1-minute-old row with a 1-hour threshold ⇒ fresh.
|
||||||
|
assert compute_freshness(row, now=now, fresh_threshold=3600) == "fresh"
|
||||||
|
|
||||||
|
|
||||||
|
def test_freshness_stale_beyond_threshold():
|
||||||
|
from app.api.bq_metadata_refresh import compute_freshness
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
row = {
|
||||||
|
"refreshed_at": now - timedelta(hours=10),
|
||||||
|
"error_at": None,
|
||||||
|
}
|
||||||
|
assert compute_freshness(row, now=now, fresh_threshold=3600) == "stale"
|
||||||
211
tests/test_bq_metadata_refresh_endpoint.py
Normal file
211
tests/test_bq_metadata_refresh_endpoint.py
Normal file
|
|
@ -0,0 +1,211 @@
|
||||||
|
"""End-to-end tests for the three bq_metadata_refresh endpoints."""
|
||||||
|
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
from app.api._metadata_models import TableMetadata
|
||||||
|
|
||||||
|
|
||||||
|
def _register_remote(seeded_app, table_id: str):
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.table_registry import TableRegistryRepository
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
TableRegistryRepository(conn).register(
|
||||||
|
name=table_id,
|
||||||
|
id=table_id,
|
||||||
|
source_type="bigquery",
|
||||||
|
bucket="dwh_base",
|
||||||
|
source_table=table_id,
|
||||||
|
query_mode="remote",
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
# ─── POST /api/admin/run-bq-metadata-refresh ──────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_refresh_walks_remote_rows_and_upserts(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
|
||||||
|
_register_remote(seeded_app, "a_remote")
|
||||||
|
_register_remote(seeded_app, "b_remote")
|
||||||
|
|
||||||
|
fake = TableMetadata(
|
||||||
|
rows=5, size_bytes=512, partition_by="d", clustered_by=["c"],
|
||||||
|
)
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
with patch("connectors.bigquery.metadata.fetch", return_value=fake):
|
||||||
|
r = c.post(
|
||||||
|
"/api/admin/run-bq-metadata-refresh",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
body = r.json()
|
||||||
|
assert body["total"] >= 2
|
||||||
|
assert body["succeeded"] >= 2
|
||||||
|
assert body["failed"] == 0
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
repo = BqMetadataCacheRepository(conn)
|
||||||
|
for tid in ("a_remote", "b_remote"):
|
||||||
|
row = repo.get(tid)
|
||||||
|
assert row is not None
|
||||||
|
assert row["rows"] == 5
|
||||||
|
assert row["size_bytes"] == 512
|
||||||
|
assert row["partition_by"] == "d"
|
||||||
|
assert row["clustered_by"] == ["c"]
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_refresh_marks_error_on_provider_failure(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
|
||||||
|
_register_remote(seeded_app, "boom")
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
with patch(
|
||||||
|
"connectors.bigquery.metadata.fetch",
|
||||||
|
side_effect=RuntimeError("BQ throttle"),
|
||||||
|
):
|
||||||
|
r = c.post(
|
||||||
|
"/api/admin/run-bq-metadata-refresh",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 200
|
||||||
|
body = r.json()
|
||||||
|
assert body["failed"] >= 1
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
row = BqMetadataCacheRepository(conn).get("boom")
|
||||||
|
assert row is not None
|
||||||
|
assert row["error_at"] is not None
|
||||||
|
assert "BQ throttle" in (row["error_msg"] or "")
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_refresh_requires_admin(seeded_app):
|
||||||
|
c = seeded_app["client"]
|
||||||
|
# No Authorization header → 401.
|
||||||
|
r = c.post("/api/admin/run-bq-metadata-refresh")
|
||||||
|
assert r.status_code == 401
|
||||||
|
|
||||||
|
|
||||||
|
# ─── POST /api/v2/metadata-cache/refresh?table= ───────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_refresh_one_table_endpoint(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
|
||||||
|
_register_remote(seeded_app, "single")
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
fake = TableMetadata(rows=99, size_bytes=999)
|
||||||
|
with patch("connectors.bigquery.metadata.fetch", return_value=fake):
|
||||||
|
r = c.post(
|
||||||
|
"/api/v2/metadata-cache/refresh?table=single",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
assert r.json()["status"] == "ok"
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
row = BqMetadataCacheRepository(conn).get("single")
|
||||||
|
assert row["rows"] == 99
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_refresh_one_table_unknown_id_returns_404(seeded_app):
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.post(
|
||||||
|
"/api/v2/metadata-cache/refresh?table=does_not_exist",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 404
|
||||||
|
|
||||||
|
|
||||||
|
def test_refresh_one_table_rejects_non_remote(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.table_registry import TableRegistryRepository
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
TableRegistryRepository(conn).register(
|
||||||
|
name="local_t",
|
||||||
|
id="local_t",
|
||||||
|
source_type="keboola",
|
||||||
|
bucket="in.c-x",
|
||||||
|
source_table="t",
|
||||||
|
query_mode="local",
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.post(
|
||||||
|
"/api/v2/metadata-cache/refresh?table=local_t",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 400
|
||||||
|
|
||||||
|
|
||||||
|
# ─── GET /api/v2/metadata-cache/status ────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_status_endpoint_returns_per_row_freshness(seeded_app):
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
BqMetadataCacheRepository(conn).upsert_success(
|
||||||
|
"orders", rows=1, size_bytes=1, partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.get(
|
||||||
|
"/api/v2/metadata-cache/status",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
body = r.json()
|
||||||
|
assert "scheduler_interval_seconds" in body
|
||||||
|
assert "fresh_threshold_seconds" in body
|
||||||
|
assert body["fresh_threshold_seconds"] == 2 * body["scheduler_interval_seconds"]
|
||||||
|
orders = next(t for t in body["tables"] if t["table_id"] == "orders")
|
||||||
|
assert orders["freshness"] == "fresh"
|
||||||
|
|
||||||
|
|
||||||
|
def test_status_endpoint_does_not_require_admin(seeded_app):
|
||||||
|
"""Non-admin analyst tools (CLI, Claude Code) need this surface."""
|
||||||
|
c = seeded_app["client"]
|
||||||
|
# No token at all → 401 (auth still required, just not admin).
|
||||||
|
r = c.get("/api/v2/metadata-cache/status")
|
||||||
|
assert r.status_code == 401
|
||||||
|
# Any authenticated user works — seeded_app's admin_token is the
|
||||||
|
# easiest valid bearer; downgrade once the test harness exposes a
|
||||||
|
# plain-user token.
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.get(
|
||||||
|
"/api/v2/metadata-cache/status",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 200
|
||||||
|
|
@ -18,9 +18,17 @@ def test_build_jobs_uses_documented_defaults(monkeypatch):
|
||||||
assert jobs["health-check"] == "every 5m"
|
assert jobs["health-check"] == "every 5m"
|
||||||
assert jobs["script-runner"] == "every 1m"
|
assert jobs["script-runner"] == "every 1m"
|
||||||
assert jobs["marketplaces"] == "daily 03:00"
|
assert jobs["marketplaces"] == "daily 03:00"
|
||||||
|
assert jobs["bq-metadata-refresh"] == "every 4h"
|
||||||
assert resolved_tick_seconds() == 30
|
assert resolved_tick_seconds() == 30
|
||||||
|
|
||||||
|
|
||||||
|
def test_build_jobs_honors_bq_metadata_env_override(monkeypatch):
|
||||||
|
monkeypatch.setenv("SCHEDULER_BQ_METADATA_REFRESH_INTERVAL", "7200") # 2h
|
||||||
|
from services.scheduler.__main__ import build_jobs
|
||||||
|
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
||||||
|
assert jobs["bq-metadata-refresh"] == "every 2h"
|
||||||
|
|
||||||
|
|
||||||
def test_build_jobs_honors_env_overrides(monkeypatch):
|
def test_build_jobs_honors_env_overrides(monkeypatch):
|
||||||
monkeypatch.setenv("SCHEDULER_DATA_REFRESH_INTERVAL", "1800") # 30m
|
monkeypatch.setenv("SCHEDULER_DATA_REFRESH_INTERVAL", "1800") # 30m
|
||||||
monkeypatch.setenv("SCHEDULER_HEALTH_CHECK_INTERVAL", "60") # 1m
|
monkeypatch.setenv("SCHEDULER_HEALTH_CHECK_INTERVAL", "60") # 1m
|
||||||
|
|
|
||||||
|
|
@ -1,71 +0,0 @@
|
||||||
"""Dispatch + identifier-validation gate for the source-agnostic
|
|
||||||
metadata providers."""
|
|
||||||
|
|
||||||
from app.api._metadata_models import MetadataRequest
|
|
||||||
|
|
||||||
|
|
||||||
def test_dispatcher_returns_bq_provider_for_bigquery():
|
|
||||||
from app.api.v2_catalog import _metadata_provider_for
|
|
||||||
from connectors.bigquery import metadata as bq_meta
|
|
||||||
fn = _metadata_provider_for("bigquery")
|
|
||||||
assert fn is bq_meta.fetch
|
|
||||||
|
|
||||||
|
|
||||||
def test_dispatcher_returns_keboola_provider_for_keboola():
|
|
||||||
from app.api.v2_catalog import _metadata_provider_for
|
|
||||||
from connectors.keboola import metadata as kb_meta
|
|
||||||
fn = _metadata_provider_for("keboola")
|
|
||||||
assert fn is kb_meta.fetch
|
|
||||||
|
|
||||||
|
|
||||||
def test_dispatcher_returns_none_for_unknown_source():
|
|
||||||
from app.api.v2_catalog import _metadata_provider_for
|
|
||||||
assert _metadata_provider_for("jira") is None
|
|
||||||
assert _metadata_provider_for("") is None
|
|
||||||
assert _metadata_provider_for("snowflake") is None
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_metadata_request_for_valid_row():
|
|
||||||
from app.api.v2_catalog import _build_metadata_request
|
|
||||||
req = _build_metadata_request({
|
|
||||||
"id": "orders",
|
|
||||||
"bucket": "dwh_base",
|
|
||||||
"source_table": "orders_2024",
|
|
||||||
})
|
|
||||||
assert isinstance(req, MetadataRequest)
|
|
||||||
assert req.table_id == "orders"
|
|
||||||
assert req.bucket == "dwh_base"
|
|
||||||
assert req.source_table == "orders_2024"
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_metadata_request_rejects_unsafe_bucket():
|
|
||||||
from app.api.v2_catalog import _build_metadata_request
|
|
||||||
req = _build_metadata_request({
|
|
||||||
"id": "x",
|
|
||||||
"bucket": "evil`; DROP--",
|
|
||||||
"source_table": "t",
|
|
||||||
})
|
|
||||||
assert req is None
|
|
||||||
|
|
||||||
|
|
||||||
def test_build_metadata_request_falls_back_to_id_when_source_table_missing():
|
|
||||||
"""Some legacy Keboola registry rows have empty source_table; the row id
|
|
||||||
is the table name in that case (mirrors v2_schema:168 behavior)."""
|
|
||||||
from app.api.v2_catalog import _build_metadata_request
|
|
||||||
req = _build_metadata_request({
|
|
||||||
"id": "orders",
|
|
||||||
"bucket": "in.c-crm",
|
|
||||||
"source_table": "",
|
|
||||||
})
|
|
||||||
assert req is not None
|
|
||||||
assert req.source_table == "orders"
|
|
||||||
|
|
||||||
|
|
||||||
def test_stub_providers_return_none():
|
|
||||||
"""Providers don't have their real bodies yet — stubs return None
|
|
||||||
so the catalog endpoint stays 200 while we wire the rest."""
|
|
||||||
from connectors.bigquery import metadata as bq_meta
|
|
||||||
from connectors.keboola import metadata as kb_meta
|
|
||||||
req = MetadataRequest(table_id="x", bucket="b", source_table="t")
|
|
||||||
assert bq_meta.fetch(req) is None
|
|
||||||
assert kb_meta.fetch(req) is None
|
|
||||||
|
|
@ -1,47 +1,58 @@
|
||||||
"""Unified cache flush across all four catalog/schema/sample/metadata
|
"""Unified cache flush across the three in-memory catalog/schema/sample
|
||||||
caches on registry write."""
|
caches on registry write.
|
||||||
|
|
||||||
from unittest.mock import patch
|
Post-0.50: the persistent ``bq_metadata_cache`` is intentionally NOT
|
||||||
|
invalidated here. That table's lifecycle is owned by the scheduler-
|
||||||
|
driven refresh — admins who need an immediate refresh after editing a
|
||||||
|
remote row hit ``POST /api/v2/metadata-cache/refresh?table=<id>``
|
||||||
|
explicitly. Auto-invalidation on every registry edit would re-introduce
|
||||||
|
the request-path BQ fan-out the refactor exists to avoid.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
|
||||||
|
|
||||||
def test_invalidate_flushes_all_four_caches():
|
def test_invalidate_flushes_three_in_memory_caches():
|
||||||
from app.api import v2_catalog, v2_schema, v2_sample
|
from app.api import v2_catalog, v2_schema, v2_sample
|
||||||
from app.api._metadata_models import TableMetadata
|
|
||||||
|
|
||||||
# Pre-populate.
|
# Pre-populate.
|
||||||
v2_catalog._table_rows_cache.set("all", ["fake_row"])
|
v2_catalog._table_rows_cache.set("all", ["fake_row"])
|
||||||
v2_catalog._metadata_cache.set("orders", TableMetadata(rows=10))
|
|
||||||
v2_schema._schema_cache.set("orders", {"columns": []})
|
v2_schema._schema_cache.set("orders", {"columns": []})
|
||||||
v2_sample._sample_cache.set("orders|10", [{"row": 1}])
|
v2_sample._sample_cache.set("orders|10", [{"row": 1}])
|
||||||
|
|
||||||
v2_catalog.invalidate_for_table("orders")
|
v2_catalog.invalidate_for_table("orders")
|
||||||
|
|
||||||
assert v2_catalog._table_rows_cache.get("all") is None
|
assert v2_catalog._table_rows_cache.get("all") is None
|
||||||
assert v2_catalog._metadata_cache.get("orders") is None
|
|
||||||
assert v2_schema._schema_cache.get("orders") is None
|
assert v2_schema._schema_cache.get("orders") is None
|
||||||
# Sample cache is cleared whole (we don't have prefix-invalidation).
|
# Sample cache is cleared whole (we don't have prefix-invalidation).
|
||||||
assert v2_sample._sample_cache.get("orders|10") is None
|
assert v2_sample._sample_cache.get("orders|10") is None
|
||||||
|
|
||||||
|
|
||||||
def test_invalidate_schedules_single_row_rewarm(monkeypatch):
|
def test_invalidate_does_not_touch_persistent_bq_cache():
|
||||||
"""After the flush, a background re-warm task is scheduled for the
|
"""The persistent cache survives registry-row invalidations; only an
|
||||||
same table_id. Assert via patching create_task."""
|
explicit ``POST /api/v2/metadata-cache/refresh`` (or the scheduled
|
||||||
import asyncio
|
refresh) should change it."""
|
||||||
from app.api import v2_catalog
|
from app.api import v2_catalog
|
||||||
|
|
||||||
scheduled = []
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
BqMetadataCacheRepository(conn).upsert_success(
|
||||||
|
"survives_invalidate",
|
||||||
|
rows=42, size_bytes=4096, partition_by=None, clustered_by=None,
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
def fake_create_task(coro):
|
v2_catalog.invalidate_for_table("survives_invalidate")
|
||||||
# Drain the coroutine so the test doesn't leak it.
|
|
||||||
coro.close()
|
|
||||||
scheduled.append(coro)
|
|
||||||
return None
|
|
||||||
|
|
||||||
# Simulate a running event loop so the create_task branch is reached.
|
conn = get_system_db()
|
||||||
monkeypatch.setattr(asyncio, "get_running_loop", lambda: object())
|
try:
|
||||||
monkeypatch.setattr(asyncio, "create_task", fake_create_task)
|
row = BqMetadataCacheRepository(conn).get("survives_invalidate")
|
||||||
v2_catalog.invalidate_for_table("orders")
|
finally:
|
||||||
assert len(scheduled) == 1
|
conn.close()
|
||||||
|
assert row is not None
|
||||||
|
assert row["rows"] == 42
|
||||||
|
|
||||||
|
|
||||||
def test_register_table_invalidates(seeded_app):
|
def test_register_table_invalidates(seeded_app):
|
||||||
|
|
|
||||||
|
|
@ -1,10 +1,16 @@
|
||||||
"""Catalog endpoint integration: per-table metadata enrichment for
|
"""Catalog endpoint integration: per-table metadata enrichment for
|
||||||
remote rows."""
|
remote rows.
|
||||||
|
|
||||||
|
Post-0.50 the catalog endpoint reads enrichment fields exclusively from
|
||||||
|
the persistent ``bq_metadata_cache`` table (populated by the scheduler-
|
||||||
|
driven refresh in ``app/api/bq_metadata_refresh.py``). These tests
|
||||||
|
pre-seed cache rows and verify the catalog response shape; they do NOT
|
||||||
|
mock ``connectors.bigquery.metadata.fetch`` because that path is no
|
||||||
|
longer reachable from the catalog request.
|
||||||
|
"""
|
||||||
|
|
||||||
from unittest.mock import patch
|
from unittest.mock import patch
|
||||||
|
|
||||||
from app.api._metadata_models import TableMetadata
|
|
||||||
|
|
||||||
|
|
||||||
def _register_table(seeded_app, **kwargs):
|
def _register_table(seeded_app, **kwargs):
|
||||||
"""Register a table into the test DB using TableRegistryRepository."""
|
"""Register a table into the test DB using TableRegistryRepository."""
|
||||||
|
|
@ -13,38 +19,60 @@ def _register_table(seeded_app, **kwargs):
|
||||||
conn = get_system_db()
|
conn = get_system_db()
|
||||||
try:
|
try:
|
||||||
repo = TableRegistryRepository(conn)
|
repo = TableRegistryRepository(conn)
|
||||||
# `name` defaults to `id` if not supplied
|
|
||||||
name = kwargs.pop("name", kwargs.get("id"))
|
name = kwargs.pop("name", kwargs.get("id"))
|
||||||
repo.register(name=name, **kwargs)
|
repo.register(name=name, **kwargs)
|
||||||
finally:
|
finally:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
def test_remote_row_includes_metadata_fields(seeded_app, monkeypatch):
|
def _seed_cache_row(
|
||||||
"""Catalog response for a query_mode='remote' BQ row carries the four
|
table_id: str,
|
||||||
new fields populated by the provider."""
|
*,
|
||||||
# Reset catalog row cache so this test's registered table is visible.
|
rows=None,
|
||||||
|
size_bytes=None,
|
||||||
|
partition_by=None,
|
||||||
|
clustered_by=None,
|
||||||
|
):
|
||||||
|
"""Insert a successful refresh row into bq_metadata_cache."""
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.bq_metadata_cache import BqMetadataCacheRepository
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
BqMetadataCacheRepository(conn).upsert_success(
|
||||||
|
table_id,
|
||||||
|
rows=rows,
|
||||||
|
size_bytes=size_bytes,
|
||||||
|
partition_by=partition_by,
|
||||||
|
clustered_by=clustered_by,
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _reset_catalog_caches():
|
||||||
from app.api import v2_catalog
|
from app.api import v2_catalog
|
||||||
v2_catalog._table_rows_cache.clear()
|
v2_catalog._table_rows_cache.clear()
|
||||||
v2_catalog._metadata_cache.clear()
|
|
||||||
|
|
||||||
|
def test_remote_row_includes_metadata_fields(seeded_app):
|
||||||
|
"""Catalog response for a query_mode='remote' BQ row carries the four
|
||||||
|
enrichment fields read from the persistent cache."""
|
||||||
|
_reset_catalog_caches()
|
||||||
|
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
|
|
||||||
fake_meta = TableMetadata(
|
|
||||||
rows=10000, size_bytes=2_000_000,
|
|
||||||
partition_by="event_date", clustered_by=["country", "platform"],
|
|
||||||
)
|
|
||||||
|
|
||||||
_register_table(
|
_register_table(
|
||||||
seeded_app,
|
seeded_app,
|
||||||
id="orders", source_type="bigquery", bucket="dwh_base",
|
id="orders", source_type="bigquery", bucket="dwh_base",
|
||||||
source_table="orders_2024", query_mode="remote",
|
source_table="orders_2024", query_mode="remote",
|
||||||
)
|
)
|
||||||
|
_seed_cache_row(
|
||||||
|
"orders",
|
||||||
|
rows=10000, size_bytes=2_000_000,
|
||||||
|
partition_by="event_date", clustered_by=["country", "platform"],
|
||||||
|
)
|
||||||
|
|
||||||
with patch(
|
|
||||||
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
|
|
||||||
):
|
|
||||||
r = c.get(
|
r = c.get(
|
||||||
"/api/v2/catalog",
|
"/api/v2/catalog",
|
||||||
headers={"Authorization": f"Bearer {token}"},
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
|
@ -56,15 +84,46 @@ def test_remote_row_includes_metadata_fields(seeded_app, monkeypatch):
|
||||||
assert orders["size_bytes"] == 2_000_000
|
assert orders["size_bytes"] == 2_000_000
|
||||||
assert orders["partition_by"] == "event_date"
|
assert orders["partition_by"] == "event_date"
|
||||||
assert orders["clustered_by"] == ["country", "platform"]
|
assert orders["clustered_by"] == ["country", "platform"]
|
||||||
# Existing fields still present.
|
|
||||||
assert orders["query_mode"] == "remote"
|
assert orders["query_mode"] == "remote"
|
||||||
|
assert orders["metadata_freshness"] == "fresh"
|
||||||
|
|
||||||
|
|
||||||
def test_local_row_unaffected_by_provider_dispatch(seeded_app):
|
def test_remote_row_with_no_cache_returns_null_fields(seeded_app):
|
||||||
"""query_mode='local' rows take the parquet-stat path; provider not called."""
|
"""Catalog response for a remote row with no cache entry — first boot
|
||||||
from app.api import v2_catalog
|
before scheduler tick — returns null enrichment fields and
|
||||||
v2_catalog._table_rows_cache.clear()
|
metadata_freshness='never_fetched'. MUST stay 200; MUST NOT call BQ."""
|
||||||
v2_catalog._metadata_cache.clear()
|
_reset_catalog_caches()
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
_register_table(
|
||||||
|
seeded_app,
|
||||||
|
id="cold_t", source_type="bigquery", bucket="dwh_base",
|
||||||
|
source_table="cold_t", query_mode="remote",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Patch the BQ provider so we can prove the request path never reaches it.
|
||||||
|
with patch("connectors.bigquery.metadata.fetch") as mock_fetch:
|
||||||
|
r = c.get(
|
||||||
|
"/api/v2/catalog",
|
||||||
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
)
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
mock_fetch.assert_not_called()
|
||||||
|
|
||||||
|
tables = r.json()["tables"]
|
||||||
|
cold = next(t for t in tables if t["id"] == "cold_t")
|
||||||
|
assert cold["rows"] is None
|
||||||
|
assert cold["size_bytes"] is None
|
||||||
|
assert cold["partition_by"] is None
|
||||||
|
assert cold["clustered_by"] == []
|
||||||
|
assert cold["metadata_freshness"] == "never_fetched"
|
||||||
|
|
||||||
|
|
||||||
|
def test_local_row_metadata_freshness_is_not_applicable(seeded_app):
|
||||||
|
"""query_mode='local' rows take the parquet-stat path; the freshness
|
||||||
|
field signals that the BQ cache concept doesn't apply."""
|
||||||
|
_reset_catalog_caches()
|
||||||
|
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
|
|
@ -74,74 +133,31 @@ def test_local_row_unaffected_by_provider_dispatch(seeded_app):
|
||||||
source_table="users", query_mode="local",
|
source_table="users", query_mode="local",
|
||||||
)
|
)
|
||||||
|
|
||||||
with patch("connectors.keboola.metadata.fetch") as mock_fetch:
|
|
||||||
r = c.get(
|
|
||||||
"/api/v2/catalog",
|
|
||||||
headers={"Authorization": f"Bearer {token}"},
|
|
||||||
)
|
|
||||||
assert r.status_code == 200, r.text
|
|
||||||
mock_fetch.assert_not_called()
|
|
||||||
|
|
||||||
|
|
||||||
def test_provider_failure_returns_null_metadata(seeded_app):
|
|
||||||
"""Provider returns None → row appears with null new fields, not
|
|
||||||
a 500. Catalog endpoint must stay 200."""
|
|
||||||
from app.api import v2_catalog
|
|
||||||
v2_catalog._table_rows_cache.clear()
|
|
||||||
v2_catalog._metadata_cache.clear()
|
|
||||||
|
|
||||||
c = seeded_app["client"]
|
|
||||||
token = seeded_app["admin_token"]
|
|
||||||
_register_table(
|
|
||||||
seeded_app,
|
|
||||||
id="broken", source_type="bigquery", bucket="dwh_base",
|
|
||||||
source_table="broken_t", query_mode="remote",
|
|
||||||
)
|
|
||||||
|
|
||||||
with patch(
|
|
||||||
"connectors.bigquery.metadata.fetch", return_value=None,
|
|
||||||
):
|
|
||||||
r = c.get(
|
r = c.get(
|
||||||
"/api/v2/catalog",
|
"/api/v2/catalog",
|
||||||
headers={"Authorization": f"Bearer {token}"},
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
)
|
)
|
||||||
assert r.status_code == 200, r.text
|
assert r.status_code == 200, r.text
|
||||||
tables = r.json()["tables"]
|
tables = r.json()["tables"]
|
||||||
broken = next(t for t in tables if t["id"] == "broken")
|
users = next(t for t in tables if t["id"] == "users")
|
||||||
assert broken["rows"] is None
|
assert users["metadata_freshness"] == "not_applicable"
|
||||||
assert broken["size_bytes"] is None
|
|
||||||
assert broken["partition_by"] is None
|
|
||||||
assert broken["clustered_by"] is None
|
|
||||||
|
|
||||||
|
|
||||||
def test_zero_size_bytes_reports_small_not_unknown(seeded_app):
|
def test_zero_size_bytes_reports_small_not_unknown(seeded_app):
|
||||||
"""Devin Review #1 regression: `if cached.size_bytes:` is falsy when
|
"""Devin Review #1 regression preserved across the refactor: a cache
|
||||||
`size_bytes == 0` (genuinely empty table) — that wrongly emitted
|
row with size_bytes=0 must surface rough_size_hint='small', not None.
|
||||||
`rough_size_hint=None` ("unknown") instead of `"small"` (the bucket
|
"""
|
||||||
`_bucket_size(0)` returns).
|
_reset_catalog_caches()
|
||||||
|
|
||||||
Fix in `_size_hint_for_row`: distinguish "size known to be zero" from
|
|
||||||
"size is unknown" with `is not None`."""
|
|
||||||
from app.api import v2_catalog
|
|
||||||
v2_catalog._table_rows_cache.clear()
|
|
||||||
v2_catalog._metadata_cache.clear()
|
|
||||||
|
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
|
|
||||||
fake_meta = TableMetadata(
|
|
||||||
rows=0, size_bytes=0, partition_by=None, clustered_by=[],
|
|
||||||
)
|
|
||||||
|
|
||||||
_register_table(
|
_register_table(
|
||||||
seeded_app,
|
seeded_app,
|
||||||
id="empty_t", source_type="bigquery", bucket="dwh_base",
|
id="empty_t", source_type="bigquery", bucket="dwh_base",
|
||||||
source_table="empty_t", query_mode="remote",
|
source_table="empty_t", query_mode="remote",
|
||||||
)
|
)
|
||||||
|
_seed_cache_row("empty_t", rows=0, size_bytes=0, clustered_by=[])
|
||||||
|
|
||||||
with patch(
|
|
||||||
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
|
|
||||||
):
|
|
||||||
r = c.get(
|
r = c.get(
|
||||||
"/api/v2/catalog",
|
"/api/v2/catalog",
|
||||||
headers={"Authorization": f"Bearer {token}"},
|
headers={"Authorization": f"Bearer {token}"},
|
||||||
|
|
@ -149,18 +165,15 @@ def test_zero_size_bytes_reports_small_not_unknown(seeded_app):
|
||||||
assert r.status_code == 200, r.text
|
assert r.status_code == 200, r.text
|
||||||
tables = r.json()["tables"]
|
tables = r.json()["tables"]
|
||||||
empty = next(t for t in tables if t["id"] == "empty_t")
|
empty = next(t for t in tables if t["id"] == "empty_t")
|
||||||
# The whole point of this test: 0 bytes is NOT "unknown".
|
|
||||||
assert empty["size_bytes"] == 0
|
assert empty["size_bytes"] == 0
|
||||||
assert empty["rough_size_hint"] == "small", (
|
assert empty["rough_size_hint"] == "small"
|
||||||
f"size_bytes=0 should bucket to 'small', got {empty['rough_size_hint']}"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def test_cache_hit_does_not_call_provider_twice(seeded_app):
|
def test_catalog_request_never_calls_bq(seeded_app):
|
||||||
"""First call invokes provider; second within 15 min hits cache."""
|
"""The whole point of the refactor: even with a cold cache and a
|
||||||
from app.api import v2_catalog
|
remote BQ row in the registry, GET /api/v2/catalog MUST NOT touch
|
||||||
v2_catalog._table_rows_cache.clear()
|
the BQ provider. Regressing this re-introduces the >90 s hang."""
|
||||||
v2_catalog._metadata_cache.clear()
|
_reset_catalog_caches()
|
||||||
|
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
|
|
@ -170,10 +183,8 @@ def test_cache_hit_does_not_call_provider_twice(seeded_app):
|
||||||
source_table="orders_2024", query_mode="remote",
|
source_table="orders_2024", query_mode="remote",
|
||||||
)
|
)
|
||||||
|
|
||||||
fake_meta = TableMetadata(rows=1, size_bytes=2)
|
with patch("connectors.bigquery.metadata.fetch") as mock_fetch:
|
||||||
with patch(
|
|
||||||
"connectors.bigquery.metadata.fetch", return_value=fake_meta,
|
|
||||||
) as mock_fetch:
|
|
||||||
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
|
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
|
||||||
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
|
c.get("/api/v2/catalog", headers={"Authorization": f"Bearer {token}"})
|
||||||
assert mock_fetch.call_count == 1
|
|
||||||
|
mock_fetch.assert_not_called()
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue