agnes-the-ai-analyst/app/api/_metadata_models.py
ZdenekSrotyr b6cdd68e8d feat(catalog): entity_type + validated where_examples + view-aware cost-guard + scheduler hygiene
Three behavioural improvements driven by the sub-agent end-to-end test
findings, plus scheduler tweaks to prevent the post-deploy contention
burst we measured.

CATALOG (catalog-side bugs the test agents tripped on):
  - new entity_type field per remote row (BASE TABLE / VIEW /
    MATERIALIZED VIEW). For views, rows + size_bytes return null
    instead of the misleading 0 that __TABLES__ reports.
  - where_examples now validates against the table's actual schema
    (cached known_columns from refresh). The pre-fix behavior
    blindly advertised `country_code = 'CZ'` on tables with no
    country_code column — the sub-agent tests reliably hit this on
    unit_economics.
  - new known_columns + entity_type columns on bq_metadata_cache;
    populated by bq_metadata_refresh.refresh_one from the same
    fetch_bq_columns_full call (no extra BQ roundtrip) plus a
    cheap INFORMATION_SCHEMA.TABLES lookup for table_type.

QUERY COST-GUARD:
  - remote_scan_too_large suggestion now names views explicitly:
    `Target(s) <ids> are VIEW or MATERIALIZED VIEW. BigQuery does
    not push LIMIT into the view body — SELECT * FROM <view>
    LIMIT 1 still runs the full underlying scan.` Programmatic
    consumers get a new view_targets field on the error detail.

SCHEDULER HYGIENE (the post-deploy 1-minute window where
concurrent parquet downloads dropped to ~1 MB/s):
  - SCHEDULER_STARTUP_GRACE_SECONDS (default 60) holds the first
    tick so the burst doesn't overlap cache_warmup writes.
  - SCHEDULER_BQ_METADATA_INITIAL_OFFSET_MAX_SECONDS (default 900)
    randomises bq-metadata-refresh's first-fire offset.

TESTS:
  - test_bq_metadata_cache_repo: entity_type + known_columns round-trip
  - test_v2_catalog_remote_metadata: where_examples validation, views
    return null rows/size_bytes, cold rows have empty examples
  - test_api_query_guardrail: VIEW-aware suggestion text + view_targets
  - test_connectors_bigquery_metadata: entity_type lookup mock + new
    fields in TableMetadata expectations
  - test_scheduler_sidecar: grace + jitter env-var resolution
2026-05-12 10:37:35 +02:00

54 lines
2.1 KiB
Python

"""Shared data shapes for source-agnostic table-metadata providers.
Lives under `app/api/` because the primary consumer is
`app/api/v2_catalog.py`. Connector-side providers in `connectors/<source>/`
import upward into this module — the inverse layering would force
`v2_catalog.py` to depend on `connectors/__init__.py`, which is the
wrong direction.
"""
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class MetadataRequest:
"""Narrow input passed to a metadata provider's `fetch()`.
`bucket` and `source_table` are pre-validated by the dispatcher
(`validate_quoted_identifier`) before construction, so the provider
can interpolate them into SQL/URL paths without re-checking. Frozen
so the (provider, request)-keyed cache lookup is stable.
"""
table_id: str
bucket: str
source_table: str
@dataclass
class TableMetadata:
"""Source-agnostic metadata bundle. Every field optional — providers
fill what they can cheaply get; callers tolerate `None`. Adding a new
field here is a non-breaking change: existing CLI consumers don't
even render `rough_size_hint` (verified `grep -rn rough_size_hint cli/`
is empty), let alone the new fields.
``entity_type`` for BigQuery mirrors INFORMATION_SCHEMA.TABLES.table_type
(``BASE TABLE`` / ``VIEW`` / ``MATERIALIZED VIEW`` / ``EXTERNAL`` /
``SNAPSHOT`` / ``CLONE``). Catalog uses it to hide misleading
``rows=0, size_bytes=0`` for VIEWs (which __TABLES__ reports as zero)
and to inject a "LIMIT doesn't push into view body" hint into
cost-guard errors when a remote query targets a VIEW.
``known_columns`` is the list of column names from the same refresh
that populated this row. Catalog endpoint filters generic
``where_examples`` templates against this list — drops example
predicates that reference columns the table doesn't have.
"""
rows: int | None = None
size_bytes: int | None = None
partition_by: str | None = None
clustered_by: list[str] | None = None
entity_type: str | None = None
known_columns: list[str] | None = None