Three behavioural improvements driven by the sub-agent end-to-end test
findings, plus scheduler tweaks to prevent the post-deploy contention
burst we measured.
CATALOG (catalog-side bugs the test agents tripped on):
- new entity_type field per remote row (BASE TABLE / VIEW /
MATERIALIZED VIEW). For views, rows + size_bytes return null
instead of the misleading 0 that __TABLES__ reports.
- where_examples now validates against the table's actual schema
(cached known_columns from refresh). The pre-fix behavior
blindly advertised `country_code = 'CZ'` on tables with no
country_code column — the sub-agent tests reliably hit this on
unit_economics.
- new known_columns + entity_type columns on bq_metadata_cache;
populated by bq_metadata_refresh.refresh_one from the same
fetch_bq_columns_full call (no extra BQ roundtrip) plus a
cheap INFORMATION_SCHEMA.TABLES lookup for table_type.
QUERY COST-GUARD:
- remote_scan_too_large suggestion now names views explicitly:
`Target(s) <ids> are VIEW or MATERIALIZED VIEW. BigQuery does
not push LIMIT into the view body — SELECT * FROM <view>
LIMIT 1 still runs the full underlying scan.` Programmatic
consumers get a new view_targets field on the error detail.
SCHEDULER HYGIENE (the post-deploy 1-minute window where
concurrent parquet downloads dropped to ~1 MB/s):
- SCHEDULER_STARTUP_GRACE_SECONDS (default 60) holds the first
tick so the burst doesn't overlap cache_warmup writes.
- SCHEDULER_BQ_METADATA_INITIAL_OFFSET_MAX_SECONDS (default 900)
randomises bq-metadata-refresh's first-fire offset.
TESTS:
- test_bq_metadata_cache_repo: entity_type + known_columns round-trip
- test_v2_catalog_remote_metadata: where_examples validation, views
return null rows/size_bytes, cold rows have empty examples
- test_api_query_guardrail: VIEW-aware suggestion text + view_targets
- test_connectors_bigquery_metadata: entity_type lookup mock + new
fields in TableMetadata expectations
- test_scheduler_sidecar: grace + jitter env-var resolution
54 lines
2.1 KiB
Python
54 lines
2.1 KiB
Python
"""Shared data shapes for source-agnostic table-metadata providers.
|
|
|
|
Lives under `app/api/` because the primary consumer is
|
|
`app/api/v2_catalog.py`. Connector-side providers in `connectors/<source>/`
|
|
import upward into this module — the inverse layering would force
|
|
`v2_catalog.py` to depend on `connectors/__init__.py`, which is the
|
|
wrong direction.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from dataclasses import dataclass
|
|
|
|
|
|
@dataclass(frozen=True)
|
|
class MetadataRequest:
|
|
"""Narrow input passed to a metadata provider's `fetch()`.
|
|
|
|
`bucket` and `source_table` are pre-validated by the dispatcher
|
|
(`validate_quoted_identifier`) before construction, so the provider
|
|
can interpolate them into SQL/URL paths without re-checking. Frozen
|
|
so the (provider, request)-keyed cache lookup is stable.
|
|
"""
|
|
table_id: str
|
|
bucket: str
|
|
source_table: str
|
|
|
|
|
|
@dataclass
|
|
class TableMetadata:
|
|
"""Source-agnostic metadata bundle. Every field optional — providers
|
|
fill what they can cheaply get; callers tolerate `None`. Adding a new
|
|
field here is a non-breaking change: existing CLI consumers don't
|
|
even render `rough_size_hint` (verified `grep -rn rough_size_hint cli/`
|
|
is empty), let alone the new fields.
|
|
|
|
``entity_type`` for BigQuery mirrors INFORMATION_SCHEMA.TABLES.table_type
|
|
(``BASE TABLE`` / ``VIEW`` / ``MATERIALIZED VIEW`` / ``EXTERNAL`` /
|
|
``SNAPSHOT`` / ``CLONE``). Catalog uses it to hide misleading
|
|
``rows=0, size_bytes=0`` for VIEWs (which __TABLES__ reports as zero)
|
|
and to inject a "LIMIT doesn't push into view body" hint into
|
|
cost-guard errors when a remote query targets a VIEW.
|
|
|
|
``known_columns`` is the list of column names from the same refresh
|
|
that populated this row. Catalog endpoint filters generic
|
|
``where_examples`` templates against this list — drops example
|
|
predicates that reference columns the table doesn't have.
|
|
"""
|
|
rows: int | None = None
|
|
size_bytes: int | None = None
|
|
partition_by: str | None = None
|
|
clustered_by: list[str] | None = None
|
|
entity_type: str | None = None
|
|
known_columns: list[str] | None = None
|