* docs(spec): #134 unify BigQuery access behind BqAccess facade Brainstorm output for issue #134. Captures: - root cause (incl. correction of the issue's hypothesis about commit 33a9964) - BqAccess facade API + project resolution rules - error contract — typed BqAccessError mapped to HTTP 502 for upstream BQ failures, 500 for deployment/config bugs - migration plan for v2_scan, v2_sample, RemoteQueryEngine - test rewrite eliminating _bq_client_factory injection point - E2E verification protocol on agnes-development as success criterion * docs(spec): #134 revise after first review Incorporates code-reviewer findings: Must-fix: - Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope. - Reframe v2_scan headline: missing try/except around BQ calls is the actual cause of bare 500s, not project resolution (which 33a9964 fixed). - List two more deferred call sites (extractor.py, register_bq_table) with explicit rationale. Important: - Drop billing != data clause from cross_project_forbidden heuristic; rely only on 'serviceusage' substring. billing != data is normal for cross-project setup, was over-classifying. - Split bq_bad_request into _user (400) and _server (502) variants; add sql_origin parameter to translate_bq_error so call sites declare whether SQL contains user input. - Add @functools.cache to BqAccess.from_config; document tests bypass via dependency_overrides. - Replace monkey-patched-classmethod test pattern with BqAccess(client_factory=...) injection at construction time. Cleaner than today's _bq_client_factory and 1:1 migration shape. - Keep BqProjects.data (reviewer assumed registry has source_project; it doesn't). Multi-project explicitly listed as non-goal with note. Nice-to-have: - Add 'Implementation strategy' section: 2 staged commits (bug fix alone is revertable; refactor follows). - Extend E2E protocol to cover all three endpoints, not just /sample. - Note removal of stale docstring at src/remote_query.py:204. * docs(spec): #134 revision 3 — incorporates second-round review Must-fix from second review: - v2_schema split into two migration cases: _fetch_bq_schema translates errors via translate_bq_error; _fetch_bq_table_options preserves its swallow-all 'except Exception → return {}' so /schema doesn't 502 on partition-info failures. - RemoteQueryEngine.__init__ now resolves BqAccess lazily (in _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests in test_remote_query.py would suddenly fail with not_configured. - translate_bq_error pass-through for BqAccessError is now load-bearing (clause 1, before any Google-API branch). bq.client() raises BqAccessError for bq_lib_missing/auth_failed; without explicit pass-through those fall to 'unknown' and re-raise as bare 500. - Commit 1 now emits the SAME structured response shape as commit 2 to avoid contract churn between commits. - BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments — flagged in CHANGELOG ### Changed. Editorial: - sql_origin renamed to bad_request_status with values 'client_error' / 'upstream_error' (clearer about what the parameter actually decides). bq_bad_request_user/_server kinds collapsed to bq_bad_request (400) and bq_upstream_error (502). - CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller; unaffected because new bq_access kwarg has default None. - Added unit/integration tests for the new contracts: test_translate_passes_through_BqAccessError, test_v2_scan_returns_500_on_bq_lib_missing, test_v2_schema_returns_200_with_empty_partition_on_bq_failure, test_resolve_succeeds_after_config_set. - E2E protocol now covers /schema as the fourth endpoint. - Documented functools.cache-doesn't-cache-exceptions semantics and fixture nullcontext-doesn't-close caveat for nested sessions. * docs(spec): #134 revision 4 — incorporates third-round review Third reviewer verdict: 'implementation-ready with two trivial edits'; explicitly noted prior rounds did the heavy lifting. Edits: 1. get_bq_access() module-level function instead of @classmethod @functools.cache from_config. Removes the classmethod-cache stacking footgun (different Python versions wrap differently) and gives FastAPI's dependency introspection a clean function signature. Drops the 'Do not subclass BqAccess' caveat that no longer applies. 2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample), _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all in commit 1 — preserved as-is, then migrated (still preserved) in commit 2. All three endpoints emit the same structured body shape so client parsers see one consistent contract throughout the staged rollout. No more half-rolled-out window where /sample is bare 500 while /scan is structured 502. * docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification) Bite-sized TDD tasks. 3 phases, 16 tasks total: Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints: Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan, _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options preserved untouched. CHANGELOG Fixed entries. Phase 2 (Commit 2) — BqAccess facade extraction + migration: Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up (BqProjects, BqAccessError, translate_bq_error, default factories, BqAccess class, get_bq_access module-level cached). Task 2.6 adds conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG Changed BREAKING + Internal. Phase 3 — Verification: 3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual E2E on agnes-development per spec protocol → close #134. Self-review table maps spec sections to implementing tasks; no gaps. * fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict block only) with try/except for google.api_core exceptions, translating to HTTPException with a structured body shape: {error, message, details}. Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500 with no body when the SA on agnes-development hit cross-project Forbidden on serviceusage.services.use. Also fixes /sample's missing billing_project fallback (the bug 33a9964 fixed for /scan never landed here). Status code split: - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is user-derived from req.select/where/order_by. - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is server-constructed from validated identifiers. - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in error message (with hint pointing at data_source.bigquery.billing_project), else bq_forbidden. Body shape matches what the upcoming BqAccess refactor (next commit) will produce, so client-side parsers see one consistent contract throughout the staged rollout. _fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty contract is intentional and survives into the refactor; /schema continues to return 200 with empty partition info when partition queries fail. Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema endpoints exist only to make the test pattern (monkeypatching whole _fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal once BqAccess centralizes translation. * refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine Extracts the duplicated BigQuery-access pattern (project resolution + client construction + DuckDB-extension session + Google-API error translation) into connectors/bigquery/access.py. Migrates four call sites to use it: - app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan - app/api/v2_sample.py — _fetch_bq_sample - app/api/v2_schema.py — _fetch_bq_schema (strict translation), _fetch_bq_table_options (preserves swallow-all best-effort contract) - src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg The new module exposes: - BqProjects (frozen dataclass: billing + data project IDs) - BqAccessError (typed exception with HTTP_STATUS class mapping) - BqAccess (facade with injectable client_factory/duckdb_session_factory for tests; defaults call the real google-cloud-bigquery + DuckDB extension) - get_bq_access (module-level @functools.cache; FastAPI Depends target) - translate_bq_error (Google API exception → BqAccessError mapper, with BqAccessError pass-through, 'serviceusage'-substring heuristic for cross_project_forbidden, and bad_request_status param distinguishing user-derived (400) from server-constructed (502) SQL) - _default_client_factory, _default_duckdb_session_factory RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only RemoteQueryEngine tests need no changes — bq_access defaults to None and get_bq_access() is only invoked on first BQ call (lazy resolution). BqAccessError raised internally is translated to RemoteQueryError( error_type="bq_error") in _get_bq_client to preserve the engine's existing public contract — CLI and /api/query/hybrid callers see no change. Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample, test_v2_schema) migrate from monkey-patching whole _fetch_* functions to using the new bq_access fixture in tests/conftest.py — which exercises the REAL translation path through BqAccess + translate_bq_error, closing the test gap flagged in Task 1.1's review. Side-effect behavior change: v2_sample's FROM clause now uses the data project (instance.yaml data_source.bigquery.project), not the conflated billing_project from Phase 1. Documented in CHANGELOG ### Internal. BREAKING for deployments combining BIGQUERY_PROJECT env var with data_source.bigquery.project in instance.yaml — env var now overrides data project too. See CHANGELOG ### Changed. Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py, scripts/duckdb_manager.register_bq_table) explicitly out of scope; tracked as follow-up. Removed stale docstring at the previous src/remote_query.py:204 that referenced scripts.duckdb_manager._create_bq_client as the default BQ client factory (RemoteQueryEngine never actually used that function). Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_*.py + tests/test_remote_query.py migrated to bq_access fixture (counts unchanged or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures (DB migration tests with unrelated internal_roles DependencyException — not introduced by this PR). * fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed) CI on PR #138 caught: bigquery.Client(...) resolves Application Default Credentials at construction time; without ADC (CI without SA key, dev laptop without 'gcloud auth application-default login') it raises google.auth.exceptions.DefaultCredentialsError synchronously. Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError propagated as raw exception — and from production endpoints would surface as bare 500 (the exact failure mode #134 sets out to fix). Now translates to BqAccessError(kind='auth_failed', details.hint='Run gcloud auth application-default login...'). Endpoint catch chain returns HTTP 502 with structured body. Adds unit test test_raises_auth_failed_on_default_credentials_error. Third-round spec review flagged this case in passing; the fix didn't land. CI's auth-less environment surfaced it. * fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends breaks all v2 endpoints for non-BigQuery instances'. Pre-fix: get_bq_access() raised BqAccessError(not_configured) when neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set. Because FastAPI resolves Depends() BEFORE the endpoint body runs, this exception fires during dep-injection — the endpoint's try/except BqAccessError clause never gets a chance to catch it. Result: every v2 request on Keboola-only or CSV-only instances returned bare HTTP 500, even for local-source tables that never touch BigQuery. Fix: get_bq_access() now returns a sentinel BqAccess with empty BqProjects and factories that raise BqAccessError(not_configured) on actual use. Construction succeeds, FastAPI's dep-injection cleanly yields the sentinel, the endpoint runs. The local-source code path in build_sample / build_schema / etc. never calls bq.client() or bq.duckdb_session() (it reads parquet directly), so non-BQ tables return 200 as before. Only when an endpoint actually tries to query BQ (source_type == 'bigquery') does the sentinel raise — and the endpoint's existing except BqAccessError catches it normally, returning structured 502 with hint. Test get_bq_access::test_raises_not_configured_when_neither_set renamed and rewritten to test_returns_sentinel_when_neither_set: asserts BqAccess is returned, then asserts client() and duckdb_session() each raise BqAccessError(not_configured) on call. Test test_does_not_cache_exceptions removed (no longer applicable) and replaced with test_sentinel_is_cached_per_process documenting the operator-restart-on-config-change contract. * docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule) Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained customer-specific deployment hostnames, deployment names, and a GCP project ID that violated CLAUDE.md's vendor-agnostic OSS rule ('Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies'). Replacements: agnes-development.groupondev.com -> <your-agnes-host> agnes-development -> <your-dev-instance> prj-grp-dataview-prod-1ff9 -> <your-data-project> s1_session_landings -> <bq_table_id> E2E verification semantics unchanged — operators still run the same four curls + config flip + retry, just substituting their own host / deployment name / project / table. * fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at process level, so it captures BigQuery project IDs at first call and ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints re-read get_value() on every request, so admin /api/admin/server-config saves (which call instance_config.reset_cache()) hot-reloaded the BQ project. Without this fix, my refactor silently regresses that contract — operators editing instance.yaml via the admin UI would see no effect on v2 endpoints until container restart. instance_config.reset_cache() now also calls connectors.bigquery.access.get_bq_access.cache_clear() (lazy import, swallowed if connectors module isn't loaded — keeps instance_config usable in isolated unit tests). Adds test_instance_config_reset_cache_invalidates_get_bq_access as regression guard. Updates CHANGELOG Internal entry to mention the hot-reload contract + the not-configured sentinel behavior (round-3 fix from Devin BUG_0001 was previously only in commit message). * fix(bq_access): surface not_configured before identifier validation + plan path genericize Devin BUG_0001 + BUG_0002 round 5 on PR #138. BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md vendor-agnostic rule. Replaced with '<worktree-root>' placeholder. BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel BqAccess (BQ not configured), the empty bq.projects.data was reaching validate_quoted_identifier first and raising ValueError -> endpoint mapped to HTTP 400 'unsafe_identifier' instead of structured 500 'not_configured' with hint. Each fetch helper now checks 'if not bq.projects.data: bq.client()' as the first step, which triggers the sentinel's BqAccessError(not_configured). Endpoint catches the typed error and returns HTTP 500 with hint pointing at data_source.bigquery.project. Best-effort _fetch_bq_table_options returns {} silently in this case (preserves the swallow-all contract). * fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery extension is a C++ plugin making its own HTTP calls — when BQ returns 403, it throws duckdb.IOException with the BQ error embedded as text, not gax.Forbidden. translate_bq_error's isinstance checks would miss these, falling to case 7 → bare 500 in production for v2_scan, v2_sample, and v2_schema (the bigquery_query() paths). Fix: last-resort string-match heuristic before the re-raise. 'Forbidden' / '403' / 'Bad Request' / '400' in the lowercased message classifies via the same kind hierarchy. The 'serviceusage' substring still distinguishes cross_project_forbidden from bq_forbidden. Specific enough that random exceptions without HTTP-error keywords still re-raise. Adds 4 unit tests covering the new heuristic + the 'don't swallow random exceptions' invariant. * chore(release): cut 0.22.0 PR #138 contains issue #134 user-visible behavior changes: - BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml data_source.bigquery.project for v2 endpoints (previously RemoteQueryEngine billing only). - Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate, /schema when BigQuery raises Forbidden/BadRequest (was bare 500). - Internal: BqAccess facade refactor unifying four duplicate BQ-access call sites; instance_config.reset_cache() now invalidates BqAccess cache too so admin server-config saves hot-reload BQ project IDs. Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.
604 lines
39 KiB
Markdown
604 lines
39 KiB
Markdown
# Issue #134 — Unify BigQuery access behind `BqAccess`, fix v2_sample + 502 contract
|
||
|
||
**Date:** 2026-04-29 (revision 4 — incorporates three rounds of code review)
|
||
**Issue:** [keboola/agnes-the-ai-analyst#134](https://github.com/keboola/agnes-the-ai-analyst/issues/134)
|
||
**Branch:** `fix/134-bq-access-unify`
|
||
|
||
## Problem
|
||
|
||
Issue #134 reports that v2 BigQuery endpoints on `<your-dev-instance>` (0.18.0) return HTTP 500 with no body:
|
||
|
||
- `POST /api/v2/scan/estimate`
|
||
- `POST /api/v2/scan`
|
||
- `GET /api/v2/sample/{table_id}`
|
||
|
||
`POST /api/query/hybrid` against the same tables works correctly.
|
||
|
||
### Root cause analysis
|
||
|
||
Two distinct bugs cause the symptom:
|
||
|
||
#### Bug A — `v2_scan.py` has correct project resolution but no error translation
|
||
|
||
Commit `33a9964` added the `billing_project` parameter to `app/api/v2_scan.py`. Today both endpoints (`scan_endpoint` line 385, `scan_estimate_endpoint` line 221) read `data_source.bigquery.billing_project` from `instance.yaml`, fall back to `project`, and pass it to the BQ client constructor.
|
||
|
||
**However, `_bq_dry_run_bytes` (lines 43-55) and `_run_bq_scan` (lines 266-282) have no `try/except` at all.** The endpoint-level handlers in `scan_endpoint` and `scan_estimate_endpoint` catch `WhereValidationError`, `QuotaExceededError`, `FileNotFoundError`, `PermissionError`, `ValueError` — but `google.api_core.exceptions.Forbidden` and `BadRequest` propagate as bare HTTP 500 with no body.
|
||
|
||
**This is the headline cause of the v2_scan/estimate 500s.** When the SA on `<your-dev-instance>` lacks `serviceusage.services.use` on whatever project resolves to billing, BQ raises `Forbidden`, which propagates uncaught. The config fix landed in `33a9964`; the error translation didn't.
|
||
|
||
#### Bug B — `v2_sample.py` is missing the billing_project split entirely
|
||
|
||
`app/api/v2_sample.py:104` reads only `data_source.bigquery.project`, never `billing_project`. It then passes that project to `bigquery_query()` as the billing target. This is the same bug `33a9964` fixed in `v2_scan.py`, just in a sibling file that didn't get the same patch.
|
||
|
||
`v2_sample.py` also has no structured error handling — it catches only `FileNotFoundError` (404) and `PermissionError` (403). Anything else (Google API errors, identifier `ValueError`, `ImportError`) bubbles up as bare HTTP 500.
|
||
|
||
#### Bug C — `v2_schema.py` has the same shape (one strict block, one best-effort block)
|
||
|
||
`app/api/v2_schema.py` contains **two separate blocks** of the INSTALL/LOAD/SECRET/`bigquery_query()` dance with **different error semantics**:
|
||
|
||
- `_fetch_bq_schema` (lines 48-73): hard-required for the endpoint to return a schema. Today swallows nothing; would benefit from structured translation.
|
||
- `_fetch_bq_table_options` (lines 90-129): best-effort partition/cluster info. Wraps everything in `try/except Exception → return {}` and a `logger.warning`. Endpoint returns successfully even if partition info fails.
|
||
|
||
Schema reportedly works for Pavel today because both queries hit `INFORMATION_SCHEMA`, which doesn't trip the `serviceusage` permission check on the billing project. But the same code path is one query change away from the same 500.
|
||
|
||
#### Operator-config dimension (out of scope for code fix)
|
||
|
||
If `instance.yaml` on `<your-dev-instance>` does not set `data_source.bigquery.billing_project`, the fallback `or project_id` puts the call right back in the broken state. The fix surfaces this to the operator via a structured error body containing a `hint` pointing at the missing config key.
|
||
|
||
### Five+ duplicate code paths today
|
||
|
||
The BQ-access pattern is duplicated across:
|
||
|
||
| File | Function | Shape | Status in this PR |
|
||
|---|---|---|---|
|
||
| `app/api/v2_scan.py` | `_bq_dry_run_bytes`, `_run_bq_scan` | `bigquery.Client` (Python SDK) | **In scope** |
|
||
| `app/api/v2_sample.py` | `_fetch_bq_sample` | DuckDB `bigquery_query()` | **In scope** |
|
||
| `app/api/v2_schema.py` | `_fetch_bq_schema` | DuckDB `bigquery_query()` | **In scope (strict translation)** |
|
||
| `app/api/v2_schema.py` | `_fetch_bq_table_options` | DuckDB `bigquery_query()` | **In scope (preserve swallow-all `except Exception → {}`)** |
|
||
| `src/remote_query.py` | `RemoteQueryEngine._get_bq_client` | `bigquery.Client` (Python SDK) | **In scope** |
|
||
| `connectors/bigquery/extractor.py` | sync-time extractor | mixed (`ATTACH 'project=...'` + `bigquery_query()`) | **Deferred** (see below) |
|
||
| `scripts/duckdb_manager.py` | `register_bq_table` | `bigquery.Client` (Python SDK) | **Deferred** (see below) |
|
||
|
||
**Deferred sites — rationale:**
|
||
|
||
- **`extractor.py`** runs at sync time, async, behind the scheduler. Errors surface in logs / `sync_history`, not as HTTP responses. Different lifecycle, different control flow (uses `ATTACH` not `bigquery.Client.query`). Migrating it doubles the PR size for benefit not in #134's scope. Track as follow-up issue.
|
||
- **`register_bq_table`** is admin-only, runs once at table registration time (M1 from #108). Its project resolution is `bq_project or BIGQUERY_PROJECT env` — no `instance.yaml` fallback, different semantics from the runtime path. Different concern. Track as follow-up issue.
|
||
|
||
The stale docstring at `src/remote_query.py:204` claims `_bq_client_factory` defaults to `scripts.duckdb_manager._create_bq_client`. It doesn't — `_get_bq_client` constructs `_bq_module.Client(project=project)` inline at line 450. Will self-correct when `_bq_client_factory` is removed.
|
||
|
||
## Goals
|
||
|
||
1. **Fix the v2_sample billing_project bug** so cross-project BQ reads work when the operator sets `billing_project`.
|
||
2. **Fix the v2_scan/estimate error translation** so cross-project Forbidden surfaces as a structured 502 instead of bare 500.
|
||
3. **Translate Google API errors into structured responses** with actionable bodies across all three v2 endpoints (the user / CLI gets an error shape they can reason about, not bare 500).
|
||
4. **Eliminate four duplicate BQ-access call sites** behind a single facade so the fix lives in one place. (Two deferred — see above.)
|
||
5. **Preserve test invasiveness** — existing tests that mock the BQ client must remain straightforward to write, ideally cleaner than today's `_bq_client_factory` injection point.
|
||
|
||
## Non-goals
|
||
|
||
- Changing the operator-facing config schema. `data_source.bigquery.billing_project` already exists; we route everything through it.
|
||
- Auto-detecting cross-project misconfiguration at startup (rejected for scope; would require a real BQ call at boot).
|
||
- Touching the `/api/query/hybrid` endpoint behavior — `RemoteQueryEngine` internals change, but the HTTP contract does not.
|
||
- Migrating `extractor.py` or `register_bq_table` to `BqAccess` (deferred — see above).
|
||
- Adding per-table multi-project support. Today's `table_registry` schema has no `source_project` column; every BQ table uses `instance.yaml`'s `project` as the data project. Future multi-project is a separate feature; spec notes the constraint so it can be lifted cleanly later.
|
||
- Schema migration / data migration of any kind.
|
||
- Hot-reload of `instance.yaml`. `get_bq_access()` is `@functools.cache`'d at process level. Changing config requires a container restart. (Today's behavior is identical — `instance.yaml` is loaded once at boot.)
|
||
|
||
## Design
|
||
|
||
### Architecture — new module `connectors/bigquery/access.py`
|
||
|
||
```
|
||
connectors/bigquery/
|
||
├── auth.py (existing — get_metadata_token, unchanged)
|
||
├── extractor.py (existing — unchanged in this PR)
|
||
└── access.py (NEW)
|
||
├── BqProjects (frozen dataclass)
|
||
├── BqAccessError (typed exception with HTTP_STATUS class mapping)
|
||
├── BqAccess (facade with injectable factories)
|
||
├── get_bq_access() -> BqAccess (module-level, @functools.cache; FastAPI Depends target)
|
||
├── translate_bq_error(e, projects, *, bad_request_status) -> BqAccessError
|
||
├── _default_client_factory (real bigquery.Client construction)
|
||
└── _default_duckdb_session_factory (real INSTALL/LOAD/SECRET dance)
|
||
```
|
||
|
||
### `BqAccess` public API
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class BqProjects:
|
||
billing: str # billing/quota target — used as `project=` and `quota_project_id=`
|
||
data: str # data project for FROM clauses (today: instance.yaml `project`).
|
||
# Note: locked to a single project per instance until table_registry
|
||
# grows a per-table source_project column. See "Non-goals".
|
||
|
||
|
||
class BqAccessError(Exception):
|
||
HTTP_STATUS = {
|
||
"not_configured": 500, # admin/config bug — page on-call
|
||
"bq_lib_missing": 500, # deployment bug
|
||
"auth_failed": 502, # GCP metadata server unreachable
|
||
"cross_project_forbidden": 502, # SA lacks serviceusage.services.use on billing project
|
||
"bq_forbidden": 502, # other Forbidden from BQ
|
||
"bq_bad_request": 400, # 400 from BQ when caller flagged it as client-derived
|
||
"bq_upstream_error": 502, # all other upstream BQ failures (incl. server-derived BadRequest)
|
||
}
|
||
|
||
def __init__(self, kind: str, message: str, details: dict | None = None):
|
||
self.kind = kind
|
||
self.message = message
|
||
self.details = details or {}
|
||
super().__init__(message)
|
||
|
||
|
||
class BqAccess:
|
||
"""Single entry point for BigQuery access — config resolution, client construction,
|
||
DuckDB-extension session, and error translation. Stateless after construction.
|
||
|
||
Factories are injectable for tests:
|
||
bq = BqAccess(
|
||
BqProjects(billing="test-billing", data="test-data"),
|
||
client_factory=lambda projects: mock_client,
|
||
)
|
||
"""
|
||
|
||
def __init__(
|
||
self,
|
||
projects: BqProjects,
|
||
*,
|
||
client_factory: Callable[[BqProjects], "bigquery.Client"] | None = None,
|
||
duckdb_session_factory: Callable[[BqProjects], "AbstractContextManager"] | None = None,
|
||
):
|
||
self._projects = projects
|
||
self._client_factory = client_factory or _default_client_factory
|
||
self._duckdb_session_factory = duckdb_session_factory or _default_duckdb_session_factory
|
||
|
||
@property
|
||
def projects(self) -> BqProjects: ...
|
||
|
||
def client(self) -> "bigquery.Client":
|
||
"""Construct (or retrieve from injected factory) a BigQuery client billed to
|
||
projects.billing. Raises BqAccessError(kind='bq_lib_missing') if
|
||
google-cloud-bigquery is not installed; raises BqAccessError(kind='auth_failed')
|
||
on credential resolution failure."""
|
||
|
||
@contextmanager
|
||
def duckdb_session(self) -> Iterator["duckdb.DuckDBPyConnection"]:
|
||
"""Yield in-memory DuckDB conn with bigquery extension loaded + SECRET set
|
||
from get_metadata_token(). Auto-cleanup. Translates INSTALL/LOAD/SECRET failures
|
||
and metadata-token failures to BqAccessError(kind='auth_failed' or 'bq_lib_missing')."""
|
||
|
||
|
||
@functools.cache
|
||
def get_bq_access() -> "BqAccess":
|
||
"""Module-level factory used as the FastAPI Depends target.
|
||
|
||
Resolves projects from BIGQUERY_PROJECT env → instance.yaml billing_project →
|
||
instance.yaml project. Returns a BqAccess instance with default factories.
|
||
|
||
Process-cached: config is loaded at boot and doesn't change at runtime. Hot-reload
|
||
of instance.yaml is explicitly out of scope. Note that functools.cache does NOT
|
||
cache exceptions, so a failed call (BqAccessError(kind='not_configured')) is retried
|
||
on the next invocation — useful when the operator fixes config and restarts a
|
||
request without restarting the container.
|
||
|
||
Tests inject via FastAPI's dependency_overrides[get_bq_access] = lambda: bq, or
|
||
construct BqAccess(...) directly for non-endpoint code (e.g. RemoteQueryEngine).
|
||
Tests do NOT mutate this cache.
|
||
|
||
Module-level (not BqAccess.from_config classmethod) to avoid the
|
||
@classmethod + @functools.cache stacking footgun and to give FastAPI's
|
||
dependency introspection a clean function signature."""
|
||
|
||
|
||
def translate_bq_error(
|
||
e: Exception,
|
||
projects: BqProjects,
|
||
*,
|
||
bad_request_status: Literal["client_error", "upstream_error"],
|
||
) -> BqAccessError:
|
||
"""Convert Google API exceptions into a typed BqAccessError.
|
||
|
||
Mapping (FIRST match wins):
|
||
1. BqAccessError -> pass through unchanged (this is critical:
|
||
bq.client() and bq.duckdb_session() can raise
|
||
BqAccessError directly for bq_lib_missing /
|
||
auth_failed, and those must round-trip through
|
||
translate_bq_error without being reclassified
|
||
as 'unknown' and re-raised).
|
||
2. Forbidden + 'serviceusage' in str(e).lower()
|
||
-> cross_project_forbidden (with hint)
|
||
3. Forbidden -> bq_forbidden
|
||
4. BadRequest, bad_request_status='client_error'
|
||
-> bq_bad_request (HTTP 400)
|
||
5. BadRequest, bad_request_status='upstream_error'
|
||
-> bq_upstream_error (HTTP 502)
|
||
6. GoogleAPICallError (other) -> bq_upstream_error
|
||
7. Anything else -> RE-RAISED unchanged (don't swallow programmer errors)
|
||
|
||
`bad_request_status` MUST be supplied by the caller. It distinguishes:
|
||
- 'client_error': SQL contains user input (select/where/order_by/limit). BQ rejecting
|
||
it is plausibly the user's fault — return 400 with the BQ message.
|
||
- 'upstream_error': SQL is fully built server-side from validated identifiers.
|
||
BQ rejecting it is server-side corruption — return 502."""
|
||
```
|
||
|
||
### Project resolution rules (single source of truth)
|
||
|
||
`get_bq_access` resolves projects in this order (matching today's `RemoteQueryEngine._get_bq_client` behavior):
|
||
|
||
1. `BIGQUERY_PROJECT` env var → if set, used as **both** billing and data (legacy override).
|
||
2. `data_source.bigquery.billing_project` from `instance.yaml` → billing.
|
||
3. `data_source.bigquery.project` from `instance.yaml` → data, and billing if (2) is unset.
|
||
|
||
If neither (1) nor (3) yields a value: `BqAccessError(kind='not_configured', details={"hint": "set data_source.bigquery.project in instance.yaml"})`.
|
||
|
||
> **BREAKING for env-only deployments.** Today `RemoteQueryEngine._get_bq_client` uses `BIGQUERY_PROJECT` *only* as the billing project; data project for FROM clauses comes from elsewhere (e.g. SQL strings the user wrote). After this change, `BIGQUERY_PROJECT` sets both billing and data, **overriding** `data_source.bigquery.project` for FROM-clause construction in `v2_scan` / `v2_sample` / `v2_schema`. Deployments that combine env-var-for-billing + yaml-for-data must migrate by setting `data_source.bigquery.billing_project` in `instance.yaml` and clearing the env var. Flagged in CHANGELOG.
|
||
|
||
### Cross-project Forbidden detection
|
||
|
||
`'serviceusage' in str(e).lower()` is the only reliable signal of the cross-project quota issue. Other cross-project failures (revoked SA, table-level ACL, dataset-level deny) degrade to generic `bq_forbidden` — still 502, still structured body, just less specific in the hint. `billing != data` is the **normal** cross-project setup, not a signal of failure; using it to classify would over-trigger and misdirect operators.
|
||
|
||
`billing != data` MAY enrich `details.hint` ("Note: billing and data projects differ — verify SA has both BQ Read on data project AND serviceusage.services.use on billing project") but does not alter `kind`.
|
||
|
||
### Status code mapping rationale
|
||
|
||
- **400** for `bq_bad_request` — BQ rejecting a SQL string built from user input is plausibly the user's fault. Returns it back to the user as a 4xx with the BQ message.
|
||
- **502** for `bq_forbidden`, `cross_project_forbidden`, `bq_upstream_error`, `auth_failed` — upstream BQ refused or was unreachable. Operationally distinguishable from 500 in dashboards: "integration with BQ broken" vs "Agnes itself broken".
|
||
- **500** for `not_configured`, `bq_lib_missing` — deployment/admin-config bugs that should page on-call, not transient upstream errors.
|
||
|
||
### Migration of four call sites
|
||
|
||
#### A. `app/api/v2_scan.py`
|
||
|
||
`_bq_dry_run_bytes` and `_run_bq_scan` change signature from `(project: str, sql: str)` to `(bq: BqAccess, sql: str)`. Body (note: `bq.client()` is called *outside* the `try/except` so a `BqAccessError` from it still propagates correctly through the endpoint's `except BqAccessError` clause, even before the translator's pass-through would catch it):
|
||
|
||
```python
|
||
def _bq_dry_run_bytes(bq: BqAccess, sql: str) -> int:
|
||
from google.cloud import bigquery
|
||
client = bq.client() # may raise BqAccessError(bq_lib_missing/auth_failed); propagates as-is
|
||
try:
|
||
job = client.query(
|
||
sql, job_config=bigquery.QueryJobConfig(dry_run=True, use_query_cache=False),
|
||
)
|
||
return int(job.total_bytes_processed or 0)
|
||
except Exception as e:
|
||
raise translate_bq_error(e, bq.projects, bad_request_status="client_error")
|
||
```
|
||
|
||
`_run_bq_scan` mirrors the same shape with `bad_request_status="client_error"` (SQL contains user's `select`/`where`/`order_by`).
|
||
|
||
`scan_endpoint`, `scan_estimate_endpoint`, `estimate`, and `run_scan` lose the `project_id` and `billing_project` parameters in favor of `bq: BqAccess`. Endpoints inject via FastAPI:
|
||
|
||
```python
|
||
@router.post("/scan")
|
||
async def scan_endpoint(
|
||
raw: dict,
|
||
user: dict = Depends(get_current_user),
|
||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||
bq: BqAccess = Depends(get_bq_access),
|
||
):
|
||
try:
|
||
ipc = run_scan(conn, user, raw, bq=bq, quota=...)
|
||
return Response(...)
|
||
except WhereValidationError as e:
|
||
raise HTTPException(status_code=400, detail=...)
|
||
except QuotaExceededError as e:
|
||
raise HTTPException(status_code=429, detail=...)
|
||
except BqAccessError as e:
|
||
raise HTTPException(
|
||
status_code=BqAccessError.HTTP_STATUS.get(e.kind, 500),
|
||
detail={"error": e.kind, "message": e.message, "details": e.details},
|
||
)
|
||
```
|
||
|
||
`_build_bq_sql(table_row, project_id, req)` keeps `project_id` as a parameter (it's the data project for the FROM clause). Call sites pass `bq.projects.data`. **Forward-compat note** in code comments: when `table_registry` grows a per-table `source_project` column, callers should prefer `table_row.get('source_project') or bq.projects.data`. `tests/test_v2_scan.py:206-207` imports `_build_bq_sql` directly — the signature is unchanged, no test update needed.
|
||
|
||
#### B. `app/api/v2_sample.py`
|
||
|
||
`_fetch_bq_sample` changes signature to `(bq: BqAccess, dataset: str, table: str, n: int)`:
|
||
|
||
```python
|
||
def _fetch_bq_sample(bq: BqAccess, dataset: str, table: str, n: int) -> list[dict]:
|
||
from src.identifier_validation import validate_quoted_identifier
|
||
if not (validate_quoted_identifier(bq.projects.data, "BQ project")
|
||
and validate_quoted_identifier(dataset, "BQ dataset")
|
||
and validate_quoted_identifier(table, "BQ source_table")):
|
||
raise ValueError("unsafe BQ identifier in registry — refusing to query")
|
||
|
||
bq_sql = f"SELECT * FROM `{bq.projects.data}.{dataset}.{table}` LIMIT {int(n)}"
|
||
with bq.duckdb_session() as conn:
|
||
try:
|
||
df = conn.execute(
|
||
"SELECT * FROM bigquery_query(?, ?)",
|
||
[bq.projects.billing, bq_sql],
|
||
).fetchdf()
|
||
return df.to_dict(orient="records")
|
||
except Exception as e:
|
||
raise translate_bq_error(e, bq.projects, bad_request_status="upstream_error")
|
||
```
|
||
|
||
`build_sample` and `sample` endpoint signatures lose `project_id` for `bq: BqAccess = Depends(get_bq_access)`. Endpoint catch chain adds `BqAccessError` and `ValueError → 400 (kind='unsafe_identifier')`.
|
||
|
||
#### C. `app/api/v2_schema.py` — two blocks, two semantics
|
||
|
||
The two blocks have **different error contracts** — preserve them.
|
||
|
||
**`_fetch_bq_schema` (lines 48-73)** — strict; the endpoint can't return without it. Migrate to `bq.duckdb_session()` + `translate_bq_error(..., bad_request_status="upstream_error")`. Endpoint wraps in `try/except BqAccessError` and returns the structured 502.
|
||
|
||
**`_fetch_bq_table_options` (lines 90-129)** — best-effort; current code has `except Exception as e: logger.warning(...); return {}`. **Preserve this contract.** Migrate to use `bq.duckdb_session()` for the connection but keep the outer `try/except Exception → return {}`. Do NOT translate via `translate_bq_error`; the `/schema` endpoint must continue to return successfully with empty partition info when BQ is unreachable, not 502.
|
||
|
||
```python
|
||
def _fetch_bq_table_options(bq: BqAccess, dataset: str, table: str) -> dict:
|
||
from src.identifier_validation import validate_quoted_identifier
|
||
if not (validate_quoted_identifier(bq.projects.data, "BQ project")
|
||
and validate_quoted_identifier(dataset, "BQ dataset")
|
||
and validate_quoted_identifier(table, "BQ source_table")):
|
||
return {}
|
||
|
||
try:
|
||
with bq.duckdb_session() as conn:
|
||
bq_sql = (
|
||
f"SELECT column_name, is_partitioning_column, clustering_ordinal_position "
|
||
f"FROM `{bq.projects.data}.{dataset}.INFORMATION_SCHEMA.COLUMNS` "
|
||
f"WHERE table_name = ? "
|
||
f"ORDER BY clustering_ordinal_position NULLS LAST"
|
||
)
|
||
rows = conn.execute(
|
||
"SELECT * FROM bigquery_query(?, ?, ?)",
|
||
[bq.projects.billing, bq_sql, table],
|
||
).fetchall()
|
||
if not rows:
|
||
return {}
|
||
partition_by = next(
|
||
(r[0] for r in rows if (r[1] or "").upper() == "YES"),
|
||
None,
|
||
)
|
||
clustered_by = [r[0] for r in rows if r[2] is not None]
|
||
return {"partition_by": partition_by, "clustered_by": clustered_by}
|
||
except Exception as e:
|
||
logger.warning("BQ table options fetch failed for %s.%s.%s: %s",
|
||
bq.projects.data, dataset, table, e)
|
||
return {}
|
||
```
|
||
|
||
Endpoint signature gains `bq: BqAccess = Depends(get_bq_access)`.
|
||
|
||
#### D. `src/remote_query.py` — lazy `BqAccess` construction
|
||
|
||
`RemoteQueryEngine.__init__` signature changes:
|
||
|
||
```python
|
||
def __init__(
|
||
self,
|
||
..., # existing args
|
||
bq_access: BqAccess | None = None,
|
||
):
|
||
...
|
||
self._bq = bq_access # may stay None — resolved lazily
|
||
```
|
||
|
||
**Lazy resolution is critical.** Many existing tests in `tests/test_remote_query.py` (lines 106, 148, 196, 417, 520, 529, 538) construct `RemoteQueryEngine(analytics_conn)` for DuckDB-only paths that never touch BQ. After this change, eager `get_bq_access()` at construction would fail those tests with `not_configured` in any environment without `instance.yaml` configured for BQ. Resolve only when actually needed:
|
||
|
||
```python
|
||
def _get_bq_client(self):
|
||
if self._bq is None:
|
||
from connectors.bigquery.access import get_bq_access
|
||
self._bq = get_bq_access() # may raise BqAccessError; that's fine
|
||
return self._bq.client()
|
||
```
|
||
|
||
`_bq_client_factory` parameter, the docstring at line 204 (which referenced the stale `scripts.duckdb_manager._create_bq_client` default), and lines 407-450 of `_get_bq_client` all delete. The fallback chain logic moves to `get_bq_access`.
|
||
|
||
**External caller note:** `cli/commands/query.py:120` constructs `RemoteQueryEngine(conn, **engine_kwargs)`. The new `bq_access` kwarg has default `None`, so the CLI continues to work via the lazy `get_bq_access()` path. No CLI change needed.
|
||
|
||
### Test rewrite
|
||
|
||
The existing `_bq_client_factory` injection point in `RemoteQueryEngine` and tests like `tests/test_remote_query.py` currently look like:
|
||
|
||
```python
|
||
engine = RemoteQueryEngine(_bq_client_factory=lambda project: mock_client)
|
||
```
|
||
|
||
Migrates to direct `BqAccess` injection — no monkey-patching, no classmethod gymnastics:
|
||
|
||
```python
|
||
def test_remote_query_x():
|
||
bq = BqAccess(
|
||
BqProjects(billing="test-billing", data="test-data"),
|
||
client_factory=lambda projects: mock_client,
|
||
)
|
||
engine = RemoteQueryEngine(..., bq_access=bq)
|
||
engine.execute(...)
|
||
```
|
||
|
||
DuckDB-only `RemoteQueryEngine` tests (the ~7 sites listed above) need NO change — `bq_access=None` defaults preserve today's behavior; `get_bq_access()` is never called.
|
||
|
||
For FastAPI endpoint tests (`tests/test_v2_*.py`), use FastAPI's `dependency_overrides`:
|
||
|
||
```python
|
||
def test_v2_scan_x(client):
|
||
bq = BqAccess(
|
||
BqProjects(billing="test-billing", data="test-data"),
|
||
client_factory=lambda projects: mock_client,
|
||
)
|
||
app.dependency_overrides[get_bq_access] = lambda: bq
|
||
try:
|
||
response = client.post("/api/v2/scan", json={...})
|
||
assert response.status_code == 200
|
||
finally:
|
||
app.dependency_overrides.clear()
|
||
```
|
||
|
||
Optional shared fixture in `tests/conftest.py`:
|
||
|
||
```python
|
||
@pytest.fixture
|
||
def bq_access():
|
||
"""Build a BqAccess with pluggable factories. Returns a callable that overrides
|
||
the FastAPI Depends and yields the BqAccess instance."""
|
||
def _build(*, client=None, duckdb_conn=None,
|
||
billing="test-billing", data="test-data"):
|
||
bq = BqAccess(
|
||
BqProjects(billing=billing, data=data),
|
||
client_factory=(lambda projects: client) if client else None,
|
||
duckdb_session_factory=(
|
||
lambda projects: contextlib.nullcontext(duckdb_conn)
|
||
) if duckdb_conn else None,
|
||
)
|
||
app.dependency_overrides[get_bq_access] = lambda: bq
|
||
return bq
|
||
|
||
yield _build
|
||
app.dependency_overrides.clear()
|
||
```
|
||
|
||
> **Fixture caveat for nested sessions.** `contextlib.nullcontext(duckdb_conn)` does NOT close the conn on `__exit__`. The production path closes it via `_default_duckdb_session_factory`. Tests that exercise multiple sequential `bq.duckdb_session()` calls within a single test function will see the same conn object both times — fine for assertion purposes, but won't catch close-and-reopen regressions. The unit test `test_duckdb_session_closes_on_exit` covers the close behavior on the production factory; the fixture is for endpoint integration tests that don't care.
|
||
|
||
### Tests
|
||
|
||
#### Unit tests — `tests/test_bq_access.py` (new)
|
||
|
||
| Test | Asserts |
|
||
|---|---|
|
||
| `test_resolve_env_var_wins` | `BIGQUERY_PROJECT=foo` overrides `instance.yaml`; both billing+data = foo |
|
||
| `test_resolve_billing_falls_back_to_project` | unset `billing_project` → both billing and data = `project` |
|
||
| `test_resolve_billing_distinct_from_project` | both set → billing and data differ |
|
||
| `test_resolve_raises_when_neither_set` | `BqAccessError(kind='not_configured')` with hint |
|
||
| `test_resolve_succeeds_after_config_set` | call once → raises; set config; call again → succeeds (functools.cache doesn't cache exceptions) |
|
||
| `test_get_bq_access_is_cached` | two successful calls return the same instance |
|
||
| `test_translate_forbidden_serviceusage` | `gax.Forbidden('serviceusage.services.use')` → `kind='cross_project_forbidden'` + hint |
|
||
| `test_translate_forbidden_no_serviceusage_diff_projects` | `gax.Forbidden('table-level perm denied')` + billing≠data → `kind='bq_forbidden'` (NOT cross_project) |
|
||
| `test_translate_forbidden_same_project` | `gax.Forbidden` + billing==data → `kind='bq_forbidden'` |
|
||
| `test_translate_bad_request_client_error` | `gax.BadRequest`, `bad_request_status='client_error'` → `kind='bq_bad_request'`, status 400 |
|
||
| `test_translate_bad_request_upstream_error` | `gax.BadRequest`, `bad_request_status='upstream_error'` → `kind='bq_upstream_error'`, status 502 |
|
||
| `test_translate_passes_through_BqAccessError` | `BqAccessError('bq_lib_missing', ...)` in → identical out (CRITICAL — guards against bq.client() raising it inside try/except) |
|
||
| `test_translate_unknown_reraises` | `RuntimeError("oops")` is re-raised, NOT silently wrapped |
|
||
| `test_client_uses_billing_as_quota_project` | `_default_client_factory` constructs with `quota_project_id=projects.billing` |
|
||
| `test_default_client_factory_raises_bq_lib_missing_on_importerror` | mock missing google-cloud-bigquery → `BqAccessError(bq_lib_missing)` |
|
||
| `test_default_duckdb_session_raises_auth_failed_on_metadata_error` | mock `get_metadata_token` raising `BQMetadataAuthError` → `BqAccessError(auth_failed)` |
|
||
| `test_duckdb_session_closes_on_exit` | mock token + duckdb conn, assert `conn.close()` called |
|
||
| `test_duckdb_session_closes_on_exception` | exception inside `with` block still triggers `conn.close()` |
|
||
| `test_injected_client_factory_overrides_default` | `BqAccess(..., client_factory=...)` skips `_default_client_factory` |
|
||
|
||
#### Integration tests — extending `tests/test_v2_*.py`
|
||
|
||
| Test | Asserts |
|
||
|---|---|
|
||
| `test_v2_scan_returns_502_on_bq_forbidden_serviceusage` | mock client raises `gax.Forbidden('...serviceusage...')`; response 502 + body `error=cross_project_forbidden` + hint mentions `billing_project` |
|
||
| `test_v2_scan_returns_400_on_bq_bad_request` | mock client raises `gax.BadRequest('invalid syntax')`; response 400 + body `error=bq_bad_request` |
|
||
| `test_v2_scan_estimate_returns_502_on_bq_forbidden` | same pattern for `/scan/estimate` |
|
||
| `test_v2_scan_returns_500_on_bq_lib_missing` | mock client raises `BqAccessError(bq_lib_missing)`; response 500 + structured body |
|
||
| `test_v2_sample_returns_502_on_bq_forbidden` | mock duckdb_session raises via `bigquery_query`; response 502 + structured body |
|
||
| `test_v2_sample_returns_400_on_unsafe_identifier` | registry row with backtick in `source_table` → 400 + body `error=unsafe_identifier` |
|
||
| `test_v2_sample_returns_404_on_unknown_table` | unchanged behavior (regression guard) |
|
||
| `test_v2_sample_returns_403_on_unauthorized` | unchanged behavior (regression guard) |
|
||
| `test_v2_schema_returns_502_on_bq_forbidden` | strict block (`_fetch_bq_schema`) failure → 502 |
|
||
| `test_v2_schema_returns_200_with_empty_partition_on_bq_failure` | best-effort block (`_fetch_bq_table_options`) failure → 200, schema returned, partition_by/clustered_by absent. Regression guard for the swallow-all preservation. |
|
||
| `test_v2_schema_returns_200_on_success` | regression guard (the existing happy path) |
|
||
|
||
#### E2E manual verification (post-deploy on `<your-dev-instance>`)
|
||
|
||
For each of `/sample`, `/scan/estimate`, `/scan`, `/schema`:
|
||
|
||
```bash
|
||
PAT=...
|
||
|
||
# Pre-deploy reference (current production behavior — bare 500 with no body for /sample/scan*)
|
||
curl -k -i -H "Authorization: Bearer $PAT" \
|
||
https://<your-agnes-host>/api/v2/sample/<bq_table_id>?n=2
|
||
|
||
curl -k -i -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
|
||
https://<your-agnes-host>/api/v2/scan/estimate \
|
||
-d '{"table_id":"<bq_table_id>","select":["event_date"],"where":"event_date = DATE \"2026-04-21\""}'
|
||
|
||
curl -k -i -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
|
||
https://<your-agnes-host>/api/v2/scan \
|
||
-d '{"table_id":"<bq_table_id>","select":["event_date"],"where":"event_date = DATE \"2026-04-21\"","limit":50}'
|
||
|
||
curl -k -i -H "Authorization: Bearer $PAT" \
|
||
https://<your-agnes-host>/api/v2/schema/<bq_table_id>
|
||
|
||
# Post-deploy, BEFORE fixing instance.yaml — expect:
|
||
# /sample, /scan, /scan/estimate: 502 + structured JSON body with
|
||
# error=cross_project_forbidden and a hint.
|
||
# /schema: 200 (because INFORMATION_SCHEMA queries don't fail today; if
|
||
# something else fails on the strict block, expect 502).
|
||
|
||
# Operator action: set data_source.bigquery.billing_project in instance.yaml,
|
||
# restart the container.
|
||
|
||
# Post-config-fix — expect 200 on all four endpoints.
|
||
```
|
||
|
||
This four-endpoint × three-state matrix is the success criterion for closing #134. Without it, "fixed" is unverifiable.
|
||
|
||
## Implementation strategy — staged commits
|
||
|
||
Per first-round review: stage as **two commits** so the user-visible bug fix is independently reviewable / revertable from the refactor. Per second-round review: **both commits emit the same structured response shape** so client-side parsers (CLI, UI) don't see contract churn between commits.
|
||
|
||
**Commit 1 — Minimal bug fix (revertable, atomic across all three v2 endpoints):**
|
||
- `app/api/v2_sample.py`:
|
||
- Read `billing_project` with same fallback as `v2_scan.py:385`; pass to `bigquery_query()`.
|
||
- Wrap `_fetch_bq_sample` in `try/except google.api_core.exceptions.*` translating to `HTTPException` with the structured body shape (see below).
|
||
- `app/api/v2_scan.py`: wrap `_bq_dry_run_bytes` and `_run_bq_scan` in the same `try/except` shape.
|
||
- `app/api/v2_schema.py`: wrap `_fetch_bq_schema` (strict block) in the same `try/except`. **Do NOT touch `_fetch_bq_table_options`** — its `except Exception → return {}` swallow-all is preserved unchanged in commit 1, then migrated to use `bq.duckdb_session()` in commit 2 (still preserved).
|
||
|
||
All three endpoints emit the **same structured body shape** that commit 2 will produce, so client-side parsers (CLI, UI) see one consistent contract throughout the rollout:
|
||
|
||
```python
|
||
except gax.Forbidden as e:
|
||
kind = "cross_project_forbidden" if "serviceusage" in str(e).lower() else "bq_forbidden"
|
||
raise HTTPException(
|
||
status_code=502,
|
||
detail={"error": kind, "message": str(e), "details": {...}},
|
||
)
|
||
except gax.BadRequest as e:
|
||
# v2_scan: bad_request_status='client_error' → kind='bq_bad_request', 400
|
||
# v2_sample, v2_schema: bad_request_status='upstream_error' → kind='bq_upstream_error', 502
|
||
raise HTTPException(
|
||
status_code=400 if <user_derived> else 502,
|
||
detail={"error": "bq_bad_request" if <user_derived> else "bq_upstream_error",
|
||
"message": str(e), "details": {}},
|
||
)
|
||
```
|
||
|
||
- Tests: regression tests for the structured body shape on all three endpoints.
|
||
|
||
This commit alone closes the user-visible part of #134 atomically — no half-rolled-out window where `/sample` returns bare 500 while `/scan` returns structured 502. If commit 2 needs another review round, commit 1 still ships and the response contract is forward-compatible.
|
||
|
||
**Commit 2 — `BqAccess` extraction + migration:**
|
||
- Create `connectors/bigquery/access.py` with the design above.
|
||
- Migrate `v2_scan`, `v2_sample`, `v2_schema` (both blocks, with separate semantics), `RemoteQueryEngine` (lazy `bq_access`) to `BqAccess`.
|
||
- Remove `_bq_client_factory` from `RemoteQueryEngine.__init__` and the stale docstring at `src/remote_query.py:204`.
|
||
- Migrate tests to the new `BqAccess(client_factory=...)` + `dependency_overrides` pattern.
|
||
- Delete inline `try/except gax.*` blocks added in commit 1; route through `translate_bq_error` instead.
|
||
|
||
## Risks
|
||
|
||
1. **Test rewrite breaks something subtle.** `tests/test_remote_query.py` and possibly `tests/test_duckdb_manager.py` have many `_bq_client_factory` call sites. The new fixture pattern must cover every shape they exercise. Mitigation: convert tests one-by-one in commit 2, run pytest after each, before deleting the old injection point. DuckDB-only tests that don't touch BQ are protected by lazy `bq_access` resolution in `RemoteQueryEngine`.
|
||
2. **Cross-project Forbidden detection heuristic is narrow but principled.** Relies on Google's error message containing `'serviceusage'` (case-insensitive). False positives are unlikely (the substring is specific). False **negatives** are possible — those degrade to `bq_forbidden` with a generic message, still a 502 with structured body. Acceptable.
|
||
3. **`get_bq_access()` is `@functools.cache`'d.** Cheap and process-lifetime-safe (config is loaded at boot and immutable). Tests use `dependency_overrides` and direct `BqAccess(...)` construction, never the cached path — no cache invalidation needed in tests. Hot-reload of `instance.yaml` is explicitly out of scope.
|
||
4. **`bq_bad_request → 400` could leak BQ error messages.** BQ's `BadRequest` text typically describes the SQL problem. We surface it in `details.message`. Operators who don't want this can filter at a reverse-proxy layer; this matches behavior of any 4xx-with-detail elsewhere in the app.
|
||
5. **`BIGQUERY_PROJECT` env-var precedence is BREAKING for env-only deployments.** Deployments that combine env-var-for-billing + yaml-for-data must migrate. See the project-resolution rules section. Flag in CHANGELOG and release notes.
|
||
6. **Two duplicate sites left behind (`extractor.py`, `register_bq_table`).** Explicit follow-up issue should be filed at PR-merge time.
|
||
7. **`translate_bq_error` pass-through ordering is load-bearing.** `bq.client()` and `bq.duckdb_session()` raise `BqAccessError` directly for `bq_lib_missing` / `auth_failed`. The translator's first clause MUST be `if isinstance(e, BqAccessError): return e`. Otherwise those typed errors fall through to "unknown" and get re-raised as bare 500. Unit-tested via `test_translate_passes_through_BqAccessError`.
|
||
8. **In-memory DB reload cost on each `duckdb_session()` (pre-existing).** Every BQ-via-DuckDB call runs `INSTALL bigquery` fresh. If the extension binary isn't already cached on disk, this is an HTTPS download. Today's code has the same cost; this PR doesn't fix it. Future optimization: long-lived in-memory conn with extension pre-loaded, behind a thread-safe pool.
|
||
|
||
## CHANGELOG entry (for the implementation PR)
|
||
|
||
Under `## [Unreleased]`:
|
||
|
||
**`### Fixed`**
|
||
- v2 `/sample` endpoint: BigQuery cross-project queries now respect `data_source.bigquery.billing_project` from `instance.yaml` (mirrors v2 `/scan` fix from `33a9964`). Closes #134 for `/sample`.
|
||
- v2 `/scan`, `/scan/estimate`: BigQuery upstream errors no longer return bare HTTP 500 with empty body. `Forbidden` from BQ now returns HTTP 502 with structured JSON body (`{"error": "cross_project_forbidden", "message": "...", "details": {"hint": "..."}}`); `BadRequest` on user-derived SQL returns HTTP 400 with `kind=bq_bad_request`. Closes #134 for `/scan*`.
|
||
- v2 `/schema`: same error translation applied to the strict path (`_fetch_bq_schema`); the best-effort partition-info path (`_fetch_bq_table_options`) preserves its swallow-all-and-return-empty behavior, so `/schema` still returns 200 with empty partition info if BQ partition-info queries fail.
|
||
|
||
**`### Changed`**
|
||
- **BREAKING for deployments using `BIGQUERY_PROJECT` env var alongside `data_source.bigquery.project` in `instance.yaml`.** The env var now sets BOTH billing and data project, overriding `data_source.bigquery.project` for FROM-clause construction in `v2_scan` / `v2_sample` / `v2_schema`. Migrate by clearing `BIGQUERY_PROJECT` and using `data_source.bigquery.billing_project` + `data_source.bigquery.project` in `instance.yaml`. (Previously `BIGQUERY_PROJECT` only affected `RemoteQueryEngine` billing.)
|
||
|
||
**`### Internal`**
|
||
- New shared module `connectors/bigquery/access.py` — `BqAccess` facade unifies BQ project resolution, client construction, DuckDB-extension session, and Google-API error translation across `v2_scan`, `v2_sample`, `v2_schema`, and `RemoteQueryEngine`.
|
||
- **Internal API change:** `RemoteQueryEngine.__init__` no longer accepts `_bq_client_factory`. Callers that injected it migrate to `RemoteQueryEngine(..., bq_access=BqAccess(projects, client_factory=...))`. The CLI (`cli/commands/query.py`) is unaffected — it never injected the factory and the new `bq_access` kwarg defaults to `None` (lazy `get_bq_access()` on first BQ call).
|
||
- Stale docstring at `src/remote_query.py:204` (referencing `scripts.duckdb_manager._create_bq_client` as the default factory) removed.
|
||
- Two known-duplicate BQ-access sites (`connectors/bigquery/extractor.py`, `scripts/duckdb_manager.register_bq_table`) explicitly out of scope; tracked as follow-up.
|