agnes-the-ai-analyst/docs/superpowers/specs/2026-04-29-issue-134-bq-access-unify-design.md
ZdenekSrotyr 83adf01bde
fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138)
* docs(spec): #134 unify BigQuery access behind BqAccess facade

Brainstorm output for issue #134. Captures:
- root cause (incl. correction of the issue's hypothesis about commit 33a9964)
- BqAccess facade API + project resolution rules
- error contract — typed BqAccessError mapped to HTTP 502 for upstream
  BQ failures, 500 for deployment/config bugs
- migration plan for v2_scan, v2_sample, RemoteQueryEngine
- test rewrite eliminating _bq_client_factory injection point
- E2E verification protocol on agnes-development as success criterion

* docs(spec): #134 revise after first review

Incorporates code-reviewer findings:

Must-fix:
- Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope.
- Reframe v2_scan headline: missing try/except around BQ calls is the
  actual cause of bare 500s, not project resolution (which 33a9964 fixed).
- List two more deferred call sites (extractor.py, register_bq_table)
  with explicit rationale.

Important:
- Drop billing != data clause from cross_project_forbidden heuristic;
  rely only on 'serviceusage' substring. billing != data is normal
  for cross-project setup, was over-classifying.
- Split bq_bad_request into _user (400) and _server (502) variants;
  add sql_origin parameter to translate_bq_error so call sites declare
  whether SQL contains user input.
- Add @functools.cache to BqAccess.from_config; document tests bypass
  via dependency_overrides.
- Replace monkey-patched-classmethod test pattern with
  BqAccess(client_factory=...) injection at construction time. Cleaner
  than today's _bq_client_factory and 1:1 migration shape.
- Keep BqProjects.data (reviewer assumed registry has source_project;
  it doesn't). Multi-project explicitly listed as non-goal with note.

Nice-to-have:
- Add 'Implementation strategy' section: 2 staged commits (bug fix
  alone is revertable; refactor follows).
- Extend E2E protocol to cover all three endpoints, not just /sample.
- Note removal of stale docstring at src/remote_query.py:204.

* docs(spec): #134 revision 3 — incorporates second-round review

Must-fix from second review:
- v2_schema split into two migration cases: _fetch_bq_schema translates
  errors via translate_bq_error; _fetch_bq_table_options preserves its
  swallow-all 'except Exception → return {}' so /schema doesn't 502 on
  partition-info failures.
- RemoteQueryEngine.__init__ now resolves BqAccess lazily (in
  _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests
  in test_remote_query.py would suddenly fail with not_configured.
- translate_bq_error pass-through for BqAccessError is now load-bearing
  (clause 1, before any Google-API branch). bq.client() raises BqAccessError
  for bq_lib_missing/auth_failed; without explicit pass-through those
  fall to 'unknown' and re-raise as bare 500.
- Commit 1 now emits the SAME structured response shape as commit 2 to
  avoid contract churn between commits.
- BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments
  — flagged in CHANGELOG ### Changed.

Editorial:
- sql_origin renamed to bad_request_status with values 'client_error' /
  'upstream_error' (clearer about what the parameter actually decides).
  bq_bad_request_user/_server kinds collapsed to bq_bad_request (400)
  and bq_upstream_error (502).
- CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller;
  unaffected because new bq_access kwarg has default None.
- Added unit/integration tests for the new contracts:
  test_translate_passes_through_BqAccessError,
  test_v2_scan_returns_500_on_bq_lib_missing,
  test_v2_schema_returns_200_with_empty_partition_on_bq_failure,
  test_resolve_succeeds_after_config_set.
- E2E protocol now covers /schema as the fourth endpoint.
- Documented functools.cache-doesn't-cache-exceptions semantics and
  fixture nullcontext-doesn't-close caveat for nested sessions.

* docs(spec): #134 revision 4 — incorporates third-round review

Third reviewer verdict: 'implementation-ready with two trivial edits';
explicitly noted prior rounds did the heavy lifting.

Edits:
1. get_bq_access() module-level function instead of @classmethod
   @functools.cache from_config. Removes the classmethod-cache stacking
   footgun (different Python versions wrap differently) and gives FastAPI's
   dependency introspection a clean function signature. Drops the
   'Do not subclass BqAccess' caveat that no longer applies.

2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample),
   _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema
   (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all
   in commit 1 — preserved as-is, then migrated (still preserved) in commit 2.
   All three endpoints emit the same structured body shape so client parsers
   see one consistent contract throughout the staged rollout. No more
   half-rolled-out window where /sample is bare 500 while /scan is
   structured 502.

* docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification)

Bite-sized TDD tasks. 3 phases, 16 tasks total:

Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints:
  Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan,
  _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options
  preserved untouched. CHANGELOG Fixed entries.

Phase 2 (Commit 2) — BqAccess facade extraction + migration:
  Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up
  (BqProjects, BqAccessError, translate_bq_error, default factories,
  BqAccess class, get_bq_access module-level cached). Task 2.6 adds
  conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema
  to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests
  (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG
  Changed BREAKING + Internal.

Phase 3 — Verification:
  3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual
  E2E on agnes-development per spec protocol → close #134.

Self-review table maps spec sections to implementing tasks; no gaps.

* fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema

Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict
block only) with try/except for google.api_core exceptions, translating to
HTTPException with a structured body shape: {error, message, details}.

Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500
with no body when the SA on agnes-development hit cross-project Forbidden
on serviceusage.services.use.

Also fixes /sample's missing billing_project fallback (the bug 33a9964
fixed for /scan never landed here).

Status code split:
  - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is
    user-derived from req.select/where/order_by.
  - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is
    server-constructed from validated identifiers.
  - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in
    error message (with hint pointing at data_source.bigquery.billing_project),
    else bq_forbidden.

Body shape matches what the upcoming BqAccess refactor (next commit) will
produce, so client-side parsers see one consistent contract throughout
the staged rollout.

_fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty
contract is intentional and survives into the refactor; /schema continues to
return 200 with empty partition info when partition queries fail.

Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema
endpoints exist only to make the test pattern (monkeypatching whole
_fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal
once BqAccess centralizes translation.

* refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine

Extracts the duplicated BigQuery-access pattern (project resolution +
client construction + DuckDB-extension session + Google-API error
translation) into connectors/bigquery/access.py. Migrates four
call sites to use it:

- app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan
- app/api/v2_sample.py — _fetch_bq_sample
- app/api/v2_schema.py — _fetch_bq_schema (strict translation),
  _fetch_bq_table_options (preserves swallow-all best-effort contract)
- src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg

The new module exposes:
- BqProjects (frozen dataclass: billing + data project IDs)
- BqAccessError (typed exception with HTTP_STATUS class mapping)
- BqAccess (facade with injectable client_factory/duckdb_session_factory
  for tests; defaults call the real google-cloud-bigquery + DuckDB extension)
- get_bq_access (module-level @functools.cache; FastAPI Depends target)
- translate_bq_error (Google API exception → BqAccessError mapper, with
  BqAccessError pass-through, 'serviceusage'-substring heuristic for
  cross_project_forbidden, and bad_request_status param distinguishing
  user-derived (400) from server-constructed (502) SQL)
- _default_client_factory, _default_duckdb_session_factory

RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests
migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only
RemoteQueryEngine tests need no changes — bq_access defaults to None and
get_bq_access() is only invoked on first BQ call (lazy resolution).
BqAccessError raised internally is translated to RemoteQueryError(
error_type="bq_error") in _get_bq_client to preserve the engine's
existing public contract — CLI and /api/query/hybrid callers see no change.

Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample,
test_v2_schema) migrate from monkey-patching whole _fetch_* functions
to using the new bq_access fixture in tests/conftest.py — which
exercises the REAL translation path through BqAccess + translate_bq_error,
closing the test gap flagged in Task 1.1's review.

Side-effect behavior change: v2_sample's FROM clause now uses the data
project (instance.yaml data_source.bigquery.project), not the conflated
billing_project from Phase 1. Documented in CHANGELOG ### Internal.

BREAKING for deployments combining BIGQUERY_PROJECT env var with
data_source.bigquery.project in instance.yaml — env var now overrides
data project too. See CHANGELOG ### Changed.

Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py,
scripts/duckdb_manager.register_bq_table) explicitly out of scope;
tracked as follow-up.

Removed stale docstring at the previous src/remote_query.py:204
that referenced scripts.duckdb_manager._create_bq_client as the default
BQ client factory (RemoteQueryEngine never actually used that function).

Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_*.py +
tests/test_remote_query.py migrated to bq_access fixture (counts unchanged
or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures
(DB migration tests with unrelated internal_roles DependencyException —
not introduced by this PR).

* fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed)

CI on PR #138 caught: bigquery.Client(...) resolves Application Default
Credentials at construction time; without ADC (CI without SA key, dev
laptop without 'gcloud auth application-default login') it raises
google.auth.exceptions.DefaultCredentialsError synchronously.

Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError
propagated as raw exception — and from production endpoints would surface
as bare 500 (the exact failure mode #134 sets out to fix).

Now translates to BqAccessError(kind='auth_failed', details.hint='Run
gcloud auth application-default login...'). Endpoint catch chain returns
HTTP 502 with structured body. Adds unit test
test_raises_auth_failed_on_default_credentials_error.

Third-round spec review flagged this case in passing; the fix didn't land.
CI's auth-less environment surfaced it.

* fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured

Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends
breaks all v2 endpoints for non-BigQuery instances'.

Pre-fix: get_bq_access() raised BqAccessError(not_configured) when
neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set.
Because FastAPI resolves Depends() BEFORE the endpoint body runs, this
exception fires during dep-injection — the endpoint's try/except
BqAccessError clause never gets a chance to catch it. Result: every
v2 request on Keboola-only or CSV-only instances returned bare HTTP
500, even for local-source tables that never touch BigQuery.

Fix: get_bq_access() now returns a sentinel BqAccess with empty
BqProjects and factories that raise BqAccessError(not_configured)
on actual use. Construction succeeds, FastAPI's dep-injection cleanly
yields the sentinel, the endpoint runs. The local-source code path
in build_sample / build_schema / etc. never calls bq.client() or
bq.duckdb_session() (it reads parquet directly), so non-BQ tables
return 200 as before. Only when an endpoint actually tries to query
BQ (source_type == 'bigquery') does the sentinel raise — and the
endpoint's existing except BqAccessError catches it normally,
returning structured 502 with hint.

Test get_bq_access::test_raises_not_configured_when_neither_set
renamed and rewritten to test_returns_sentinel_when_neither_set:
asserts BqAccess is returned, then asserts client() and
duckdb_session() each raise BqAccessError(not_configured) on call.

Test test_does_not_cache_exceptions removed (no longer applicable)
and replaced with test_sentinel_is_cached_per_process documenting
the operator-restart-on-config-change contract.

* docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule)

Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained
customer-specific deployment hostnames, deployment names, and a GCP
project ID that violated CLAUDE.md's vendor-agnostic OSS rule
('Nothing customer-specific belongs in code, configuration defaults,
comments, docs, commit messages, PR titles, or PR bodies').

Replacements:
  agnes-development.groupondev.com -> <your-agnes-host>
  agnes-development                -> <your-dev-instance>
  prj-grp-dataview-prod-1ff9       -> <your-data-project>
  s1_session_landings              -> <bq_table_id>

E2E verification semantics unchanged — operators still run the same
four curls + config flip + retry, just substituting their own host /
deployment name / project / table.

* fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache

Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at
process level, so it captures BigQuery project IDs at first call and
ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints
re-read get_value() on every request, so admin /api/admin/server-config
saves (which call instance_config.reset_cache()) hot-reloaded the BQ
project. Without this fix, my refactor silently regresses that contract
— operators editing instance.yaml via the admin UI would see no effect
on v2 endpoints until container restart.

instance_config.reset_cache() now also calls
connectors.bigquery.access.get_bq_access.cache_clear() (lazy import,
swallowed if connectors module isn't loaded — keeps instance_config
usable in isolated unit tests).

Adds test_instance_config_reset_cache_invalidates_get_bq_access as
regression guard. Updates CHANGELOG Internal entry to mention the
hot-reload contract + the not-configured sentinel behavior (round-3
fix from Devin BUG_0001 was previously only in commit message).

* fix(bq_access): surface not_configured before identifier validation + plan path genericize

Devin BUG_0001 + BUG_0002 round 5 on PR #138.

BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md
vendor-agnostic rule. Replaced with '<worktree-root>' placeholder.

BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel
BqAccess (BQ not configured), the empty bq.projects.data was reaching
validate_quoted_identifier first and raising ValueError -> endpoint
mapped to HTTP 400 'unsafe_identifier' instead of structured 500
'not_configured' with hint.

Each fetch helper now checks 'if not bq.projects.data: bq.client()' as
the first step, which triggers the sentinel's BqAccessError(not_configured).
Endpoint catches the typed error and returns HTTP 500 with hint pointing
at data_source.bigquery.project. Best-effort _fetch_bq_table_options
returns {} silently in this case (preserves the swallow-all contract).

* fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match

Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery
extension is a C++ plugin making its own HTTP calls — when BQ returns
403, it throws duckdb.IOException with the BQ error embedded as text,
not gax.Forbidden. translate_bq_error's isinstance checks would miss
these, falling to case 7 → bare 500 in production for v2_scan, v2_sample,
and v2_schema (the bigquery_query() paths).

Fix: last-resort string-match heuristic before the re-raise. 'Forbidden'
/ '403' / 'Bad Request' / '400' in the lowercased message classifies via
the same kind hierarchy. The 'serviceusage' substring still distinguishes
cross_project_forbidden from bq_forbidden. Specific enough that random
exceptions without HTTP-error keywords still re-raise.

Adds 4 unit tests covering the new heuristic + the 'don't swallow random
exceptions' invariant.

* chore(release): cut 0.22.0

PR #138 contains issue #134 user-visible behavior changes:
- BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml
  data_source.bigquery.project for v2 endpoints (previously
  RemoteQueryEngine billing only).
- Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate,
  /schema when BigQuery raises Forbidden/BadRequest (was bare 500).
- Internal: BqAccess facade refactor unifying four duplicate BQ-access
  call sites; instance_config.reset_cache() now invalidates BqAccess
  cache too so admin server-config saves hot-reload BQ project IDs.

Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.
2026-04-30 10:11:20 +02:00

604 lines
39 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Issue #134 — Unify BigQuery access behind `BqAccess`, fix v2_sample + 502 contract
**Date:** 2026-04-29 (revision 4 — incorporates three rounds of code review)
**Issue:** [keboola/agnes-the-ai-analyst#134](https://github.com/keboola/agnes-the-ai-analyst/issues/134)
**Branch:** `fix/134-bq-access-unify`
## Problem
Issue #134 reports that v2 BigQuery endpoints on `<your-dev-instance>` (0.18.0) return HTTP 500 with no body:
- `POST /api/v2/scan/estimate`
- `POST /api/v2/scan`
- `GET /api/v2/sample/{table_id}`
`POST /api/query/hybrid` against the same tables works correctly.
### Root cause analysis
Two distinct bugs cause the symptom:
#### Bug A — `v2_scan.py` has correct project resolution but no error translation
Commit `33a9964` added the `billing_project` parameter to `app/api/v2_scan.py`. Today both endpoints (`scan_endpoint` line 385, `scan_estimate_endpoint` line 221) read `data_source.bigquery.billing_project` from `instance.yaml`, fall back to `project`, and pass it to the BQ client constructor.
**However, `_bq_dry_run_bytes` (lines 43-55) and `_run_bq_scan` (lines 266-282) have no `try/except` at all.** The endpoint-level handlers in `scan_endpoint` and `scan_estimate_endpoint` catch `WhereValidationError`, `QuotaExceededError`, `FileNotFoundError`, `PermissionError`, `ValueError` — but `google.api_core.exceptions.Forbidden` and `BadRequest` propagate as bare HTTP 500 with no body.
**This is the headline cause of the v2_scan/estimate 500s.** When the SA on `<your-dev-instance>` lacks `serviceusage.services.use` on whatever project resolves to billing, BQ raises `Forbidden`, which propagates uncaught. The config fix landed in `33a9964`; the error translation didn't.
#### Bug B — `v2_sample.py` is missing the billing_project split entirely
`app/api/v2_sample.py:104` reads only `data_source.bigquery.project`, never `billing_project`. It then passes that project to `bigquery_query()` as the billing target. This is the same bug `33a9964` fixed in `v2_scan.py`, just in a sibling file that didn't get the same patch.
`v2_sample.py` also has no structured error handling — it catches only `FileNotFoundError` (404) and `PermissionError` (403). Anything else (Google API errors, identifier `ValueError`, `ImportError`) bubbles up as bare HTTP 500.
#### Bug C — `v2_schema.py` has the same shape (one strict block, one best-effort block)
`app/api/v2_schema.py` contains **two separate blocks** of the INSTALL/LOAD/SECRET/`bigquery_query()` dance with **different error semantics**:
- `_fetch_bq_schema` (lines 48-73): hard-required for the endpoint to return a schema. Today swallows nothing; would benefit from structured translation.
- `_fetch_bq_table_options` (lines 90-129): best-effort partition/cluster info. Wraps everything in `try/except Exception → return {}` and a `logger.warning`. Endpoint returns successfully even if partition info fails.
Schema reportedly works for Pavel today because both queries hit `INFORMATION_SCHEMA`, which doesn't trip the `serviceusage` permission check on the billing project. But the same code path is one query change away from the same 500.
#### Operator-config dimension (out of scope for code fix)
If `instance.yaml` on `<your-dev-instance>` does not set `data_source.bigquery.billing_project`, the fallback `or project_id` puts the call right back in the broken state. The fix surfaces this to the operator via a structured error body containing a `hint` pointing at the missing config key.
### Five+ duplicate code paths today
The BQ-access pattern is duplicated across:
| File | Function | Shape | Status in this PR |
|---|---|---|---|
| `app/api/v2_scan.py` | `_bq_dry_run_bytes`, `_run_bq_scan` | `bigquery.Client` (Python SDK) | **In scope** |
| `app/api/v2_sample.py` | `_fetch_bq_sample` | DuckDB `bigquery_query()` | **In scope** |
| `app/api/v2_schema.py` | `_fetch_bq_schema` | DuckDB `bigquery_query()` | **In scope (strict translation)** |
| `app/api/v2_schema.py` | `_fetch_bq_table_options` | DuckDB `bigquery_query()` | **In scope (preserve swallow-all `except Exception → {}`)** |
| `src/remote_query.py` | `RemoteQueryEngine._get_bq_client` | `bigquery.Client` (Python SDK) | **In scope** |
| `connectors/bigquery/extractor.py` | sync-time extractor | mixed (`ATTACH 'project=...'` + `bigquery_query()`) | **Deferred** (see below) |
| `scripts/duckdb_manager.py` | `register_bq_table` | `bigquery.Client` (Python SDK) | **Deferred** (see below) |
**Deferred sites — rationale:**
- **`extractor.py`** runs at sync time, async, behind the scheduler. Errors surface in logs / `sync_history`, not as HTTP responses. Different lifecycle, different control flow (uses `ATTACH` not `bigquery.Client.query`). Migrating it doubles the PR size for benefit not in #134's scope. Track as follow-up issue.
- **`register_bq_table`** is admin-only, runs once at table registration time (M1 from #108). Its project resolution is `bq_project or BIGQUERY_PROJECT env` — no `instance.yaml` fallback, different semantics from the runtime path. Different concern. Track as follow-up issue.
The stale docstring at `src/remote_query.py:204` claims `_bq_client_factory` defaults to `scripts.duckdb_manager._create_bq_client`. It doesn't — `_get_bq_client` constructs `_bq_module.Client(project=project)` inline at line 450. Will self-correct when `_bq_client_factory` is removed.
## Goals
1. **Fix the v2_sample billing_project bug** so cross-project BQ reads work when the operator sets `billing_project`.
2. **Fix the v2_scan/estimate error translation** so cross-project Forbidden surfaces as a structured 502 instead of bare 500.
3. **Translate Google API errors into structured responses** with actionable bodies across all three v2 endpoints (the user / CLI gets an error shape they can reason about, not bare 500).
4. **Eliminate four duplicate BQ-access call sites** behind a single facade so the fix lives in one place. (Two deferred — see above.)
5. **Preserve test invasiveness** — existing tests that mock the BQ client must remain straightforward to write, ideally cleaner than today's `_bq_client_factory` injection point.
## Non-goals
- Changing the operator-facing config schema. `data_source.bigquery.billing_project` already exists; we route everything through it.
- Auto-detecting cross-project misconfiguration at startup (rejected for scope; would require a real BQ call at boot).
- Touching the `/api/query/hybrid` endpoint behavior — `RemoteQueryEngine` internals change, but the HTTP contract does not.
- Migrating `extractor.py` or `register_bq_table` to `BqAccess` (deferred — see above).
- Adding per-table multi-project support. Today's `table_registry` schema has no `source_project` column; every BQ table uses `instance.yaml`'s `project` as the data project. Future multi-project is a separate feature; spec notes the constraint so it can be lifted cleanly later.
- Schema migration / data migration of any kind.
- Hot-reload of `instance.yaml`. `get_bq_access()` is `@functools.cache`'d at process level. Changing config requires a container restart. (Today's behavior is identical — `instance.yaml` is loaded once at boot.)
## Design
### Architecture — new module `connectors/bigquery/access.py`
```
connectors/bigquery/
├── auth.py (existing — get_metadata_token, unchanged)
├── extractor.py (existing — unchanged in this PR)
└── access.py (NEW)
├── BqProjects (frozen dataclass)
├── BqAccessError (typed exception with HTTP_STATUS class mapping)
├── BqAccess (facade with injectable factories)
├── get_bq_access() -> BqAccess (module-level, @functools.cache; FastAPI Depends target)
├── translate_bq_error(e, projects, *, bad_request_status) -> BqAccessError
├── _default_client_factory (real bigquery.Client construction)
└── _default_duckdb_session_factory (real INSTALL/LOAD/SECRET dance)
```
### `BqAccess` public API
```python
@dataclass(frozen=True)
class BqProjects:
billing: str # billing/quota target — used as `project=` and `quota_project_id=`
data: str # data project for FROM clauses (today: instance.yaml `project`).
# Note: locked to a single project per instance until table_registry
# grows a per-table source_project column. See "Non-goals".
class BqAccessError(Exception):
HTTP_STATUS = {
"not_configured": 500, # admin/config bug — page on-call
"bq_lib_missing": 500, # deployment bug
"auth_failed": 502, # GCP metadata server unreachable
"cross_project_forbidden": 502, # SA lacks serviceusage.services.use on billing project
"bq_forbidden": 502, # other Forbidden from BQ
"bq_bad_request": 400, # 400 from BQ when caller flagged it as client-derived
"bq_upstream_error": 502, # all other upstream BQ failures (incl. server-derived BadRequest)
}
def __init__(self, kind: str, message: str, details: dict | None = None):
self.kind = kind
self.message = message
self.details = details or {}
super().__init__(message)
class BqAccess:
"""Single entry point for BigQuery access — config resolution, client construction,
DuckDB-extension session, and error translation. Stateless after construction.
Factories are injectable for tests:
bq = BqAccess(
BqProjects(billing="test-billing", data="test-data"),
client_factory=lambda projects: mock_client,
)
"""
def __init__(
self,
projects: BqProjects,
*,
client_factory: Callable[[BqProjects], "bigquery.Client"] | None = None,
duckdb_session_factory: Callable[[BqProjects], "AbstractContextManager"] | None = None,
):
self._projects = projects
self._client_factory = client_factory or _default_client_factory
self._duckdb_session_factory = duckdb_session_factory or _default_duckdb_session_factory
@property
def projects(self) -> BqProjects: ...
def client(self) -> "bigquery.Client":
"""Construct (or retrieve from injected factory) a BigQuery client billed to
projects.billing. Raises BqAccessError(kind='bq_lib_missing') if
google-cloud-bigquery is not installed; raises BqAccessError(kind='auth_failed')
on credential resolution failure."""
@contextmanager
def duckdb_session(self) -> Iterator["duckdb.DuckDBPyConnection"]:
"""Yield in-memory DuckDB conn with bigquery extension loaded + SECRET set
from get_metadata_token(). Auto-cleanup. Translates INSTALL/LOAD/SECRET failures
and metadata-token failures to BqAccessError(kind='auth_failed' or 'bq_lib_missing')."""
@functools.cache
def get_bq_access() -> "BqAccess":
"""Module-level factory used as the FastAPI Depends target.
Resolves projects from BIGQUERY_PROJECT env → instance.yaml billing_project →
instance.yaml project. Returns a BqAccess instance with default factories.
Process-cached: config is loaded at boot and doesn't change at runtime. Hot-reload
of instance.yaml is explicitly out of scope. Note that functools.cache does NOT
cache exceptions, so a failed call (BqAccessError(kind='not_configured')) is retried
on the next invocation — useful when the operator fixes config and restarts a
request without restarting the container.
Tests inject via FastAPI's dependency_overrides[get_bq_access] = lambda: bq, or
construct BqAccess(...) directly for non-endpoint code (e.g. RemoteQueryEngine).
Tests do NOT mutate this cache.
Module-level (not BqAccess.from_config classmethod) to avoid the
@classmethod + @functools.cache stacking footgun and to give FastAPI's
dependency introspection a clean function signature."""
def translate_bq_error(
e: Exception,
projects: BqProjects,
*,
bad_request_status: Literal["client_error", "upstream_error"],
) -> BqAccessError:
"""Convert Google API exceptions into a typed BqAccessError.
Mapping (FIRST match wins):
1. BqAccessError -> pass through unchanged (this is critical:
bq.client() and bq.duckdb_session() can raise
BqAccessError directly for bq_lib_missing /
auth_failed, and those must round-trip through
translate_bq_error without being reclassified
as 'unknown' and re-raised).
2. Forbidden + 'serviceusage' in str(e).lower()
-> cross_project_forbidden (with hint)
3. Forbidden -> bq_forbidden
4. BadRequest, bad_request_status='client_error'
-> bq_bad_request (HTTP 400)
5. BadRequest, bad_request_status='upstream_error'
-> bq_upstream_error (HTTP 502)
6. GoogleAPICallError (other) -> bq_upstream_error
7. Anything else -> RE-RAISED unchanged (don't swallow programmer errors)
`bad_request_status` MUST be supplied by the caller. It distinguishes:
- 'client_error': SQL contains user input (select/where/order_by/limit). BQ rejecting
it is plausibly the user's fault — return 400 with the BQ message.
- 'upstream_error': SQL is fully built server-side from validated identifiers.
BQ rejecting it is server-side corruption — return 502."""
```
### Project resolution rules (single source of truth)
`get_bq_access` resolves projects in this order (matching today's `RemoteQueryEngine._get_bq_client` behavior):
1. `BIGQUERY_PROJECT` env var → if set, used as **both** billing and data (legacy override).
2. `data_source.bigquery.billing_project` from `instance.yaml` → billing.
3. `data_source.bigquery.project` from `instance.yaml` → data, and billing if (2) is unset.
If neither (1) nor (3) yields a value: `BqAccessError(kind='not_configured', details={"hint": "set data_source.bigquery.project in instance.yaml"})`.
> **BREAKING for env-only deployments.** Today `RemoteQueryEngine._get_bq_client` uses `BIGQUERY_PROJECT` *only* as the billing project; data project for FROM clauses comes from elsewhere (e.g. SQL strings the user wrote). After this change, `BIGQUERY_PROJECT` sets both billing and data, **overriding** `data_source.bigquery.project` for FROM-clause construction in `v2_scan` / `v2_sample` / `v2_schema`. Deployments that combine env-var-for-billing + yaml-for-data must migrate by setting `data_source.bigquery.billing_project` in `instance.yaml` and clearing the env var. Flagged in CHANGELOG.
### Cross-project Forbidden detection
`'serviceusage' in str(e).lower()` is the only reliable signal of the cross-project quota issue. Other cross-project failures (revoked SA, table-level ACL, dataset-level deny) degrade to generic `bq_forbidden` — still 502, still structured body, just less specific in the hint. `billing != data` is the **normal** cross-project setup, not a signal of failure; using it to classify would over-trigger and misdirect operators.
`billing != data` MAY enrich `details.hint` ("Note: billing and data projects differ — verify SA has both BQ Read on data project AND serviceusage.services.use on billing project") but does not alter `kind`.
### Status code mapping rationale
- **400** for `bq_bad_request` — BQ rejecting a SQL string built from user input is plausibly the user's fault. Returns it back to the user as a 4xx with the BQ message.
- **502** for `bq_forbidden`, `cross_project_forbidden`, `bq_upstream_error`, `auth_failed` — upstream BQ refused or was unreachable. Operationally distinguishable from 500 in dashboards: "integration with BQ broken" vs "Agnes itself broken".
- **500** for `not_configured`, `bq_lib_missing` — deployment/admin-config bugs that should page on-call, not transient upstream errors.
### Migration of four call sites
#### A. `app/api/v2_scan.py`
`_bq_dry_run_bytes` and `_run_bq_scan` change signature from `(project: str, sql: str)` to `(bq: BqAccess, sql: str)`. Body (note: `bq.client()` is called *outside* the `try/except` so a `BqAccessError` from it still propagates correctly through the endpoint's `except BqAccessError` clause, even before the translator's pass-through would catch it):
```python
def _bq_dry_run_bytes(bq: BqAccess, sql: str) -> int:
from google.cloud import bigquery
client = bq.client() # may raise BqAccessError(bq_lib_missing/auth_failed); propagates as-is
try:
job = client.query(
sql, job_config=bigquery.QueryJobConfig(dry_run=True, use_query_cache=False),
)
return int(job.total_bytes_processed or 0)
except Exception as e:
raise translate_bq_error(e, bq.projects, bad_request_status="client_error")
```
`_run_bq_scan` mirrors the same shape with `bad_request_status="client_error"` (SQL contains user's `select`/`where`/`order_by`).
`scan_endpoint`, `scan_estimate_endpoint`, `estimate`, and `run_scan` lose the `project_id` and `billing_project` parameters in favor of `bq: BqAccess`. Endpoints inject via FastAPI:
```python
@router.post("/scan")
async def scan_endpoint(
raw: dict,
user: dict = Depends(get_current_user),
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
bq: BqAccess = Depends(get_bq_access),
):
try:
ipc = run_scan(conn, user, raw, bq=bq, quota=...)
return Response(...)
except WhereValidationError as e:
raise HTTPException(status_code=400, detail=...)
except QuotaExceededError as e:
raise HTTPException(status_code=429, detail=...)
except BqAccessError as e:
raise HTTPException(
status_code=BqAccessError.HTTP_STATUS.get(e.kind, 500),
detail={"error": e.kind, "message": e.message, "details": e.details},
)
```
`_build_bq_sql(table_row, project_id, req)` keeps `project_id` as a parameter (it's the data project for the FROM clause). Call sites pass `bq.projects.data`. **Forward-compat note** in code comments: when `table_registry` grows a per-table `source_project` column, callers should prefer `table_row.get('source_project') or bq.projects.data`. `tests/test_v2_scan.py:206-207` imports `_build_bq_sql` directly — the signature is unchanged, no test update needed.
#### B. `app/api/v2_sample.py`
`_fetch_bq_sample` changes signature to `(bq: BqAccess, dataset: str, table: str, n: int)`:
```python
def _fetch_bq_sample(bq: BqAccess, dataset: str, table: str, n: int) -> list[dict]:
from src.identifier_validation import validate_quoted_identifier
if not (validate_quoted_identifier(bq.projects.data, "BQ project")
and validate_quoted_identifier(dataset, "BQ dataset")
and validate_quoted_identifier(table, "BQ source_table")):
raise ValueError("unsafe BQ identifier in registry — refusing to query")
bq_sql = f"SELECT * FROM `{bq.projects.data}.{dataset}.{table}` LIMIT {int(n)}"
with bq.duckdb_session() as conn:
try:
df = conn.execute(
"SELECT * FROM bigquery_query(?, ?)",
[bq.projects.billing, bq_sql],
).fetchdf()
return df.to_dict(orient="records")
except Exception as e:
raise translate_bq_error(e, bq.projects, bad_request_status="upstream_error")
```
`build_sample` and `sample` endpoint signatures lose `project_id` for `bq: BqAccess = Depends(get_bq_access)`. Endpoint catch chain adds `BqAccessError` and `ValueError → 400 (kind='unsafe_identifier')`.
#### C. `app/api/v2_schema.py` — two blocks, two semantics
The two blocks have **different error contracts** — preserve them.
**`_fetch_bq_schema` (lines 48-73)** — strict; the endpoint can't return without it. Migrate to `bq.duckdb_session()` + `translate_bq_error(..., bad_request_status="upstream_error")`. Endpoint wraps in `try/except BqAccessError` and returns the structured 502.
**`_fetch_bq_table_options` (lines 90-129)** — best-effort; current code has `except Exception as e: logger.warning(...); return {}`. **Preserve this contract.** Migrate to use `bq.duckdb_session()` for the connection but keep the outer `try/except Exception → return {}`. Do NOT translate via `translate_bq_error`; the `/schema` endpoint must continue to return successfully with empty partition info when BQ is unreachable, not 502.
```python
def _fetch_bq_table_options(bq: BqAccess, dataset: str, table: str) -> dict:
from src.identifier_validation import validate_quoted_identifier
if not (validate_quoted_identifier(bq.projects.data, "BQ project")
and validate_quoted_identifier(dataset, "BQ dataset")
and validate_quoted_identifier(table, "BQ source_table")):
return {}
try:
with bq.duckdb_session() as conn:
bq_sql = (
f"SELECT column_name, is_partitioning_column, clustering_ordinal_position "
f"FROM `{bq.projects.data}.{dataset}.INFORMATION_SCHEMA.COLUMNS` "
f"WHERE table_name = ? "
f"ORDER BY clustering_ordinal_position NULLS LAST"
)
rows = conn.execute(
"SELECT * FROM bigquery_query(?, ?, ?)",
[bq.projects.billing, bq_sql, table],
).fetchall()
if not rows:
return {}
partition_by = next(
(r[0] for r in rows if (r[1] or "").upper() == "YES"),
None,
)
clustered_by = [r[0] for r in rows if r[2] is not None]
return {"partition_by": partition_by, "clustered_by": clustered_by}
except Exception as e:
logger.warning("BQ table options fetch failed for %s.%s.%s: %s",
bq.projects.data, dataset, table, e)
return {}
```
Endpoint signature gains `bq: BqAccess = Depends(get_bq_access)`.
#### D. `src/remote_query.py` — lazy `BqAccess` construction
`RemoteQueryEngine.__init__` signature changes:
```python
def __init__(
self,
..., # existing args
bq_access: BqAccess | None = None,
):
...
self._bq = bq_access # may stay None — resolved lazily
```
**Lazy resolution is critical.** Many existing tests in `tests/test_remote_query.py` (lines 106, 148, 196, 417, 520, 529, 538) construct `RemoteQueryEngine(analytics_conn)` for DuckDB-only paths that never touch BQ. After this change, eager `get_bq_access()` at construction would fail those tests with `not_configured` in any environment without `instance.yaml` configured for BQ. Resolve only when actually needed:
```python
def _get_bq_client(self):
if self._bq is None:
from connectors.bigquery.access import get_bq_access
self._bq = get_bq_access() # may raise BqAccessError; that's fine
return self._bq.client()
```
`_bq_client_factory` parameter, the docstring at line 204 (which referenced the stale `scripts.duckdb_manager._create_bq_client` default), and lines 407-450 of `_get_bq_client` all delete. The fallback chain logic moves to `get_bq_access`.
**External caller note:** `cli/commands/query.py:120` constructs `RemoteQueryEngine(conn, **engine_kwargs)`. The new `bq_access` kwarg has default `None`, so the CLI continues to work via the lazy `get_bq_access()` path. No CLI change needed.
### Test rewrite
The existing `_bq_client_factory` injection point in `RemoteQueryEngine` and tests like `tests/test_remote_query.py` currently look like:
```python
engine = RemoteQueryEngine(_bq_client_factory=lambda project: mock_client)
```
Migrates to direct `BqAccess` injection — no monkey-patching, no classmethod gymnastics:
```python
def test_remote_query_x():
bq = BqAccess(
BqProjects(billing="test-billing", data="test-data"),
client_factory=lambda projects: mock_client,
)
engine = RemoteQueryEngine(..., bq_access=bq)
engine.execute(...)
```
DuckDB-only `RemoteQueryEngine` tests (the ~7 sites listed above) need NO change — `bq_access=None` defaults preserve today's behavior; `get_bq_access()` is never called.
For FastAPI endpoint tests (`tests/test_v2_*.py`), use FastAPI's `dependency_overrides`:
```python
def test_v2_scan_x(client):
bq = BqAccess(
BqProjects(billing="test-billing", data="test-data"),
client_factory=lambda projects: mock_client,
)
app.dependency_overrides[get_bq_access] = lambda: bq
try:
response = client.post("/api/v2/scan", json={...})
assert response.status_code == 200
finally:
app.dependency_overrides.clear()
```
Optional shared fixture in `tests/conftest.py`:
```python
@pytest.fixture
def bq_access():
"""Build a BqAccess with pluggable factories. Returns a callable that overrides
the FastAPI Depends and yields the BqAccess instance."""
def _build(*, client=None, duckdb_conn=None,
billing="test-billing", data="test-data"):
bq = BqAccess(
BqProjects(billing=billing, data=data),
client_factory=(lambda projects: client) if client else None,
duckdb_session_factory=(
lambda projects: contextlib.nullcontext(duckdb_conn)
) if duckdb_conn else None,
)
app.dependency_overrides[get_bq_access] = lambda: bq
return bq
yield _build
app.dependency_overrides.clear()
```
> **Fixture caveat for nested sessions.** `contextlib.nullcontext(duckdb_conn)` does NOT close the conn on `__exit__`. The production path closes it via `_default_duckdb_session_factory`. Tests that exercise multiple sequential `bq.duckdb_session()` calls within a single test function will see the same conn object both times — fine for assertion purposes, but won't catch close-and-reopen regressions. The unit test `test_duckdb_session_closes_on_exit` covers the close behavior on the production factory; the fixture is for endpoint integration tests that don't care.
### Tests
#### Unit tests — `tests/test_bq_access.py` (new)
| Test | Asserts |
|---|---|
| `test_resolve_env_var_wins` | `BIGQUERY_PROJECT=foo` overrides `instance.yaml`; both billing+data = foo |
| `test_resolve_billing_falls_back_to_project` | unset `billing_project` → both billing and data = `project` |
| `test_resolve_billing_distinct_from_project` | both set → billing and data differ |
| `test_resolve_raises_when_neither_set` | `BqAccessError(kind='not_configured')` with hint |
| `test_resolve_succeeds_after_config_set` | call once → raises; set config; call again → succeeds (functools.cache doesn't cache exceptions) |
| `test_get_bq_access_is_cached` | two successful calls return the same instance |
| `test_translate_forbidden_serviceusage` | `gax.Forbidden('serviceusage.services.use')``kind='cross_project_forbidden'` + hint |
| `test_translate_forbidden_no_serviceusage_diff_projects` | `gax.Forbidden('table-level perm denied')` + billing≠data → `kind='bq_forbidden'` (NOT cross_project) |
| `test_translate_forbidden_same_project` | `gax.Forbidden` + billing==data → `kind='bq_forbidden'` |
| `test_translate_bad_request_client_error` | `gax.BadRequest`, `bad_request_status='client_error'``kind='bq_bad_request'`, status 400 |
| `test_translate_bad_request_upstream_error` | `gax.BadRequest`, `bad_request_status='upstream_error'``kind='bq_upstream_error'`, status 502 |
| `test_translate_passes_through_BqAccessError` | `BqAccessError('bq_lib_missing', ...)` in → identical out (CRITICAL — guards against bq.client() raising it inside try/except) |
| `test_translate_unknown_reraises` | `RuntimeError("oops")` is re-raised, NOT silently wrapped |
| `test_client_uses_billing_as_quota_project` | `_default_client_factory` constructs with `quota_project_id=projects.billing` |
| `test_default_client_factory_raises_bq_lib_missing_on_importerror` | mock missing google-cloud-bigquery → `BqAccessError(bq_lib_missing)` |
| `test_default_duckdb_session_raises_auth_failed_on_metadata_error` | mock `get_metadata_token` raising `BQMetadataAuthError``BqAccessError(auth_failed)` |
| `test_duckdb_session_closes_on_exit` | mock token + duckdb conn, assert `conn.close()` called |
| `test_duckdb_session_closes_on_exception` | exception inside `with` block still triggers `conn.close()` |
| `test_injected_client_factory_overrides_default` | `BqAccess(..., client_factory=...)` skips `_default_client_factory` |
#### Integration tests — extending `tests/test_v2_*.py`
| Test | Asserts |
|---|---|
| `test_v2_scan_returns_502_on_bq_forbidden_serviceusage` | mock client raises `gax.Forbidden('...serviceusage...')`; response 502 + body `error=cross_project_forbidden` + hint mentions `billing_project` |
| `test_v2_scan_returns_400_on_bq_bad_request` | mock client raises `gax.BadRequest('invalid syntax')`; response 400 + body `error=bq_bad_request` |
| `test_v2_scan_estimate_returns_502_on_bq_forbidden` | same pattern for `/scan/estimate` |
| `test_v2_scan_returns_500_on_bq_lib_missing` | mock client raises `BqAccessError(bq_lib_missing)`; response 500 + structured body |
| `test_v2_sample_returns_502_on_bq_forbidden` | mock duckdb_session raises via `bigquery_query`; response 502 + structured body |
| `test_v2_sample_returns_400_on_unsafe_identifier` | registry row with backtick in `source_table` → 400 + body `error=unsafe_identifier` |
| `test_v2_sample_returns_404_on_unknown_table` | unchanged behavior (regression guard) |
| `test_v2_sample_returns_403_on_unauthorized` | unchanged behavior (regression guard) |
| `test_v2_schema_returns_502_on_bq_forbidden` | strict block (`_fetch_bq_schema`) failure → 502 |
| `test_v2_schema_returns_200_with_empty_partition_on_bq_failure` | best-effort block (`_fetch_bq_table_options`) failure → 200, schema returned, partition_by/clustered_by absent. Regression guard for the swallow-all preservation. |
| `test_v2_schema_returns_200_on_success` | regression guard (the existing happy path) |
#### E2E manual verification (post-deploy on `<your-dev-instance>`)
For each of `/sample`, `/scan/estimate`, `/scan`, `/schema`:
```bash
PAT=...
# Pre-deploy reference (current production behavior — bare 500 with no body for /sample/scan*)
curl -k -i -H "Authorization: Bearer $PAT" \
https://<your-agnes-host>/api/v2/sample/<bq_table_id>?n=2
curl -k -i -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
https://<your-agnes-host>/api/v2/scan/estimate \
-d '{"table_id":"<bq_table_id>","select":["event_date"],"where":"event_date = DATE \"2026-04-21\""}'
curl -k -i -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
https://<your-agnes-host>/api/v2/scan \
-d '{"table_id":"<bq_table_id>","select":["event_date"],"where":"event_date = DATE \"2026-04-21\"","limit":50}'
curl -k -i -H "Authorization: Bearer $PAT" \
https://<your-agnes-host>/api/v2/schema/<bq_table_id>
# Post-deploy, BEFORE fixing instance.yaml — expect:
# /sample, /scan, /scan/estimate: 502 + structured JSON body with
# error=cross_project_forbidden and a hint.
# /schema: 200 (because INFORMATION_SCHEMA queries don't fail today; if
# something else fails on the strict block, expect 502).
# Operator action: set data_source.bigquery.billing_project in instance.yaml,
# restart the container.
# Post-config-fix — expect 200 on all four endpoints.
```
This four-endpoint × three-state matrix is the success criterion for closing #134. Without it, "fixed" is unverifiable.
## Implementation strategy — staged commits
Per first-round review: stage as **two commits** so the user-visible bug fix is independently reviewable / revertable from the refactor. Per second-round review: **both commits emit the same structured response shape** so client-side parsers (CLI, UI) don't see contract churn between commits.
**Commit 1 — Minimal bug fix (revertable, atomic across all three v2 endpoints):**
- `app/api/v2_sample.py`:
- Read `billing_project` with same fallback as `v2_scan.py:385`; pass to `bigquery_query()`.
- Wrap `_fetch_bq_sample` in `try/except google.api_core.exceptions.*` translating to `HTTPException` with the structured body shape (see below).
- `app/api/v2_scan.py`: wrap `_bq_dry_run_bytes` and `_run_bq_scan` in the same `try/except` shape.
- `app/api/v2_schema.py`: wrap `_fetch_bq_schema` (strict block) in the same `try/except`. **Do NOT touch `_fetch_bq_table_options`** — its `except Exception → return {}` swallow-all is preserved unchanged in commit 1, then migrated to use `bq.duckdb_session()` in commit 2 (still preserved).
All three endpoints emit the **same structured body shape** that commit 2 will produce, so client-side parsers (CLI, UI) see one consistent contract throughout the rollout:
```python
except gax.Forbidden as e:
kind = "cross_project_forbidden" if "serviceusage" in str(e).lower() else "bq_forbidden"
raise HTTPException(
status_code=502,
detail={"error": kind, "message": str(e), "details": {...}},
)
except gax.BadRequest as e:
# v2_scan: bad_request_status='client_error' → kind='bq_bad_request', 400
# v2_sample, v2_schema: bad_request_status='upstream_error' → kind='bq_upstream_error', 502
raise HTTPException(
status_code=400 if <user_derived> else 502,
detail={"error": "bq_bad_request" if <user_derived> else "bq_upstream_error",
"message": str(e), "details": {}},
)
```
- Tests: regression tests for the structured body shape on all three endpoints.
This commit alone closes the user-visible part of #134 atomically — no half-rolled-out window where `/sample` returns bare 500 while `/scan` returns structured 502. If commit 2 needs another review round, commit 1 still ships and the response contract is forward-compatible.
**Commit 2 — `BqAccess` extraction + migration:**
- Create `connectors/bigquery/access.py` with the design above.
- Migrate `v2_scan`, `v2_sample`, `v2_schema` (both blocks, with separate semantics), `RemoteQueryEngine` (lazy `bq_access`) to `BqAccess`.
- Remove `_bq_client_factory` from `RemoteQueryEngine.__init__` and the stale docstring at `src/remote_query.py:204`.
- Migrate tests to the new `BqAccess(client_factory=...)` + `dependency_overrides` pattern.
- Delete inline `try/except gax.*` blocks added in commit 1; route through `translate_bq_error` instead.
## Risks
1. **Test rewrite breaks something subtle.** `tests/test_remote_query.py` and possibly `tests/test_duckdb_manager.py` have many `_bq_client_factory` call sites. The new fixture pattern must cover every shape they exercise. Mitigation: convert tests one-by-one in commit 2, run pytest after each, before deleting the old injection point. DuckDB-only tests that don't touch BQ are protected by lazy `bq_access` resolution in `RemoteQueryEngine`.
2. **Cross-project Forbidden detection heuristic is narrow but principled.** Relies on Google's error message containing `'serviceusage'` (case-insensitive). False positives are unlikely (the substring is specific). False **negatives** are possible — those degrade to `bq_forbidden` with a generic message, still a 502 with structured body. Acceptable.
3. **`get_bq_access()` is `@functools.cache`'d.** Cheap and process-lifetime-safe (config is loaded at boot and immutable). Tests use `dependency_overrides` and direct `BqAccess(...)` construction, never the cached path — no cache invalidation needed in tests. Hot-reload of `instance.yaml` is explicitly out of scope.
4. **`bq_bad_request → 400` could leak BQ error messages.** BQ's `BadRequest` text typically describes the SQL problem. We surface it in `details.message`. Operators who don't want this can filter at a reverse-proxy layer; this matches behavior of any 4xx-with-detail elsewhere in the app.
5. **`BIGQUERY_PROJECT` env-var precedence is BREAKING for env-only deployments.** Deployments that combine env-var-for-billing + yaml-for-data must migrate. See the project-resolution rules section. Flag in CHANGELOG and release notes.
6. **Two duplicate sites left behind (`extractor.py`, `register_bq_table`).** Explicit follow-up issue should be filed at PR-merge time.
7. **`translate_bq_error` pass-through ordering is load-bearing.** `bq.client()` and `bq.duckdb_session()` raise `BqAccessError` directly for `bq_lib_missing` / `auth_failed`. The translator's first clause MUST be `if isinstance(e, BqAccessError): return e`. Otherwise those typed errors fall through to "unknown" and get re-raised as bare 500. Unit-tested via `test_translate_passes_through_BqAccessError`.
8. **In-memory DB reload cost on each `duckdb_session()` (pre-existing).** Every BQ-via-DuckDB call runs `INSTALL bigquery` fresh. If the extension binary isn't already cached on disk, this is an HTTPS download. Today's code has the same cost; this PR doesn't fix it. Future optimization: long-lived in-memory conn with extension pre-loaded, behind a thread-safe pool.
## CHANGELOG entry (for the implementation PR)
Under `## [Unreleased]`:
**`### Fixed`**
- v2 `/sample` endpoint: BigQuery cross-project queries now respect `data_source.bigquery.billing_project` from `instance.yaml` (mirrors v2 `/scan` fix from `33a9964`). Closes #134 for `/sample`.
- v2 `/scan`, `/scan/estimate`: BigQuery upstream errors no longer return bare HTTP 500 with empty body. `Forbidden` from BQ now returns HTTP 502 with structured JSON body (`{"error": "cross_project_forbidden", "message": "...", "details": {"hint": "..."}}`); `BadRequest` on user-derived SQL returns HTTP 400 with `kind=bq_bad_request`. Closes #134 for `/scan*`.
- v2 `/schema`: same error translation applied to the strict path (`_fetch_bq_schema`); the best-effort partition-info path (`_fetch_bq_table_options`) preserves its swallow-all-and-return-empty behavior, so `/schema` still returns 200 with empty partition info if BQ partition-info queries fail.
**`### Changed`**
- **BREAKING for deployments using `BIGQUERY_PROJECT` env var alongside `data_source.bigquery.project` in `instance.yaml`.** The env var now sets BOTH billing and data project, overriding `data_source.bigquery.project` for FROM-clause construction in `v2_scan` / `v2_sample` / `v2_schema`. Migrate by clearing `BIGQUERY_PROJECT` and using `data_source.bigquery.billing_project` + `data_source.bigquery.project` in `instance.yaml`. (Previously `BIGQUERY_PROJECT` only affected `RemoteQueryEngine` billing.)
**`### Internal`**
- New shared module `connectors/bigquery/access.py``BqAccess` facade unifies BQ project resolution, client construction, DuckDB-extension session, and Google-API error translation across `v2_scan`, `v2_sample`, `v2_schema`, and `RemoteQueryEngine`.
- **Internal API change:** `RemoteQueryEngine.__init__` no longer accepts `_bq_client_factory`. Callers that injected it migrate to `RemoteQueryEngine(..., bq_access=BqAccess(projects, client_factory=...))`. The CLI (`cli/commands/query.py`) is unaffected — it never injected the factory and the new `bq_access` kwarg defaults to `None` (lazy `get_bq_access()` on first BQ call).
- Stale docstring at `src/remote_query.py:204` (referencing `scripts.duckdb_manager._create_bq_client` as the default factory) removed.
- Two known-duplicate BQ-access sites (`connectors/bigquery/extractor.py`, `scripts/duckdb_manager.register_bq_table`) explicitly out of scope; tracked as follow-up.