agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	c3e82972c8	feat(bq): decouple table_registry bucket from BQ dataset name (#343 ) (#346 ) * feat(bq): decouple table_registry bucket from BQ dataset name (#343) Adds optional `bq_fqn` column (schema v51) carrying the fully-qualified BigQuery path (project.dataset.table) so the rebuild path no longer has to reconstruct it from the dual-purpose `bucket` field (which is also a UX/RBAC label). - Schema v51 migration + _SYSTEM_SCHEMA carry the nullable column; rows without it keep using the legacy bucket+source_table+ remote_attach.project path (backwards compat). - BQ extractor honors bq_fqn per row when present: dataset/table override on same-project rows; cross-project VIEW path works via bigquery_query(billing, ...); cross-project BASE TABLE skipped with a clear warning (multi-ATTACH per project deferred to follow-up). - Orchestrator pre-pass detects drift between extract.duckdb _remote_attach.url and overlay data_source.bigquery.project, calls rebuild_from_registry to regenerate when they differ. Closes the operational hazard where /admin/server-config edits silently left the on-disk extract pointing at the old project until the next manual sync. - Startup config check warns when project ≠ billing_project without location set (the on-disk symptom is "provider returned no data" silently in metadata cache), and when a warehouse-like data project has no billing_project override (silent 403 serviceusage path). - _resolve_bq_location warning now points at the location config key explicitly so operators see the actionable fix in the log. - POST /api/admin/register-table and PUT /api/admin/registry/{id} accept bq_fqn; malformed values rejected at the API boundary (422). - 25 tests covering parse_bq_fqn matrix, extractor override paths (same-project + cross-project VIEW + cross-project BASE TABLE skip), orchestrator drift sync, startup-validator heuristic, admin models. UI surface for bq_fqn input in /admin/tables intentionally omitted from this PR (3.5k-line template change) — admins can register through the REST API or `agnes admin` CLI in the meantime. Multi-project ATTACH support is the same scope deferral as the cross-project BASE TABLE skip; both ride a follow-up PR. * review fixes: abstract CHANGELOG, merge duplicate Changed, bump docs schema version - CHANGELOG.md: remove customer-specific hostname + incident date range from the orchestrator drift-sync entry (vendor-agnostic OSS rule), fold the entry into the existing [Unreleased] ### Changed section instead of opening a duplicate heading. - docs/architecture.md: bump 'Current schema version' from 19 to 51 to match SCHEMA_VERSION (per agnes-orchestrator skill rule #4). * review fixes: vendor-agnostic test fixture + Schema v51 internal bullet - tests/test_bq_fqn.py: replace customer GCP project ID with generic 'my-warehouse-project' placeholder (vendor-agnostic OSS rule). Test asserts on the warehouse-like heuristic, not the literal project name, so the rename is behavior-neutral. - CHANGELOG.md: add explicit '\\Schema v51\\' bullet under `### Internal` naming the new version + summarizing the additive nullable column (matches the convention from v47/v48 bullets). * fix(bq): cross-project _detect_table_type bills against extractor project Addresses Devin review on #346 — pre-fix _detect_table_type passed the data project as BOTH the FROM-clause target AND the bigquery_query() first arg (billing project). For cross-project bq_fqn rows where fqn_project != project_id, the data SA holds bigquery.dataViewer on fqn_project but the serviceusage.services.use permission only on project_id, so the call 403'd. init_extract's broad except Exception swallowed the error and silently skipped the row, meaning the cross-project VIEW path at extractor.py:~696 — the PR's primary cross-project use case — never executed. - Add optional billing_project kwarg to _detect_table_type; defaults to project for backwards compat (same-project callers unaffected). - Update the init_extract call site to pass billing_project=project_id explicitly. Same-project rows (fqn_project == project_id) are a no-op; cross-project rows now route billing to the project where the SA actually has services.use. - 2 new tests in TestDetectTableTypeBilling cover (a) explicit billing_project routing to bigquery_query 1st arg + data project staying in FROM, and (b) the backwards-compat default. Plus test_cross_project_detect_call_bills_against_extractor_project pins the call-site wiring — captures the (project, billing_project) pair the extractor passes for a cross-project bq_fqn row. * release: 0.54.29 — bq_fqn decoupling + marketplace refactor + setup-script UX Accumulated [Unreleased] content from #342 (flea marketplace refactor), #344 (setup script step-2 cwd check), and #346 (this PR — bq_fqn column + orchestrator drift sync + startup config check). Schema v51.	2026-05-19 11:17:32 +00:00
ZdenekSrotyr	a48524509a	docs: consolidate and de-clutter the documentation tree (#306 ) CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release sections collapsed to one, stale v1->v35 schema history dropped (it lives in CHANGELOG), marketplace endpoint internals and verbose process sections moved out or tightened. New focused docs: - docs/RELEASING.md - release process, deploy workflows, CI quirks (RELEASE_TEMPLATE.md folded in as an appendix) - docs/marketplace.md - marketplace ingestion + re-serving internals - docs/README.md - documentation index by audience, linked from README.md and CLAUDE.md Archived under docs/archive/: docs/superpowers/ (52 historical planning artifacts), HACKATHON.md, pd-ps-comments.md, security-audit-2026-04.md, future/NOTIFICATIONS.md. Removed the docs/auto-install.md stub. Fixed dangling links in connectors/jira/README.md and dev_docs/README.md, repointed code/doc references to archived paths.	2026-05-14 18:54:22 +00:00
ZdenekSrotyr	aa5921da67	release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223 ) ## Summary - Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers). - `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`. - All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled. - Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`). - SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge. - New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice. Closes #155. Closes #156. ## Test plan - [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones. - [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged. - [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh. - [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses. - [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness). - [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint. ## Notable design decisions - BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`. - VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance. - `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables. - Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost. - Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated. ## Out of scope - Profile / column histograms / dimension cardinality for remote tables (separate issue). - Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call. - Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source). ## Release Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-05-07 18:33:55 +02:00
ZdenekSrotyr	e5645fd280	fix: devil's advocate R1 — chunked probe, parse-error heuristic narrow, pool settings refresh, content-length sanity, multi-project skip R1 adversarial review surfaced 5 issues, all addressed: #1 chunked download silently disabled in non-Caddy deployments (HEAD on GET-only FastAPI route returns 405). _probe_range_support now falls back to GET with Range: bytes=0-0 when HEAD fails — works against both Caddy file_server (HEAD-friendly) and dev FastAPI direct (GET-only). #2 parse-error fallback heuristic too broad — matched on Unrecognized name / Function not found / No matching signature / Invalid cast, which BQ surfaces for ordinary user-column typos. That triggered slow ATTACH-catalog retry on every typo (2× latency tax). Narrowed to just 'Syntax error' / 'syntax error' which are the genuine DuckDB-vs-BQ dialect mismatch markers. #3 apply_bq_session_settings was only run on fresh-built pool entries, not on reuse. An operator's /admin/server-config change to bq_query _timeout_ms wouldn't propagate to long-lived pooled sessions until restart. Fixed: re-apply on every pool acquire (idempotent + fail-soft). #4 content-length sanity bound — a misconfigured proxy returning a wildly inflated Content-Length would cause overlapping chunked Range requests against the actual file → corrupt assembled output (caught by manifest hash check, but only after wasted bandwidth). Cap at 100 GiB; above that, drop to single-stream. #5 rewriter assumed every BQ row resolves under the single bq.projects.data project. Bucket containing '.' suggests a project- qualified bucket (multi-project deployment); rewriter would silently target the wrong project. Conservative skip with regression test.	2026-05-06 13:50:46 +02:00
ZdenekSrotyr	14db85f506	fix(bq): map 'Response too large' to its own error class instead of generic bad_request translate_bq_error previously mapped BQ's responseTooLarge failure mode to bq_bad_request (HTTP 400 with the raw upstream message). The user- facing implication ('your SQL has a syntax error') is wrong -- the root cause is query shape (BQ refused to return the result inline because it exceeded the response size limit), and the actionable remediation is 'narrow the WHERE clause, aggregate further, or use a materialized table'. Add bq_response_too_large as a first-class BqAccessError kind (also 400) with a canonical hint message; original BQ message preserved in details for operator debugging. Detection is substring-based on 'response too large' and fires before the generic BadRequest path so the dedicated mapping always wins. Affects every BQ-touching path since they all share translate_bq_error -- /api/query, /api/v2/{scan,sample,schema}, materialize.	2026-05-06 13:09:31 +02:00
ZdenekSrotyr	83209f32b0	perf(bq): pool DuckDB BQ extension sessions to amortize INSTALL/LOAD/ATTACH cost Each BqAccess.duckdb_session() acquire previously created a fresh in-memory DuckDB conn and ran INSTALL bigquery; LOAD bigquery; CREATE SECRET; ATTACH on it -- costing ~0.5 s per request even before any BQ work. Add a process-local pool (deque + lock) of pre-warmed sessions; acquire reuses a warm entry when available, refreshing the auth SECRET so a long-lived pool entry doesn't keep a stale GCE metadata token past its TTL. Liveness probe (cheap SELECT 1) drops broken entries before handing them to callers. On exception inside the with-block the conn is closed instead of returned to pool (session may carry dirty state). Pool size is data_source.bigquery.session_pool_size (default 4; sentinel 0 disables pooling). Process-cached, not fork-safe (single uvicorn worker is the supported deployment shape per CLAUDE.md). All call sites get faster automatically: /api/query, /api/v2/{scan, sample,schema}, materialize, the orchestrator's remote-attach, and the BQ dry-run cap-guard.	2026-05-06 13:06:25 +02:00
Vojtech Rysanek	32c8ea601a	fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures The DuckDB BigQuery extension defaults bq_query_timeout_ms to 90 s, which is too tight for analyst-scale queries against view-backed BQ datasets. Agnes already has apply_bq_session_settings() that bumps it to 600 s (configurable via data_source.bigquery.query_timeout_ms), but two regressions let the 90 s default leak through to live queries: 1. apply_bq_session_settings() swallowed every Exception silently. If the BigQuery extension wasn't loaded on the connection yet, or the installed extension version didn't recognise the setting, the SET would fail and the function would return without surfacing the problem. Operators saw 90 s timeouts on 'agnes query --remote' with no log line explaining why. 2. The call sites in src/db.py:_reattach_remote_extensions and src/orchestrator.py:_remote_attach only invoked apply_bq_session_settings on the metadata-token branch (token_env empty, the BqAccess contract). The token-based and no-auth branches ran ATTACH against the BigQuery extension without ever applying the timeout setting — so any BQ source registered with an explicit token_env, or with no auth env at all, fell back to the 90 s default. Fix: - apply_bq_session_settings now logs WARNING on each failure path (instance_config import error, non-numeric value, SET execution failure, readback error). It also verifies the setting actually landed via SELECT current_setting('bq_query_timeout_ms') and logs WARNING when the readback disagrees with the requested value, which catches the silent-ignore case some extension versions exhibit. - Both _reattach_remote_extensions (src/db.py) and _remote_attach (src/orchestrator.py) now call apply_bq_session_settings on every branch that ATTACHes a BigQuery alias, not only the metadata-token branch. Idempotent: calling it twice on the metadata-token path is a no-op SET. Tests: - Extended the _RecordingConn fixture to support .fetchone() so the readback assertion path works. Updated existing call-shape assertions to expect the SELECT current_setting readback alongside the SET. Added two new tests covering the WARNING surfaces for SET failure and readback mismatch — regression guards for the silent- fallback bug this PR addresses. - Full BQ-touching suite (398 tests) passes.	2026-05-06 11:24:14 +04:00
ZdenekSrotyr	4f04235502	feat(bigquery): bq_query_timeout_ms knob; default 600s (was 90s) DuckDB BigQuery extension defaults `bq_query_timeout_ms` to 90 s, which is too tight for analyst-scale queries against view-backed BQ datasets. `agnes query --remote` HTTP 400'd with `Binder Error: Query execution exceeded the timeout. Job ID: ...` whenever the underlying BQ job ran longer than 90 s, even though the job itself was healthy. Add `data_source.bigquery.query_timeout_ms` (default 600 000 ms = 10 min, sentinel 0 falls through to the extension default). Applied via `SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching DuckDB session: orchestrator's `_remote_attach` ATTACH path, BqAccess session factory, and the standalone extractor. Configurable via `/admin/server-config` UI. Fail-soft: extension versions that don't recognise the setting silently keep the default rather than poisoning the session.	2026-05-05 16:40:40 +02:00
ZdenekSrotyr	83adf01bde	fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138 ) * docs(spec): #134 unify BigQuery access behind BqAccess facade Brainstorm output for issue #134. Captures: - root cause (incl. correction of the issue's hypothesis about commit 33a9964) - BqAccess facade API + project resolution rules - error contract — typed BqAccessError mapped to HTTP 502 for upstream BQ failures, 500 for deployment/config bugs - migration plan for v2_scan, v2_sample, RemoteQueryEngine - test rewrite eliminating _bq_client_factory injection point - E2E verification protocol on agnes-development as success criterion * docs(spec): #134 revise after first review Incorporates code-reviewer findings: Must-fix: - Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope. - Reframe v2_scan headline: missing try/except around BQ calls is the actual cause of bare 500s, not project resolution (which 33a9964 fixed). - List two more deferred call sites (extractor.py, register_bq_table) with explicit rationale. Important: - Drop billing != data clause from cross_project_forbidden heuristic; rely only on 'serviceusage' substring. billing != data is normal for cross-project setup, was over-classifying. - Split bq_bad_request into _user (400) and _server (502) variants; add sql_origin parameter to translate_bq_error so call sites declare whether SQL contains user input. - Add @functools.cache to BqAccess.from_config; document tests bypass via dependency_overrides. - Replace monkey-patched-classmethod test pattern with BqAccess(client_factory=...) injection at construction time. Cleaner than today's _bq_client_factory and 1:1 migration shape. - Keep BqProjects.data (reviewer assumed registry has source_project; it doesn't). Multi-project explicitly listed as non-goal with note. Nice-to-have: - Add 'Implementation strategy' section: 2 staged commits (bug fix alone is revertable; refactor follows). - Extend E2E protocol to cover all three endpoints, not just /sample. - Note removal of stale docstring at src/remote_query.py:204. * docs(spec): #134 revision 3 — incorporates second-round review Must-fix from second review: - v2_schema split into two migration cases: _fetch_bq_schema translates errors via translate_bq_error; _fetch_bq_table_options preserves its swallow-all 'except Exception → return {}' so /schema doesn't 502 on partition-info failures. - RemoteQueryEngine.__init__ now resolves BqAccess lazily (in _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests in test_remote_query.py would suddenly fail with not_configured. - translate_bq_error pass-through for BqAccessError is now load-bearing (clause 1, before any Google-API branch). bq.client() raises BqAccessError for bq_lib_missing/auth_failed; without explicit pass-through those fall to 'unknown' and re-raise as bare 500. - Commit 1 now emits the SAME structured response shape as commit 2 to avoid contract churn between commits. - BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments — flagged in CHANGELOG ### Changed. Editorial: - sql_origin renamed to bad_request_status with values 'client_error' / 'upstream_error' (clearer about what the parameter actually decides). bq_bad_request_user/_server kinds collapsed to bq_bad_request (400) and bq_upstream_error (502). - CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller; unaffected because new bq_access kwarg has default None. - Added unit/integration tests for the new contracts: test_translate_passes_through_BqAccessError, test_v2_scan_returns_500_on_bq_lib_missing, test_v2_schema_returns_200_with_empty_partition_on_bq_failure, test_resolve_succeeds_after_config_set. - E2E protocol now covers /schema as the fourth endpoint. - Documented functools.cache-doesn't-cache-exceptions semantics and fixture nullcontext-doesn't-close caveat for nested sessions. * docs(spec): #134 revision 4 — incorporates third-round review Third reviewer verdict: 'implementation-ready with two trivial edits'; explicitly noted prior rounds did the heavy lifting. Edits: 1. get_bq_access() module-level function instead of @classmethod @functools.cache from_config. Removes the classmethod-cache stacking footgun (different Python versions wrap differently) and gives FastAPI's dependency introspection a clean function signature. Drops the 'Do not subclass BqAccess' caveat that no longer applies. 2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample), _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all in commit 1 — preserved as-is, then migrated (still preserved) in commit 2. All three endpoints emit the same structured body shape so client parsers see one consistent contract throughout the staged rollout. No more half-rolled-out window where /sample is bare 500 while /scan is structured 502. * docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification) Bite-sized TDD tasks. 3 phases, 16 tasks total: Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints: Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan, _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options preserved untouched. CHANGELOG Fixed entries. Phase 2 (Commit 2) — BqAccess facade extraction + migration: Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up (BqProjects, BqAccessError, translate_bq_error, default factories, BqAccess class, get_bq_access module-level cached). Task 2.6 adds conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG Changed BREAKING + Internal. Phase 3 — Verification: 3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual E2E on agnes-development per spec protocol → close #134. Self-review table maps spec sections to implementing tasks; no gaps. * fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict block only) with try/except for google.api_core exceptions, translating to HTTPException with a structured body shape: {error, message, details}. Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500 with no body when the SA on agnes-development hit cross-project Forbidden on serviceusage.services.use. Also fixes /sample's missing billing_project fallback (the bug 33a9964 fixed for /scan never landed here). Status code split: - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is user-derived from req.select/where/order_by. - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is server-constructed from validated identifiers. - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in error message (with hint pointing at data_source.bigquery.billing_project), else bq_forbidden. Body shape matches what the upcoming BqAccess refactor (next commit) will produce, so client-side parsers see one consistent contract throughout the staged rollout. _fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty contract is intentional and survives into the refactor; /schema continues to return 200 with empty partition info when partition queries fail. Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema endpoints exist only to make the test pattern (monkeypatching whole _fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal once BqAccess centralizes translation. * refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine Extracts the duplicated BigQuery-access pattern (project resolution + client construction + DuckDB-extension session + Google-API error translation) into connectors/bigquery/access.py. Migrates four call sites to use it: - app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan - app/api/v2_sample.py — _fetch_bq_sample - app/api/v2_schema.py — _fetch_bq_schema (strict translation), _fetch_bq_table_options (preserves swallow-all best-effort contract) - src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg The new module exposes: - BqProjects (frozen dataclass: billing + data project IDs) - BqAccessError (typed exception with HTTP_STATUS class mapping) - BqAccess (facade with injectable client_factory/duckdb_session_factory for tests; defaults call the real google-cloud-bigquery + DuckDB extension) - get_bq_access (module-level @functools.cache; FastAPI Depends target) - translate_bq_error (Google API exception → BqAccessError mapper, with BqAccessError pass-through, 'serviceusage'-substring heuristic for cross_project_forbidden, and bad_request_status param distinguishing user-derived (400) from server-constructed (502) SQL) - _default_client_factory, _default_duckdb_session_factory RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only RemoteQueryEngine tests need no changes — bq_access defaults to None and get_bq_access() is only invoked on first BQ call (lazy resolution). BqAccessError raised internally is translated to RemoteQueryError( error_type="bq_error") in _get_bq_client to preserve the engine's existing public contract — CLI and /api/query/hybrid callers see no change. Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample, test_v2_schema) migrate from monkey-patching whole _fetch_* functions to using the new bq_access fixture in tests/conftest.py — which exercises the REAL translation path through BqAccess + translate_bq_error, closing the test gap flagged in Task 1.1's review. Side-effect behavior change: v2_sample's FROM clause now uses the data project (instance.yaml data_source.bigquery.project), not the conflated billing_project from Phase 1. Documented in CHANGELOG ### Internal. BREAKING for deployments combining BIGQUERY_PROJECT env var with data_source.bigquery.project in instance.yaml — env var now overrides data project too. See CHANGELOG ### Changed. Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py, scripts/duckdb_manager.register_bq_table) explicitly out of scope; tracked as follow-up. Removed stale docstring at the previous src/remote_query.py:204 that referenced scripts.duckdb_manager._create_bq_client as the default BQ client factory (RemoteQueryEngine never actually used that function). Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_.py + tests/test_remote_query.py migrated to bq_access fixture (counts unchanged or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures (DB migration tests with unrelated internal_roles DependencyException — not introduced by this PR). fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed) CI on PR #138 caught: bigquery.Client(...) resolves Application Default Credentials at construction time; without ADC (CI without SA key, dev laptop without 'gcloud auth application-default login') it raises google.auth.exceptions.DefaultCredentialsError synchronously. Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError propagated as raw exception — and from production endpoints would surface as bare 500 (the exact failure mode #134 sets out to fix). Now translates to BqAccessError(kind='auth_failed', details.hint='Run gcloud auth application-default login...'). Endpoint catch chain returns HTTP 502 with structured body. Adds unit test test_raises_auth_failed_on_default_credentials_error. Third-round spec review flagged this case in passing; the fix didn't land. CI's auth-less environment surfaced it. * fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends breaks all v2 endpoints for non-BigQuery instances'. Pre-fix: get_bq_access() raised BqAccessError(not_configured) when neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set. Because FastAPI resolves Depends() BEFORE the endpoint body runs, this exception fires during dep-injection — the endpoint's try/except BqAccessError clause never gets a chance to catch it. Result: every v2 request on Keboola-only or CSV-only instances returned bare HTTP 500, even for local-source tables that never touch BigQuery. Fix: get_bq_access() now returns a sentinel BqAccess with empty BqProjects and factories that raise BqAccessError(not_configured) on actual use. Construction succeeds, FastAPI's dep-injection cleanly yields the sentinel, the endpoint runs. The local-source code path in build_sample / build_schema / etc. never calls bq.client() or bq.duckdb_session() (it reads parquet directly), so non-BQ tables return 200 as before. Only when an endpoint actually tries to query BQ (source_type == 'bigquery') does the sentinel raise — and the endpoint's existing except BqAccessError catches it normally, returning structured 502 with hint. Test get_bq_access::test_raises_not_configured_when_neither_set renamed and rewritten to test_returns_sentinel_when_neither_set: asserts BqAccess is returned, then asserts client() and duckdb_session() each raise BqAccessError(not_configured) on call. Test test_does_not_cache_exceptions removed (no longer applicable) and replaced with test_sentinel_is_cached_per_process documenting the operator-restart-on-config-change contract. * docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule) Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained customer-specific deployment hostnames, deployment names, and a GCP project ID that violated CLAUDE.md's vendor-agnostic OSS rule ('Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies'). Replacements: agnes-development.groupondev.com -> <your-agnes-host> agnes-development -> <your-dev-instance> prj-grp-dataview-prod-1ff9 -> <your-data-project> s1_session_landings -> <bq_table_id> E2E verification semantics unchanged — operators still run the same four curls + config flip + retry, just substituting their own host / deployment name / project / table. * fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at process level, so it captures BigQuery project IDs at first call and ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints re-read get_value() on every request, so admin /api/admin/server-config saves (which call instance_config.reset_cache()) hot-reloaded the BQ project. Without this fix, my refactor silently regresses that contract — operators editing instance.yaml via the admin UI would see no effect on v2 endpoints until container restart. instance_config.reset_cache() now also calls connectors.bigquery.access.get_bq_access.cache_clear() (lazy import, swallowed if connectors module isn't loaded — keeps instance_config usable in isolated unit tests). Adds test_instance_config_reset_cache_invalidates_get_bq_access as regression guard. Updates CHANGELOG Internal entry to mention the hot-reload contract + the not-configured sentinel behavior (round-3 fix from Devin BUG_0001 was previously only in commit message). * fix(bq_access): surface not_configured before identifier validation + plan path genericize Devin BUG_0001 + BUG_0002 round 5 on PR #138. BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md vendor-agnostic rule. Replaced with '<worktree-root>' placeholder. BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel BqAccess (BQ not configured), the empty bq.projects.data was reaching validate_quoted_identifier first and raising ValueError -> endpoint mapped to HTTP 400 'unsafe_identifier' instead of structured 500 'not_configured' with hint. Each fetch helper now checks 'if not bq.projects.data: bq.client()' as the first step, which triggers the sentinel's BqAccessError(not_configured). Endpoint catches the typed error and returns HTTP 500 with hint pointing at data_source.bigquery.project. Best-effort _fetch_bq_table_options returns {} silently in this case (preserves the swallow-all contract). * fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery extension is a C++ plugin making its own HTTP calls — when BQ returns 403, it throws duckdb.IOException with the BQ error embedded as text, not gax.Forbidden. translate_bq_error's isinstance checks would miss these, falling to case 7 → bare 500 in production for v2_scan, v2_sample, and v2_schema (the bigquery_query() paths). Fix: last-resort string-match heuristic before the re-raise. 'Forbidden' / '403' / 'Bad Request' / '400' in the lowercased message classifies via the same kind hierarchy. The 'serviceusage' substring still distinguishes cross_project_forbidden from bq_forbidden. Specific enough that random exceptions without HTTP-error keywords still re-raise. Adds 4 unit tests covering the new heuristic + the 'don't swallow random exceptions' invariant. * chore(release): cut 0.22.0 PR #138 contains issue #134 user-visible behavior changes: - BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml data_source.bigquery.project for v2 endpoints (previously RemoteQueryEngine billing only). - Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate, /schema when BigQuery raises Forbidden/BadRequest (was bare 500). - Internal: BqAccess facade refactor unifying four duplicate BQ-access call sites; instance_config.reset_cache() now invalidates BqAccess cache too so admin server-config saves hot-reload BQ project IDs. Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.	2026-04-30 10:11:20 +02:00

9 commits