From b2d54126dc5cdfc8f97af81388fd80f34f4962ea Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Sun, 3 May 2026 19:48:44 +0200 Subject: [PATCH] docs(query): #160 align CLAUDE.md/skills/CHANGELOG with new --remote behavior + cost guardrail MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes the rails docs that PR #154 over-promised. The reporter (#160) tried `da query --remote` against a VIEW row and saw a catalog error; the previous version of the docs said this would work as a one-shot server-side execution. Now it actually does (see prior commits), but the docs also need to acknowledge the new cost guardrail and the registry-gated direct-bq path. Touched files: - **CLAUDE.md** (root, "Querying Agnes data — agent rails"): the `da query --remote` bullet under "Choose the right tool" now spells out the BASE TABLE vs VIEW/MATERIALIZED_VIEW pushdown asymmetry + the 5 GiB scan cap + the registry-gating of direct bq.* paths. "When NOT to use `da fetch`" decision matrix updated with a separate row for VIEW aggregates so analysts see why the cap might trip. - **config/claude_md_template.txt** (PR #154's analyst CLAUDE.md): three-patterns table caveat for the cost guardrail. - **cli/skills/agnes-data-querying.md**: `When NOT to use da fetch` matrix updated with the same VIEW caveat + registry-gating note. - **cli/skills/agnes-table-registration.md:121**: replaced the example that suggested raw `bq.".."` syntax (now blocked by the RBAC patch) with the registered-name form. - **CHANGELOG.md**: full Unreleased entry with Added (Test Connection endpoint + cost-cap server-config knob + placeholder UI), Fixed (the five #160-class fixes: VIEW resolution, RBAC patch, blocklist, bigquery_query() blocking, CLI render, hybrid endpoint detail flattening), Changed (BREAKING legacy_wrap_views removal + quota relocation). 140 tests pass across the issue-affected files. --- CHANGELOG.md | 74 ++++++++++++++++++++++++++ CLAUDE.md | 21 ++++++-- cli/skills/agnes-data-querying.md | 5 +- cli/skills/agnes-table-registration.md | 2 +- 4 files changed, 95 insertions(+), 7 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 65db123..c89d3f9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,80 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +### Added +- **`/admin/server-config` BQ test connection**: admin-only `POST + /api/admin/bigquery/test-connection` runs a 10s-timeout `SELECT 1` + against BigQuery via the saved `data_source.bigquery` config and + returns typed structured feedback (`200 ok` / `400 not_configured` / + `502 cross_project_forbidden` / `504 timeout`). The + /admin/server-config UI gets a "Test BigQuery connection" button next + to the data_source Save button; on failure the inline result uses the + same structured shape as the CLI renderer so operators see the same + hint format admins do. +- **`api.query.bq_max_scan_bytes` server-config knob** (default 5 GiB): + caps the BigQuery scan that `da query --remote` will issue against + `query_mode='remote'` BQ rows. Exceeded queries are rejected with a + structured `400 remote_scan_too_large` detail naming the bytes, + tables, and a `da fetch` suggestion. Quota usage is recorded against + the same daily byte cap as `/api/v2/scan`. +- **`data_source.bigquery.billing_project` placeholder UI**: the admin + form now shows `(defaults to )` greyed under an empty + billing_project input, surfacing the access.py:339-340 fallback rule + directly in the UI. + +### Fixed +- **`da query --remote` against `query_mode='remote'` BigQuery rows + whose underlying entity is a `VIEW` or `MATERIALIZED_VIEW`** now + resolves correctly (issue #160). The BQ extractor creates a master + view via the catalog path (`bq."".""`) for + `BASE TABLE` (Storage Read API; predicate pushdown) and via + `bigquery_query()` for `VIEW`/`MATERIALIZED_VIEW` (jobs API). Other + BQ entity types (`EXTERNAL`, `SNAPSHOT`, `CLONE`) are logged + skipped + at extraction with no `_meta` row, so the orchestrator doesn't strand + a registered name with a non-existent inner view. +- **Direct `bq."".""` references in `/api/query` + are now registry-gated**: unregistered paths return 403 + `bq_path_not_registered`; registered paths are subject to the same + per-name grant check as registered names. Closes a pre-existing RBAC + bypass where direct catalog-path syntax skipped the master-view + forbidden-table check entirely. Quoted catalog tokens + (`"bq"."ds"."tbl"`) are caught by the same regex. +- **`bigquery_query()` direct calls in user SQL are now blocked** by the + `/api/query` keyword blocklist. Closes a pre-existing function-call + bypass that ran arbitrary BQ jobs API calls against any reachable + dataset, ignoring the registry. Wrap views internal to the BQ + extractor still use `bigquery_query()` inside their `CREATE VIEW` + body — those run via DuckDB's view resolution at query time, never + via user-submitted SQL, so the blocklist doesn't break them. +- **CLI commands (`da query --remote`, `da query --register-bq`, + `da fetch`, `da schema`, etc.) pretty-print structured BigQuery + errors** — `cross_project_forbidden`, `bq_forbidden`, `auth_failed`, + `not_configured`, `remote_scan_too_large`, `bq_path_not_registered`, + etc. — instead of dumping the truncated JSON body. The hint that + explains how to fix `USER_PROJECT_DENIED` (set + `data_source.bigquery.billing_project` in /admin/server-config) is + now actually visible to the operator. +- **`/api/query/hybrid` now returns dict `detail`** for typed errors + (was flattening to `f"BQ '{alias}': {error_type}: {message}"`), so + the new CLI renderer surfaces the structured shape consistently + across both endpoints. + +### Changed +- **BREAKING (config-only): `data_source.bigquery.legacy_wrap_views` + removed**. The flag was opt-in for the wrap-view behavior that is now + the default. Keys still present in operator overlays are silently + ignored — no action required. Operators who previously set + `legacy_wrap_views: false` (the prior default) get the new behavior + for VIEW / MATERIALIZED_VIEW rows: a master view is created (via the + BQ jobs API), and `da query --remote` works against the registered + name. The cost concern that motivated the prior default is now + addressed by the server-side guardrail (see Added). +- **Quota tracker relocated**: `_build_quota_tracker` and + `_quota_singleton` moved from `app/api/v2_scan.py` to + `app/api/v2_quota.py` (their natural home). `v2_scan.py` re-exports + the function for backwards compat; existing test sites that call + `v2_scan._build_quota_tracker()` keep working. + ## [0.31.0] — 2026-05-04 ### Added diff --git a/CLAUDE.md b/CLAUDE.md index f16c162..fe9a0e0 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -203,7 +203,13 @@ Tables in `da catalog` have a `query_mode`: - **`remote`** (typically BigQuery): the parquet does NOT exist on the laptop. You MUST either: 1. **`da fetch`** a filtered subset → query the local snapshot, OR - 2. **`da query --remote`** for one-shot server-side execution, OR + 2. **`da query --remote`** for one-shot server-side execution. Works on + all `query_mode='remote'` rows regardless of upstream BQ entity type + (BASE TABLE → Storage Read API with predicate pushdown; VIEW / + MATERIALIZED_VIEW → BQ jobs API, no pushdown). Cost-guarded by a + 5 GiB scan cap (configurable in /admin/server-config). Direct + `bq.""."
"` paths are registry-gated — unregistered + paths return 403 `bq_path_not_registered`. 3. **`da query --register-bq`** for hybrid joins (rarely needed). ### `da fetch` workflow (preferred for remote tables) @@ -257,10 +263,17 @@ in your `da query` calls — there's no `--where` on local since fetch is implic ### When NOT to use `da fetch` -- Single aggregate on remote table (`SELECT COUNT(*) FROM remote`): +- Single aggregate on remote BASE TABLE (`SELECT COUNT(*) FROM remote`): use `da query --remote "SELECT COUNT(*) FROM web_sessions_example"`. - No materialization needed; cheap. -- Throwaway exploration with raw BQ syntax: `da query --remote`. + Storage Read API pushes the COUNT into BQ — cheap, no materialization. +- Single aggregate on remote VIEW/MATERIALIZED_VIEW: same syntax works + (#160), but the BQ jobs API can't push WHERE/COUNT into the view body. + Cost guardrail (default 5 GiB) catches expensive scans → 400 + `remote_scan_too_large` with `da fetch` suggestion. Pivot to + `da fetch --where ''` if the cap is hit. +- Throwaway exploration: `da query --remote "SELECT … FROM "`. + Direct `bq.""."
"` paths are now registry-gated — register + first or use the catalog id. - Cross-table JOIN with both tables remote: combine `da fetch` for one side + `da query --remote` for the other; full cross-remote JOIN requires more thought (see #101 for design space). diff --git a/cli/skills/agnes-data-querying.md b/cli/skills/agnes-data-querying.md index 1acb931..f675fb3 100644 --- a/cli/skills/agnes-data-querying.md +++ b/cli/skills/agnes-data-querying.md @@ -101,8 +101,9 @@ For `source_type=keboola` / `source_type=jira` (local), use **DuckDB SQL** in yo | Scenario | Use instead | |----------|------------| -| Single aggregate on remote table (`SELECT COUNT(*)`) | `da query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed | -| Throwaway exploration with raw BQ syntax | `da query --remote` — one-shot, no snapshot | +| Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `da query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) | +| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `da fetch` suggestion. Pivot to `da fetch --where ''` if rejected. | +| Throwaway exploration with raw BQ syntax | `da query --remote "SELECT … FROM "` — direct `bq.""."
"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. | | Cross-table JOIN with both remote | Use `da fetch` for one side + `da query --remote` for the other; full cross-remote JOIN needs design (see #101) | ## When the table you need isn't in `da catalog` diff --git a/cli/skills/agnes-table-registration.md b/cli/skills/agnes-table-registration.md index 141f861..9118ca1 100644 --- a/cli/skills/agnes-table-registration.md +++ b/cli/skills/agnes-table-registration.md @@ -118,7 +118,7 @@ Returns `204 No Content` on success, `404` if the id doesn't exist. **The underl ## When NOT to register -- The user wants to inspect a table once, doesn't intend to share it: use `da query --remote "SELECT … FROM \`..
\`"` instead. +- The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `da query --remote "SELECT … FROM "`. Direct `bq.""."
"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass). - The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register. - The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `da fetch --where`.