docs(query): #160 align CLAUDE.md/skills/CHANGELOG with new --remote behavior + cost guardrail
Fixes the rails docs that PR #154 over-promised. The reporter (#160) tried `da query --remote` against a VIEW row and saw a catalog error; the previous version of the docs said this would work as a one-shot server-side execution. Now it actually does (see prior commits), but the docs also need to acknowledge the new cost guardrail and the registry-gated direct-bq path. Touched files: - **CLAUDE.md** (root, "Querying Agnes data — agent rails"): the `da query --remote` bullet under "Choose the right tool" now spells out the BASE TABLE vs VIEW/MATERIALIZED_VIEW pushdown asymmetry + the 5 GiB scan cap + the registry-gating of direct bq.* paths. "When NOT to use `da fetch`" decision matrix updated with a separate row for VIEW aggregates so analysts see why the cap might trip. - **config/claude_md_template.txt** (PR #154's analyst CLAUDE.md): three-patterns table caveat for the cost guardrail. - **cli/skills/agnes-data-querying.md**: `When NOT to use da fetch` matrix updated with the same VIEW caveat + registry-gating note. - **cli/skills/agnes-table-registration.md:121**: replaced the example that suggested raw `bq."<project>.<dataset>.<table>"` syntax (now blocked by the RBAC patch) with the registered-name form. - **CHANGELOG.md**: full Unreleased entry with Added (Test Connection endpoint + cost-cap server-config knob + placeholder UI), Fixed (the five #160-class fixes: VIEW resolution, RBAC patch, blocklist, bigquery_query() blocking, CLI render, hybrid endpoint detail flattening), Changed (BREAKING legacy_wrap_views removal + quota relocation). 140 tests pass across the issue-affected files.
This commit is contained in:
parent
39bdc1ff45
commit
b2d54126dc
4 changed files with 95 additions and 7 deletions
74
CHANGELOG.md
74
CHANGELOG.md
|
|
@ -10,6 +10,80 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
- **`/admin/server-config` BQ test connection**: admin-only `POST
|
||||
/api/admin/bigquery/test-connection` runs a 10s-timeout `SELECT 1`
|
||||
against BigQuery via the saved `data_source.bigquery` config and
|
||||
returns typed structured feedback (`200 ok` / `400 not_configured` /
|
||||
`502 cross_project_forbidden` / `504 timeout`). The
|
||||
/admin/server-config UI gets a "Test BigQuery connection" button next
|
||||
to the data_source Save button; on failure the inline result uses the
|
||||
same structured shape as the CLI renderer so operators see the same
|
||||
hint format admins do.
|
||||
- **`api.query.bq_max_scan_bytes` server-config knob** (default 5 GiB):
|
||||
caps the BigQuery scan that `da query --remote` will issue against
|
||||
`query_mode='remote'` BQ rows. Exceeded queries are rejected with a
|
||||
structured `400 remote_scan_too_large` detail naming the bytes,
|
||||
tables, and a `da fetch` suggestion. Quota usage is recorded against
|
||||
the same daily byte cap as `/api/v2/scan`.
|
||||
- **`data_source.bigquery.billing_project` placeholder UI**: the admin
|
||||
form now shows `(defaults to <project>)` greyed under an empty
|
||||
billing_project input, surfacing the access.py:339-340 fallback rule
|
||||
directly in the UI.
|
||||
|
||||
### Fixed
|
||||
- **`da query --remote` against `query_mode='remote'` BigQuery rows
|
||||
whose underlying entity is a `VIEW` or `MATERIALIZED_VIEW`** now
|
||||
resolves correctly (issue #160). The BQ extractor creates a master
|
||||
view via the catalog path (`bq."<dataset>"."<source_table>"`) for
|
||||
`BASE TABLE` (Storage Read API; predicate pushdown) and via
|
||||
`bigquery_query()` for `VIEW`/`MATERIALIZED_VIEW` (jobs API). Other
|
||||
BQ entity types (`EXTERNAL`, `SNAPSHOT`, `CLONE`) are logged + skipped
|
||||
at extraction with no `_meta` row, so the orchestrator doesn't strand
|
||||
a registered name with a non-existent inner view.
|
||||
- **Direct `bq."<dataset>"."<source_table>"` references in `/api/query`
|
||||
are now registry-gated**: unregistered paths return 403
|
||||
`bq_path_not_registered`; registered paths are subject to the same
|
||||
per-name grant check as registered names. Closes a pre-existing RBAC
|
||||
bypass where direct catalog-path syntax skipped the master-view
|
||||
forbidden-table check entirely. Quoted catalog tokens
|
||||
(`"bq"."ds"."tbl"`) are caught by the same regex.
|
||||
- **`bigquery_query()` direct calls in user SQL are now blocked** by the
|
||||
`/api/query` keyword blocklist. Closes a pre-existing function-call
|
||||
bypass that ran arbitrary BQ jobs API calls against any reachable
|
||||
dataset, ignoring the registry. Wrap views internal to the BQ
|
||||
extractor still use `bigquery_query()` inside their `CREATE VIEW`
|
||||
body — those run via DuckDB's view resolution at query time, never
|
||||
via user-submitted SQL, so the blocklist doesn't break them.
|
||||
- **CLI commands (`da query --remote`, `da query --register-bq`,
|
||||
`da fetch`, `da schema`, etc.) pretty-print structured BigQuery
|
||||
errors** — `cross_project_forbidden`, `bq_forbidden`, `auth_failed`,
|
||||
`not_configured`, `remote_scan_too_large`, `bq_path_not_registered`,
|
||||
etc. — instead of dumping the truncated JSON body. The hint that
|
||||
explains how to fix `USER_PROJECT_DENIED` (set
|
||||
`data_source.bigquery.billing_project` in /admin/server-config) is
|
||||
now actually visible to the operator.
|
||||
- **`/api/query/hybrid` now returns dict `detail`** for typed errors
|
||||
(was flattening to `f"BQ '{alias}': {error_type}: {message}"`), so
|
||||
the new CLI renderer surfaces the structured shape consistently
|
||||
across both endpoints.
|
||||
|
||||
### Changed
|
||||
- **BREAKING (config-only): `data_source.bigquery.legacy_wrap_views`
|
||||
removed**. The flag was opt-in for the wrap-view behavior that is now
|
||||
the default. Keys still present in operator overlays are silently
|
||||
ignored — no action required. Operators who previously set
|
||||
`legacy_wrap_views: false` (the prior default) get the new behavior
|
||||
for VIEW / MATERIALIZED_VIEW rows: a master view is created (via the
|
||||
BQ jobs API), and `da query --remote` works against the registered
|
||||
name. The cost concern that motivated the prior default is now
|
||||
addressed by the server-side guardrail (see Added).
|
||||
- **Quota tracker relocated**: `_build_quota_tracker` and
|
||||
`_quota_singleton` moved from `app/api/v2_scan.py` to
|
||||
`app/api/v2_quota.py` (their natural home). `v2_scan.py` re-exports
|
||||
the function for backwards compat; existing test sites that call
|
||||
`v2_scan._build_quota_tracker()` keep working.
|
||||
|
||||
## [0.31.0] — 2026-05-04
|
||||
|
||||
### Added
|
||||
|
|
|
|||
21
CLAUDE.md
21
CLAUDE.md
|
|
@ -203,7 +203,13 @@ Tables in `da catalog` have a `query_mode`:
|
|||
- **`remote`** (typically BigQuery): the parquet does NOT exist on the laptop.
|
||||
You MUST either:
|
||||
1. **`da fetch`** a filtered subset → query the local snapshot, OR
|
||||
2. **`da query --remote`** for one-shot server-side execution, OR
|
||||
2. **`da query --remote`** for one-shot server-side execution. Works on
|
||||
all `query_mode='remote'` rows regardless of upstream BQ entity type
|
||||
(BASE TABLE → Storage Read API with predicate pushdown; VIEW /
|
||||
MATERIALIZED_VIEW → BQ jobs API, no pushdown). Cost-guarded by a
|
||||
5 GiB scan cap (configurable in /admin/server-config). Direct
|
||||
`bq."<dataset>"."<table>"` paths are registry-gated — unregistered
|
||||
paths return 403 `bq_path_not_registered`.
|
||||
3. **`da query --register-bq`** for hybrid joins (rarely needed).
|
||||
|
||||
### `da fetch` workflow (preferred for remote tables)
|
||||
|
|
@ -257,10 +263,17 @@ in your `da query` calls — there's no `--where` on local since fetch is implic
|
|||
|
||||
### When NOT to use `da fetch`
|
||||
|
||||
- Single aggregate on remote table (`SELECT COUNT(*) FROM remote`):
|
||||
- Single aggregate on remote BASE TABLE (`SELECT COUNT(*) FROM remote`):
|
||||
use `da query --remote "SELECT COUNT(*) FROM web_sessions_example"`.
|
||||
No materialization needed; cheap.
|
||||
- Throwaway exploration with raw BQ syntax: `da query --remote`.
|
||||
Storage Read API pushes the COUNT into BQ — cheap, no materialization.
|
||||
- Single aggregate on remote VIEW/MATERIALIZED_VIEW: same syntax works
|
||||
(#160), but the BQ jobs API can't push WHERE/COUNT into the view body.
|
||||
Cost guardrail (default 5 GiB) catches expensive scans → 400
|
||||
`remote_scan_too_large` with `da fetch` suggestion. Pivot to
|
||||
`da fetch <id> --where '<predicate>'` if the cap is hit.
|
||||
- Throwaway exploration: `da query --remote "SELECT … FROM <registered_id>"`.
|
||||
Direct `bq."<dataset>"."<table>"` paths are now registry-gated — register
|
||||
first or use the catalog id.
|
||||
- Cross-table JOIN with both tables remote: combine `da fetch` for one
|
||||
side + `da query --remote` for the other; full cross-remote JOIN
|
||||
requires more thought (see #101 for design space).
|
||||
|
|
|
|||
|
|
@ -101,8 +101,9 @@ For `source_type=keboola` / `source_type=jira` (local), use **DuckDB SQL** in yo
|
|||
|
||||
| Scenario | Use instead |
|
||||
|----------|------------|
|
||||
| Single aggregate on remote table (`SELECT COUNT(*)`) | `da query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed |
|
||||
| Throwaway exploration with raw BQ syntax | `da query --remote` — one-shot, no snapshot |
|
||||
| Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `da query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) |
|
||||
| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `da fetch` suggestion. Pivot to `da fetch <id> --where '<predicate>'` if rejected. |
|
||||
| Throwaway exploration with raw BQ syntax | `da query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. |
|
||||
| Cross-table JOIN with both remote | Use `da fetch` for one side + `da query --remote` for the other; full cross-remote JOIN needs design (see #101) |
|
||||
|
||||
## When the table you need isn't in `da catalog`
|
||||
|
|
|
|||
|
|
@ -118,7 +118,7 @@ Returns `204 No Content` on success, `404` if the id doesn't exist. **The underl
|
|||
|
||||
## When NOT to register
|
||||
|
||||
- The user wants to inspect a table once, doesn't intend to share it: use `da query --remote "SELECT … FROM \`<project>.<dataset>.<table>\`"` instead.
|
||||
- The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `da query --remote "SELECT … FROM <registered_id>"`. Direct `bq."<dataset>"."<table>"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass).
|
||||
- The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register.
|
||||
- The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `da fetch --where`.
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue