Adds Pass 3 to `_bq_guardrail_inputs` that scans user SQL for full
backtick paths `<project>.<dataset>.<table>` and gates them
identically to the `bq."<dataset>"."<table>"` pass:
- Project must match the configured BigQuery data project
(`get_bq_access().projects.data`). Mismatch → HTTP 403
`bq_path_cross_project`.
- Path must point at a registered row. Unregistered → HTTP 403
`bq_path_not_registered`.
- Non-admin caller must hold a grant on the registered row's id.
Missing grant → HTTP 403 `bq_path_access_denied`.
Pre-fix, full backtick paths bypassed Agnes RBAC entirely — only the
service account scope limited reach. Post-fix the boundary matches
what `agnes catalog`-driven flows already enforce. Admin still
bypasses the per-id grant check but cannot bypass registration or
project match.
Pass 3 also seeds `dry_run_set` for resolved registered paths so the
cost-cap dry-run runs against the same physical table the user named
— composing cleanly with the Layer 2 fail-fast fallback.
When BQ rejects the rewritten dry-run SQL with `bq_bad_request`, the
cap-guard now retries with the user's ORIGINAL SQL instead of building
a synthetic `SELECT * FROM <table>` per registered table. The
synthetic path threw away user filters / projections / partition
predicates and routinely ballooned the estimate to "full table size",
falsely tripping `remote_scan_too_large` on legitimate narrow queries
(typical issue #201 trace: rewriter corrupts a backtick path → BQ
parse error → synthetic over-estimate → 400).
Behaviour:
- Rewritten SQL succeeds: same as before (issue #171 single-dry-run).
- Rewritten SQL parse-errors, original SQL succeeds: use original
estimate. Common case for users submitting BQ-native input.
- Both fail with `bq_bad_request`: HTTP 400 `remote_estimate_failed`
with a hint pointing at `agnes catalog` / BQ-native syntax. No
silent over-estimate.
- Non-parse BQ error (forbidden, upstream): still 502 as before.
This is a behaviour change for clients matching error kinds — failure
to estimate scan size now surfaces as `remote_estimate_failed`
instead of being masked behind `remote_scan_too_large` from the
synthetic path.
Replaces the existing `test_guardrail_falls_back_to_per_table_estimate_on_bq_parse_error`
(which pinned the old contract) with `test_fallback_tries_original_sql_first`
and `test_fallback_fails_fast_on_pure_duckdb_syntax`.
`agnes query --remote` corrupted user SQL when the request contained a
full BigQuery backtick path (`<project>.<dataset>.<table>`) whose
table segment matched a registered bare-name alias. The bare-name
rewriter used `\b` word-boundary matching against the lower-cased SQL;
both `.` and `` ` `` are non-word characters, so the regex fired
INSIDE the user's backtick path and produced malformed nested-backtick
SQL that BigQuery rejected at parse time.
Fix:
- Add `_mask_backticks(sql)` helper: replace each `…` segment with
spaces of equal length, preserving offsets so word-boundary
searches find positions only outside backticks.
- `_bq_guardrail_inputs` (bare-name pass + forbidden-table pass)
searches against the masked SQL.
- `_rewrite_bq_table_refs_to_native` Pass 1 splits the SQL on
`(\`[^\`]*\`)` and rewrites only the outside-backtick chunks. Pass
2 (`bq."ds"."tbl"` → backtick form) is unchanged — its prefix can't
appear inside backticks.
Adds three regressions covering the rewrite + guardrail paths.
The iterative bare-name rewriter (one re.sub per name, longest-first)
was vulnerable to cross-contamination when the GCP project ID contained
a registered table name as a hyphen-delimited word.
Concrete repro:
project = 'my-ue-project'
registered = ['orders', 'ue']
user SQL = 'SELECT * FROM orders JOIN ue ON ...'
iter 1 (orders): produces 'FROM `my-ue-project.fin.orders` JOIN ue ...'
iter 2 (ue): '\bue\b' matches 'ue' INSIDE 'my-ue-project' (hyphen
creates word boundary on both sides) — corrupts
the iter-1 path
Fallback at query.py:576 caught the resulting BQ parse error and fell
back to per-table SELECT * estimate, so impact was over-estimation,
not fail-open — but the #171 partition-pruning fix silently degraded
to pre-fix behavior whenever a project name shared a hyphen-segment
with a registered table.
Fix: single re.sub call with an alternation regex sorted longest-first.
Single-pass means each source position is processed exactly once, so
freshly-inserted backticked text from one match isn't re-scanned by
later names in the alternation.
Regression test
test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name
covers the exact Devin repro.
Closes#171. The /api/query cost guardrail used to dry-run a synthetic
`SELECT * FROM <table>` for each registered remote-BQ row referenced
by the user SQL — which made BigQuery estimate a full table scan, with
column projection, predicate pushdown, and partition pruning all
disabled. Narrow queries on big partitioned/clustered tables (the
documented happy path for `agnes query --remote`) hit ~30,000×
over-estimates and got rejected with 400 `remote_scan_too_large` even
when BQ's own dry-run reported single-digit MB.
Pavel's report on #171 traced the root cause and proposed the fix:
rewrite the user SQL to BQ-native syntax and dry-run it as a single
job, exactly the way `bq query --dry_run` works.
Implementation:
- New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered
names (word-boundary, case-insensitive, longest-first to avoid prefix
collisions) + bq."<ds>"."<tbl>" forms to backticked
`<project>.<ds>.<tbl>` paths.
- _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap
check uses the real estimate.
- Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only
syntax like ::INT casts), the guard falls back to the pre-fix
per-table SELECT * approach so non-portable queries still get a
(loose) cap estimate instead of fail-opening. Non-parse BQ errors
(forbidden, upstream) still propagate as 502.
- _bq_guardrail_inputs now also returns name_lookups so the rewriter
has the (registered_name, bucket, source_table) mapping it needs.
- Per-table breakdown is unavailable from a composite dry-run; total
bytes are pinned to dry_run_set[0] for the post-flight
record_bytes(sum(...)) call to keep returning the right total.
Tests (7 new, 3 existing still pass):
- dry-run receives rewritten user SQL with WHERE clause intact (the
load-bearing assertion for #171)
- single dry-run per request even with multiple registered tables
(JOIN, UNION) referenced
- fallback to per-table SELECT * on bq_bad_request
- non-parse BQ errors (forbidden) still 502
- rewriter unit tests: bare + bq.path in same SQL, longest-name-wins
on prefix collision, case-insensitive bare-name match
Caught by my own broader test scope after Devin fixes — three test files
asserted on user-visible strings that were renamed by the bootstrap PR
but the assertions weren't updated:
- tests/test_api_query_guardrail.py:110 — asserted `da fetch in suggestion`
on /api/query 400 response. Renamed to `agnes snapshot create`.
- tests/test_query_materialized_error_message.py:56 — asserted `da sync`
in materialized-not-yet error detail. Renamed to `agnes pull`.
- tests/test_cli_error_render.py:71 — fixture data + assertion both
carried `da fetch`. Updated to `agnes snapshot create`.
Plus an actual content miss: docs/setup/claude_settings.json (a template
shipped to operators) still installed `da sync` / `da sync --upload-only`
hooks. The companion test file (tests/test_setup_hooks_template.py) was
asserting that legacy state. Updated both:
- Template hooks: `agnes pull --quiet` / `agnes push --quiet`
- Test assertions + function name match the new commands
The headline implementation for issue #160. POST /api/query now gates
direct `bq."<dataset>"."<source_table>"` references behind the registry
and bounds the BQ scan cost behind a configurable cap. Wired through
the same singleton QuotaTracker as /api/v2/scan so daily-byte budgets
are shared across both BQ-touching paths.
Changes in app/api/query.py:
- Add module-level `BQ_PATH` regex matching the 16 syntax variants
verified empirically (fully-quoted, unquoted, mixed quoting,
case-insensitive, inside CTE bodies, multi-path, …).
- Add `bigquery_query` to the SQL keyword blocklist. Closes the
pre-existing function-call backdoor where a user could run an
arbitrary BQ jobs API call against any reachable dataset, bypassing
the registry and RBAC. Wrap views internal to the BQ extractor still
use bigquery_query() — but those run via DuckDB view resolution at
query time, not via user-submitted SQL, so the blocklist doesn't
break them.
- Add `_bq_guardrail_inputs` helper: walks user SQL twice — once for
bare-name matches against accessible registered remote-BQ names
(contributes to dry_run_set), once for direct `bq.X.Y` matches
(gated against `find_by_bq_path` lookups, returns 403 with
structured detail on miss or grant violation).
- Add `_enforce_remote_bq_quota_and_cap` helper: pre-flight
`check_daily_budget` (over-cap → 429), then `with quota.acquire(...)`
wraps a per-path BQ dry-run, sums bytes, raises 400
`remote_scan_too_large` when total > cap.
- Cap default 5 GiB; configurable via `api.query.bq_max_scan_bytes`
in /admin/server-config (next phase wires the UI).
- Post-flight `record_bytes` against the user's daily counter.
- Module-level imports of `_bq_dry_run_bytes`, `_build_quota_tracker`,
`get_bq_access` so tests can monkeypatch via `app.api.query.<name>`.
Tests:
- All 23 RED tests from the previous commit now pass (regex matrix,
blocklist with detail-string assertion, RBAC unregistered/admin-bypass,
guardrail dry-run-called/over-cap-rejected, quota pre-flight 429).
- mock_dry_run fixture stubs both `_bq_dry_run_bytes` and `get_bq_access`
so guardrail tests don't require a live BQ project.
- Quota test uses `admin1` (the seeded_app fixture's actual user id, not
`admin`).
Smoke: 887 passed across query/bq/admin/extractor/registry/quota
domains. No regressions.
5 new test files for the upcoming /api/query pre-flight block (next
commit). All failing for the right reason on the current codebase:
tests/test_query_bq_regex.py (8 + 1 + 7 + 1 = 17 cases)
Pure unit test of `BQ_PATH` regex constant (not yet imported from
app.api.query). Verifies the 16-case matrix from spec §4.3.1:
positive matches for fully-quoted / unquoted / mixed quoting / case
variants / inside CTE bodies / multiple paths in one statement;
negative for bare registered names / 2-part bq.col / prefix that
contains bq / middle-component bq / quoted bare names; documented
string-literal false-positive accepted.
tests/test_query_bigquery_query_blocked.py (3 cases)
POST /api/query with bigquery_query() function call must hit the
canonical blocklist rejection ("Only single SELECT queries are
allowed"). Today the blocklist passes all 3 — confirmed RED via
detail-string assertion.
tests/test_api_query_rbac_bq_path.py (4 cases)
Direct bq."<ds>"."<tbl>" references must be registry-gated:
unregistered → 403 bq_path_not_registered; registered + admin →
bypass per-name grant; case-insensitive lookup; string-literal
containing bq.X.Y → 403 (strict-deny).
tests/test_api_query_guardrail.py (3 cases)
Cost guardrail: SQL referencing a registered remote BQ row invokes
_bq_dry_run_bytes (verified via call-counter side effect); over-cap
dry-run returns 400 remote_scan_too_large with bytes/tables/suggestion
in detail; non-BQ queries skip the dry-run entirely.
tests/test_api_query_quota.py (3 cases)
Daily-byte quota check_daily_budget pre-flight (over-cap → 429
before dry-run); record_bytes post-flight on the shared singleton
v2_quota tracker; non-BQ queries leave the counter alone.
RED breakdown: 16 ImportError (BQ_PATH not yet defined) + 7 assertion
failures = 23 fully-RED. 6 tests pass for regression-green reasons
(use `if r.status_code == 403:` patterns where current code returns
400 for unrelated reasons). They serve as anti-regression guards once
the implementation lands and remain green throughout — documented per
spec §6 Phase 1 RED-discipline notes.