From 500db8cd3c4fe2abb76fbe8e27e8f4ff1f3ecfb9 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Mon, 4 May 2026 21:08:21 +0200 Subject: [PATCH] fix(query-guardrail): dry-run user SQL not synthetic SELECT * (#171) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #171. The /api/query cost guardrail used to dry-run a synthetic `SELECT * FROM ` for each registered remote-BQ row referenced by the user SQL — which made BigQuery estimate a full table scan, with column projection, predicate pushdown, and partition pruning all disabled. Narrow queries on big partitioned/clustered tables (the documented happy path for `agnes query --remote`) hit ~30,000× over-estimates and got rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Pavel's report on #171 traced the root cause and proposed the fix: rewrite the user SQL to BQ-native syntax and dry-run it as a single job, exactly the way `bq query --dry_run` works. Implementation: - New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered names (word-boundary, case-insensitive, longest-first to avoid prefix collisions) + bq.""."" forms to backticked `..` paths. - _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap check uses the real estimate. - Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only syntax like ::INT casts), the guard falls back to the pre-fix per-table SELECT * approach so non-portable queries still get a (loose) cap estimate instead of fail-opening. Non-parse BQ errors (forbidden, upstream) still propagate as 502. - _bq_guardrail_inputs now also returns name_lookups so the rewriter has the (registered_name, bucket, source_table) mapping it needs. - Per-table breakdown is unavailable from a composite dry-run; total bytes are pinned to dry_run_set[0] for the post-flight record_bytes(sum(...)) call to keep returning the right total. Tests (7 new, 3 existing still pass): - dry-run receives rewritten user SQL with WHERE clause intact (the load-bearing assertion for #171) - single dry-run per request even with multiple registered tables (JOIN, UNION) referenced - fallback to per-table SELECT * on bq_bad_request - non-parse BQ errors (forbidden) still 502 - rewriter unit tests: bare + bq.path in same SQL, longest-name-wins on prefix collision, case-insensitive bare-name match --- CHANGELOG.md | 1 + app/api/query.py | 180 +++++++++++++++++++--- tests/test_api_query_guardrail.py | 248 ++++++++++++++++++++++++++++++ 3 files changed, 410 insertions(+), 19 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index e4666cb..7d34942 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -38,6 +38,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n - `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`. - Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`). - `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces. +- **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM
` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."".""`) to BQ-native (`` `..` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam). - **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode. - `agnes snapshot create … --estimate` in a pre-init directory no longer leaks an httpx `ConnectError` traceback to stderr. The estimate-guard fix (3d587681) let `--estimate` reach `api_post_json`, but the existing `except V2ClientError` clause didn't catch transport-layer errors when no server was configured (defaulted to `http://localhost:8000`). Now also catches `httpx.HTTPError` and renders the friendly hint `Run \`agnes init …\` first`. - `agnes push` now reads Claude Code session jsonls from `~/.claude/projects//` (where Claude Code actually writes them), instead of `/user/sessions/` (which the SessionEnd hook never populated — the previous code uploaded an empty list every time). Encoding logic in `cli/lib/claude_sessions.py` probes both Claude Code variants — older `/`→`-` and newer all-non-alphanumeric→`-` — and unions the result, so users who have upgraded Claude Code mid-project see sessions from both encoded dirs. Falls back to `/user/sessions/` for back-compat. diff --git a/app/api/query.py b/app/api/query.py index af97653..891c066 100644 --- a/app/api/query.py +++ b/app/api/query.py @@ -150,7 +150,7 @@ async def execute_query( raise HTTPException(status_code=403, detail=f"Access denied to table '{table}'") # ---- #160 BQ remote-row guardrail + RBAC patch ------------------- - dry_run_set, blocked_bq_path = _bq_guardrail_inputs( + dry_run_set, name_lookups, blocked_bq_path = _bq_guardrail_inputs( request.sql, sql_lower, conn, user, allowed, ) if blocked_bq_path is not None: @@ -172,7 +172,10 @@ async def execute_query( user_id = user.get("email") or user.get("id") or "anon" guard = ( _bq_quota_and_cap_guard( - user_id=user_id, dry_run_set=dry_run_set, sql=request.sql, + user_id=user_id, + dry_run_set=dry_run_set, + name_lookups=name_lookups, + sql=request.sql, ) if dry_run_set else contextlib.nullcontext() @@ -314,11 +317,19 @@ def _bq_guardrail_inputs( ): """Two-pass scan over user SQL for the upcoming BQ guardrail + RBAC patch. - Returns a tuple `(dry_run_set, blocked_bq_path)`: + Returns a tuple `(dry_run_set, name_lookups, blocked_bq_path)`: - `dry_run_set` is a list of `(bucket, source_table, est_bytes)` triples identifying every BigQuery row the request will scan. The caller dry-runs - each and bills the sum against the user's daily quota. + the rewritten user SQL once and distributes the total here for quota + bookkeeping. + + - `name_lookups` is a list of `(registered_name, bucket, source_table)` + triples — only the bare-name matches from pass 1, NOT the direct + `bq."".""` matches. Issue #171 fix: the cap-guard rewrites + these name → ``\\`..\\``` when building + the BQ-native SQL for dry-run, so partition pruning + column projection + + predicate pushdown all engage. - `blocked_bq_path` is a structured-detail dict for the caller to raise HTTPException(403) with, when user SQL contains a direct @@ -342,6 +353,7 @@ def _bq_guardrail_inputs( # what shows in `agnes catalog`), so the regex match below uses `name`, # but the access gate uses `id`. dry_run: list = [] + name_lookups: list = [] seen_paths: set = set() accessible_set = set(allowed) if allowed is not None else None for r in repo.list_by_source("bigquery"): @@ -363,6 +375,11 @@ def _bq_guardrail_inputs( if key not in seen_paths: seen_paths.add(key) dry_run.append((bucket, source_table, 0)) # bytes filled at dry-run + # Record the (name, bucket, source_table) mapping separately so the + # cap-guard's SQL rewriter can find every occurrence — even if the + # user references the same physical table under two registered + # names (rare but possible: aliased catalog rows). + name_lookups.append((str(name), bucket, source_table)) # 2. Direct bq.. pass: every match must point at a registered # row. Run BEFORE adding to dry_run so unregistered paths fail-fast. @@ -372,7 +389,7 @@ def _bq_guardrail_inputs( source_table_raw = m.group(2).strip('"') row = repo.find_by_bq_path(bucket_raw, source_table_raw) if row is None: - return [], { + return [], [], { "reason": "bq_path_not_registered", "path": f'bq."{bucket_raw}"."{source_table_raw}"', "hint": ( @@ -387,7 +404,7 @@ def _bq_guardrail_inputs( # Devin Review iter #3. if not is_admin: if accessible_set is None or row["id"] not in accessible_set: - return [], { + return [], [], { "reason": "bq_path_access_denied", "path": f'bq."{bucket_raw}"."{source_table_raw}"', "registered_as": row["name"], @@ -401,11 +418,69 @@ def _bq_guardrail_inputs( seen_paths.add(key) dry_run.append((bucket, source_table, 0)) - return dry_run, None + return dry_run, name_lookups, None + + +def _rewrite_user_sql_for_bq_dry_run( + sql: str, name_lookups: list, project: str, +) -> str: + """Rewrite user SQL from DuckDB-flavor to BQ-native so a single + `_bq_dry_run_bytes` call can estimate scan size for the EXACT query + the user submitted (issue #171). + + Two transformations: + + 1. Each registered remote-BQ name (word-boundary, case-insensitive) + → ``\\`..\\````. Sorted longest-first + so a longer name (`unit_economics_summary`) is rewritten before a + shorter one that's a prefix (`unit_economics`) and we don't end up + with a partially-rewritten identifier. + + 2. ``bq."".""`` (and the unquoted variant) → ``\\`..\\````. + + The rewrite is regex-only (no SQL parser): a registered name appearing + inside a string literal (e.g. an `IN (...)` value or a `LIKE` pattern) + will also be rewritten. This is acceptable because (a) it's vanishingly + rare to have a string literal exactly matching a registered table name, + and (b) when it does happen the dry-run errors out and the caller falls + back to the per-table SELECT * estimate (current behavior, no regression). + + CTE shadowing: a `WITH unit_economics AS (...)` followed by `FROM + unit_economics` would also rewrite the `FROM` reference. BQ then treats + the CTE as unreferenced (legal) and the dry-run estimates the rewritten + physical table — likely an over-estimate. Same fallback path covers this. + """ + out = sql + # Pass 1: bare-name rewrite, longest names first. + for name, bucket, source_table in sorted( + name_lookups, key=lambda t: -len(t[0]) + ): + target = f"`{project}.{bucket}.{source_table}`" + out = re.sub( + r"\b" + re.escape(name) + r"\b", + target, + out, + flags=re.IGNORECASE, + ) + + # Pass 2: bq."ds"."tbl" / bq.ds.tbl → `..`. + def _bq_path_repl(m: re.Match) -> str: + ds = m.group(1).strip('"') + tbl = m.group(2).strip('"') + return f"`{project}.{ds}.{tbl}`" + + out = BQ_PATH.sub(_bq_path_repl, out) + return out @contextlib.contextmanager -def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str): +def _bq_quota_and_cap_guard( + *, + user_id: str, + dry_run_set: list, + name_lookups: list, + sql: str, +): """Pre-flight check + dry-run + cap enforcement for /api/query BQ paths. Context-manager shape (Devin Review #5 on PR #168). Earlier implementation @@ -419,16 +494,36 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str): The caller's `with` block holds the slot through both dry-run AND the subsequent `analytics.execute(...)` until the body exits. + Issue #171 fix: dry-run runs ONCE on the user's actual SQL (translated + to BQ-native via `_rewrite_user_sql_for_bq_dry_run`). Pre-fix the + pre-check did N dry-runs of synthetic ``SELECT * FROM
`` per + referenced table — which ignored WHERE filters, column projection, and + partition pruning, over-estimating scan size up to ~30,000× on + partitioned/clustered tables and rejecting narrow queries that BQ + itself would dry-run as a few MB. + + Fallback: if BQ rejects the rewritten SQL with a parse-level + ``client_error`` (e.g. DuckDB-only syntax like ``::INT`` casts that + don't translate to BQ), fall back to the pre-#171 per-table + SELECT * approach so the cap-guard still functions — over-estimate + is preferred over fail-open. Forbidden / upstream errors still + propagate as HTTP 502. + + Flow: 1. `check_daily_budget` — over-cap users get 429 BEFORE any BQ work. 2. `quota.acquire(user_id)` opened — concurrent-slot held throughout. - 3. Dry-run each `(bucket, source_table)` via `_bq_dry_run_bytes`. - 4. If sum > cap → 400 `remote_scan_too_large`. + 3. Single dry-run of rewritten user SQL → `total_bytes`. + On parse error, fall back to per-table SELECT * → sum. + 4. If total > cap → 400 `remote_scan_too_large`. 5. Yield. Caller runs `analytics.execute(...)` + `record_bytes(...)`. 6. On exit, slot released. Mutates `dry_run_set` in place: the third tuple element (bytes) is - populated with the per-path dry-run result so the caller can sum and - record the bytes against the user's quota post-flight. + populated so the caller can sum and record bytes against the user's + quota post-flight. Single-dry-run path puts `total_bytes` on the first + entry and zero on the rest (BQ doesn't expose per-table bytes for a + composite query); the caller's `sum(b for _, _, b in dry_run_set)` + still equals `total_bytes`. """ quota = _build_quota_tracker() try: @@ -464,19 +559,66 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str): # Devin Review #2 on PR #168. try: with quota.acquire(user_id): + project = bq.projects.data + rewritten_sql = _rewrite_user_sql_for_bq_dry_run( + sql, name_lookups, project, + ) + + # Try the single-dry-run path first (issue #171). Falls back + # to the per-table SELECT * approach only on BQ parse errors + # (kind="bq_bad_request" — DuckDB-only syntax that BQ can't + # translate). All other BQ errors propagate as 502 below. total_bytes = 0 - for i, (bucket, source_table, _) in enumerate(dry_run_set): - bq_sql = f"SELECT * FROM `{bq.projects.data}.{bucket}.{source_table}`" - try: - est = _bq_dry_run_bytes(bq, bq_sql) - except BqAccessError as exc: + used_fallback = False + try: + total_bytes = _bq_dry_run_bytes(bq, rewritten_sql) + except BqAccessError as exc: + if exc.kind == "bq_bad_request": + logger.warning( + "BQ dry-run rejected the rewritten SQL " + "(kind=%s, message=%s). Falling back to per-table " + "SELECT * estimate; the cap check will over-estimate " + "scan bytes for this query. Consider rewriting to " + "BQ-native syntax for a tight pre-check.", + exc.kind, exc.message, + ) + used_fallback = True + else: raise HTTPException(status_code=502, detail={ "kind": exc.kind, "message": exc.message, **(exc.details or {}), }) - dry_run_set[i] = (bucket, source_table, est) - total_bytes += est + + if used_fallback: + # Pre-#171 path: estimate per registered table from a + # synthetic SELECT *. Over-estimates partitioned scans but + # never under-estimates, so the cap still bounds risk. + for i, (bucket, source_table, _) in enumerate(dry_run_set): + fallback_sql = ( + f"SELECT * FROM `{project}.{bucket}.{source_table}`" + ) + try: + est = _bq_dry_run_bytes(bq, fallback_sql) + except BqAccessError as exc: + raise HTTPException(status_code=502, detail={ + "kind": exc.kind, + "message": exc.message, + **(exc.details or {}), + }) + dry_run_set[i] = (bucket, source_table, est) + total_bytes += est + else: + # Single-dry-run path. Distribute the total to dry_run_set + # so the caller's `record_bytes(sum(...))` stays correct. + # Per-table breakdown is unavailable from a composite + # dry-run; pin total to entry 0, zero the rest. + if dry_run_set: + b0, t0, _ = dry_run_set[0] + dry_run_set[0] = (b0, t0, total_bytes) + for i in range(1, len(dry_run_set)): + bi, ti, _ = dry_run_set[i] + dry_run_set[i] = (bi, ti, 0) if cap_bytes > 0 and total_bytes > cap_bytes: tables = [f"{b}.{t}" for b, t, _ in dry_run_set] diff --git a/tests/test_api_query_guardrail.py b/tests/test_api_query_guardrail.py index 078bf48..bc6acc0 100644 --- a/tests/test_api_query_guardrail.py +++ b/tests/test_api_query_guardrail.py @@ -136,3 +136,251 @@ def test_no_bq_row_reference_skips_dry_run(seeded_app, monkeypatch): ) assert state["calls"] == 0, \ f"guardrail must skip dry-run on non-BQ queries; got {state['calls']} calls" + + +# --------------------------------------------------------------------------- +# Issue #171: pre-check used to dry-run synthetic SELECT * per registered +# table → 30,000× over-estimate on partitioned/clustered tables. Fix: rewrite +# user SQL from DuckDB-flavor (bare names + `bq..`) to BQ-native +# (\\`..\\`) and run a SINGLE dry-run on the user's actual +# SQL, so partition pruning, column projection, and predicate pushdown all +# count toward the cap check. +# --------------------------------------------------------------------------- + + +def test_guardrail_dry_runs_rewritten_user_sql_not_synthetic_select_star( + seeded_app, mock_dry_run, monkeypatch, +): + """The dry-run must receive the USER's SQL with bare table names rewritten + to backticked paths — not a synthetic ``SELECT * FROM
``. + + This is the load-bearing assertion for issue #171: if the pre-check sees + only the table name it can't prune partitions or project columns, and + the estimate balloons to "full table size" instead of "what BQ would + actually scan." + """ + _register_bq_remote_row("ue", "finance", "ue") + captured = {"sql": None} + + def capturing_fake(_bq, sql): + captured["sql"] = sql + return 1024 # tiny — pass the cap + + monkeypatch.setattr( + "app.api.query._bq_dry_run_bytes", capturing_fake, raising=False, + ) + + c = seeded_app["client"] + token = seeded_app["admin_token"] + user_sql = ( + "SELECT order_id FROM ue " + "WHERE event_date = DATE '2026-04-30' AND country = 'CZ'" + ) + c.post("/api/query", json={"sql": user_sql}, headers=_auth(token)) + + sent = captured["sql"] + assert sent is not None, "dry-run never invoked" + # User-side filters must survive the rewrite — that's the whole point of + # the fix, partition pruning + predicate pushdown only engage in the BQ + # planner if the WHERE clause reaches it. + assert "event_date" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}" + assert "country" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}" + # Bare name `ue` must have been rewritten to a backticked + # `.finance.ue` path (project comes from the test stub + # `_FakeProjects.data = "test-data-prj"`). + assert "`test-data-prj.finance.ue`" in sent, ( + f"bare-name rewrite failed; sent SQL: {sent!r}" + ) + # Pre-#171 path emitted `SELECT * FROM`; the new path forwards the + # user SELECT clause untouched. + assert "SELECT order_id" in sent, ( + f"pre-check is still using synthetic SELECT *; sent SQL: {sent!r}" + ) + + +def test_guardrail_invokes_dry_run_exactly_once_per_request( + seeded_app, mock_dry_run, monkeypatch, +): + """Single dry-run path: even when the user references multiple registered + tables in one query (a JOIN, a UNION, …), only ONE dry-run fires. + + Pre-#171 the pre-check ran N dry-runs (one synthetic SELECT * per table) + and summed. Now BQ does the joining for us in a single dry-run — cheaper + AND more accurate (joins/filters/projections apply across both sides). + """ + _register_bq_remote_row("orders", "finance", "orders") + _register_bq_remote_row("traffic", "marketing", "traffic") + + state = {"call_count": 0, "last_sql": None} + + def counting_fake(_bq, sql): + state["call_count"] += 1 + state["last_sql"] = sql + return 100 # tiny + + monkeypatch.setattr( + "app.api.query._bq_dry_run_bytes", counting_fake, raising=False, + ) + + c = seeded_app["client"] + token = seeded_app["admin_token"] + c.post( + "/api/query", + json={ + "sql": ( + "SELECT o.id, t.views FROM orders o " + "JOIN traffic t ON o.date = t.date" + ), + }, + headers=_auth(token), + ) + assert state["call_count"] == 1, ( + f"single-dry-run path expected; got {state['call_count']} calls" + ) + # Both bare names rewritten in the same SQL. + assert "`test-data-prj.finance.orders`" in state["last_sql"] + assert "`test-data-prj.marketing.traffic`" in state["last_sql"] + + +def test_guardrail_falls_back_to_per_table_estimate_on_bq_parse_error( + seeded_app, mock_dry_run, monkeypatch, +): + """When BQ rejects the rewritten SQL with ``bq_bad_request`` (DuckDB-only + syntax that doesn't translate — e.g. ``::INT`` casts, ``STRPOS``, …), + the cap-guard falls back to the pre-#171 per-table SELECT * approach + so a non-portable query still gets a (loose) cap estimate instead of + fail-opening. + """ + from connectors.bigquery.access import BqAccessError + + _register_bq_remote_row("ue", "finance", "ue") + + state = {"calls": []} + + def fake_dry_run(_bq, sql): + state["calls"].append(sql) + # First call (rewritten user SQL) → BQ parse error. + if len(state["calls"]) == 1: + raise BqAccessError("bq_bad_request", "Syntax error: unexpected '::'") + # Second call (fallback per-table SELECT *) → small bytes, pass cap. + return 4096 + + monkeypatch.setattr( + "app.api.query._bq_dry_run_bytes", fake_dry_run, raising=False, + ) + + c = seeded_app["client"] + token = seeded_app["admin_token"] + # SQL with DuckDB-only `::INT` cast that BQ would reject. + r = c.post( + "/api/query", + json={"sql": "SELECT order_id::INT FROM ue WHERE country = 'CZ'"}, + headers=_auth(token), + ) + + # Two dry-runs (rewritten + fallback per-table) before the (failed) + # execute. Status will be a downstream error from analytics.execute() + # since `::INT` doesn't work in DuckDB either against a remote view — + # but the GUARDRAIL must have completed without 5xx-ing. + assert len(state["calls"]) == 2, ( + f"expected 1 rewritten + 1 fallback dry-run, got {len(state['calls'])}: " + f"{state['calls']}" + ) + assert "::" in state["calls"][0], "first call should be the rewritten user SQL" + assert state["calls"][1].startswith("SELECT * FROM"), ( + "second call should be the per-table fallback" + ) + # Whatever HTTP status comes back must NOT be 502 from the guard's + # transport-error path — fallback must absorb the bq_bad_request. + assert r.status_code != 502, r.json() + + +def test_guardrail_propagates_502_on_non_parse_bq_errors( + seeded_app, mock_dry_run, monkeypatch, +): + """Forbidden / upstream-error from BQ on the dry-run still maps to 502; + fallback only kicks in for parse errors. Important so a misconfigured + SA doesn't silently fall back to a stale-metadata estimate.""" + from connectors.bigquery.access import BqAccessError + + _register_bq_remote_row("ue", "finance", "ue") + + def always_forbidden(_bq, _sql): + raise BqAccessError("bq_forbidden", "Permission denied", details={}) + + monkeypatch.setattr( + "app.api.query._bq_dry_run_bytes", always_forbidden, raising=False, + ) + + c = seeded_app["client"] + token = seeded_app["admin_token"] + r = c.post( + "/api/query", + json={"sql": "SELECT count(*) FROM ue"}, + headers=_auth(token), + ) + assert r.status_code == 502, r.json() + detail = r.json().get("detail", {}) + if isinstance(detail, dict): + assert detail.get("kind") == "bq_forbidden" + + +def test_rewrite_helper_handles_bare_name_and_bq_path_in_same_sql(): + """Direct unit-test of the rewriter so the exact regex behavior is + pinned: both bare names AND ``bq..`` references in the same + SQL are translated, and longer names win over shorter prefixes. + """ + from app.api.query import _rewrite_user_sql_for_bq_dry_run + + rewritten = _rewrite_user_sql_for_bq_dry_run( + sql=( + 'SELECT a.id, b.col ' + 'FROM ue a JOIN bq."finance"."traffic" b ON a.date = b.date' + ), + name_lookups=[("ue", "finance", "ue")], + project="data-prj", + ) + assert "`data-prj.finance.ue`" in rewritten + assert "`data-prj.finance.traffic`" in rewritten + # Original duckdb-flavor `bq."ds"."t"` form should have been replaced — + # if it's still in the output, the BQ.path pass missed it. + assert 'bq."finance"."traffic"' not in rewritten + + +def test_rewrite_helper_longer_name_wins_over_prefix(): + """When two registered names share a prefix (`unit_economics`, + `unit_economics_summary`), the longer one must rewrite first so the + shorter one's regex doesn't eat the prefix and leave junk like + ``\\`...ue\\`_summary`` behind. + """ + from app.api.query import _rewrite_user_sql_for_bq_dry_run + + rewritten = _rewrite_user_sql_for_bq_dry_run( + sql="SELECT * FROM unit_economics_summary", + name_lookups=[ + ("unit_economics", "fin", "ue"), + ("unit_economics_summary", "fin", "ue_summary"), + ], + project="p", + ) + assert "`p.fin.ue_summary`" in rewritten + # If the shorter name had eaten the prefix we'd see `p.fin.ue`_summary + # (broken token). Assert that doesn't happen. + assert "`p.fin.ue`" not in rewritten + + +def test_rewrite_helper_is_case_insensitive_on_bare_names(): + """Bare-name match in `_bq_guardrail_inputs` is case-insensitive (it + runs against `sql_lower`). The rewriter must match the same set of + occurrences on the original-case SQL or we'd silently leave some + references untranslated and dry-run on a half-rewritten SQL. + """ + from app.api.query import _rewrite_user_sql_for_bq_dry_run + + rewritten = _rewrite_user_sql_for_bq_dry_run( + sql="SELECT * FROM UE WHERE Ue.id IS NOT NULL", + name_lookups=[("ue", "fin", "ue")], + project="p", + ) + assert "`p.fin.ue` WHERE `p.fin.ue`.id" in rewritten or \ + rewritten.lower().count("`p.fin.ue`") == 2