fix(query-guardrail): dry-run user SQL not synthetic SELECT * (#171)

Closes #171. The /api/query cost guardrail used to dry-run a synthetic
`SELECT * FROM <table>` for each registered remote-BQ row referenced
by the user SQL — which made BigQuery estimate a full table scan, with
column projection, predicate pushdown, and partition pruning all
disabled. Narrow queries on big partitioned/clustered tables (the
documented happy path for `agnes query --remote`) hit ~30,000×
over-estimates and got rejected with 400 `remote_scan_too_large` even
when BQ's own dry-run reported single-digit MB.

Pavel's report on #171 traced the root cause and proposed the fix:
rewrite the user SQL to BQ-native syntax and dry-run it as a single
job, exactly the way `bq query --dry_run` works.

Implementation:
- New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered
  names (word-boundary, case-insensitive, longest-first to avoid prefix
  collisions) + bq."<ds>"."<tbl>" forms to backticked
  `<project>.<ds>.<tbl>` paths.
- _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap
  check uses the real estimate.
- Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only
  syntax like ::INT casts), the guard falls back to the pre-fix
  per-table SELECT * approach so non-portable queries still get a
  (loose) cap estimate instead of fail-opening. Non-parse BQ errors
  (forbidden, upstream) still propagate as 502.
- _bq_guardrail_inputs now also returns name_lookups so the rewriter
  has the (registered_name, bucket, source_table) mapping it needs.
- Per-table breakdown is unavailable from a composite dry-run; total
  bytes are pinned to dry_run_set[0] for the post-flight
  record_bytes(sum(...)) call to keep returning the right total.

Tests (7 new, 3 existing still pass):
- dry-run receives rewritten user SQL with WHERE clause intact (the
  load-bearing assertion for #171)
- single dry-run per request even with multiple registered tables
  (JOIN, UNION) referenced
- fallback to per-table SELECT * on bq_bad_request
- non-parse BQ errors (forbidden) still 502
- rewriter unit tests: bare + bq.path in same SQL, longest-name-wins
  on prefix collision, case-insensitive bare-name match
This commit is contained in:
ZdenekSrotyr 2026-05-04 21:08:21 +02:00
parent bd462187e8
commit 500db8cd3c
3 changed files with 410 additions and 19 deletions

View file

@ -38,6 +38,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
- **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM <table>` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."<ds>"."<tbl>"`) to BQ-native (`` `<project>.<ds>.<tbl>` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam).
- **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode.
- `agnes snapshot create … --estimate` in a pre-init directory no longer leaks an httpx `ConnectError` traceback to stderr. The estimate-guard fix (3d587681) let `--estimate` reach `api_post_json`, but the existing `except V2ClientError` clause didn't catch transport-layer errors when no server was configured (defaulted to `http://localhost:8000`). Now also catches `httpx.HTTPError` and renders the friendly hint `Run \`agnes init …\` first`.
- `agnes push` now reads Claude Code session jsonls from `~/.claude/projects/<encoded-cwd>/` (where Claude Code actually writes them), instead of `<workspace>/user/sessions/` (which the SessionEnd hook never populated — the previous code uploaded an empty list every time). Encoding logic in `cli/lib/claude_sessions.py` probes both Claude Code variants — older `/`→`-` and newer all-non-alphanumeric→`-` — and unions the result, so users who have upgraded Claude Code mid-project see sessions from both encoded dirs. Falls back to `<workspace>/user/sessions/` for back-compat.

View file

@ -150,7 +150,7 @@ async def execute_query(
raise HTTPException(status_code=403, detail=f"Access denied to table '{table}'")
# ---- #160 BQ remote-row guardrail + RBAC patch -------------------
dry_run_set, blocked_bq_path = _bq_guardrail_inputs(
dry_run_set, name_lookups, blocked_bq_path = _bq_guardrail_inputs(
request.sql, sql_lower, conn, user, allowed,
)
if blocked_bq_path is not None:
@ -172,7 +172,10 @@ async def execute_query(
user_id = user.get("email") or user.get("id") or "anon"
guard = (
_bq_quota_and_cap_guard(
user_id=user_id, dry_run_set=dry_run_set, sql=request.sql,
user_id=user_id,
dry_run_set=dry_run_set,
name_lookups=name_lookups,
sql=request.sql,
)
if dry_run_set
else contextlib.nullcontext()
@ -314,11 +317,19 @@ def _bq_guardrail_inputs(
):
"""Two-pass scan over user SQL for the upcoming BQ guardrail + RBAC patch.
Returns a tuple `(dry_run_set, blocked_bq_path)`:
Returns a tuple `(dry_run_set, name_lookups, blocked_bq_path)`:
- `dry_run_set` is a list of `(bucket, source_table, est_bytes)` triples
identifying every BigQuery row the request will scan. The caller dry-runs
each and bills the sum against the user's daily quota.
the rewritten user SQL once and distributes the total here for quota
bookkeeping.
- `name_lookups` is a list of `(registered_name, bucket, source_table)`
triples only the bare-name matches from pass 1, NOT the direct
`bq."<ds>"."<tbl>"` matches. Issue #171 fix: the cap-guard rewrites
these name ``\\`<project>.<bucket>.<source_table>\\``` when building
the BQ-native SQL for dry-run, so partition pruning + column projection
+ predicate pushdown all engage.
- `blocked_bq_path` is a structured-detail dict for the caller to raise
HTTPException(403) with, when user SQL contains a direct
@ -342,6 +353,7 @@ def _bq_guardrail_inputs(
# what shows in `agnes catalog`), so the regex match below uses `name`,
# but the access gate uses `id`.
dry_run: list = []
name_lookups: list = []
seen_paths: set = set()
accessible_set = set(allowed) if allowed is not None else None
for r in repo.list_by_source("bigquery"):
@ -363,6 +375,11 @@ def _bq_guardrail_inputs(
if key not in seen_paths:
seen_paths.add(key)
dry_run.append((bucket, source_table, 0)) # bytes filled at dry-run
# Record the (name, bucket, source_table) mapping separately so the
# cap-guard's SQL rewriter can find every occurrence — even if the
# user references the same physical table under two registered
# names (rare but possible: aliased catalog rows).
name_lookups.append((str(name), bucket, source_table))
# 2. Direct bq.<ds>.<tbl> pass: every match must point at a registered
# row. Run BEFORE adding to dry_run so unregistered paths fail-fast.
@ -372,7 +389,7 @@ def _bq_guardrail_inputs(
source_table_raw = m.group(2).strip('"')
row = repo.find_by_bq_path(bucket_raw, source_table_raw)
if row is None:
return [], {
return [], [], {
"reason": "bq_path_not_registered",
"path": f'bq."{bucket_raw}"."{source_table_raw}"',
"hint": (
@ -387,7 +404,7 @@ def _bq_guardrail_inputs(
# Devin Review iter #3.
if not is_admin:
if accessible_set is None or row["id"] not in accessible_set:
return [], {
return [], [], {
"reason": "bq_path_access_denied",
"path": f'bq."{bucket_raw}"."{source_table_raw}"',
"registered_as": row["name"],
@ -401,11 +418,69 @@ def _bq_guardrail_inputs(
seen_paths.add(key)
dry_run.append((bucket, source_table, 0))
return dry_run, None
return dry_run, name_lookups, None
def _rewrite_user_sql_for_bq_dry_run(
sql: str, name_lookups: list, project: str,
) -> str:
"""Rewrite user SQL from DuckDB-flavor to BQ-native so a single
`_bq_dry_run_bytes` call can estimate scan size for the EXACT query
the user submitted (issue #171).
Two transformations:
1. Each registered remote-BQ name (word-boundary, case-insensitive)
``\\`<project>.<bucket>.<source_table>\\````. Sorted longest-first
so a longer name (`unit_economics_summary`) is rewritten before a
shorter one that's a prefix (`unit_economics`) and we don't end up
with a partially-rewritten identifier.
2. ``bq."<ds>"."<tbl>"`` (and the unquoted variant) ``\\`<project>.<ds>.<tbl>\\````.
The rewrite is regex-only (no SQL parser): a registered name appearing
inside a string literal (e.g. an `IN (...)` value or a `LIKE` pattern)
will also be rewritten. This is acceptable because (a) it's vanishingly
rare to have a string literal exactly matching a registered table name,
and (b) when it does happen the dry-run errors out and the caller falls
back to the per-table SELECT * estimate (current behavior, no regression).
CTE shadowing: a `WITH unit_economics AS (...)` followed by `FROM
unit_economics` would also rewrite the `FROM` reference. BQ then treats
the CTE as unreferenced (legal) and the dry-run estimates the rewritten
physical table likely an over-estimate. Same fallback path covers this.
"""
out = sql
# Pass 1: bare-name rewrite, longest names first.
for name, bucket, source_table in sorted(
name_lookups, key=lambda t: -len(t[0])
):
target = f"`{project}.{bucket}.{source_table}`"
out = re.sub(
r"\b" + re.escape(name) + r"\b",
target,
out,
flags=re.IGNORECASE,
)
# Pass 2: bq."ds"."tbl" / bq.ds.tbl → `<project>.<ds>.<tbl>`.
def _bq_path_repl(m: re.Match) -> str:
ds = m.group(1).strip('"')
tbl = m.group(2).strip('"')
return f"`{project}.{ds}.{tbl}`"
out = BQ_PATH.sub(_bq_path_repl, out)
return out
@contextlib.contextmanager
def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
def _bq_quota_and_cap_guard(
*,
user_id: str,
dry_run_set: list,
name_lookups: list,
sql: str,
):
"""Pre-flight check + dry-run + cap enforcement for /api/query BQ paths.
Context-manager shape (Devin Review #5 on PR #168). Earlier implementation
@ -419,16 +494,36 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
The caller's `with` block holds the slot through both dry-run AND the
subsequent `analytics.execute(...)` until the body exits.
Issue #171 fix: dry-run runs ONCE on the user's actual SQL (translated
to BQ-native via `_rewrite_user_sql_for_bq_dry_run`). Pre-fix the
pre-check did N dry-runs of synthetic ``SELECT * FROM <table>`` per
referenced table which ignored WHERE filters, column projection, and
partition pruning, over-estimating scan size up to ~30,000× on
partitioned/clustered tables and rejecting narrow queries that BQ
itself would dry-run as a few MB.
Fallback: if BQ rejects the rewritten SQL with a parse-level
``client_error`` (e.g. DuckDB-only syntax like ``::INT`` casts that
don't translate to BQ), fall back to the pre-#171 per-table
SELECT * approach so the cap-guard still functions over-estimate
is preferred over fail-open. Forbidden / upstream errors still
propagate as HTTP 502.
Flow:
1. `check_daily_budget` over-cap users get 429 BEFORE any BQ work.
2. `quota.acquire(user_id)` opened concurrent-slot held throughout.
3. Dry-run each `(bucket, source_table)` via `_bq_dry_run_bytes`.
4. If sum > cap 400 `remote_scan_too_large`.
3. Single dry-run of rewritten user SQL `total_bytes`.
On parse error, fall back to per-table SELECT * sum.
4. If total > cap 400 `remote_scan_too_large`.
5. Yield. Caller runs `analytics.execute(...)` + `record_bytes(...)`.
6. On exit, slot released.
Mutates `dry_run_set` in place: the third tuple element (bytes) is
populated with the per-path dry-run result so the caller can sum and
record the bytes against the user's quota post-flight.
populated so the caller can sum and record bytes against the user's
quota post-flight. Single-dry-run path puts `total_bytes` on the first
entry and zero on the rest (BQ doesn't expose per-table bytes for a
composite query); the caller's `sum(b for _, _, b in dry_run_set)`
still equals `total_bytes`.
"""
quota = _build_quota_tracker()
try:
@ -464,19 +559,66 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
# Devin Review #2 on PR #168.
try:
with quota.acquire(user_id):
project = bq.projects.data
rewritten_sql = _rewrite_user_sql_for_bq_dry_run(
sql, name_lookups, project,
)
# Try the single-dry-run path first (issue #171). Falls back
# to the per-table SELECT * approach only on BQ parse errors
# (kind="bq_bad_request" — DuckDB-only syntax that BQ can't
# translate). All other BQ errors propagate as 502 below.
total_bytes = 0
for i, (bucket, source_table, _) in enumerate(dry_run_set):
bq_sql = f"SELECT * FROM `{bq.projects.data}.{bucket}.{source_table}`"
try:
est = _bq_dry_run_bytes(bq, bq_sql)
except BqAccessError as exc:
used_fallback = False
try:
total_bytes = _bq_dry_run_bytes(bq, rewritten_sql)
except BqAccessError as exc:
if exc.kind == "bq_bad_request":
logger.warning(
"BQ dry-run rejected the rewritten SQL "
"(kind=%s, message=%s). Falling back to per-table "
"SELECT * estimate; the cap check will over-estimate "
"scan bytes for this query. Consider rewriting to "
"BQ-native syntax for a tight pre-check.",
exc.kind, exc.message,
)
used_fallback = True
else:
raise HTTPException(status_code=502, detail={
"kind": exc.kind,
"message": exc.message,
**(exc.details or {}),
})
dry_run_set[i] = (bucket, source_table, est)
total_bytes += est
if used_fallback:
# Pre-#171 path: estimate per registered table from a
# synthetic SELECT *. Over-estimates partitioned scans but
# never under-estimates, so the cap still bounds risk.
for i, (bucket, source_table, _) in enumerate(dry_run_set):
fallback_sql = (
f"SELECT * FROM `{project}.{bucket}.{source_table}`"
)
try:
est = _bq_dry_run_bytes(bq, fallback_sql)
except BqAccessError as exc:
raise HTTPException(status_code=502, detail={
"kind": exc.kind,
"message": exc.message,
**(exc.details or {}),
})
dry_run_set[i] = (bucket, source_table, est)
total_bytes += est
else:
# Single-dry-run path. Distribute the total to dry_run_set
# so the caller's `record_bytes(sum(...))` stays correct.
# Per-table breakdown is unavailable from a composite
# dry-run; pin total to entry 0, zero the rest.
if dry_run_set:
b0, t0, _ = dry_run_set[0]
dry_run_set[0] = (b0, t0, total_bytes)
for i in range(1, len(dry_run_set)):
bi, ti, _ = dry_run_set[i]
dry_run_set[i] = (bi, ti, 0)
if cap_bytes > 0 and total_bytes > cap_bytes:
tables = [f"{b}.{t}" for b, t, _ in dry_run_set]

View file

@ -136,3 +136,251 @@ def test_no_bq_row_reference_skips_dry_run(seeded_app, monkeypatch):
)
assert state["calls"] == 0, \
f"guardrail must skip dry-run on non-BQ queries; got {state['calls']} calls"
# ---------------------------------------------------------------------------
# Issue #171: pre-check used to dry-run synthetic SELECT * per registered
# table → 30,000× over-estimate on partitioned/clustered tables. Fix: rewrite
# user SQL from DuckDB-flavor (bare names + `bq.<ds>.<tbl>`) to BQ-native
# (\\`<project>.<ds>.<tbl>\\`) and run a SINGLE dry-run on the user's actual
# SQL, so partition pruning, column projection, and predicate pushdown all
# count toward the cap check.
# ---------------------------------------------------------------------------
def test_guardrail_dry_runs_rewritten_user_sql_not_synthetic_select_star(
seeded_app, mock_dry_run, monkeypatch,
):
"""The dry-run must receive the USER's SQL with bare table names rewritten
to backticked paths not a synthetic ``SELECT * FROM <table>``.
This is the load-bearing assertion for issue #171: if the pre-check sees
only the table name it can't prune partitions or project columns, and
the estimate balloons to "full table size" instead of "what BQ would
actually scan."
"""
_register_bq_remote_row("ue", "finance", "ue")
captured = {"sql": None}
def capturing_fake(_bq, sql):
captured["sql"] = sql
return 1024 # tiny — pass the cap
monkeypatch.setattr(
"app.api.query._bq_dry_run_bytes", capturing_fake, raising=False,
)
c = seeded_app["client"]
token = seeded_app["admin_token"]
user_sql = (
"SELECT order_id FROM ue "
"WHERE event_date = DATE '2026-04-30' AND country = 'CZ'"
)
c.post("/api/query", json={"sql": user_sql}, headers=_auth(token))
sent = captured["sql"]
assert sent is not None, "dry-run never invoked"
# User-side filters must survive the rewrite — that's the whole point of
# the fix, partition pruning + predicate pushdown only engage in the BQ
# planner if the WHERE clause reaches it.
assert "event_date" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}"
assert "country" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}"
# Bare name `ue` must have been rewritten to a backticked
# `<project>.finance.ue` path (project comes from the test stub
# `_FakeProjects.data = "test-data-prj"`).
assert "`test-data-prj.finance.ue`" in sent, (
f"bare-name rewrite failed; sent SQL: {sent!r}"
)
# Pre-#171 path emitted `SELECT * FROM`; the new path forwards the
# user SELECT clause untouched.
assert "SELECT order_id" in sent, (
f"pre-check is still using synthetic SELECT *; sent SQL: {sent!r}"
)
def test_guardrail_invokes_dry_run_exactly_once_per_request(
seeded_app, mock_dry_run, monkeypatch,
):
"""Single dry-run path: even when the user references multiple registered
tables in one query (a JOIN, a UNION, ), only ONE dry-run fires.
Pre-#171 the pre-check ran N dry-runs (one synthetic SELECT * per table)
and summed. Now BQ does the joining for us in a single dry-run cheaper
AND more accurate (joins/filters/projections apply across both sides).
"""
_register_bq_remote_row("orders", "finance", "orders")
_register_bq_remote_row("traffic", "marketing", "traffic")
state = {"call_count": 0, "last_sql": None}
def counting_fake(_bq, sql):
state["call_count"] += 1
state["last_sql"] = sql
return 100 # tiny
monkeypatch.setattr(
"app.api.query._bq_dry_run_bytes", counting_fake, raising=False,
)
c = seeded_app["client"]
token = seeded_app["admin_token"]
c.post(
"/api/query",
json={
"sql": (
"SELECT o.id, t.views FROM orders o "
"JOIN traffic t ON o.date = t.date"
),
},
headers=_auth(token),
)
assert state["call_count"] == 1, (
f"single-dry-run path expected; got {state['call_count']} calls"
)
# Both bare names rewritten in the same SQL.
assert "`test-data-prj.finance.orders`" in state["last_sql"]
assert "`test-data-prj.marketing.traffic`" in state["last_sql"]
def test_guardrail_falls_back_to_per_table_estimate_on_bq_parse_error(
seeded_app, mock_dry_run, monkeypatch,
):
"""When BQ rejects the rewritten SQL with ``bq_bad_request`` (DuckDB-only
syntax that doesn't translate — e.g. ``::INT`` casts, ``STRPOS``, …),
the cap-guard falls back to the pre-#171 per-table SELECT * approach
so a non-portable query still gets a (loose) cap estimate instead of
fail-opening.
"""
from connectors.bigquery.access import BqAccessError
_register_bq_remote_row("ue", "finance", "ue")
state = {"calls": []}
def fake_dry_run(_bq, sql):
state["calls"].append(sql)
# First call (rewritten user SQL) → BQ parse error.
if len(state["calls"]) == 1:
raise BqAccessError("bq_bad_request", "Syntax error: unexpected '::'")
# Second call (fallback per-table SELECT *) → small bytes, pass cap.
return 4096
monkeypatch.setattr(
"app.api.query._bq_dry_run_bytes", fake_dry_run, raising=False,
)
c = seeded_app["client"]
token = seeded_app["admin_token"]
# SQL with DuckDB-only `::INT` cast that BQ would reject.
r = c.post(
"/api/query",
json={"sql": "SELECT order_id::INT FROM ue WHERE country = 'CZ'"},
headers=_auth(token),
)
# Two dry-runs (rewritten + fallback per-table) before the (failed)
# execute. Status will be a downstream error from analytics.execute()
# since `::INT` doesn't work in DuckDB either against a remote view —
# but the GUARDRAIL must have completed without 5xx-ing.
assert len(state["calls"]) == 2, (
f"expected 1 rewritten + 1 fallback dry-run, got {len(state['calls'])}: "
f"{state['calls']}"
)
assert "::" in state["calls"][0], "first call should be the rewritten user SQL"
assert state["calls"][1].startswith("SELECT * FROM"), (
"second call should be the per-table fallback"
)
# Whatever HTTP status comes back must NOT be 502 from the guard's
# transport-error path — fallback must absorb the bq_bad_request.
assert r.status_code != 502, r.json()
def test_guardrail_propagates_502_on_non_parse_bq_errors(
seeded_app, mock_dry_run, monkeypatch,
):
"""Forbidden / upstream-error from BQ on the dry-run still maps to 502;
fallback only kicks in for parse errors. Important so a misconfigured
SA doesn't silently fall back to a stale-metadata estimate."""
from connectors.bigquery.access import BqAccessError
_register_bq_remote_row("ue", "finance", "ue")
def always_forbidden(_bq, _sql):
raise BqAccessError("bq_forbidden", "Permission denied", details={})
monkeypatch.setattr(
"app.api.query._bq_dry_run_bytes", always_forbidden, raising=False,
)
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.post(
"/api/query",
json={"sql": "SELECT count(*) FROM ue"},
headers=_auth(token),
)
assert r.status_code == 502, r.json()
detail = r.json().get("detail", {})
if isinstance(detail, dict):
assert detail.get("kind") == "bq_forbidden"
def test_rewrite_helper_handles_bare_name_and_bq_path_in_same_sql():
"""Direct unit-test of the rewriter so the exact regex behavior is
pinned: both bare names AND ``bq.<ds>.<tbl>`` references in the same
SQL are translated, and longer names win over shorter prefixes.
"""
from app.api.query import _rewrite_user_sql_for_bq_dry_run
rewritten = _rewrite_user_sql_for_bq_dry_run(
sql=(
'SELECT a.id, b.col '
'FROM ue a JOIN bq."finance"."traffic" b ON a.date = b.date'
),
name_lookups=[("ue", "finance", "ue")],
project="data-prj",
)
assert "`data-prj.finance.ue`" in rewritten
assert "`data-prj.finance.traffic`" in rewritten
# Original duckdb-flavor `bq."ds"."t"` form should have been replaced —
# if it's still in the output, the BQ.path pass missed it.
assert 'bq."finance"."traffic"' not in rewritten
def test_rewrite_helper_longer_name_wins_over_prefix():
"""When two registered names share a prefix (`unit_economics`,
`unit_economics_summary`), the longer one must rewrite first so the
shorter one's regex doesn't eat the prefix and leave junk like
``\\`...ue\\`_summary`` behind.
"""
from app.api.query import _rewrite_user_sql_for_bq_dry_run
rewritten = _rewrite_user_sql_for_bq_dry_run(
sql="SELECT * FROM unit_economics_summary",
name_lookups=[
("unit_economics", "fin", "ue"),
("unit_economics_summary", "fin", "ue_summary"),
],
project="p",
)
assert "`p.fin.ue_summary`" in rewritten
# If the shorter name had eaten the prefix we'd see `p.fin.ue`_summary
# (broken token). Assert that doesn't happen.
assert "`p.fin.ue`" not in rewritten
def test_rewrite_helper_is_case_insensitive_on_bare_names():
"""Bare-name match in `_bq_guardrail_inputs` is case-insensitive (it
runs against `sql_lower`). The rewriter must match the same set of
occurrences on the original-case SQL or we'd silently leave some
references untranslated and dry-run on a half-rewritten SQL.
"""
from app.api.query import _rewrite_user_sql_for_bq_dry_run
rewritten = _rewrite_user_sql_for_bq_dry_run(
sql="SELECT * FROM UE WHERE Ue.id IS NOT NULL",
name_lookups=[("ue", "fin", "ue")],
project="p",
)
assert "`p.fin.ue` WHERE `p.fin.ue`.id" in rewritten or \
rewritten.lower().count("`p.fin.ue`") == 2