diff --git a/CHANGELOG.md b/CHANGELOG.md
index e4666cb..7d34942 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -38,6 +38,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
+- **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM
` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."".""`) to BQ-native (`` `..` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam).
- **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode.
- `agnes snapshot create … --estimate` in a pre-init directory no longer leaks an httpx `ConnectError` traceback to stderr. The estimate-guard fix (3d587681) let `--estimate` reach `api_post_json`, but the existing `except V2ClientError` clause didn't catch transport-layer errors when no server was configured (defaulted to `http://localhost:8000`). Now also catches `httpx.HTTPError` and renders the friendly hint `Run \`agnes init …\` first`.
- `agnes push` now reads Claude Code session jsonls from `~/.claude/projects//` (where Claude Code actually writes them), instead of `/user/sessions/` (which the SessionEnd hook never populated — the previous code uploaded an empty list every time). Encoding logic in `cli/lib/claude_sessions.py` probes both Claude Code variants — older `/`→`-` and newer all-non-alphanumeric→`-` — and unions the result, so users who have upgraded Claude Code mid-project see sessions from both encoded dirs. Falls back to `/user/sessions/` for back-compat.
diff --git a/app/api/query.py b/app/api/query.py
index af97653..891c066 100644
--- a/app/api/query.py
+++ b/app/api/query.py
@@ -150,7 +150,7 @@ async def execute_query(
raise HTTPException(status_code=403, detail=f"Access denied to table '{table}'")
# ---- #160 BQ remote-row guardrail + RBAC patch -------------------
- dry_run_set, blocked_bq_path = _bq_guardrail_inputs(
+ dry_run_set, name_lookups, blocked_bq_path = _bq_guardrail_inputs(
request.sql, sql_lower, conn, user, allowed,
)
if blocked_bq_path is not None:
@@ -172,7 +172,10 @@ async def execute_query(
user_id = user.get("email") or user.get("id") or "anon"
guard = (
_bq_quota_and_cap_guard(
- user_id=user_id, dry_run_set=dry_run_set, sql=request.sql,
+ user_id=user_id,
+ dry_run_set=dry_run_set,
+ name_lookups=name_lookups,
+ sql=request.sql,
)
if dry_run_set
else contextlib.nullcontext()
@@ -314,11 +317,19 @@ def _bq_guardrail_inputs(
):
"""Two-pass scan over user SQL for the upcoming BQ guardrail + RBAC patch.
- Returns a tuple `(dry_run_set, blocked_bq_path)`:
+ Returns a tuple `(dry_run_set, name_lookups, blocked_bq_path)`:
- `dry_run_set` is a list of `(bucket, source_table, est_bytes)` triples
identifying every BigQuery row the request will scan. The caller dry-runs
- each and bills the sum against the user's daily quota.
+ the rewritten user SQL once and distributes the total here for quota
+ bookkeeping.
+
+ - `name_lookups` is a list of `(registered_name, bucket, source_table)`
+ triples — only the bare-name matches from pass 1, NOT the direct
+ `bq."".""` matches. Issue #171 fix: the cap-guard rewrites
+ these name → ``\\`..\\``` when building
+ the BQ-native SQL for dry-run, so partition pruning + column projection
+ + predicate pushdown all engage.
- `blocked_bq_path` is a structured-detail dict for the caller to raise
HTTPException(403) with, when user SQL contains a direct
@@ -342,6 +353,7 @@ def _bq_guardrail_inputs(
# what shows in `agnes catalog`), so the regex match below uses `name`,
# but the access gate uses `id`.
dry_run: list = []
+ name_lookups: list = []
seen_paths: set = set()
accessible_set = set(allowed) if allowed is not None else None
for r in repo.list_by_source("bigquery"):
@@ -363,6 +375,11 @@ def _bq_guardrail_inputs(
if key not in seen_paths:
seen_paths.add(key)
dry_run.append((bucket, source_table, 0)) # bytes filled at dry-run
+ # Record the (name, bucket, source_table) mapping separately so the
+ # cap-guard's SQL rewriter can find every occurrence — even if the
+ # user references the same physical table under two registered
+ # names (rare but possible: aliased catalog rows).
+ name_lookups.append((str(name), bucket, source_table))
# 2. Direct bq.. pass: every match must point at a registered
# row. Run BEFORE adding to dry_run so unregistered paths fail-fast.
@@ -372,7 +389,7 @@ def _bq_guardrail_inputs(
source_table_raw = m.group(2).strip('"')
row = repo.find_by_bq_path(bucket_raw, source_table_raw)
if row is None:
- return [], {
+ return [], [], {
"reason": "bq_path_not_registered",
"path": f'bq."{bucket_raw}"."{source_table_raw}"',
"hint": (
@@ -387,7 +404,7 @@ def _bq_guardrail_inputs(
# Devin Review iter #3.
if not is_admin:
if accessible_set is None or row["id"] not in accessible_set:
- return [], {
+ return [], [], {
"reason": "bq_path_access_denied",
"path": f'bq."{bucket_raw}"."{source_table_raw}"',
"registered_as": row["name"],
@@ -401,11 +418,69 @@ def _bq_guardrail_inputs(
seen_paths.add(key)
dry_run.append((bucket, source_table, 0))
- return dry_run, None
+ return dry_run, name_lookups, None
+
+
+def _rewrite_user_sql_for_bq_dry_run(
+ sql: str, name_lookups: list, project: str,
+) -> str:
+ """Rewrite user SQL from DuckDB-flavor to BQ-native so a single
+ `_bq_dry_run_bytes` call can estimate scan size for the EXACT query
+ the user submitted (issue #171).
+
+ Two transformations:
+
+ 1. Each registered remote-BQ name (word-boundary, case-insensitive)
+ → ``\\`..\\````. Sorted longest-first
+ so a longer name (`unit_economics_summary`) is rewritten before a
+ shorter one that's a prefix (`unit_economics`) and we don't end up
+ with a partially-rewritten identifier.
+
+ 2. ``bq."".""`` (and the unquoted variant) → ``\\`..\\````.
+
+ The rewrite is regex-only (no SQL parser): a registered name appearing
+ inside a string literal (e.g. an `IN (...)` value or a `LIKE` pattern)
+ will also be rewritten. This is acceptable because (a) it's vanishingly
+ rare to have a string literal exactly matching a registered table name,
+ and (b) when it does happen the dry-run errors out and the caller falls
+ back to the per-table SELECT * estimate (current behavior, no regression).
+
+ CTE shadowing: a `WITH unit_economics AS (...)` followed by `FROM
+ unit_economics` would also rewrite the `FROM` reference. BQ then treats
+ the CTE as unreferenced (legal) and the dry-run estimates the rewritten
+ physical table — likely an over-estimate. Same fallback path covers this.
+ """
+ out = sql
+ # Pass 1: bare-name rewrite, longest names first.
+ for name, bucket, source_table in sorted(
+ name_lookups, key=lambda t: -len(t[0])
+ ):
+ target = f"`{project}.{bucket}.{source_table}`"
+ out = re.sub(
+ r"\b" + re.escape(name) + r"\b",
+ target,
+ out,
+ flags=re.IGNORECASE,
+ )
+
+ # Pass 2: bq."ds"."tbl" / bq.ds.tbl → `..`.
+ def _bq_path_repl(m: re.Match) -> str:
+ ds = m.group(1).strip('"')
+ tbl = m.group(2).strip('"')
+ return f"`{project}.{ds}.{tbl}`"
+
+ out = BQ_PATH.sub(_bq_path_repl, out)
+ return out
@contextlib.contextmanager
-def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
+def _bq_quota_and_cap_guard(
+ *,
+ user_id: str,
+ dry_run_set: list,
+ name_lookups: list,
+ sql: str,
+):
"""Pre-flight check + dry-run + cap enforcement for /api/query BQ paths.
Context-manager shape (Devin Review #5 on PR #168). Earlier implementation
@@ -419,16 +494,36 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
The caller's `with` block holds the slot through both dry-run AND the
subsequent `analytics.execute(...)` until the body exits.
+ Issue #171 fix: dry-run runs ONCE on the user's actual SQL (translated
+ to BQ-native via `_rewrite_user_sql_for_bq_dry_run`). Pre-fix the
+ pre-check did N dry-runs of synthetic ``SELECT * FROM `` per
+ referenced table — which ignored WHERE filters, column projection, and
+ partition pruning, over-estimating scan size up to ~30,000× on
+ partitioned/clustered tables and rejecting narrow queries that BQ
+ itself would dry-run as a few MB.
+
+ Fallback: if BQ rejects the rewritten SQL with a parse-level
+ ``client_error`` (e.g. DuckDB-only syntax like ``::INT`` casts that
+ don't translate to BQ), fall back to the pre-#171 per-table
+ SELECT * approach so the cap-guard still functions — over-estimate
+ is preferred over fail-open. Forbidden / upstream errors still
+ propagate as HTTP 502.
+
+ Flow:
1. `check_daily_budget` — over-cap users get 429 BEFORE any BQ work.
2. `quota.acquire(user_id)` opened — concurrent-slot held throughout.
- 3. Dry-run each `(bucket, source_table)` via `_bq_dry_run_bytes`.
- 4. If sum > cap → 400 `remote_scan_too_large`.
+ 3. Single dry-run of rewritten user SQL → `total_bytes`.
+ On parse error, fall back to per-table SELECT * → sum.
+ 4. If total > cap → 400 `remote_scan_too_large`.
5. Yield. Caller runs `analytics.execute(...)` + `record_bytes(...)`.
6. On exit, slot released.
Mutates `dry_run_set` in place: the third tuple element (bytes) is
- populated with the per-path dry-run result so the caller can sum and
- record the bytes against the user's quota post-flight.
+ populated so the caller can sum and record bytes against the user's
+ quota post-flight. Single-dry-run path puts `total_bytes` on the first
+ entry and zero on the rest (BQ doesn't expose per-table bytes for a
+ composite query); the caller's `sum(b for _, _, b in dry_run_set)`
+ still equals `total_bytes`.
"""
quota = _build_quota_tracker()
try:
@@ -464,19 +559,66 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
# Devin Review #2 on PR #168.
try:
with quota.acquire(user_id):
+ project = bq.projects.data
+ rewritten_sql = _rewrite_user_sql_for_bq_dry_run(
+ sql, name_lookups, project,
+ )
+
+ # Try the single-dry-run path first (issue #171). Falls back
+ # to the per-table SELECT * approach only on BQ parse errors
+ # (kind="bq_bad_request" — DuckDB-only syntax that BQ can't
+ # translate). All other BQ errors propagate as 502 below.
total_bytes = 0
- for i, (bucket, source_table, _) in enumerate(dry_run_set):
- bq_sql = f"SELECT * FROM `{bq.projects.data}.{bucket}.{source_table}`"
- try:
- est = _bq_dry_run_bytes(bq, bq_sql)
- except BqAccessError as exc:
+ used_fallback = False
+ try:
+ total_bytes = _bq_dry_run_bytes(bq, rewritten_sql)
+ except BqAccessError as exc:
+ if exc.kind == "bq_bad_request":
+ logger.warning(
+ "BQ dry-run rejected the rewritten SQL "
+ "(kind=%s, message=%s). Falling back to per-table "
+ "SELECT * estimate; the cap check will over-estimate "
+ "scan bytes for this query. Consider rewriting to "
+ "BQ-native syntax for a tight pre-check.",
+ exc.kind, exc.message,
+ )
+ used_fallback = True
+ else:
raise HTTPException(status_code=502, detail={
"kind": exc.kind,
"message": exc.message,
**(exc.details or {}),
})
- dry_run_set[i] = (bucket, source_table, est)
- total_bytes += est
+
+ if used_fallback:
+ # Pre-#171 path: estimate per registered table from a
+ # synthetic SELECT *. Over-estimates partitioned scans but
+ # never under-estimates, so the cap still bounds risk.
+ for i, (bucket, source_table, _) in enumerate(dry_run_set):
+ fallback_sql = (
+ f"SELECT * FROM `{project}.{bucket}.{source_table}`"
+ )
+ try:
+ est = _bq_dry_run_bytes(bq, fallback_sql)
+ except BqAccessError as exc:
+ raise HTTPException(status_code=502, detail={
+ "kind": exc.kind,
+ "message": exc.message,
+ **(exc.details or {}),
+ })
+ dry_run_set[i] = (bucket, source_table, est)
+ total_bytes += est
+ else:
+ # Single-dry-run path. Distribute the total to dry_run_set
+ # so the caller's `record_bytes(sum(...))` stays correct.
+ # Per-table breakdown is unavailable from a composite
+ # dry-run; pin total to entry 0, zero the rest.
+ if dry_run_set:
+ b0, t0, _ = dry_run_set[0]
+ dry_run_set[0] = (b0, t0, total_bytes)
+ for i in range(1, len(dry_run_set)):
+ bi, ti, _ = dry_run_set[i]
+ dry_run_set[i] = (bi, ti, 0)
if cap_bytes > 0 and total_bytes > cap_bytes:
tables = [f"{b}.{t}" for b, t, _ in dry_run_set]
diff --git a/tests/test_api_query_guardrail.py b/tests/test_api_query_guardrail.py
index 078bf48..bc6acc0 100644
--- a/tests/test_api_query_guardrail.py
+++ b/tests/test_api_query_guardrail.py
@@ -136,3 +136,251 @@ def test_no_bq_row_reference_skips_dry_run(seeded_app, monkeypatch):
)
assert state["calls"] == 0, \
f"guardrail must skip dry-run on non-BQ queries; got {state['calls']} calls"
+
+
+# ---------------------------------------------------------------------------
+# Issue #171: pre-check used to dry-run synthetic SELECT * per registered
+# table → 30,000× over-estimate on partitioned/clustered tables. Fix: rewrite
+# user SQL from DuckDB-flavor (bare names + `bq..`) to BQ-native
+# (\\`..\\`) and run a SINGLE dry-run on the user's actual
+# SQL, so partition pruning, column projection, and predicate pushdown all
+# count toward the cap check.
+# ---------------------------------------------------------------------------
+
+
+def test_guardrail_dry_runs_rewritten_user_sql_not_synthetic_select_star(
+ seeded_app, mock_dry_run, monkeypatch,
+):
+ """The dry-run must receive the USER's SQL with bare table names rewritten
+ to backticked paths — not a synthetic ``SELECT * FROM ``.
+
+ This is the load-bearing assertion for issue #171: if the pre-check sees
+ only the table name it can't prune partitions or project columns, and
+ the estimate balloons to "full table size" instead of "what BQ would
+ actually scan."
+ """
+ _register_bq_remote_row("ue", "finance", "ue")
+ captured = {"sql": None}
+
+ def capturing_fake(_bq, sql):
+ captured["sql"] = sql
+ return 1024 # tiny — pass the cap
+
+ monkeypatch.setattr(
+ "app.api.query._bq_dry_run_bytes", capturing_fake, raising=False,
+ )
+
+ c = seeded_app["client"]
+ token = seeded_app["admin_token"]
+ user_sql = (
+ "SELECT order_id FROM ue "
+ "WHERE event_date = DATE '2026-04-30' AND country = 'CZ'"
+ )
+ c.post("/api/query", json={"sql": user_sql}, headers=_auth(token))
+
+ sent = captured["sql"]
+ assert sent is not None, "dry-run never invoked"
+ # User-side filters must survive the rewrite — that's the whole point of
+ # the fix, partition pruning + predicate pushdown only engage in the BQ
+ # planner if the WHERE clause reaches it.
+ assert "event_date" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}"
+ assert "country" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}"
+ # Bare name `ue` must have been rewritten to a backticked
+ # `.finance.ue` path (project comes from the test stub
+ # `_FakeProjects.data = "test-data-prj"`).
+ assert "`test-data-prj.finance.ue`" in sent, (
+ f"bare-name rewrite failed; sent SQL: {sent!r}"
+ )
+ # Pre-#171 path emitted `SELECT * FROM`; the new path forwards the
+ # user SELECT clause untouched.
+ assert "SELECT order_id" in sent, (
+ f"pre-check is still using synthetic SELECT *; sent SQL: {sent!r}"
+ )
+
+
+def test_guardrail_invokes_dry_run_exactly_once_per_request(
+ seeded_app, mock_dry_run, monkeypatch,
+):
+ """Single dry-run path: even when the user references multiple registered
+ tables in one query (a JOIN, a UNION, …), only ONE dry-run fires.
+
+ Pre-#171 the pre-check ran N dry-runs (one synthetic SELECT * per table)
+ and summed. Now BQ does the joining for us in a single dry-run — cheaper
+ AND more accurate (joins/filters/projections apply across both sides).
+ """
+ _register_bq_remote_row("orders", "finance", "orders")
+ _register_bq_remote_row("traffic", "marketing", "traffic")
+
+ state = {"call_count": 0, "last_sql": None}
+
+ def counting_fake(_bq, sql):
+ state["call_count"] += 1
+ state["last_sql"] = sql
+ return 100 # tiny
+
+ monkeypatch.setattr(
+ "app.api.query._bq_dry_run_bytes", counting_fake, raising=False,
+ )
+
+ c = seeded_app["client"]
+ token = seeded_app["admin_token"]
+ c.post(
+ "/api/query",
+ json={
+ "sql": (
+ "SELECT o.id, t.views FROM orders o "
+ "JOIN traffic t ON o.date = t.date"
+ ),
+ },
+ headers=_auth(token),
+ )
+ assert state["call_count"] == 1, (
+ f"single-dry-run path expected; got {state['call_count']} calls"
+ )
+ # Both bare names rewritten in the same SQL.
+ assert "`test-data-prj.finance.orders`" in state["last_sql"]
+ assert "`test-data-prj.marketing.traffic`" in state["last_sql"]
+
+
+def test_guardrail_falls_back_to_per_table_estimate_on_bq_parse_error(
+ seeded_app, mock_dry_run, monkeypatch,
+):
+ """When BQ rejects the rewritten SQL with ``bq_bad_request`` (DuckDB-only
+ syntax that doesn't translate — e.g. ``::INT`` casts, ``STRPOS``, …),
+ the cap-guard falls back to the pre-#171 per-table SELECT * approach
+ so a non-portable query still gets a (loose) cap estimate instead of
+ fail-opening.
+ """
+ from connectors.bigquery.access import BqAccessError
+
+ _register_bq_remote_row("ue", "finance", "ue")
+
+ state = {"calls": []}
+
+ def fake_dry_run(_bq, sql):
+ state["calls"].append(sql)
+ # First call (rewritten user SQL) → BQ parse error.
+ if len(state["calls"]) == 1:
+ raise BqAccessError("bq_bad_request", "Syntax error: unexpected '::'")
+ # Second call (fallback per-table SELECT *) → small bytes, pass cap.
+ return 4096
+
+ monkeypatch.setattr(
+ "app.api.query._bq_dry_run_bytes", fake_dry_run, raising=False,
+ )
+
+ c = seeded_app["client"]
+ token = seeded_app["admin_token"]
+ # SQL with DuckDB-only `::INT` cast that BQ would reject.
+ r = c.post(
+ "/api/query",
+ json={"sql": "SELECT order_id::INT FROM ue WHERE country = 'CZ'"},
+ headers=_auth(token),
+ )
+
+ # Two dry-runs (rewritten + fallback per-table) before the (failed)
+ # execute. Status will be a downstream error from analytics.execute()
+ # since `::INT` doesn't work in DuckDB either against a remote view —
+ # but the GUARDRAIL must have completed without 5xx-ing.
+ assert len(state["calls"]) == 2, (
+ f"expected 1 rewritten + 1 fallback dry-run, got {len(state['calls'])}: "
+ f"{state['calls']}"
+ )
+ assert "::" in state["calls"][0], "first call should be the rewritten user SQL"
+ assert state["calls"][1].startswith("SELECT * FROM"), (
+ "second call should be the per-table fallback"
+ )
+ # Whatever HTTP status comes back must NOT be 502 from the guard's
+ # transport-error path — fallback must absorb the bq_bad_request.
+ assert r.status_code != 502, r.json()
+
+
+def test_guardrail_propagates_502_on_non_parse_bq_errors(
+ seeded_app, mock_dry_run, monkeypatch,
+):
+ """Forbidden / upstream-error from BQ on the dry-run still maps to 502;
+ fallback only kicks in for parse errors. Important so a misconfigured
+ SA doesn't silently fall back to a stale-metadata estimate."""
+ from connectors.bigquery.access import BqAccessError
+
+ _register_bq_remote_row("ue", "finance", "ue")
+
+ def always_forbidden(_bq, _sql):
+ raise BqAccessError("bq_forbidden", "Permission denied", details={})
+
+ monkeypatch.setattr(
+ "app.api.query._bq_dry_run_bytes", always_forbidden, raising=False,
+ )
+
+ c = seeded_app["client"]
+ token = seeded_app["admin_token"]
+ r = c.post(
+ "/api/query",
+ json={"sql": "SELECT count(*) FROM ue"},
+ headers=_auth(token),
+ )
+ assert r.status_code == 502, r.json()
+ detail = r.json().get("detail", {})
+ if isinstance(detail, dict):
+ assert detail.get("kind") == "bq_forbidden"
+
+
+def test_rewrite_helper_handles_bare_name_and_bq_path_in_same_sql():
+ """Direct unit-test of the rewriter so the exact regex behavior is
+ pinned: both bare names AND ``bq..`` references in the same
+ SQL are translated, and longer names win over shorter prefixes.
+ """
+ from app.api.query import _rewrite_user_sql_for_bq_dry_run
+
+ rewritten = _rewrite_user_sql_for_bq_dry_run(
+ sql=(
+ 'SELECT a.id, b.col '
+ 'FROM ue a JOIN bq."finance"."traffic" b ON a.date = b.date'
+ ),
+ name_lookups=[("ue", "finance", "ue")],
+ project="data-prj",
+ )
+ assert "`data-prj.finance.ue`" in rewritten
+ assert "`data-prj.finance.traffic`" in rewritten
+ # Original duckdb-flavor `bq."ds"."t"` form should have been replaced —
+ # if it's still in the output, the BQ.path pass missed it.
+ assert 'bq."finance"."traffic"' not in rewritten
+
+
+def test_rewrite_helper_longer_name_wins_over_prefix():
+ """When two registered names share a prefix (`unit_economics`,
+ `unit_economics_summary`), the longer one must rewrite first so the
+ shorter one's regex doesn't eat the prefix and leave junk like
+ ``\\`...ue\\`_summary`` behind.
+ """
+ from app.api.query import _rewrite_user_sql_for_bq_dry_run
+
+ rewritten = _rewrite_user_sql_for_bq_dry_run(
+ sql="SELECT * FROM unit_economics_summary",
+ name_lookups=[
+ ("unit_economics", "fin", "ue"),
+ ("unit_economics_summary", "fin", "ue_summary"),
+ ],
+ project="p",
+ )
+ assert "`p.fin.ue_summary`" in rewritten
+ # If the shorter name had eaten the prefix we'd see `p.fin.ue`_summary
+ # (broken token). Assert that doesn't happen.
+ assert "`p.fin.ue`" not in rewritten
+
+
+def test_rewrite_helper_is_case_insensitive_on_bare_names():
+ """Bare-name match in `_bq_guardrail_inputs` is case-insensitive (it
+ runs against `sql_lower`). The rewriter must match the same set of
+ occurrences on the original-case SQL or we'd silently leave some
+ references untranslated and dry-run on a half-rewritten SQL.
+ """
+ from app.api.query import _rewrite_user_sql_for_bq_dry_run
+
+ rewritten = _rewrite_user_sql_for_bq_dry_run(
+ sql="SELECT * FROM UE WHERE Ue.id IS NOT NULL",
+ name_lookups=[("ue", "fin", "ue")],
+ project="p",
+ )
+ assert "`p.fin.ue` WHERE `p.fin.ue`.id" in rewritten or \
+ rewritten.lower().count("`p.fin.ue`") == 2