fix(query-guardrail): dry-run user SQL not synthetic SELECT * (#171)
Closes #171. The /api/query cost guardrail used to dry-run a synthetic `SELECT * FROM <table>` for each registered remote-BQ row referenced by the user SQL — which made BigQuery estimate a full table scan, with column projection, predicate pushdown, and partition pruning all disabled. Narrow queries on big partitioned/clustered tables (the documented happy path for `agnes query --remote`) hit ~30,000× over-estimates and got rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Pavel's report on #171 traced the root cause and proposed the fix: rewrite the user SQL to BQ-native syntax and dry-run it as a single job, exactly the way `bq query --dry_run` works. Implementation: - New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered names (word-boundary, case-insensitive, longest-first to avoid prefix collisions) + bq."<ds>"."<tbl>" forms to backticked `<project>.<ds>.<tbl>` paths. - _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap check uses the real estimate. - Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only syntax like ::INT casts), the guard falls back to the pre-fix per-table SELECT * approach so non-portable queries still get a (loose) cap estimate instead of fail-opening. Non-parse BQ errors (forbidden, upstream) still propagate as 502. - _bq_guardrail_inputs now also returns name_lookups so the rewriter has the (registered_name, bucket, source_table) mapping it needs. - Per-table breakdown is unavailable from a composite dry-run; total bytes are pinned to dry_run_set[0] for the post-flight record_bytes(sum(...)) call to keep returning the right total. Tests (7 new, 3 existing still pass): - dry-run receives rewritten user SQL with WHERE clause intact (the load-bearing assertion for #171) - single dry-run per request even with multiple registered tables (JOIN, UNION) referenced - fallback to per-table SELECT * on bq_bad_request - non-parse BQ errors (forbidden) still 502 - rewriter unit tests: bare + bq.path in same SQL, longest-name-wins on prefix collision, case-insensitive bare-name match
This commit is contained in:
parent
bd462187e8
commit
500db8cd3c
3 changed files with 410 additions and 19 deletions
|
|
@ -38,6 +38,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
|
|||
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
|
||||
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
|
||||
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
|
||||
- **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM <table>` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."<ds>"."<tbl>"`) to BQ-native (`` `<project>.<ds>.<tbl>` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam).
|
||||
- **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode.
|
||||
- `agnes snapshot create … --estimate` in a pre-init directory no longer leaks an httpx `ConnectError` traceback to stderr. The estimate-guard fix (3d587681) let `--estimate` reach `api_post_json`, but the existing `except V2ClientError` clause didn't catch transport-layer errors when no server was configured (defaulted to `http://localhost:8000`). Now also catches `httpx.HTTPError` and renders the friendly hint `Run \`agnes init …\` first`.
|
||||
- `agnes push` now reads Claude Code session jsonls from `~/.claude/projects/<encoded-cwd>/` (where Claude Code actually writes them), instead of `<workspace>/user/sessions/` (which the SessionEnd hook never populated — the previous code uploaded an empty list every time). Encoding logic in `cli/lib/claude_sessions.py` probes both Claude Code variants — older `/`→`-` and newer all-non-alphanumeric→`-` — and unions the result, so users who have upgraded Claude Code mid-project see sessions from both encoded dirs. Falls back to `<workspace>/user/sessions/` for back-compat.
|
||||
|
|
|
|||
180
app/api/query.py
180
app/api/query.py
|
|
@ -150,7 +150,7 @@ async def execute_query(
|
|||
raise HTTPException(status_code=403, detail=f"Access denied to table '{table}'")
|
||||
|
||||
# ---- #160 BQ remote-row guardrail + RBAC patch -------------------
|
||||
dry_run_set, blocked_bq_path = _bq_guardrail_inputs(
|
||||
dry_run_set, name_lookups, blocked_bq_path = _bq_guardrail_inputs(
|
||||
request.sql, sql_lower, conn, user, allowed,
|
||||
)
|
||||
if blocked_bq_path is not None:
|
||||
|
|
@ -172,7 +172,10 @@ async def execute_query(
|
|||
user_id = user.get("email") or user.get("id") or "anon"
|
||||
guard = (
|
||||
_bq_quota_and_cap_guard(
|
||||
user_id=user_id, dry_run_set=dry_run_set, sql=request.sql,
|
||||
user_id=user_id,
|
||||
dry_run_set=dry_run_set,
|
||||
name_lookups=name_lookups,
|
||||
sql=request.sql,
|
||||
)
|
||||
if dry_run_set
|
||||
else contextlib.nullcontext()
|
||||
|
|
@ -314,11 +317,19 @@ def _bq_guardrail_inputs(
|
|||
):
|
||||
"""Two-pass scan over user SQL for the upcoming BQ guardrail + RBAC patch.
|
||||
|
||||
Returns a tuple `(dry_run_set, blocked_bq_path)`:
|
||||
Returns a tuple `(dry_run_set, name_lookups, blocked_bq_path)`:
|
||||
|
||||
- `dry_run_set` is a list of `(bucket, source_table, est_bytes)` triples
|
||||
identifying every BigQuery row the request will scan. The caller dry-runs
|
||||
each and bills the sum against the user's daily quota.
|
||||
the rewritten user SQL once and distributes the total here for quota
|
||||
bookkeeping.
|
||||
|
||||
- `name_lookups` is a list of `(registered_name, bucket, source_table)`
|
||||
triples — only the bare-name matches from pass 1, NOT the direct
|
||||
`bq."<ds>"."<tbl>"` matches. Issue #171 fix: the cap-guard rewrites
|
||||
these name → ``\\`<project>.<bucket>.<source_table>\\``` when building
|
||||
the BQ-native SQL for dry-run, so partition pruning + column projection
|
||||
+ predicate pushdown all engage.
|
||||
|
||||
- `blocked_bq_path` is a structured-detail dict for the caller to raise
|
||||
HTTPException(403) with, when user SQL contains a direct
|
||||
|
|
@ -342,6 +353,7 @@ def _bq_guardrail_inputs(
|
|||
# what shows in `agnes catalog`), so the regex match below uses `name`,
|
||||
# but the access gate uses `id`.
|
||||
dry_run: list = []
|
||||
name_lookups: list = []
|
||||
seen_paths: set = set()
|
||||
accessible_set = set(allowed) if allowed is not None else None
|
||||
for r in repo.list_by_source("bigquery"):
|
||||
|
|
@ -363,6 +375,11 @@ def _bq_guardrail_inputs(
|
|||
if key not in seen_paths:
|
||||
seen_paths.add(key)
|
||||
dry_run.append((bucket, source_table, 0)) # bytes filled at dry-run
|
||||
# Record the (name, bucket, source_table) mapping separately so the
|
||||
# cap-guard's SQL rewriter can find every occurrence — even if the
|
||||
# user references the same physical table under two registered
|
||||
# names (rare but possible: aliased catalog rows).
|
||||
name_lookups.append((str(name), bucket, source_table))
|
||||
|
||||
# 2. Direct bq.<ds>.<tbl> pass: every match must point at a registered
|
||||
# row. Run BEFORE adding to dry_run so unregistered paths fail-fast.
|
||||
|
|
@ -372,7 +389,7 @@ def _bq_guardrail_inputs(
|
|||
source_table_raw = m.group(2).strip('"')
|
||||
row = repo.find_by_bq_path(bucket_raw, source_table_raw)
|
||||
if row is None:
|
||||
return [], {
|
||||
return [], [], {
|
||||
"reason": "bq_path_not_registered",
|
||||
"path": f'bq."{bucket_raw}"."{source_table_raw}"',
|
||||
"hint": (
|
||||
|
|
@ -387,7 +404,7 @@ def _bq_guardrail_inputs(
|
|||
# Devin Review iter #3.
|
||||
if not is_admin:
|
||||
if accessible_set is None or row["id"] not in accessible_set:
|
||||
return [], {
|
||||
return [], [], {
|
||||
"reason": "bq_path_access_denied",
|
||||
"path": f'bq."{bucket_raw}"."{source_table_raw}"',
|
||||
"registered_as": row["name"],
|
||||
|
|
@ -401,11 +418,69 @@ def _bq_guardrail_inputs(
|
|||
seen_paths.add(key)
|
||||
dry_run.append((bucket, source_table, 0))
|
||||
|
||||
return dry_run, None
|
||||
return dry_run, name_lookups, None
|
||||
|
||||
|
||||
def _rewrite_user_sql_for_bq_dry_run(
|
||||
sql: str, name_lookups: list, project: str,
|
||||
) -> str:
|
||||
"""Rewrite user SQL from DuckDB-flavor to BQ-native so a single
|
||||
`_bq_dry_run_bytes` call can estimate scan size for the EXACT query
|
||||
the user submitted (issue #171).
|
||||
|
||||
Two transformations:
|
||||
|
||||
1. Each registered remote-BQ name (word-boundary, case-insensitive)
|
||||
→ ``\\`<project>.<bucket>.<source_table>\\````. Sorted longest-first
|
||||
so a longer name (`unit_economics_summary`) is rewritten before a
|
||||
shorter one that's a prefix (`unit_economics`) and we don't end up
|
||||
with a partially-rewritten identifier.
|
||||
|
||||
2. ``bq."<ds>"."<tbl>"`` (and the unquoted variant) → ``\\`<project>.<ds>.<tbl>\\````.
|
||||
|
||||
The rewrite is regex-only (no SQL parser): a registered name appearing
|
||||
inside a string literal (e.g. an `IN (...)` value or a `LIKE` pattern)
|
||||
will also be rewritten. This is acceptable because (a) it's vanishingly
|
||||
rare to have a string literal exactly matching a registered table name,
|
||||
and (b) when it does happen the dry-run errors out and the caller falls
|
||||
back to the per-table SELECT * estimate (current behavior, no regression).
|
||||
|
||||
CTE shadowing: a `WITH unit_economics AS (...)` followed by `FROM
|
||||
unit_economics` would also rewrite the `FROM` reference. BQ then treats
|
||||
the CTE as unreferenced (legal) and the dry-run estimates the rewritten
|
||||
physical table — likely an over-estimate. Same fallback path covers this.
|
||||
"""
|
||||
out = sql
|
||||
# Pass 1: bare-name rewrite, longest names first.
|
||||
for name, bucket, source_table in sorted(
|
||||
name_lookups, key=lambda t: -len(t[0])
|
||||
):
|
||||
target = f"`{project}.{bucket}.{source_table}`"
|
||||
out = re.sub(
|
||||
r"\b" + re.escape(name) + r"\b",
|
||||
target,
|
||||
out,
|
||||
flags=re.IGNORECASE,
|
||||
)
|
||||
|
||||
# Pass 2: bq."ds"."tbl" / bq.ds.tbl → `<project>.<ds>.<tbl>`.
|
||||
def _bq_path_repl(m: re.Match) -> str:
|
||||
ds = m.group(1).strip('"')
|
||||
tbl = m.group(2).strip('"')
|
||||
return f"`{project}.{ds}.{tbl}`"
|
||||
|
||||
out = BQ_PATH.sub(_bq_path_repl, out)
|
||||
return out
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
|
||||
def _bq_quota_and_cap_guard(
|
||||
*,
|
||||
user_id: str,
|
||||
dry_run_set: list,
|
||||
name_lookups: list,
|
||||
sql: str,
|
||||
):
|
||||
"""Pre-flight check + dry-run + cap enforcement for /api/query BQ paths.
|
||||
|
||||
Context-manager shape (Devin Review #5 on PR #168). Earlier implementation
|
||||
|
|
@ -419,16 +494,36 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
|
|||
The caller's `with` block holds the slot through both dry-run AND the
|
||||
subsequent `analytics.execute(...)` until the body exits.
|
||||
|
||||
Issue #171 fix: dry-run runs ONCE on the user's actual SQL (translated
|
||||
to BQ-native via `_rewrite_user_sql_for_bq_dry_run`). Pre-fix the
|
||||
pre-check did N dry-runs of synthetic ``SELECT * FROM <table>`` per
|
||||
referenced table — which ignored WHERE filters, column projection, and
|
||||
partition pruning, over-estimating scan size up to ~30,000× on
|
||||
partitioned/clustered tables and rejecting narrow queries that BQ
|
||||
itself would dry-run as a few MB.
|
||||
|
||||
Fallback: if BQ rejects the rewritten SQL with a parse-level
|
||||
``client_error`` (e.g. DuckDB-only syntax like ``::INT`` casts that
|
||||
don't translate to BQ), fall back to the pre-#171 per-table
|
||||
SELECT * approach so the cap-guard still functions — over-estimate
|
||||
is preferred over fail-open. Forbidden / upstream errors still
|
||||
propagate as HTTP 502.
|
||||
|
||||
Flow:
|
||||
1. `check_daily_budget` — over-cap users get 429 BEFORE any BQ work.
|
||||
2. `quota.acquire(user_id)` opened — concurrent-slot held throughout.
|
||||
3. Dry-run each `(bucket, source_table)` via `_bq_dry_run_bytes`.
|
||||
4. If sum > cap → 400 `remote_scan_too_large`.
|
||||
3. Single dry-run of rewritten user SQL → `total_bytes`.
|
||||
On parse error, fall back to per-table SELECT * → sum.
|
||||
4. If total > cap → 400 `remote_scan_too_large`.
|
||||
5. Yield. Caller runs `analytics.execute(...)` + `record_bytes(...)`.
|
||||
6. On exit, slot released.
|
||||
|
||||
Mutates `dry_run_set` in place: the third tuple element (bytes) is
|
||||
populated with the per-path dry-run result so the caller can sum and
|
||||
record the bytes against the user's quota post-flight.
|
||||
populated so the caller can sum and record bytes against the user's
|
||||
quota post-flight. Single-dry-run path puts `total_bytes` on the first
|
||||
entry and zero on the rest (BQ doesn't expose per-table bytes for a
|
||||
composite query); the caller's `sum(b for _, _, b in dry_run_set)`
|
||||
still equals `total_bytes`.
|
||||
"""
|
||||
quota = _build_quota_tracker()
|
||||
try:
|
||||
|
|
@ -464,19 +559,66 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
|
|||
# Devin Review #2 on PR #168.
|
||||
try:
|
||||
with quota.acquire(user_id):
|
||||
project = bq.projects.data
|
||||
rewritten_sql = _rewrite_user_sql_for_bq_dry_run(
|
||||
sql, name_lookups, project,
|
||||
)
|
||||
|
||||
# Try the single-dry-run path first (issue #171). Falls back
|
||||
# to the per-table SELECT * approach only on BQ parse errors
|
||||
# (kind="bq_bad_request" — DuckDB-only syntax that BQ can't
|
||||
# translate). All other BQ errors propagate as 502 below.
|
||||
total_bytes = 0
|
||||
for i, (bucket, source_table, _) in enumerate(dry_run_set):
|
||||
bq_sql = f"SELECT * FROM `{bq.projects.data}.{bucket}.{source_table}`"
|
||||
try:
|
||||
est = _bq_dry_run_bytes(bq, bq_sql)
|
||||
except BqAccessError as exc:
|
||||
used_fallback = False
|
||||
try:
|
||||
total_bytes = _bq_dry_run_bytes(bq, rewritten_sql)
|
||||
except BqAccessError as exc:
|
||||
if exc.kind == "bq_bad_request":
|
||||
logger.warning(
|
||||
"BQ dry-run rejected the rewritten SQL "
|
||||
"(kind=%s, message=%s). Falling back to per-table "
|
||||
"SELECT * estimate; the cap check will over-estimate "
|
||||
"scan bytes for this query. Consider rewriting to "
|
||||
"BQ-native syntax for a tight pre-check.",
|
||||
exc.kind, exc.message,
|
||||
)
|
||||
used_fallback = True
|
||||
else:
|
||||
raise HTTPException(status_code=502, detail={
|
||||
"kind": exc.kind,
|
||||
"message": exc.message,
|
||||
**(exc.details or {}),
|
||||
})
|
||||
dry_run_set[i] = (bucket, source_table, est)
|
||||
total_bytes += est
|
||||
|
||||
if used_fallback:
|
||||
# Pre-#171 path: estimate per registered table from a
|
||||
# synthetic SELECT *. Over-estimates partitioned scans but
|
||||
# never under-estimates, so the cap still bounds risk.
|
||||
for i, (bucket, source_table, _) in enumerate(dry_run_set):
|
||||
fallback_sql = (
|
||||
f"SELECT * FROM `{project}.{bucket}.{source_table}`"
|
||||
)
|
||||
try:
|
||||
est = _bq_dry_run_bytes(bq, fallback_sql)
|
||||
except BqAccessError as exc:
|
||||
raise HTTPException(status_code=502, detail={
|
||||
"kind": exc.kind,
|
||||
"message": exc.message,
|
||||
**(exc.details or {}),
|
||||
})
|
||||
dry_run_set[i] = (bucket, source_table, est)
|
||||
total_bytes += est
|
||||
else:
|
||||
# Single-dry-run path. Distribute the total to dry_run_set
|
||||
# so the caller's `record_bytes(sum(...))` stays correct.
|
||||
# Per-table breakdown is unavailable from a composite
|
||||
# dry-run; pin total to entry 0, zero the rest.
|
||||
if dry_run_set:
|
||||
b0, t0, _ = dry_run_set[0]
|
||||
dry_run_set[0] = (b0, t0, total_bytes)
|
||||
for i in range(1, len(dry_run_set)):
|
||||
bi, ti, _ = dry_run_set[i]
|
||||
dry_run_set[i] = (bi, ti, 0)
|
||||
|
||||
if cap_bytes > 0 and total_bytes > cap_bytes:
|
||||
tables = [f"{b}.{t}" for b, t, _ in dry_run_set]
|
||||
|
|
|
|||
|
|
@ -136,3 +136,251 @@ def test_no_bq_row_reference_skips_dry_run(seeded_app, monkeypatch):
|
|||
)
|
||||
assert state["calls"] == 0, \
|
||||
f"guardrail must skip dry-run on non-BQ queries; got {state['calls']} calls"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Issue #171: pre-check used to dry-run synthetic SELECT * per registered
|
||||
# table → 30,000× over-estimate on partitioned/clustered tables. Fix: rewrite
|
||||
# user SQL from DuckDB-flavor (bare names + `bq.<ds>.<tbl>`) to BQ-native
|
||||
# (\\`<project>.<ds>.<tbl>\\`) and run a SINGLE dry-run on the user's actual
|
||||
# SQL, so partition pruning, column projection, and predicate pushdown all
|
||||
# count toward the cap check.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_guardrail_dry_runs_rewritten_user_sql_not_synthetic_select_star(
|
||||
seeded_app, mock_dry_run, monkeypatch,
|
||||
):
|
||||
"""The dry-run must receive the USER's SQL with bare table names rewritten
|
||||
to backticked paths — not a synthetic ``SELECT * FROM <table>``.
|
||||
|
||||
This is the load-bearing assertion for issue #171: if the pre-check sees
|
||||
only the table name it can't prune partitions or project columns, and
|
||||
the estimate balloons to "full table size" instead of "what BQ would
|
||||
actually scan."
|
||||
"""
|
||||
_register_bq_remote_row("ue", "finance", "ue")
|
||||
captured = {"sql": None}
|
||||
|
||||
def capturing_fake(_bq, sql):
|
||||
captured["sql"] = sql
|
||||
return 1024 # tiny — pass the cap
|
||||
|
||||
monkeypatch.setattr(
|
||||
"app.api.query._bq_dry_run_bytes", capturing_fake, raising=False,
|
||||
)
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
user_sql = (
|
||||
"SELECT order_id FROM ue "
|
||||
"WHERE event_date = DATE '2026-04-30' AND country = 'CZ'"
|
||||
)
|
||||
c.post("/api/query", json={"sql": user_sql}, headers=_auth(token))
|
||||
|
||||
sent = captured["sql"]
|
||||
assert sent is not None, "dry-run never invoked"
|
||||
# User-side filters must survive the rewrite — that's the whole point of
|
||||
# the fix, partition pruning + predicate pushdown only engage in the BQ
|
||||
# planner if the WHERE clause reaches it.
|
||||
assert "event_date" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}"
|
||||
assert "country" in sent, f"WHERE clause stripped from dry-run SQL: {sent!r}"
|
||||
# Bare name `ue` must have been rewritten to a backticked
|
||||
# `<project>.finance.ue` path (project comes from the test stub
|
||||
# `_FakeProjects.data = "test-data-prj"`).
|
||||
assert "`test-data-prj.finance.ue`" in sent, (
|
||||
f"bare-name rewrite failed; sent SQL: {sent!r}"
|
||||
)
|
||||
# Pre-#171 path emitted `SELECT * FROM`; the new path forwards the
|
||||
# user SELECT clause untouched.
|
||||
assert "SELECT order_id" in sent, (
|
||||
f"pre-check is still using synthetic SELECT *; sent SQL: {sent!r}"
|
||||
)
|
||||
|
||||
|
||||
def test_guardrail_invokes_dry_run_exactly_once_per_request(
|
||||
seeded_app, mock_dry_run, monkeypatch,
|
||||
):
|
||||
"""Single dry-run path: even when the user references multiple registered
|
||||
tables in one query (a JOIN, a UNION, …), only ONE dry-run fires.
|
||||
|
||||
Pre-#171 the pre-check ran N dry-runs (one synthetic SELECT * per table)
|
||||
and summed. Now BQ does the joining for us in a single dry-run — cheaper
|
||||
AND more accurate (joins/filters/projections apply across both sides).
|
||||
"""
|
||||
_register_bq_remote_row("orders", "finance", "orders")
|
||||
_register_bq_remote_row("traffic", "marketing", "traffic")
|
||||
|
||||
state = {"call_count": 0, "last_sql": None}
|
||||
|
||||
def counting_fake(_bq, sql):
|
||||
state["call_count"] += 1
|
||||
state["last_sql"] = sql
|
||||
return 100 # tiny
|
||||
|
||||
monkeypatch.setattr(
|
||||
"app.api.query._bq_dry_run_bytes", counting_fake, raising=False,
|
||||
)
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
c.post(
|
||||
"/api/query",
|
||||
json={
|
||||
"sql": (
|
||||
"SELECT o.id, t.views FROM orders o "
|
||||
"JOIN traffic t ON o.date = t.date"
|
||||
),
|
||||
},
|
||||
headers=_auth(token),
|
||||
)
|
||||
assert state["call_count"] == 1, (
|
||||
f"single-dry-run path expected; got {state['call_count']} calls"
|
||||
)
|
||||
# Both bare names rewritten in the same SQL.
|
||||
assert "`test-data-prj.finance.orders`" in state["last_sql"]
|
||||
assert "`test-data-prj.marketing.traffic`" in state["last_sql"]
|
||||
|
||||
|
||||
def test_guardrail_falls_back_to_per_table_estimate_on_bq_parse_error(
|
||||
seeded_app, mock_dry_run, monkeypatch,
|
||||
):
|
||||
"""When BQ rejects the rewritten SQL with ``bq_bad_request`` (DuckDB-only
|
||||
syntax that doesn't translate — e.g. ``::INT`` casts, ``STRPOS``, …),
|
||||
the cap-guard falls back to the pre-#171 per-table SELECT * approach
|
||||
so a non-portable query still gets a (loose) cap estimate instead of
|
||||
fail-opening.
|
||||
"""
|
||||
from connectors.bigquery.access import BqAccessError
|
||||
|
||||
_register_bq_remote_row("ue", "finance", "ue")
|
||||
|
||||
state = {"calls": []}
|
||||
|
||||
def fake_dry_run(_bq, sql):
|
||||
state["calls"].append(sql)
|
||||
# First call (rewritten user SQL) → BQ parse error.
|
||||
if len(state["calls"]) == 1:
|
||||
raise BqAccessError("bq_bad_request", "Syntax error: unexpected '::'")
|
||||
# Second call (fallback per-table SELECT *) → small bytes, pass cap.
|
||||
return 4096
|
||||
|
||||
monkeypatch.setattr(
|
||||
"app.api.query._bq_dry_run_bytes", fake_dry_run, raising=False,
|
||||
)
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
# SQL with DuckDB-only `::INT` cast that BQ would reject.
|
||||
r = c.post(
|
||||
"/api/query",
|
||||
json={"sql": "SELECT order_id::INT FROM ue WHERE country = 'CZ'"},
|
||||
headers=_auth(token),
|
||||
)
|
||||
|
||||
# Two dry-runs (rewritten + fallback per-table) before the (failed)
|
||||
# execute. Status will be a downstream error from analytics.execute()
|
||||
# since `::INT` doesn't work in DuckDB either against a remote view —
|
||||
# but the GUARDRAIL must have completed without 5xx-ing.
|
||||
assert len(state["calls"]) == 2, (
|
||||
f"expected 1 rewritten + 1 fallback dry-run, got {len(state['calls'])}: "
|
||||
f"{state['calls']}"
|
||||
)
|
||||
assert "::" in state["calls"][0], "first call should be the rewritten user SQL"
|
||||
assert state["calls"][1].startswith("SELECT * FROM"), (
|
||||
"second call should be the per-table fallback"
|
||||
)
|
||||
# Whatever HTTP status comes back must NOT be 502 from the guard's
|
||||
# transport-error path — fallback must absorb the bq_bad_request.
|
||||
assert r.status_code != 502, r.json()
|
||||
|
||||
|
||||
def test_guardrail_propagates_502_on_non_parse_bq_errors(
|
||||
seeded_app, mock_dry_run, monkeypatch,
|
||||
):
|
||||
"""Forbidden / upstream-error from BQ on the dry-run still maps to 502;
|
||||
fallback only kicks in for parse errors. Important so a misconfigured
|
||||
SA doesn't silently fall back to a stale-metadata estimate."""
|
||||
from connectors.bigquery.access import BqAccessError
|
||||
|
||||
_register_bq_remote_row("ue", "finance", "ue")
|
||||
|
||||
def always_forbidden(_bq, _sql):
|
||||
raise BqAccessError("bq_forbidden", "Permission denied", details={})
|
||||
|
||||
monkeypatch.setattr(
|
||||
"app.api.query._bq_dry_run_bytes", always_forbidden, raising=False,
|
||||
)
|
||||
|
||||
c = seeded_app["client"]
|
||||
token = seeded_app["admin_token"]
|
||||
r = c.post(
|
||||
"/api/query",
|
||||
json={"sql": "SELECT count(*) FROM ue"},
|
||||
headers=_auth(token),
|
||||
)
|
||||
assert r.status_code == 502, r.json()
|
||||
detail = r.json().get("detail", {})
|
||||
if isinstance(detail, dict):
|
||||
assert detail.get("kind") == "bq_forbidden"
|
||||
|
||||
|
||||
def test_rewrite_helper_handles_bare_name_and_bq_path_in_same_sql():
|
||||
"""Direct unit-test of the rewriter so the exact regex behavior is
|
||||
pinned: both bare names AND ``bq.<ds>.<tbl>`` references in the same
|
||||
SQL are translated, and longer names win over shorter prefixes.
|
||||
"""
|
||||
from app.api.query import _rewrite_user_sql_for_bq_dry_run
|
||||
|
||||
rewritten = _rewrite_user_sql_for_bq_dry_run(
|
||||
sql=(
|
||||
'SELECT a.id, b.col '
|
||||
'FROM ue a JOIN bq."finance"."traffic" b ON a.date = b.date'
|
||||
),
|
||||
name_lookups=[("ue", "finance", "ue")],
|
||||
project="data-prj",
|
||||
)
|
||||
assert "`data-prj.finance.ue`" in rewritten
|
||||
assert "`data-prj.finance.traffic`" in rewritten
|
||||
# Original duckdb-flavor `bq."ds"."t"` form should have been replaced —
|
||||
# if it's still in the output, the BQ.path pass missed it.
|
||||
assert 'bq."finance"."traffic"' not in rewritten
|
||||
|
||||
|
||||
def test_rewrite_helper_longer_name_wins_over_prefix():
|
||||
"""When two registered names share a prefix (`unit_economics`,
|
||||
`unit_economics_summary`), the longer one must rewrite first so the
|
||||
shorter one's regex doesn't eat the prefix and leave junk like
|
||||
``\\`...ue\\`_summary`` behind.
|
||||
"""
|
||||
from app.api.query import _rewrite_user_sql_for_bq_dry_run
|
||||
|
||||
rewritten = _rewrite_user_sql_for_bq_dry_run(
|
||||
sql="SELECT * FROM unit_economics_summary",
|
||||
name_lookups=[
|
||||
("unit_economics", "fin", "ue"),
|
||||
("unit_economics_summary", "fin", "ue_summary"),
|
||||
],
|
||||
project="p",
|
||||
)
|
||||
assert "`p.fin.ue_summary`" in rewritten
|
||||
# If the shorter name had eaten the prefix we'd see `p.fin.ue`_summary
|
||||
# (broken token). Assert that doesn't happen.
|
||||
assert "`p.fin.ue`" not in rewritten
|
||||
|
||||
|
||||
def test_rewrite_helper_is_case_insensitive_on_bare_names():
|
||||
"""Bare-name match in `_bq_guardrail_inputs` is case-insensitive (it
|
||||
runs against `sql_lower`). The rewriter must match the same set of
|
||||
occurrences on the original-case SQL or we'd silently leave some
|
||||
references untranslated and dry-run on a half-rewritten SQL.
|
||||
"""
|
||||
from app.api.query import _rewrite_user_sql_for_bq_dry_run
|
||||
|
||||
rewritten = _rewrite_user_sql_for_bq_dry_run(
|
||||
sql="SELECT * FROM UE WHERE Ue.id IS NOT NULL",
|
||||
name_lookups=[("ue", "fin", "ue")],
|
||||
project="p",
|
||||
)
|
||||
assert "`p.fin.ue` WHERE `p.fin.ue`.id" in rewritten or \
|
||||
rewritten.lower().count("`p.fin.ue`") == 2
|
||||
|
|
|
|||
Loading…
Reference in a new issue