fix(query-guardrail): single-pass alternation regex (Devin Review on query.py:464)

The iterative bare-name rewriter (one re.sub per name, longest-first)
was vulnerable to cross-contamination when the GCP project ID contained
a registered table name as a hyphen-delimited word.

Concrete repro:
  project        = 'my-ue-project'
  registered     = ['orders', 'ue']
  user SQL       = 'SELECT * FROM orders JOIN ue ON ...'
  iter 1 (orders): produces 'FROM `my-ue-project.fin.orders` JOIN ue ...'
  iter 2 (ue):     '\bue\b' matches 'ue' INSIDE 'my-ue-project' (hyphen
                   creates word boundary on both sides) — corrupts
                   the iter-1 path

Fallback at query.py:576 caught the resulting BQ parse error and fell
back to per-table SELECT * estimate, so impact was over-estimation,
not fail-open — but the #171 partition-pruning fix silently degraded
to pre-fix behavior whenever a project name shared a hyphen-segment
with a registered table.

Fix: single re.sub call with an alternation regex sorted longest-first.
Single-pass means each source position is processed exactly once, so
freshly-inserted backticked text from one match isn't re-scanned by
later names in the alternation.

Regression test
test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name
covers the exact Devin repro.
This commit is contained in:
ZdenekSrotyr 2026-05-04 22:51:33 +02:00
parent c432e90f62
commit 5915f92eaa
3 changed files with 89 additions and 15 deletions

View file

@ -39,6 +39,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup` page now produces a
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
- **`agnes query --remote` SQL rewriter no longer corrupts output when the GCP project ID contains a registered table name as a hyphen-delimited word** (Devin Review on `query.py:464`). The previous iterative rewrite (one `re.sub(\b<name>\b, ...)` per registered name) was vulnerable to cross-contamination: e.g. project `my-ue-project` + registered `orders` + registered `ue` → iter 1 rewrites `orders` to `\`my-ue-project.fin.orders\``, iter 2's `\bue\b` then matches the `ue` INSIDE `my-ue-project` and corrupts the iter-1 path. Fix: replaced the iteration with a SINGLE `re.sub` whose alternation regex (sorted longest-first) handles every name in one pass, so freshly-inserted backticked text isn't re-scanned. The fallback at `query.py:576` (per-table SELECT * on BQ parse error) caught the corrupted output as `bq_bad_request` so impact was over-estimation rather than fail-open, but the partition-pruning benefit of #171 is now preserved for projects whose IDs share a hyphen-segment with a registered table name. Regression test in `tests/test_api_query_guardrail.py::test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name`.
- **BigQuery materialize TTL reclaim is no longer dead code** (Devin Review on `extractor.py:166`). `_try_acquire_file_lock` used to call `open(lock_path, mode="w")` BEFORE checking the lock-file mtime, which truncated the file and refreshed mtime to *now* on every invocation. The subsequent `time.time() - lock_path.stat().st_mtime` always saw age ~0, so `age > TTL` never fired, and `materialize.lock_ttl_seconds` was a silently no-op config knob. Fix: stat the lock path BEFORE any `open()` to read the real pre-probe mtime; if older than TTL, unlink (forcing a fresh inode for the next `open + flock`); only then probe. Two regression tests added: `test_stale_held_lock_is_reclaimed_despite_live_holder` exercises the full reclaim path with a still-living fcntl holder, `test_failed_probe_does_not_self_refresh_lock_mtime` pins that a failed acquisition doesn't pathologically loop. Residual cross-process risk (a genuinely overrunning materialize past TTL races a fresh attempt) is documented in the helper docstring; in-process `threading.Lock` keyed on `table_id` blocks the single-process race.
- **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit).
- **`agnes status` sessions counter now reads the same source as `agnes push`** — `~/.claude/projects/<encoded-cwd>/` (Claude Code's actual write path) with the legacy `<workspace>/user/sessions/` as a fallback, via `cli.lib.claude_sessions.list_session_files()`. Pre-fix the counter only checked the legacy dir and always reported 0 in workspaces bootstrapped with `agnes init` (since Claude Code never writes there).

View file

@ -431,12 +431,22 @@ def _rewrite_user_sql_for_bq_dry_run(
Two transformations:
1. Each registered remote-BQ name (word-boundary, case-insensitive)
``\\`<project>.<bucket>.<source_table>\\````. Sorted longest-first
so a longer name (`unit_economics_summary`) is rewritten before a
shorter one that's a prefix (`unit_economics`) and we don't end up
with a partially-rewritten identifier.
``\\`<project>.<bucket>.<source_table>\\````. A SINGLE re.sub call
with an alternation regex sorted longest-first replaces every
occurrence in one pass important to avoid cross-contamination
(Devin Review on query.py:464). The previous iterative approach
(one re.sub per name, longest-first) corrupted output when the
project ID contained a registered table name as a hyphen-delimited
word: Pass 1 iter N's `\\bname\\b` regex would match INSIDE the
backticked replacement text from a prior iter. Concrete repro:
project = `my-ue-project`, registered names `orders` + `ue`, SQL
`FROM orders JOIN ue` after iter 1 (orders): the backticked path
contains `my-ue-project`, then iter 2 (ue) matches the `ue` inside
it. Single-pass alternation processes each source position exactly
once, so the freshly-inserted backticked text isn't re-scanned.
2. ``bq."<ds>"."<tbl>"`` (and the unquoted variant) ``\\`<project>.<ds>.<tbl>\\````.
Distinct pattern from Pass 1, no overlap, separate re.sub.
The rewrite is regex-only (no SQL parser): a registered name appearing
inside a string literal (e.g. an `IN (...)` value or a `LIKE` pattern)
@ -451,17 +461,32 @@ def _rewrite_user_sql_for_bq_dry_run(
physical table likely an over-estimate. Same fallback path covers this.
"""
out = sql
# Pass 1: bare-name rewrite, longest names first.
for name, bucket, source_table in sorted(
name_lookups, key=lambda t: -len(t[0])
):
target = f"`{project}.{bucket}.{source_table}`"
out = re.sub(
r"\b" + re.escape(name) + r"\b",
target,
out,
flags=re.IGNORECASE,
)
# Pass 1: bare-name rewrite. Build a single alternation regex sorted
# longest-first, with a function-replacement that looks the matched
# name up in a case-insensitive dict. Single-pass means freshly
# inserted backticked text isn't re-scanned, fixing the
# project-ID-contains-name corruption (Devin Review on query.py:464).
if name_lookups:
# Map name (lower-cased) → backticked target. Names are
# case-insensitive on the input side per the existing helper
# contract (see test_rewrite_helper_is_case_insensitive_on_bare_names).
name_to_target: dict[str, str] = {}
for name, bucket, source_table in name_lookups:
name_to_target[name.lower()] = f"`{project}.{bucket}.{source_table}`"
# Alternation pattern, longest-first. Longer match wins at any
# given position because Python's re tries alternatives
# left-to-right and stops at the first match — pinning longest
# entries to the front preserves the prefix-collision invariant
# exercised by test_rewrite_helper_longer_name_wins_over_prefix.
sorted_names = sorted(name_to_target.keys(), key=len, reverse=True)
pattern = r"\b(" + "|".join(re.escape(n) for n in sorted_names) + r")\b"
def _name_repl(m: re.Match) -> str:
return name_to_target[m.group(1).lower()]
out = re.sub(pattern, _name_repl, out, flags=re.IGNORECASE)
# Pass 2: bq."ds"."tbl" / bq.ds.tbl → `<project>.<ds>.<tbl>`.
def _bq_path_repl(m: re.Match) -> str:

View file

@ -369,6 +369,54 @@ def test_rewrite_helper_longer_name_wins_over_prefix():
assert "`p.fin.ue`" not in rewritten
def test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name():
"""Regression for Devin Review on query.py:464.
Pre-fix the rewriter ran one `re.sub(\\bname\\b, ...)` per registered
table, longest-first. When the GCP project ID contained a registered
table name as a hyphen-delimited word (e.g. project=`my-ue-project`,
registered name=`ue`), iter N's `\\b` regex would match INSIDE the
backticked replacement text from a PRIOR iter, corrupting the output.
Concrete trace:
- SQL: ``FROM orders JOIN ue ON ...``
- Iter 1 (orders): produces ``FROM `my-ue-project.fin.orders` JOIN ue ON``
- Iter 2 (ue): `\\bue\\b` matches `ue` inside `my-ue-project` (hyphen =
word boundary on both sides) corrupts the iter-1 path.
Post-fix: single `re.sub` with an alternation regex processes each
source position exactly once. Freshly-inserted backticked text is
NOT re-scanned by subsequent name patterns.
"""
from app.api.query import _rewrite_user_sql_for_bq_dry_run
rewritten = _rewrite_user_sql_for_bq_dry_run(
sql="SELECT * FROM orders JOIN ue ON orders.id = ue.id",
name_lookups=[
("orders", "fin", "orders"),
("ue", "analytics", "ue_metrics"),
],
project="my-ue-project",
)
# Both names rewritten exactly once. Critically, the orders path is
# NOT corrupted by a stray rewrite of `ue` inside `my-ue-project`.
assert "`my-ue-project.fin.orders`" in rewritten
assert "`my-ue-project.analytics.ue_metrics`" in rewritten
# The corruption signature: the orders path would contain a nested
# backtick-fenced ue path. Pinning this absence is the load-bearing
# assertion — it fails on the pre-fix iterative rewriter.
assert "`my-`my-ue-project.analytics.ue_metrics`-project" not in rewritten
# Bare `ue` outside backticks (the JOIN clause) should be rewritten.
# The 2nd `ue.id` was already rewritten by the same single-pass call.
# No `\\bue\\b` survives outside backticks.
import re as _re
bare_ue_matches = _re.findall(r"(?<!\\.)\\bue\\b(?![.`])", rewritten)
assert not bare_ue_matches, f"unrewritten bare `ue` left: {bare_ue_matches!r}"
def test_rewrite_helper_is_case_insensitive_on_bare_names():
"""Bare-name match in `_bq_guardrail_inputs` is case-insensitive (it
runs against `sql_lower`). The rewriter must match the same set of