diff --git a/CHANGELOG.md b/CHANGELOG.md index dae2c4f..92c0d6a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -39,6 +39,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup` page now produces a - `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`. - Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`). - `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces. +- **`agnes query --remote` SQL rewriter no longer corrupts output when the GCP project ID contains a registered table name as a hyphen-delimited word** (Devin Review on `query.py:464`). The previous iterative rewrite (one `re.sub(\b\b, ...)` per registered name) was vulnerable to cross-contamination: e.g. project `my-ue-project` + registered `orders` + registered `ue` → iter 1 rewrites `orders` to `\`my-ue-project.fin.orders\``, iter 2's `\bue\b` then matches the `ue` INSIDE `my-ue-project` and corrupts the iter-1 path. Fix: replaced the iteration with a SINGLE `re.sub` whose alternation regex (sorted longest-first) handles every name in one pass, so freshly-inserted backticked text isn't re-scanned. The fallback at `query.py:576` (per-table SELECT * on BQ parse error) caught the corrupted output as `bq_bad_request` so impact was over-estimation rather than fail-open, but the partition-pruning benefit of #171 is now preserved for projects whose IDs share a hyphen-segment with a registered table name. Regression test in `tests/test_api_query_guardrail.py::test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name`. - **BigQuery materialize TTL reclaim is no longer dead code** (Devin Review on `extractor.py:166`). `_try_acquire_file_lock` used to call `open(lock_path, mode="w")` BEFORE checking the lock-file mtime, which truncated the file and refreshed mtime to *now* on every invocation. The subsequent `time.time() - lock_path.stat().st_mtime` always saw age ~0, so `age > TTL` never fired, and `materialize.lock_ttl_seconds` was a silently no-op config knob. Fix: stat the lock path BEFORE any `open()` to read the real pre-probe mtime; if older than TTL, unlink (forcing a fresh inode for the next `open + flock`); only then probe. Two regression tests added: `test_stale_held_lock_is_reclaimed_despite_live_holder` exercises the full reclaim path with a still-living fcntl holder, `test_failed_probe_does_not_self_refresh_lock_mtime` pins that a failed acquisition doesn't pathologically loop. Residual cross-process risk (a genuinely overrunning materialize past TTL races a fresh attempt) is documented in the helper docstring; in-process `threading.Lock` keyed on `table_id` blocks the single-process race. - **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit). - **`agnes status` sessions counter now reads the same source as `agnes push`** — `~/.claude/projects//` (Claude Code's actual write path) with the legacy `/user/sessions/` as a fallback, via `cli.lib.claude_sessions.list_session_files()`. Pre-fix the counter only checked the legacy dir and always reported 0 in workspaces bootstrapped with `agnes init` (since Claude Code never writes there). diff --git a/app/api/query.py b/app/api/query.py index 891c066..df6cc50 100644 --- a/app/api/query.py +++ b/app/api/query.py @@ -431,12 +431,22 @@ def _rewrite_user_sql_for_bq_dry_run( Two transformations: 1. Each registered remote-BQ name (word-boundary, case-insensitive) - → ``\\`..\\````. Sorted longest-first - so a longer name (`unit_economics_summary`) is rewritten before a - shorter one that's a prefix (`unit_economics`) and we don't end up - with a partially-rewritten identifier. + → ``\\`..\\````. A SINGLE re.sub call + with an alternation regex sorted longest-first replaces every + occurrence in one pass — important to avoid cross-contamination + (Devin Review on query.py:464). The previous iterative approach + (one re.sub per name, longest-first) corrupted output when the + project ID contained a registered table name as a hyphen-delimited + word: Pass 1 iter N's `\\bname\\b` regex would match INSIDE the + backticked replacement text from a prior iter. Concrete repro: + project = `my-ue-project`, registered names `orders` + `ue`, SQL + `FROM orders JOIN ue` → after iter 1 (orders): the backticked path + contains `my-ue-project`, then iter 2 (ue) matches the `ue` inside + it. Single-pass alternation processes each source position exactly + once, so the freshly-inserted backticked text isn't re-scanned. 2. ``bq."".""`` (and the unquoted variant) → ``\\`..\\````. + Distinct pattern from Pass 1, no overlap, separate re.sub. The rewrite is regex-only (no SQL parser): a registered name appearing inside a string literal (e.g. an `IN (...)` value or a `LIKE` pattern) @@ -451,17 +461,32 @@ def _rewrite_user_sql_for_bq_dry_run( physical table — likely an over-estimate. Same fallback path covers this. """ out = sql - # Pass 1: bare-name rewrite, longest names first. - for name, bucket, source_table in sorted( - name_lookups, key=lambda t: -len(t[0]) - ): - target = f"`{project}.{bucket}.{source_table}`" - out = re.sub( - r"\b" + re.escape(name) + r"\b", - target, - out, - flags=re.IGNORECASE, - ) + + # Pass 1: bare-name rewrite. Build a single alternation regex sorted + # longest-first, with a function-replacement that looks the matched + # name up in a case-insensitive dict. Single-pass means freshly + # inserted backticked text isn't re-scanned, fixing the + # project-ID-contains-name corruption (Devin Review on query.py:464). + if name_lookups: + # Map name (lower-cased) → backticked target. Names are + # case-insensitive on the input side per the existing helper + # contract (see test_rewrite_helper_is_case_insensitive_on_bare_names). + name_to_target: dict[str, str] = {} + for name, bucket, source_table in name_lookups: + name_to_target[name.lower()] = f"`{project}.{bucket}.{source_table}`" + + # Alternation pattern, longest-first. Longer match wins at any + # given position because Python's re tries alternatives + # left-to-right and stops at the first match — pinning longest + # entries to the front preserves the prefix-collision invariant + # exercised by test_rewrite_helper_longer_name_wins_over_prefix. + sorted_names = sorted(name_to_target.keys(), key=len, reverse=True) + pattern = r"\b(" + "|".join(re.escape(n) for n in sorted_names) + r")\b" + + def _name_repl(m: re.Match) -> str: + return name_to_target[m.group(1).lower()] + + out = re.sub(pattern, _name_repl, out, flags=re.IGNORECASE) # Pass 2: bq."ds"."tbl" / bq.ds.tbl → `..`. def _bq_path_repl(m: re.Match) -> str: diff --git a/tests/test_api_query_guardrail.py b/tests/test_api_query_guardrail.py index bc6acc0..f0df7b4 100644 --- a/tests/test_api_query_guardrail.py +++ b/tests/test_api_query_guardrail.py @@ -369,6 +369,54 @@ def test_rewrite_helper_longer_name_wins_over_prefix(): assert "`p.fin.ue`" not in rewritten +def test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name(): + """Regression for Devin Review on query.py:464. + + Pre-fix the rewriter ran one `re.sub(\\bname\\b, ...)` per registered + table, longest-first. When the GCP project ID contained a registered + table name as a hyphen-delimited word (e.g. project=`my-ue-project`, + registered name=`ue`), iter N's `\\b` regex would match INSIDE the + backticked replacement text from a PRIOR iter, corrupting the output. + + Concrete trace: + - SQL: ``FROM orders JOIN ue ON ...`` + - Iter 1 (orders): produces ``FROM `my-ue-project.fin.orders` JOIN ue ON`` + - Iter 2 (ue): `\\bue\\b` matches `ue` inside `my-ue-project` (hyphen = + word boundary on both sides) → corrupts the iter-1 path. + + Post-fix: single `re.sub` with an alternation regex processes each + source position exactly once. Freshly-inserted backticked text is + NOT re-scanned by subsequent name patterns. + """ + from app.api.query import _rewrite_user_sql_for_bq_dry_run + + rewritten = _rewrite_user_sql_for_bq_dry_run( + sql="SELECT * FROM orders JOIN ue ON orders.id = ue.id", + name_lookups=[ + ("orders", "fin", "orders"), + ("ue", "analytics", "ue_metrics"), + ], + project="my-ue-project", + ) + + # Both names rewritten exactly once. Critically, the orders path is + # NOT corrupted by a stray rewrite of `ue` inside `my-ue-project`. + assert "`my-ue-project.fin.orders`" in rewritten + assert "`my-ue-project.analytics.ue_metrics`" in rewritten + + # The corruption signature: the orders path would contain a nested + # backtick-fenced ue path. Pinning this absence is the load-bearing + # assertion — it fails on the pre-fix iterative rewriter. + assert "`my-`my-ue-project.analytics.ue_metrics`-project" not in rewritten + + # Bare `ue` outside backticks (the JOIN clause) should be rewritten. + # The 2nd `ue.id` was already rewritten by the same single-pass call. + # No `\\bue\\b` survives outside backticks. + import re as _re + bare_ue_matches = _re.findall(r"(?