diff --git a/CHANGELOG.md b/CHANGELOG.md index 91c2c91..0b942c1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -38,6 +38,9 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n - `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`. - Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`). - `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces. +- **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit). +- **`agnes status` sessions counter now reads the same source as `agnes push`** — `~/.claude/projects//` (Claude Code's actual write path) with the legacy `/user/sessions/` as a fallback, via `cli.lib.claude_sessions.list_session_files()`. Pre-fix the counter only checked the legacy dir and always reported 0 in workspaces bootstrapped with `agnes init` (since Claude Code never writes there). +- **BigQuery materialize lock-reclaim docstring** at `connectors/bigquery/extractor.py:_try_acquire_file_lock` corrected: a still-running holder's `fcntl.flock` does NOT block the post-unlink reacquisition (new file = new inode = independent lock). The in-process `threading.Lock` keyed on `table_id` is the actual concurrency guard; cross-process protection (two schedulers on one workspace) relies on operators not running multiple concurrent schedulers AND on the TTL being well above the longest plausible COPY (24 h default). Documenting the residual risk so it isn't masked by a misleading "we're safe" comment (Devin Review on extractor.py:111). - **`agnes pull` now re-downloads parquets when the local file is missing, even if the recorded hash matches the server.** Pre-fix the download set was computed from `sync_state.json` hash equality alone — if the parquet had been deleted (manual `rm`, disk cleanup, a different workspace sharing the same global `~/.config/agnes/sync_state.json` writing one workspace's parquets while another reads sync_state and assumes "I already have these"), the hash-equal check would short-circuit the download and the next DuckDB view rebuild would fail on a missing file. Now the existence check on `/server/parquet/.parquet` runs alongside the hash compare; missing file → forced re-download regardless of hash. - **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM ` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."".""`) to BQ-native (`` `..` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam). - **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode. diff --git a/cli/commands/status.py b/cli/commands/status.py index 9647c05..3fff5a7 100644 --- a/cli/commands/status.py +++ b/cli/commands/status.py @@ -39,8 +39,11 @@ def status( if db_path.exists(): last_synced = datetime.fromtimestamp(db_path.stat().st_mtime, tz=timezone.utc).isoformat() - sessions_dir = workspace / "user" / "sessions" - session_count = len(list(sessions_dir.glob("*.jsonl"))) if sessions_dir.exists() else 0 + # Sessions live in ~/.claude/projects// (where Claude Code + # writes them), with `/user/sessions/` as a legacy fallback. + # The helper unions both — same source of truth as `agnes push`. + from cli.lib.claude_sessions import list_session_files + session_count = len(list_session_files(workspace)) info = { "workspace": str(workspace), diff --git a/cli/config.py b/cli/config.py index e20c425..4e90798 100644 --- a/cli/config.py +++ b/cli/config.py @@ -2,8 +2,45 @@ import json import os +from contextlib import contextmanager +from contextvars import ContextVar from pathlib import Path -from typing import Optional +from typing import Iterator, Optional + + +# In-process override for `get_token()`. Used by `agnes init --token X` and +# `agnes auth import-token` to force a specific token for the duration of a +# scoped block, EVEN WHEN `~/.config/agnes/token.json` already holds a +# different (possibly stale) token. Without this override, `get_token()` +# reads the on-disk token first and the explicit `--token` argument is +# silently ignored — the bug Devin Review caught at cli/commands/init.py:99. +# +# A ContextVar is used (not a plain global) so concurrent callers — async +# tasks, threads — each see their own override, and a leaked override in +# one task can't corrupt another. `_token_override.set(...)` returns a +# token used to reset; the `_with_token_override` context manager scopes it. +_token_override: ContextVar[Optional[str]] = ContextVar( + "agnes_cli_token_override", default=None, +) + + +@contextmanager +def _with_token_override(token: Optional[str]) -> Iterator[None]: + """Set `_token_override` for the duration of the block. + + `get_token()` checks the override BEFORE reading `token.json`, so any + in-block call returns the supplied token regardless of on-disk state. + Restores the prior override (if any) on exit so nested overrides nest + correctly. + """ + if not token: + yield + return + reset_token = _token_override.set(token) + try: + yield + finally: + _token_override.reset(reset_token) def _config_dir() -> Path: @@ -18,6 +55,12 @@ def get_server_url() -> str: def get_token() -> Optional[str]: + # In-process override wins over BOTH the on-disk file and the env var. + # Set by `_with_token_override(...)`; used by `agnes init --token X` + # to force the explicit arg through the verify call even when a stale + # `~/.config/agnes/token.json` exists. + if (override := _token_override.get()) is not None: + return override token_file = _config_dir() / "token.json" if token_file.exists(): data = json.loads(token_file.read_text(encoding="utf-8")) diff --git a/cli/lib/pull.py b/cli/lib/pull.py index 79e8fe9..c2eadc5 100644 --- a/cli/lib/pull.py +++ b/cli/lib/pull.py @@ -67,28 +67,34 @@ _SAFE_ID_RE = re.compile(r"^[a-zA-Z0-9_\-]{1,128}$") @contextmanager def _override_server_env(server_url: str, token: str) -> Iterator[None]: - """Set AGNES_SERVER / AGNES_TOKEN for the duration of the call. + """Set AGNES_SERVER + scoped token override for the duration of the call. - `cli.config.get_server_url` / `get_token` already honor these env vars, - which is the same mechanism used in production. Restores prior values - on exit so the caller's environment isn't mutated permanently. + `cli.config.get_server_url` honors `AGNES_SERVER`, so the server URL is + swapped via env-var. The TOKEN override is routed through + `cli.config._with_token_override` (a ContextVar), which is checked by + `get_token()` BEFORE the on-disk `~/.config/agnes/token.json`. This is + load-bearing: `agnes init --token NEW` runs the verify call in step 2 + while the file still holds an OLD token from a prior install — without + the override, the verify uses the stale on-disk token and fails 401. - Caveats: - - **Token override is honored only when no `~/.config/agnes/token.json` - exists.** `get_token()` reads the file first and only falls back to - `AGNES_TOKEN`. `agnes init` writes `token.json` before calling - `run_pull` so the values agree in production; isolated tests/callers - that pass a different token must clear the on-disk token first. - - **Not safe for concurrent invocation in the same process** — env-var - swap is global. Single-threaded use only. + `AGNES_TOKEN` env var is also set as a back-compat hint for any code + path that bypasses `get_token()` (none in `cli/` at last audit, but + third-party hooks may), but the contextvar is the authoritative source. + + Restores prior values on exit so the caller's environment isn't + mutated permanently. Not safe for concurrent invocation across threads; + single-threaded use only. """ + from cli.config import _with_token_override + prev_server = os.environ.get("AGNES_SERVER") prev_token = os.environ.get("AGNES_TOKEN") os.environ["AGNES_SERVER"] = server_url if token: os.environ["AGNES_TOKEN"] = token try: - yield + with _with_token_override(token): + yield finally: if prev_server is None: os.environ.pop("AGNES_SERVER", None) diff --git a/connectors/bigquery/extractor.py b/connectors/bigquery/extractor.py index 5c8cc2b..86dae0a 100644 --- a/connectors/bigquery/extractor.py +++ b/connectors/bigquery/extractor.py @@ -106,10 +106,19 @@ def _try_acquire_file_lock(lock_path: Path): Stale-lock reclaim: if the lock_path exists and its mtime is older than the configured TTL, log a warning and unlink before retrying. - A live holder still wins the second flock attempt (kernel-level - flock isn't tied to mtime), so the reclaim doesn't break correctness - — it just unblocks the case where a holder process was hard-killed - before the kernel released the lock.""" + + Caveat: ``lock_path.unlink()`` + the subsequent ``open()`` creates a + NEW inode — fcntl.flock keys on inode, so a still-running holder's + lock on the (now-orphan) old inode does NOT block the new acquisition. + A genuine overrunning materialize past TTL therefore CAN race a + fresh attempt and both can write to ``.parquet.tmp``. The + in-process ``threading.Lock`` keyed on ``table_id`` blocks that race + within one scheduler process; cross-process protection (two schedulers + on the same workspace) relies on operators not running multiple + concurrent schedulers AND on the TTL being well above the longest + plausible COPY (24 h default). If real corruption surfaces in + production, the next iteration should attach a pid to the lock file + and skip reclaim while the holder pid is alive.""" lock_path.parent.mkdir(parents=True, exist_ok=True) def _try_open_and_flock(): diff --git a/tests/test_cli_init.py b/tests/test_cli_init.py index 8258c4a..ad11ea8 100644 --- a/tests/test_cli_init.py +++ b/tests/test_cli_init.py @@ -224,3 +224,95 @@ def test_init_manifest_unauthorized_when_pull_records_manifest_error(tmp_path, m output = result.output + (result.stderr or "") assert "Traceback" not in output assert ("manifest_unauthorized" in output) or ("Manifest fetch failed" in output) + + +def test_init_uses_explicit_token_arg_not_stale_disk_token(tmp_path, monkeypatch): + """Regression for Devin Review finding on init.py:99. + + Repro: a prior `agnes init` left a stale token in + `~/.config/agnes/token.json`. The new run passes a fresh token via + `--token`. Pre-fix, step 2's PAT-verify call read the on-disk token + first and only fell back to the env var — so the explicit `--token` + arg was silently ignored, the verify ran with the stale token, and + init failed 401 with a confusing 'token expired' error even though + the supplied token was valid. + + Fix: a ContextVar-based override (set by `_override_server_env`) + short-circuits `get_token()` BEFORE the on-disk read. + """ + import json + from unittest.mock import MagicMock + + cfg_dir = tmp_path / "_cfg" + cfg_dir.mkdir() + monkeypatch.setenv("AGNES_CONFIG_DIR", str(cfg_dir)) + + # Seed a stale token on disk — this is what the bug exposed: the verify + # call would prefer this over the --token arg. + token_file = cfg_dir / "token.json" + token_file.write_text(json.dumps({ + "access_token": "STALE-DO-NOT-USE", + "email": "old@example.com", + }), encoding="utf-8") + + captured = {"verify_token": None} + + def _api_get(path, *args, **kwargs): + # Verify endpoint: snapshot whatever token cli.config.get_token() + # returns at the moment of the call. If the override is wired + # correctly, this will be the --token arg, not the stale disk + # value. + if path == "/api/catalog/tables": + from cli.config import get_token + captured["verify_token"] = get_token() + resp = MagicMock() + resp.status_code = 200 + if path == "/api/welcome": + resp.json.return_value = {"content": "# Test\n"} + elif path == "/api/sync/manifest": + resp.json.return_value = {"tables": {}} + elif path == "/api/memory/bundle": + resp.json.return_value = {"mandatory": [], "approved": []} + else: + resp.json.return_value = [] + return resp + + monkeypatch.setattr("cli.commands.init.api_get", _api_get, raising=False) + monkeypatch.setattr("cli.lib.pull.api_get", _api_get, raising=False) + + result = runner.invoke(init_app, [ + "--server-url", "http://x", + "--token", "FRESH-PAT-FROM-USER", + "--workspace", str(tmp_path / "ws"), + "--force", + ]) + + assert captured["verify_token"] == "FRESH-PAT-FROM-USER", ( + "Step 2 verify call must use the explicit --token arg, " + f"not the stale on-disk token. Got: {captured['verify_token']!r}" + ) + output = result.output + (result.stderr or "") + assert "Traceback" not in output + + +def test_token_override_contextvar_does_not_leak_outside_block(): + """The override must be scoped to the `with` block — leaking it would + poison subsequent `get_token()` calls (e.g. a long-running daemon + that runs `agnes init` once and then `agnes pull` later in the same + process).""" + from cli.config import _with_token_override, get_token + import os + + # Sandbox AGNES_CONFIG_DIR so the test's own config dir doesn't muddy + # the assertion (get_token would fall through to AGNES_TOKEN env or + # to None depending on host state). + prior_env = os.environ.pop("AGNES_TOKEN", None) + try: + with _with_token_override("INSIDE"): + assert get_token() == "INSIDE" + # Outside the block: override cleared, falls through to file/env. + # Without a config file or AGNES_TOKEN set, returns None. + assert get_token() != "INSIDE" + finally: + if prior_env is not None: + os.environ["AGNES_TOKEN"] = prior_env