fix(devin-review): stale-token override + status sessions counter + lock comment

Three Devin Review findings on PR #173 addressed in one commit since they're in adjacent code paths: 1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran step 2 verify against the OLD on-disk token because `get_token()` read `~/.config/agnes/token.json` before the env var, and `_override_server_env` only set the env var. So `agnes init --force` on a machine with a stale token.json failed 401 with a confusing 'token expired' even though the --token arg was valid. Fix: ContextVar-based override in `cli.config._token_override` checked by `get_token()` BEFORE the on-disk read. `_with_token_override` context manager scopes the override. `_override_server_env` now also sets the contextvar via `_with_token_override(token)`, so both env var and contextvar carry the override (env for back-compat with anything bypassing get_token; contextvar is the authoritative source). Async-safe (each task sees its own override) and leak-proof (resets on context exit). 2 new tests: regression on stale-disk-token + scope leak guard. 2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only checked legacy `<workspace>/user/sessions/` and always reported 0 in workspaces bootstrapped with `agnes init` (Claude Code writes to `~/.claude/projects/`, not the legacy path). Same bug we fixed for `agnes push` in 08e49591. Fix: route through `cli.lib.claude_sessions.list_session_files()` so status and push agree on what counts as a pending session. 3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed "a live holder still wins the second flock attempt" — incorrect on Linux. After `unlink()` + `open()`, the new file is a new inode; fcntl.flock keys per-inode, so the old holder's lock does NOT block the new acquisition. In a genuine TTL-overrun scenario two writers CAN race the parquet.tmp. Fix: documentation only. Comment now honestly describes the inode-recreation behavior, names the threading.Lock as the actual in-process guard, and flags pid-gating as the next-iteration fix if real corruption surfaces. The 24h default TTL is well above typical COPY durations so the practical risk is low. Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader regression set.
2026-05-04 21:26:30 +02:00 · 2026-05-04 21:26:30 +02:00 · 8784f10a6b
commit 8784f10a6b
parent 8233c3e3f9
6 changed files with 176 additions and 20 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -38,6 +38,9 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
 - `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
 - Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
 - `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
 - **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit).
 - **`agnes status` sessions counter now reads the same source as `agnes push`** — `~/.claude/projects/<encoded-cwd>/` (Claude Code's actual write path) with the legacy `<workspace>/user/sessions/` as a fallback, via `cli.lib.claude_sessions.list_session_files()`. Pre-fix the counter only checked the legacy dir and always reported 0 in workspaces bootstrapped with `agnes init` (since Claude Code never writes there).
 - **BigQuery materialize lock-reclaim docstring** at `connectors/bigquery/extractor.py:_try_acquire_file_lock` corrected: a still-running holder's `fcntl.flock` does NOT block the post-unlink reacquisition (new file = new inode = independent lock). The in-process `threading.Lock` keyed on `table_id` is the actual concurrency guard; cross-process protection (two schedulers on one workspace) relies on operators not running multiple concurrent schedulers AND on the TTL being well above the longest plausible COPY (24 h default). Documenting the residual risk so it isn't masked by a misleading "we're safe" comment (Devin Review on extractor.py:111).
 - **`agnes pull` now re-downloads parquets when the local file is missing, even if the recorded hash matches the server.** Pre-fix the download set was computed from `sync_state.json` hash equality alone — if the parquet had been deleted (manual `rm`, disk cleanup, a different workspace sharing the same global `~/.config/agnes/sync_state.json` writing one workspace's parquets while another reads sync_state and assumes "I already have these"), the hash-equal check would short-circuit the download and the next DuckDB view rebuild would fail on a missing file. Now the existence check on `<workspace>/server/parquet/<tid>.parquet` runs alongside the hash compare; missing file → forced re-download regardless of hash.
 - **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM <table>` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."<ds>"."<tbl>"`) to BQ-native (`` `<project>.<ds>.<tbl>` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam).
 - **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode.
--- a/cli/commands/status.py
+++ b/cli/commands/status.py
@ -39,8 +39,11 @@ def status(
    if db_path.exists():
        last_synced = datetime.fromtimestamp(db_path.stat().st_mtime, tz=timezone.utc).isoformat()
-    sessions_dir = workspace / "user" / "sessions"
+    # Sessions live in ~/.claude/projects/<encoded-cwd>/ (where Claude Code
-    session_count = len(list(sessions_dir.glob("*.jsonl"))) if sessions_dir.exists() else 0
+    # writes them), with `<workspace>/user/sessions/` as a legacy fallback.
    # The helper unions both — same source of truth as `agnes push`.
    from cli.lib.claude_sessions import list_session_files
    session_count = len(list_session_files(workspace))
    info = {
        "workspace": str(workspace),
--- a/cli/config.py
+++ b/cli/config.py
@ -2,8 +2,45 @@
 import json
 import os
 from contextlib import contextmanager
 from contextvars import ContextVar
 from pathlib import Path
-from typing import Optional
+from typing import Iterator, Optional
 # In-process override for `get_token()`. Used by `agnes init --token X` and
 # `agnes auth import-token` to force a specific token for the duration of a
 # scoped block, EVEN WHEN `~/.config/agnes/token.json` already holds a
 # different (possibly stale) token. Without this override, `get_token()`
 # reads the on-disk token first and the explicit `--token` argument is
 # silently ignored — the bug Devin Review caught at cli/commands/init.py:99.
 #
 # A ContextVar is used (not a plain global) so concurrent callers — async
 # tasks, threads — each see their own override, and a leaked override in
 # one task can't corrupt another. `_token_override.set(...)` returns a
 # token used to reset; the `_with_token_override` context manager scopes it.
 _token_override: ContextVar[Optional[str]] = ContextVar(
    "agnes_cli_token_override", default=None,
 )
@contextmanager
 def _with_token_override(token: Optional[str]) -> Iterator[None]:
    """Set `_token_override` for the duration of the block.
    `get_token()` checks the override BEFORE reading `token.json`, so any
    in-block call returns the supplied token regardless of on-disk state.
    Restores the prior override (if any) on exit so nested overrides nest
    correctly.
    """
    if not token:
        yield
        return
    reset_token = _token_override.set(token)
    try:
        yield
    finally:
        _token_override.reset(reset_token)
 def _config_dir() -> Path:
@ -18,6 +55,12 @@ def get_server_url() -> str:
 def get_token() -> Optional[str]:
    # In-process override wins over BOTH the on-disk file and the env var.
    # Set by `_with_token_override(...)`; used by `agnes init --token X`
    # to force the explicit arg through the verify call even when a stale
    # `~/.config/agnes/token.json` exists.
    if (override := _token_override.get()) is not None:
        return override
    token_file = _config_dir() / "token.json"
    if token_file.exists():
        data = json.loads(token_file.read_text(encoding="utf-8"))
--- a/cli/lib/pull.py
+++ b/cli/lib/pull.py
@ -67,28 +67,34 @@ _SAFE_ID_RE = re.compile(r"^[a-zA-Z0-9_\-]{1,128}$")
@contextmanager
 def _override_server_env(server_url: str, token: str) -> Iterator[None]:
-    """Set AGNES_SERVER / AGNES_TOKEN for the duration of the call.
+    """Set AGNES_SERVER + scoped token override for the duration of the call.
-    `cli.config.get_server_url` / `get_token` already honor these env vars,
+    `cli.config.get_server_url` honors `AGNES_SERVER`, so the server URL is
-    which is the same mechanism used in production. Restores prior values
+    swapped via env-var. The TOKEN override is routed through
-    on exit so the caller's environment isn't mutated permanently.
+    `cli.config._with_token_override` (a ContextVar), which is checked by
    `get_token()` BEFORE the on-disk `~/.config/agnes/token.json`. This is
    load-bearing: `agnes init --token NEW` runs the verify call in step 2
    while the file still holds an OLD token from a prior install — without
    the override, the verify uses the stale on-disk token and fails 401.
-    Caveats:
+    `AGNES_TOKEN` env var is also set as a back-compat hint for any code
-    - **Token override is honored only when no `~/.config/agnes/token.json`
+    path that bypasses `get_token()` (none in `cli/` at last audit, but
-      exists.** `get_token()` reads the file first and only falls back to
+    third-party hooks may), but the contextvar is the authoritative source.
-      `AGNES_TOKEN`. `agnes init` writes `token.json` before calling
+
-      `run_pull` so the values agree in production; isolated tests/callers
+    Restores prior values on exit so the caller's environment isn't
-      that pass a different token must clear the on-disk token first.
+    mutated permanently. Not safe for concurrent invocation across threads;
-    - **Not safe for concurrent invocation in the same process** — env-var
+    single-threaded use only.
      swap is global. Single-threaded use only.
    """
    from cli.config import _with_token_override
    prev_server = os.environ.get("AGNES_SERVER")
    prev_token = os.environ.get("AGNES_TOKEN")
    os.environ["AGNES_SERVER"] = server_url
    if token:
        os.environ["AGNES_TOKEN"] = token
    try:
-        yield
+        with _with_token_override(token):
            yield
    finally:
        if prev_server is None:
            os.environ.pop("AGNES_SERVER", None)
--- a/connectors/bigquery/extractor.py
+++ b/connectors/bigquery/extractor.py
@ -106,10 +106,19 @@ def _try_acquire_file_lock(lock_path: Path):
    Stale-lock reclaim: if the lock_path exists and its mtime is older
    than the configured TTL, log a warning and unlink before retrying.
-    A live holder still wins the second flock attempt (kernel-level
+
-    flock isn't tied to mtime), so the reclaim doesn't break correctness
+    Caveat: ``lock_path.unlink()`` + the subsequent ``open()`` creates a
-    — it just unblocks the case where a holder process was hard-killed
+    NEW inode — fcntl.flock keys on inode, so a still-running holder's
-    before the kernel released the lock."""
+    lock on the (now-orphan) old inode does NOT block the new acquisition.
    A genuine overrunning materialize past TTL therefore CAN race a
    fresh attempt and both can write to ``<id>.parquet.tmp``. The
    in-process ``threading.Lock`` keyed on ``table_id`` blocks that race
    within one scheduler process; cross-process protection (two schedulers
    on the same workspace) relies on operators not running multiple
    concurrent schedulers AND on the TTL being well above the longest
    plausible COPY (24 h default). If real corruption surfaces in
    production, the next iteration should attach a pid to the lock file
    and skip reclaim while the holder pid is alive."""
    lock_path.parent.mkdir(parents=True, exist_ok=True)
    def _try_open_and_flock():
--- a/tests/test_cli_init.py
+++ b/tests/test_cli_init.py
@ -224,3 +224,95 @@ def test_init_manifest_unauthorized_when_pull_records_manifest_error(tmp_path, m
    output = result.output + (result.stderr or "")
    assert "Traceback" not in output
    assert ("manifest_unauthorized" in output) or ("Manifest fetch failed" in output)
 def test_init_uses_explicit_token_arg_not_stale_disk_token(tmp_path, monkeypatch):
    """Regression for Devin Review finding on init.py:99.
    Repro: a prior `agnes init` left a stale token in
    `~/.config/agnes/token.json`. The new run passes a fresh token via
    `--token`. Pre-fix, step 2's PAT-verify call read the on-disk token
    first and only fell back to the env var — so the explicit `--token`
    arg was silently ignored, the verify ran with the stale token, and
    init failed 401 with a confusing 'token expired' error even though
    the supplied token was valid.
    Fix: a ContextVar-based override (set by `_override_server_env`)
    short-circuits `get_token()` BEFORE the on-disk read.
    """
    import json
    from unittest.mock import MagicMock
    cfg_dir = tmp_path / "_cfg"
    cfg_dir.mkdir()
    monkeypatch.setenv("AGNES_CONFIG_DIR", str(cfg_dir))
    # Seed a stale token on disk — this is what the bug exposed: the verify
    # call would prefer this over the --token arg.
    token_file = cfg_dir / "token.json"
    token_file.write_text(json.dumps({
        "access_token": "STALE-DO-NOT-USE",
        "email": "old@example.com",
    }), encoding="utf-8")
    captured = {"verify_token": None}
    def _api_get(path, *args, **kwargs):
        # Verify endpoint: snapshot whatever token cli.config.get_token()
        # returns at the moment of the call. If the override is wired
        # correctly, this will be the --token arg, not the stale disk
        # value.
        if path == "/api/catalog/tables":
            from cli.config import get_token
            captured["verify_token"] = get_token()
        resp = MagicMock()
        resp.status_code = 200
        if path == "/api/welcome":
            resp.json.return_value = {"content": "# Test\n"}
        elif path == "/api/sync/manifest":
            resp.json.return_value = {"tables": {}}
        elif path == "/api/memory/bundle":
            resp.json.return_value = {"mandatory": [], "approved": []}
        else:
            resp.json.return_value = []
        return resp
    monkeypatch.setattr("cli.commands.init.api_get", _api_get, raising=False)
    monkeypatch.setattr("cli.lib.pull.api_get", _api_get, raising=False)
    result = runner.invoke(init_app, [
        "--server-url", "http://x",
        "--token", "FRESH-PAT-FROM-USER",
        "--workspace", str(tmp_path / "ws"),
        "--force",
    ])
    assert captured["verify_token"] == "FRESH-PAT-FROM-USER", (
        "Step 2 verify call must use the explicit --token arg, "
        f"not the stale on-disk token. Got: {captured['verify_token']!r}"
    )
    output = result.output + (result.stderr or "")
    assert "Traceback" not in output
 def test_token_override_contextvar_does_not_leak_outside_block():
    """The override must be scoped to the `with` block — leaking it would
    poison subsequent `get_token()` calls (e.g. a long-running daemon
    that runs `agnes init` once and then `agnes pull` later in the same
    process)."""
    from cli.config import _with_token_override, get_token
    import os
    # Sandbox AGNES_CONFIG_DIR so the test's own config dir doesn't muddy
    # the assertion (get_token would fall through to AGNES_TOKEN env or
    # to None depending on host state).
    prior_env = os.environ.pop("AGNES_TOKEN", None)
    try:
        with _with_token_override("INSIDE"):
            assert get_token() == "INSIDE"
        # Outside the block: override cleared, falls through to file/env.
        # Without a config file or AGNES_TOKEN set, returns None.
        assert get_token() != "INSIDE"
    finally:
        if prior_env is not None:
            os.environ["AGNES_TOKEN"] = prior_env