fix(devin-review): stale-token override + status sessions counter + lock comment

Three Devin Review findings on PR #173 addressed in one commit since
they're in adjacent code paths:

1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran
   step 2 verify against the OLD on-disk token because `get_token()`
   read `~/.config/agnes/token.json` before the env var, and
   `_override_server_env` only set the env var. So `agnes init --force`
   on a machine with a stale token.json failed 401 with a confusing
   'token expired' even though the --token arg was valid.

   Fix: ContextVar-based override in `cli.config._token_override`
   checked by `get_token()` BEFORE the on-disk read.
   `_with_token_override` context manager scopes the override.
   `_override_server_env` now also sets the contextvar via
   `_with_token_override(token)`, so both env var and contextvar
   carry the override (env for back-compat with anything bypassing
   get_token; contextvar is the authoritative source).
   Async-safe (each task sees its own override) and leak-proof
   (resets on context exit).
   2 new tests: regression on stale-disk-token + scope leak guard.

2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only
   checked legacy `<workspace>/user/sessions/` and always reported 0
   in workspaces bootstrapped with `agnes init` (Claude Code writes
   to `~/.claude/projects/`, not the legacy path). Same bug we fixed
   for `agnes push` in 08e49591.

   Fix: route through `cli.lib.claude_sessions.list_session_files()`
   so status and push agree on what counts as a pending session.

3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed
   "a live holder still wins the second flock attempt" — incorrect on
   Linux. After `unlink()` + `open()`, the new file is a new inode;
   fcntl.flock keys per-inode, so the old holder's lock does NOT block
   the new acquisition. In a genuine TTL-overrun scenario two writers
   CAN race the parquet.tmp.

   Fix: documentation only. Comment now honestly describes the
   inode-recreation behavior, names the threading.Lock as the actual
   in-process guard, and flags pid-gating as the next-iteration fix
   if real corruption surfaces. The 24h default TTL is well above
   typical COPY durations so the practical risk is low.

Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader
regression set.
This commit is contained in:
ZdenekSrotyr 2026-05-04 21:26:30 +02:00
parent 8233c3e3f9
commit 8784f10a6b
6 changed files with 176 additions and 20 deletions

View file

@ -38,6 +38,9 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
- **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit).
- **`agnes status` sessions counter now reads the same source as `agnes push`** — `~/.claude/projects/<encoded-cwd>/` (Claude Code's actual write path) with the legacy `<workspace>/user/sessions/` as a fallback, via `cli.lib.claude_sessions.list_session_files()`. Pre-fix the counter only checked the legacy dir and always reported 0 in workspaces bootstrapped with `agnes init` (since Claude Code never writes there).
- **BigQuery materialize lock-reclaim docstring** at `connectors/bigquery/extractor.py:_try_acquire_file_lock` corrected: a still-running holder's `fcntl.flock` does NOT block the post-unlink reacquisition (new file = new inode = independent lock). The in-process `threading.Lock` keyed on `table_id` is the actual concurrency guard; cross-process protection (two schedulers on one workspace) relies on operators not running multiple concurrent schedulers AND on the TTL being well above the longest plausible COPY (24 h default). Documenting the residual risk so it isn't masked by a misleading "we're safe" comment (Devin Review on extractor.py:111).
- **`agnes pull` now re-downloads parquets when the local file is missing, even if the recorded hash matches the server.** Pre-fix the download set was computed from `sync_state.json` hash equality alone — if the parquet had been deleted (manual `rm`, disk cleanup, a different workspace sharing the same global `~/.config/agnes/sync_state.json` writing one workspace's parquets while another reads sync_state and assumes "I already have these"), the hash-equal check would short-circuit the download and the next DuckDB view rebuild would fail on a missing file. Now the existence check on `<workspace>/server/parquet/<tid>.parquet` runs alongside the hash compare; missing file → forced re-download regardless of hash.
- **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM <table>` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."<ds>"."<tbl>"`) to BQ-native (`` `<project>.<ds>.<tbl>` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam).
- **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode.

View file

@ -39,8 +39,11 @@ def status(
if db_path.exists():
last_synced = datetime.fromtimestamp(db_path.stat().st_mtime, tz=timezone.utc).isoformat()
sessions_dir = workspace / "user" / "sessions"
session_count = len(list(sessions_dir.glob("*.jsonl"))) if sessions_dir.exists() else 0
# Sessions live in ~/.claude/projects/<encoded-cwd>/ (where Claude Code
# writes them), with `<workspace>/user/sessions/` as a legacy fallback.
# The helper unions both — same source of truth as `agnes push`.
from cli.lib.claude_sessions import list_session_files
session_count = len(list_session_files(workspace))
info = {
"workspace": str(workspace),

View file

@ -2,8 +2,45 @@
import json
import os
from contextlib import contextmanager
from contextvars import ContextVar
from pathlib import Path
from typing import Optional
from typing import Iterator, Optional
# In-process override for `get_token()`. Used by `agnes init --token X` and
# `agnes auth import-token` to force a specific token for the duration of a
# scoped block, EVEN WHEN `~/.config/agnes/token.json` already holds a
# different (possibly stale) token. Without this override, `get_token()`
# reads the on-disk token first and the explicit `--token` argument is
# silently ignored — the bug Devin Review caught at cli/commands/init.py:99.
#
# A ContextVar is used (not a plain global) so concurrent callers — async
# tasks, threads — each see their own override, and a leaked override in
# one task can't corrupt another. `_token_override.set(...)` returns a
# token used to reset; the `_with_token_override` context manager scopes it.
_token_override: ContextVar[Optional[str]] = ContextVar(
"agnes_cli_token_override", default=None,
)
@contextmanager
def _with_token_override(token: Optional[str]) -> Iterator[None]:
"""Set `_token_override` for the duration of the block.
`get_token()` checks the override BEFORE reading `token.json`, so any
in-block call returns the supplied token regardless of on-disk state.
Restores the prior override (if any) on exit so nested overrides nest
correctly.
"""
if not token:
yield
return
reset_token = _token_override.set(token)
try:
yield
finally:
_token_override.reset(reset_token)
def _config_dir() -> Path:
@ -18,6 +55,12 @@ def get_server_url() -> str:
def get_token() -> Optional[str]:
# In-process override wins over BOTH the on-disk file and the env var.
# Set by `_with_token_override(...)`; used by `agnes init --token X`
# to force the explicit arg through the verify call even when a stale
# `~/.config/agnes/token.json` exists.
if (override := _token_override.get()) is not None:
return override
token_file = _config_dir() / "token.json"
if token_file.exists():
data = json.loads(token_file.read_text(encoding="utf-8"))

View file

@ -67,28 +67,34 @@ _SAFE_ID_RE = re.compile(r"^[a-zA-Z0-9_\-]{1,128}$")
@contextmanager
def _override_server_env(server_url: str, token: str) -> Iterator[None]:
"""Set AGNES_SERVER / AGNES_TOKEN for the duration of the call.
"""Set AGNES_SERVER + scoped token override for the duration of the call.
`cli.config.get_server_url` / `get_token` already honor these env vars,
which is the same mechanism used in production. Restores prior values
on exit so the caller's environment isn't mutated permanently.
`cli.config.get_server_url` honors `AGNES_SERVER`, so the server URL is
swapped via env-var. The TOKEN override is routed through
`cli.config._with_token_override` (a ContextVar), which is checked by
`get_token()` BEFORE the on-disk `~/.config/agnes/token.json`. This is
load-bearing: `agnes init --token NEW` runs the verify call in step 2
while the file still holds an OLD token from a prior install without
the override, the verify uses the stale on-disk token and fails 401.
Caveats:
- **Token override is honored only when no `~/.config/agnes/token.json`
exists.** `get_token()` reads the file first and only falls back to
`AGNES_TOKEN`. `agnes init` writes `token.json` before calling
`run_pull` so the values agree in production; isolated tests/callers
that pass a different token must clear the on-disk token first.
- **Not safe for concurrent invocation in the same process** env-var
swap is global. Single-threaded use only.
`AGNES_TOKEN` env var is also set as a back-compat hint for any code
path that bypasses `get_token()` (none in `cli/` at last audit, but
third-party hooks may), but the contextvar is the authoritative source.
Restores prior values on exit so the caller's environment isn't
mutated permanently. Not safe for concurrent invocation across threads;
single-threaded use only.
"""
from cli.config import _with_token_override
prev_server = os.environ.get("AGNES_SERVER")
prev_token = os.environ.get("AGNES_TOKEN")
os.environ["AGNES_SERVER"] = server_url
if token:
os.environ["AGNES_TOKEN"] = token
try:
yield
with _with_token_override(token):
yield
finally:
if prev_server is None:
os.environ.pop("AGNES_SERVER", None)

View file

@ -106,10 +106,19 @@ def _try_acquire_file_lock(lock_path: Path):
Stale-lock reclaim: if the lock_path exists and its mtime is older
than the configured TTL, log a warning and unlink before retrying.
A live holder still wins the second flock attempt (kernel-level
flock isn't tied to mtime), so the reclaim doesn't break correctness
it just unblocks the case where a holder process was hard-killed
before the kernel released the lock."""
Caveat: ``lock_path.unlink()`` + the subsequent ``open()`` creates a
NEW inode fcntl.flock keys on inode, so a still-running holder's
lock on the (now-orphan) old inode does NOT block the new acquisition.
A genuine overrunning materialize past TTL therefore CAN race a
fresh attempt and both can write to ``<id>.parquet.tmp``. The
in-process ``threading.Lock`` keyed on ``table_id`` blocks that race
within one scheduler process; cross-process protection (two schedulers
on the same workspace) relies on operators not running multiple
concurrent schedulers AND on the TTL being well above the longest
plausible COPY (24 h default). If real corruption surfaces in
production, the next iteration should attach a pid to the lock file
and skip reclaim while the holder pid is alive."""
lock_path.parent.mkdir(parents=True, exist_ok=True)
def _try_open_and_flock():

View file

@ -224,3 +224,95 @@ def test_init_manifest_unauthorized_when_pull_records_manifest_error(tmp_path, m
output = result.output + (result.stderr or "")
assert "Traceback" not in output
assert ("manifest_unauthorized" in output) or ("Manifest fetch failed" in output)
def test_init_uses_explicit_token_arg_not_stale_disk_token(tmp_path, monkeypatch):
"""Regression for Devin Review finding on init.py:99.
Repro: a prior `agnes init` left a stale token in
`~/.config/agnes/token.json`. The new run passes a fresh token via
`--token`. Pre-fix, step 2's PAT-verify call read the on-disk token
first and only fell back to the env var so the explicit `--token`
arg was silently ignored, the verify ran with the stale token, and
init failed 401 with a confusing 'token expired' error even though
the supplied token was valid.
Fix: a ContextVar-based override (set by `_override_server_env`)
short-circuits `get_token()` BEFORE the on-disk read.
"""
import json
from unittest.mock import MagicMock
cfg_dir = tmp_path / "_cfg"
cfg_dir.mkdir()
monkeypatch.setenv("AGNES_CONFIG_DIR", str(cfg_dir))
# Seed a stale token on disk — this is what the bug exposed: the verify
# call would prefer this over the --token arg.
token_file = cfg_dir / "token.json"
token_file.write_text(json.dumps({
"access_token": "STALE-DO-NOT-USE",
"email": "old@example.com",
}), encoding="utf-8")
captured = {"verify_token": None}
def _api_get(path, *args, **kwargs):
# Verify endpoint: snapshot whatever token cli.config.get_token()
# returns at the moment of the call. If the override is wired
# correctly, this will be the --token arg, not the stale disk
# value.
if path == "/api/catalog/tables":
from cli.config import get_token
captured["verify_token"] = get_token()
resp = MagicMock()
resp.status_code = 200
if path == "/api/welcome":
resp.json.return_value = {"content": "# Test\n"}
elif path == "/api/sync/manifest":
resp.json.return_value = {"tables": {}}
elif path == "/api/memory/bundle":
resp.json.return_value = {"mandatory": [], "approved": []}
else:
resp.json.return_value = []
return resp
monkeypatch.setattr("cli.commands.init.api_get", _api_get, raising=False)
monkeypatch.setattr("cli.lib.pull.api_get", _api_get, raising=False)
result = runner.invoke(init_app, [
"--server-url", "http://x",
"--token", "FRESH-PAT-FROM-USER",
"--workspace", str(tmp_path / "ws"),
"--force",
])
assert captured["verify_token"] == "FRESH-PAT-FROM-USER", (
"Step 2 verify call must use the explicit --token arg, "
f"not the stale on-disk token. Got: {captured['verify_token']!r}"
)
output = result.output + (result.stderr or "")
assert "Traceback" not in output
def test_token_override_contextvar_does_not_leak_outside_block():
"""The override must be scoped to the `with` block — leaking it would
poison subsequent `get_token()` calls (e.g. a long-running daemon
that runs `agnes init` once and then `agnes pull` later in the same
process)."""
from cli.config import _with_token_override, get_token
import os
# Sandbox AGNES_CONFIG_DIR so the test's own config dir doesn't muddy
# the assertion (get_token would fall through to AGNES_TOKEN env or
# to None depending on host state).
prior_env = os.environ.pop("AGNES_TOKEN", None)
try:
with _with_token_override("INSIDE"):
assert get_token() == "INSIDE"
# Outside the block: override cleared, falls through to file/env.
# Without a config file or AGNES_TOKEN set, returns None.
assert get_token() != "INSIDE"
finally:
if prior_env is not None:
os.environ["AGNES_TOKEN"] = prior_env