fix(devin-review): stale-token override + status sessions counter + lock comment
Three Devin Review findings on PR #173 addressed in one commit since
they're in adjacent code paths:
1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran
step 2 verify against the OLD on-disk token because `get_token()`
read `~/.config/agnes/token.json` before the env var, and
`_override_server_env` only set the env var. So `agnes init --force`
on a machine with a stale token.json failed 401 with a confusing
'token expired' even though the --token arg was valid.
Fix: ContextVar-based override in `cli.config._token_override`
checked by `get_token()` BEFORE the on-disk read.
`_with_token_override` context manager scopes the override.
`_override_server_env` now also sets the contextvar via
`_with_token_override(token)`, so both env var and contextvar
carry the override (env for back-compat with anything bypassing
get_token; contextvar is the authoritative source).
Async-safe (each task sees its own override) and leak-proof
(resets on context exit).
2 new tests: regression on stale-disk-token + scope leak guard.
2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only
checked legacy `<workspace>/user/sessions/` and always reported 0
in workspaces bootstrapped with `agnes init` (Claude Code writes
to `~/.claude/projects/`, not the legacy path). Same bug we fixed
for `agnes push` in 08e49591.
Fix: route through `cli.lib.claude_sessions.list_session_files()`
so status and push agree on what counts as a pending session.
3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed
"a live holder still wins the second flock attempt" — incorrect on
Linux. After `unlink()` + `open()`, the new file is a new inode;
fcntl.flock keys per-inode, so the old holder's lock does NOT block
the new acquisition. In a genuine TTL-overrun scenario two writers
CAN race the parquet.tmp.
Fix: documentation only. Comment now honestly describes the
inode-recreation behavior, names the threading.Lock as the actual
in-process guard, and flags pid-gating as the next-iteration fix
if real corruption surfaces. The 24h default TTL is well above
typical COPY durations so the practical risk is low.
Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader
regression set.
This commit is contained in:
parent
8233c3e3f9
commit
8784f10a6b
6 changed files with 176 additions and 20 deletions
|
|
@ -38,6 +38,9 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
|
|||
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
|
||||
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
|
||||
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
|
||||
- **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit).
|
||||
- **`agnes status` sessions counter now reads the same source as `agnes push`** — `~/.claude/projects/<encoded-cwd>/` (Claude Code's actual write path) with the legacy `<workspace>/user/sessions/` as a fallback, via `cli.lib.claude_sessions.list_session_files()`. Pre-fix the counter only checked the legacy dir and always reported 0 in workspaces bootstrapped with `agnes init` (since Claude Code never writes there).
|
||||
- **BigQuery materialize lock-reclaim docstring** at `connectors/bigquery/extractor.py:_try_acquire_file_lock` corrected: a still-running holder's `fcntl.flock` does NOT block the post-unlink reacquisition (new file = new inode = independent lock). The in-process `threading.Lock` keyed on `table_id` is the actual concurrency guard; cross-process protection (two schedulers on one workspace) relies on operators not running multiple concurrent schedulers AND on the TTL being well above the longest plausible COPY (24 h default). Documenting the residual risk so it isn't masked by a misleading "we're safe" comment (Devin Review on extractor.py:111).
|
||||
- **`agnes pull` now re-downloads parquets when the local file is missing, even if the recorded hash matches the server.** Pre-fix the download set was computed from `sync_state.json` hash equality alone — if the parquet had been deleted (manual `rm`, disk cleanup, a different workspace sharing the same global `~/.config/agnes/sync_state.json` writing one workspace's parquets while another reads sync_state and assumes "I already have these"), the hash-equal check would short-circuit the download and the next DuckDB view rebuild would fail on a missing file. Now the existence check on `<workspace>/server/parquet/<tid>.parquet` runs alongside the hash compare; missing file → forced re-download regardless of hash.
|
||||
- **`agnes query --remote` no longer over-rejects narrow queries on partitioned/clustered BigQuery tables.** Closes #171. Pre-fix the `/api/query` cost guardrail dry-ran a synthetic `SELECT * FROM <table>` per registered remote-BQ row referenced by the user SQL, which forced BQ to estimate "full table scan" — column projection, predicate pushdown, and partition pruning were all ignored, producing scan-byte estimates up to ~30,000× larger than the actual query would scan. Narrow queries on big partitioned tables (the documented happy-path use case) were rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Now the guardrail rewrites the user SQL from DuckDB-flavor (bare registered names + `bq."<ds>"."<tbl>"`) to BQ-native (`` `<project>.<ds>.<tbl>` ``) and runs ONE dry-run on the EXACT user SQL — partition pruning, column projection, and predicate pushdown all engage. Cap check uses the real estimate. Fallback: if BQ rejects the rewritten SQL with `bq_bad_request` (DuckDB-only syntax that doesn't translate, e.g. `::INT` casts), the guardrail falls back to the pre-fix per-table SELECT * estimate so a non-portable query still gets bounded; non-parse errors (forbidden / upstream) propagate as 502. Helpers exported as `_rewrite_user_sql_for_bq_dry_run` (test seam).
|
||||
- **Windows: `agnes` CLI no longer crashes on cs-CZ / non-UTF-8 consoles.** Two failure modes addressed (originally reported in #172 against the pre-rename `da` CLI; ported and broadened here): (1) `agnes pull` and any other Rich-progress-bar codepath crashed with `UnicodeEncodeError` because cp1250 / cp1252 cannot encode Rich's Braille spinner glyphs — `cli/main.py` now reconfigures `sys.stdout` / `sys.stderr` to UTF-8 with `errors="replace"` at import time when `sys.platform == "win32"`. (2) `agnes skills list` and `agnes skills show` crashed with `UnicodeDecodeError` reading skill markdown that contains em-dashes / accents — every `Path.read_text()` / `Path.write_text()` / `open()` call site in `cli/` (including ones not touched by #172, since several files were renamed in the bootstrap rewrite) now passes `encoding="utf-8"` explicitly. Defensive: also covers JSON / YAML config files that were ASCII-only in practice but were one non-ASCII value away from the same failure mode.
|
||||
|
|
|
|||
|
|
@ -39,8 +39,11 @@ def status(
|
|||
if db_path.exists():
|
||||
last_synced = datetime.fromtimestamp(db_path.stat().st_mtime, tz=timezone.utc).isoformat()
|
||||
|
||||
sessions_dir = workspace / "user" / "sessions"
|
||||
session_count = len(list(sessions_dir.glob("*.jsonl"))) if sessions_dir.exists() else 0
|
||||
# Sessions live in ~/.claude/projects/<encoded-cwd>/ (where Claude Code
|
||||
# writes them), with `<workspace>/user/sessions/` as a legacy fallback.
|
||||
# The helper unions both — same source of truth as `agnes push`.
|
||||
from cli.lib.claude_sessions import list_session_files
|
||||
session_count = len(list_session_files(workspace))
|
||||
|
||||
info = {
|
||||
"workspace": str(workspace),
|
||||
|
|
|
|||
|
|
@ -2,8 +2,45 @@
|
|||
|
||||
import json
|
||||
import os
|
||||
from contextlib import contextmanager
|
||||
from contextvars import ContextVar
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import Iterator, Optional
|
||||
|
||||
|
||||
# In-process override for `get_token()`. Used by `agnes init --token X` and
|
||||
# `agnes auth import-token` to force a specific token for the duration of a
|
||||
# scoped block, EVEN WHEN `~/.config/agnes/token.json` already holds a
|
||||
# different (possibly stale) token. Without this override, `get_token()`
|
||||
# reads the on-disk token first and the explicit `--token` argument is
|
||||
# silently ignored — the bug Devin Review caught at cli/commands/init.py:99.
|
||||
#
|
||||
# A ContextVar is used (not a plain global) so concurrent callers — async
|
||||
# tasks, threads — each see their own override, and a leaked override in
|
||||
# one task can't corrupt another. `_token_override.set(...)` returns a
|
||||
# token used to reset; the `_with_token_override` context manager scopes it.
|
||||
_token_override: ContextVar[Optional[str]] = ContextVar(
|
||||
"agnes_cli_token_override", default=None,
|
||||
)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def _with_token_override(token: Optional[str]) -> Iterator[None]:
|
||||
"""Set `_token_override` for the duration of the block.
|
||||
|
||||
`get_token()` checks the override BEFORE reading `token.json`, so any
|
||||
in-block call returns the supplied token regardless of on-disk state.
|
||||
Restores the prior override (if any) on exit so nested overrides nest
|
||||
correctly.
|
||||
"""
|
||||
if not token:
|
||||
yield
|
||||
return
|
||||
reset_token = _token_override.set(token)
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
_token_override.reset(reset_token)
|
||||
|
||||
|
||||
def _config_dir() -> Path:
|
||||
|
|
@ -18,6 +55,12 @@ def get_server_url() -> str:
|
|||
|
||||
|
||||
def get_token() -> Optional[str]:
|
||||
# In-process override wins over BOTH the on-disk file and the env var.
|
||||
# Set by `_with_token_override(...)`; used by `agnes init --token X`
|
||||
# to force the explicit arg through the verify call even when a stale
|
||||
# `~/.config/agnes/token.json` exists.
|
||||
if (override := _token_override.get()) is not None:
|
||||
return override
|
||||
token_file = _config_dir() / "token.json"
|
||||
if token_file.exists():
|
||||
data = json.loads(token_file.read_text(encoding="utf-8"))
|
||||
|
|
|
|||
|
|
@ -67,28 +67,34 @@ _SAFE_ID_RE = re.compile(r"^[a-zA-Z0-9_\-]{1,128}$")
|
|||
|
||||
@contextmanager
|
||||
def _override_server_env(server_url: str, token: str) -> Iterator[None]:
|
||||
"""Set AGNES_SERVER / AGNES_TOKEN for the duration of the call.
|
||||
"""Set AGNES_SERVER + scoped token override for the duration of the call.
|
||||
|
||||
`cli.config.get_server_url` / `get_token` already honor these env vars,
|
||||
which is the same mechanism used in production. Restores prior values
|
||||
on exit so the caller's environment isn't mutated permanently.
|
||||
`cli.config.get_server_url` honors `AGNES_SERVER`, so the server URL is
|
||||
swapped via env-var. The TOKEN override is routed through
|
||||
`cli.config._with_token_override` (a ContextVar), which is checked by
|
||||
`get_token()` BEFORE the on-disk `~/.config/agnes/token.json`. This is
|
||||
load-bearing: `agnes init --token NEW` runs the verify call in step 2
|
||||
while the file still holds an OLD token from a prior install — without
|
||||
the override, the verify uses the stale on-disk token and fails 401.
|
||||
|
||||
Caveats:
|
||||
- **Token override is honored only when no `~/.config/agnes/token.json`
|
||||
exists.** `get_token()` reads the file first and only falls back to
|
||||
`AGNES_TOKEN`. `agnes init` writes `token.json` before calling
|
||||
`run_pull` so the values agree in production; isolated tests/callers
|
||||
that pass a different token must clear the on-disk token first.
|
||||
- **Not safe for concurrent invocation in the same process** — env-var
|
||||
swap is global. Single-threaded use only.
|
||||
`AGNES_TOKEN` env var is also set as a back-compat hint for any code
|
||||
path that bypasses `get_token()` (none in `cli/` at last audit, but
|
||||
third-party hooks may), but the contextvar is the authoritative source.
|
||||
|
||||
Restores prior values on exit so the caller's environment isn't
|
||||
mutated permanently. Not safe for concurrent invocation across threads;
|
||||
single-threaded use only.
|
||||
"""
|
||||
from cli.config import _with_token_override
|
||||
|
||||
prev_server = os.environ.get("AGNES_SERVER")
|
||||
prev_token = os.environ.get("AGNES_TOKEN")
|
||||
os.environ["AGNES_SERVER"] = server_url
|
||||
if token:
|
||||
os.environ["AGNES_TOKEN"] = token
|
||||
try:
|
||||
yield
|
||||
with _with_token_override(token):
|
||||
yield
|
||||
finally:
|
||||
if prev_server is None:
|
||||
os.environ.pop("AGNES_SERVER", None)
|
||||
|
|
|
|||
|
|
@ -106,10 +106,19 @@ def _try_acquire_file_lock(lock_path: Path):
|
|||
|
||||
Stale-lock reclaim: if the lock_path exists and its mtime is older
|
||||
than the configured TTL, log a warning and unlink before retrying.
|
||||
A live holder still wins the second flock attempt (kernel-level
|
||||
flock isn't tied to mtime), so the reclaim doesn't break correctness
|
||||
— it just unblocks the case where a holder process was hard-killed
|
||||
before the kernel released the lock."""
|
||||
|
||||
Caveat: ``lock_path.unlink()`` + the subsequent ``open()`` creates a
|
||||
NEW inode — fcntl.flock keys on inode, so a still-running holder's
|
||||
lock on the (now-orphan) old inode does NOT block the new acquisition.
|
||||
A genuine overrunning materialize past TTL therefore CAN race a
|
||||
fresh attempt and both can write to ``<id>.parquet.tmp``. The
|
||||
in-process ``threading.Lock`` keyed on ``table_id`` blocks that race
|
||||
within one scheduler process; cross-process protection (two schedulers
|
||||
on the same workspace) relies on operators not running multiple
|
||||
concurrent schedulers AND on the TTL being well above the longest
|
||||
plausible COPY (24 h default). If real corruption surfaces in
|
||||
production, the next iteration should attach a pid to the lock file
|
||||
and skip reclaim while the holder pid is alive."""
|
||||
lock_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _try_open_and_flock():
|
||||
|
|
|
|||
|
|
@ -224,3 +224,95 @@ def test_init_manifest_unauthorized_when_pull_records_manifest_error(tmp_path, m
|
|||
output = result.output + (result.stderr or "")
|
||||
assert "Traceback" not in output
|
||||
assert ("manifest_unauthorized" in output) or ("Manifest fetch failed" in output)
|
||||
|
||||
|
||||
def test_init_uses_explicit_token_arg_not_stale_disk_token(tmp_path, monkeypatch):
|
||||
"""Regression for Devin Review finding on init.py:99.
|
||||
|
||||
Repro: a prior `agnes init` left a stale token in
|
||||
`~/.config/agnes/token.json`. The new run passes a fresh token via
|
||||
`--token`. Pre-fix, step 2's PAT-verify call read the on-disk token
|
||||
first and only fell back to the env var — so the explicit `--token`
|
||||
arg was silently ignored, the verify ran with the stale token, and
|
||||
init failed 401 with a confusing 'token expired' error even though
|
||||
the supplied token was valid.
|
||||
|
||||
Fix: a ContextVar-based override (set by `_override_server_env`)
|
||||
short-circuits `get_token()` BEFORE the on-disk read.
|
||||
"""
|
||||
import json
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
cfg_dir = tmp_path / "_cfg"
|
||||
cfg_dir.mkdir()
|
||||
monkeypatch.setenv("AGNES_CONFIG_DIR", str(cfg_dir))
|
||||
|
||||
# Seed a stale token on disk — this is what the bug exposed: the verify
|
||||
# call would prefer this over the --token arg.
|
||||
token_file = cfg_dir / "token.json"
|
||||
token_file.write_text(json.dumps({
|
||||
"access_token": "STALE-DO-NOT-USE",
|
||||
"email": "old@example.com",
|
||||
}), encoding="utf-8")
|
||||
|
||||
captured = {"verify_token": None}
|
||||
|
||||
def _api_get(path, *args, **kwargs):
|
||||
# Verify endpoint: snapshot whatever token cli.config.get_token()
|
||||
# returns at the moment of the call. If the override is wired
|
||||
# correctly, this will be the --token arg, not the stale disk
|
||||
# value.
|
||||
if path == "/api/catalog/tables":
|
||||
from cli.config import get_token
|
||||
captured["verify_token"] = get_token()
|
||||
resp = MagicMock()
|
||||
resp.status_code = 200
|
||||
if path == "/api/welcome":
|
||||
resp.json.return_value = {"content": "# Test\n"}
|
||||
elif path == "/api/sync/manifest":
|
||||
resp.json.return_value = {"tables": {}}
|
||||
elif path == "/api/memory/bundle":
|
||||
resp.json.return_value = {"mandatory": [], "approved": []}
|
||||
else:
|
||||
resp.json.return_value = []
|
||||
return resp
|
||||
|
||||
monkeypatch.setattr("cli.commands.init.api_get", _api_get, raising=False)
|
||||
monkeypatch.setattr("cli.lib.pull.api_get", _api_get, raising=False)
|
||||
|
||||
result = runner.invoke(init_app, [
|
||||
"--server-url", "http://x",
|
||||
"--token", "FRESH-PAT-FROM-USER",
|
||||
"--workspace", str(tmp_path / "ws"),
|
||||
"--force",
|
||||
])
|
||||
|
||||
assert captured["verify_token"] == "FRESH-PAT-FROM-USER", (
|
||||
"Step 2 verify call must use the explicit --token arg, "
|
||||
f"not the stale on-disk token. Got: {captured['verify_token']!r}"
|
||||
)
|
||||
output = result.output + (result.stderr or "")
|
||||
assert "Traceback" not in output
|
||||
|
||||
|
||||
def test_token_override_contextvar_does_not_leak_outside_block():
|
||||
"""The override must be scoped to the `with` block — leaking it would
|
||||
poison subsequent `get_token()` calls (e.g. a long-running daemon
|
||||
that runs `agnes init` once and then `agnes pull` later in the same
|
||||
process)."""
|
||||
from cli.config import _with_token_override, get_token
|
||||
import os
|
||||
|
||||
# Sandbox AGNES_CONFIG_DIR so the test's own config dir doesn't muddy
|
||||
# the assertion (get_token would fall through to AGNES_TOKEN env or
|
||||
# to None depending on host state).
|
||||
prior_env = os.environ.pop("AGNES_TOKEN", None)
|
||||
try:
|
||||
with _with_token_override("INSIDE"):
|
||||
assert get_token() == "INSIDE"
|
||||
# Outside the block: override cleared, falls through to file/env.
|
||||
# Without a config file or AGNES_TOKEN set, returns None.
|
||||
assert get_token() != "INSIDE"
|
||||
finally:
|
||||
if prior_env is not None:
|
||||
os.environ["AGNES_TOKEN"] = prior_env
|
||||
|
|
|
|||
Loading…
Reference in a new issue