From 0a16e8f44e4e4c0ef3d040431cec33425943822f Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com> Date: Tue, 19 May 2026 16:53:56 +0200 Subject: [PATCH] feat(diagnose): role-aware overall status (#345 B) (#355) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reading `Overall: degraded` on a fresh analyst install — driven by server-side operator warnings (stale tables, session-pipeline cadence, BQ billing-project config) that the analyst can't act on — erodes trust in the install. The role-aware headline routes operator-only warnings through a secondary line so they're not invisible, but they no longer drive the headline an analyst sees. Server-side (`app/api/health.py`): - Per-check `audience: "analyst" | "operator"` tag on every entry in `/api/health/detailed` services dict. - New top-level `caller_role` field (derived from `user.is_admin`) so the client knows which aggregation to display. - New top-level `overall_analyst` field — analyst-only aggregation available to clients that don't want to recompute it. Client-side (`cli/commands/diagnose.py`): - When the server reports `caller_role`, analyst aggregations exclude audience=operator checks from the headline. Analyst-side warnings AND server-side errors still escalate (errors are universal). - Secondary line surfaces operator warning count so they're visible: "Overall: healthy (analyst-side); 2 operator-side warnings". - Admin/operator role auto-promotes to full aggregation; analysts can manually opt in via `--include-operator-checks` flag. - Legacy servers without `caller_role` keep the pre-#345-B full aggregation — no silent regression against older deployments. Audience defaults (`_AUDIENCE` map in health.py): - analyst: duckdb_state - operator: db_schema, data, users, bq_config, session_pipeline Tests: 4 new in TestAnalystAudienceFilter (analyst-only filtering, admin auto-promote, --include-operator-checks opt-in, legacy server fallback). 26/26 diagnose + health tests pass. --- CHANGELOG.md | 3 ++ app/api/health.py | 59 +++++++++++++++++------ cli/commands/diagnose.py | 62 +++++++++++++++++++++--- tests/test_cli_diagnose.py | 98 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 202 insertions(+), 20 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b262171..7063106 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,9 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C ## [Unreleased] +### Changed +- `agnes diagnose` is now role-aware. A fresh analyst install no longer reports `Overall: degraded` just because the server has operator-side warnings (stale tables, session-pipeline cadence, BQ billing-project config) that the analyst can't act on. Server (`/api/health/detailed`) tags every check with `audience: "analyst" | "operator"` plus a top-level `caller_role` derived from `user.is_admin` and an `overall_analyst` aggregation. Client excludes operator checks from the headline for analyst callers, surfaces operator warning count on a secondary line so they stay visible, auto-promotes admin/operator callers to the full aggregation, and lets analysts opt in via `--include-operator-checks`. Legacy servers (no `caller_role`) keep the pre-#345-B full aggregation — no silent regression. Closes #345 B. + ### Added - `AGNES_MARKETPLACE_URL` env override for `agnes refresh-marketplace --bootstrap`. Pre-fix the marketplace endpoint was hardcoded to `{server_host}/marketplace.git/`, which broke deployments that serve the marketplace from a different host than the API (reverse-proxy split, CDN-fronted marketplace). When set, the env var is parsed via `urlparse`; missing scheme or host fails fast with a clear error (operator misconfiguration surfaces immediately). The PAT injection / strip behavior is preserved on the override path. Default behavior unchanged when the env var is empty / unset. Closes #345 A. diff --git a/app/api/health.py b/app/api/health.py index af10706..1f242e1 100644 --- a/app/api/health.py +++ b/app/api/health.py @@ -422,24 +422,55 @@ async def health_check_detailed( except Exception as e: checks["session_pipeline"] = {"status": "unknown", "detail": str(e)} - # Aggregate to overall status. `info` and `unknown` surface in the - # response but never escalate the headline (issue #178). `warning` - # promotes to `degraded`; `error` (or a schema mismatch when the - # caller asked for it) promotes to `unhealthy`. - overall = "healthy" - for check in checks.values(): - if check.get("status") == "error": - overall = "unhealthy" - break - if check.get("status") == "warning": - overall = "degraded" - # Schema mismatch only escalates when the caller asked for the check + # Per-check audience tagging (issue #345 B): which checks does an + # analyst care about vs. which are operator-only? An analyst seeing + # `Overall: degraded` because the scheduler is behind on session + # extraction has no way to act on it — the warning is real, but it's + # not their warning. ``audience`` lets clients compute a role-aware + # headline without dropping checks from the response payload. + _AUDIENCE = { + "duckdb_state": "analyst", # query path; visible everywhere + "db_schema": "operator", # schema migration sanity + "data": "operator", # stale tables — admin scheduler + "users": "operator", # admin-side ops only + "bq_config": "operator", # cluster-level config + "session_pipeline": "operator", # admin-driven cadence + } + for name, check in checks.items(): + check.setdefault("audience", _AUDIENCE.get(name, "operator")) + + def _aggregate(items) -> str: + out = "healthy" + for check in items: + if check.get("status") == "error": + return "unhealthy" + if check.get("status") == "warning": + out = "degraded" + return out + + # Schema mismatch escalates only when the caller asked for the check # — otherwise the absent key is treated as "not asserted". - if "db_schema" in checks and checks["db_schema"].get("db_schema") != "ok": - overall = "unhealthy" + def _apply_schema_escalation(status: str) -> str: + if "db_schema" in checks and checks["db_schema"].get("db_schema") != "ok": + return "unhealthy" + return status + + overall = _apply_schema_escalation(_aggregate(checks.values())) + overall_analyst = _aggregate( + c for c in checks.values() if c.get("audience") == "analyst" + ) + # db_schema is audience=operator, so it never escalates the analyst + # headline — that's the intent. + + # Caller role surface: clients pick the headline they should display + # based on their own privilege. Admins/operators see ``overall``; + # analysts see ``overall_analyst`` and an optional flag to opt in. + caller_role = "admin" if _user.get("is_admin") else "analyst" return { "status": overall, + "overall_analyst": overall_analyst, + "caller_role": caller_role, "version": os.environ.get("AGNES_VERSION", "dev"), "channel": os.environ.get("RELEASE_CHANNEL", "dev"), "image_tag": os.environ.get("AGNES_TAG", "unknown"), diff --git a/cli/commands/diagnose.py b/cli/commands/diagnose.py index 05d7e96..c902161 100644 --- a/cli/commands/diagnose.py +++ b/cli/commands/diagnose.py @@ -2,6 +2,7 @@ import json from pathlib import Path +from typing import Optional import typer @@ -28,6 +29,20 @@ def diagnose( "operator is verifying a migration." ), ), + include_operator_checks: bool = typer.Option( + False, + "--include-operator-checks", + help=( + "Aggregate the headline status across operator-side checks " + "(stale tables, session-pipeline cadence, BQ billing config) " + "in addition to analyst-side ones. Default off when the caller " + "is an analyst — those checks aren't actionable from a fresh " + "analyst install and reading `Overall: degraded` on first run " + "erodes trust in the install (issue #345 B). Admins/operators " + "auto-promote to the full headline based on the server-reported " + "caller_role." + ), + ), ): """Run comprehensive system diagnostics. AI-agent friendly output.""" # If a subcommand was invoked (e.g. `agnes diagnose system`), defer to it @@ -36,18 +51,26 @@ def diagnose( return checks = [] + # ``caller_role`` is present only on servers shipping the + # role-aware health fields (issue #345 B). Legacy servers don't + # ship it; absent role disables audience filtering so we don't + # regress against an older server with the full-aggregation + # contract the rest of the CLI was written against. + caller_role: Optional[str] = None # 1. API reachability try: resp = api_get("/api/health") health = resp.json() - checks.append({"name": "api", "status": "ok", "latency_ms": resp.elapsed.total_seconds() * 1000}) + checks.append({"name": "api", "status": "ok", "audience": "analyst", "latency_ms": resp.elapsed.total_seconds() * 1000}) # Detailed health (auth required) for service-level checks try: params = {"include": "schema"} if include_schema else None resp_d = api_get("/api/health/detailed", params=params) detailed = resp_d.json() + if "caller_role" in detailed: + caller_role = detailed["caller_role"] for svc_name, svc_data in detailed.get("services", {}).items(): check = {"name": svc_name, "status": svc_data.get("status", "unknown")} check.update({k: v for k, v in svc_data.items() if k != "status"}) @@ -56,20 +79,32 @@ def diagnose( # Auth may not be configured — minimal reachability is sufficient pass except Exception as e: - checks.append({"name": "api", "status": "error", "detail": str(e)}) + checks.append({"name": "api", "status": "error", "audience": "analyst", "detail": str(e)}) # Issue #244: detect silently-broken capture-session by comparing # observed SessionStart files against the uploaded-log entries. # Adds one entry to `checks` with status ok / warning / info. try: - checks.append(capture_session_health(Path.cwd())) + cap = capture_session_health(Path.cwd()) + cap.setdefault("audience", "analyst") + checks.append(cap) except Exception as e: - checks.append({"name": "capture-session", "status": "info", "detail": f"health check failed: {e}"}) + checks.append({"name": "capture-session", "status": "info", "audience": "analyst", "detail": f"health check failed: {e}"}) # Determine overall — `info` and `unknown` surface in the per-check # output but never promote the headline (issue #178). + # + # Audience-aware headline (issue #345 B): when the server reports a + # ``caller_role``, analysts see analyst-only aggregation by default; + # operators auto-promote to the full headline; analysts can manually + # opt in via ``--include-operator-checks``. Legacy servers that don't + # ship ``caller_role`` keep the original full-aggregation behaviour + # — no analyst-only filtering until the server tags checks. + role_aware = caller_role is not None + operator_mode = (not role_aware) or include_operator_checks or caller_role != "analyst" + relevant = checks if operator_mode else [c for c in checks if c.get("audience") == "analyst"] overall = "healthy" - for c in checks: + for c in relevant: if c["status"] == "error": overall = "unhealthy" break @@ -93,6 +128,7 @@ def diagnose( result = { "overall": overall, + "caller_role": caller_role, "checks": checks, "suggested_actions": actions, } @@ -100,7 +136,21 @@ def diagnose( if as_json: typer.echo(json.dumps(result, indent=2)) else: - typer.echo(f"Overall: {overall}") + # When analysts are filtered to analyst-only aggregation, surface + # any operator-side warnings as a secondary line so they're not + # invisible — they just don't get to drive the headline. + operator_warns = [ + c for c in checks + if c.get("audience") == "operator" and c.get("status") in ("warning", "error") + ] + if not operator_mode and operator_warns: + typer.echo( + f"Overall: {overall} (analyst-side); " + f"{len(operator_warns)} operator-side " + f"{'warning' if len(operator_warns) == 1 else 'warnings'}" + ) + else: + typer.echo(f"Overall: {overall}") for c in checks: detail = "" if "detail" in c: diff --git a/tests/test_cli_diagnose.py b/tests/test_cli_diagnose.py index e993a15..87adad2 100644 --- a/tests/test_cli_diagnose.py +++ b/tests/test_cli_diagnose.py @@ -96,3 +96,101 @@ class TestDiagnoseJson: data = json.loads(result.output) api_check = next(c for c in data["checks"] if c["name"] == "api") assert "latency_ms" in api_check + + +class TestAnalystAudienceFilter: + """Issue #345 B — analysts shouldn't see ``Overall: degraded`` on fresh + install just because the server has operator-level warnings (stale + tables, session-pipeline behind). The role-aware headline lets the + server tag each check with ``audience`` and report ``caller_role``; + the CLI filters the overall when caller is analyst. + """ + + def test_analyst_sees_healthy_when_only_operator_checks_warn(self): + """Analyst role + operator-side warning → ``Overall: healthy`` + with a secondary line surfacing the operator warning count. + Pre-fix behaviour was ``Overall: degraded`` even though the + analyst can't act on the stale-tables warning.""" + health = { + "caller_role": "analyst", + "services": { + "duckdb_state": {"status": "ok", "audience": "analyst"}, + "data": { + "status": "warning", + "audience": "operator", + "stale_tables": ["orders"], + }, + "session_pipeline": { + "status": "warning", + "audience": "operator", + "detail": "verification-detector behind by ~8.8h", + }, + }, + } + with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)): + result = runner.invoke(app, ["diagnose"]) + assert result.exit_code == 0 + # Headline is analyst-side healthy with a secondary count line + assert "healthy (analyst-side)" in result.output + assert "2 operator-side warnings" in result.output + # Per-check rows still show the warnings — we just don't escalate + assert "[warning] data" in result.output + + def test_admin_caller_role_aggregates_full_set(self): + """Admin/operator role auto-promotes to the full aggregation — + operator-side warnings DO escalate the headline.""" + health = { + "caller_role": "admin", + "services": { + "duckdb_state": {"status": "ok", "audience": "analyst"}, + "data": { + "status": "warning", + "audience": "operator", + "stale_tables": ["orders"], + }, + }, + } + with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)): + result = runner.invoke(app, ["diagnose"]) + assert result.exit_code == 0 + assert "degraded" in result.output.lower() + # No analyst-side qualifier — admin sees the full headline directly + assert "analyst-side" not in result.output + + def test_analyst_with_include_operator_checks_flag(self): + """``--include-operator-checks`` lets an analyst opt in to the + full aggregation when they actually want to see the operator + warnings drive the headline (e.g. when paging an operator).""" + health = { + "caller_role": "analyst", + "services": { + "duckdb_state": {"status": "ok", "audience": "analyst"}, + "data": { + "status": "warning", + "audience": "operator", + "stale_tables": ["orders"], + }, + }, + } + with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)): + result = runner.invoke(app, ["diagnose", "--include-operator-checks"]) + assert result.exit_code == 0 + assert "degraded" in result.output.lower() + assert "analyst-side" not in result.output + + def test_legacy_server_response_keeps_full_aggregation(self): + """When the server doesn't ship ``caller_role`` (older deploy), + the CLI must NOT silently start filtering — that would regress + diagnoses against any pre-#345-B server. Test_diagnose_warning_service + above already covers the same shape; this test makes the contract + explicit.""" + health = { + "services": { + "duckdb": {"status": "warning", "stale_tables": ["x"]}, + }, + # No caller_role → no role-aware filtering. + } + with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)): + result = runner.invoke(app, ["diagnose"]) + assert result.exit_code == 0 + assert "degraded" in result.output.lower()