Reading `Overall: degraded` on a fresh analyst install — driven by server-side operator warnings (stale tables, session-pipeline cadence, BQ billing-project config) that the analyst can't act on — erodes trust in the install. The role-aware headline routes operator-only warnings through a secondary line so they're not invisible, but they no longer drive the headline an analyst sees. Server-side (`app/api/health.py`): - Per-check `audience: "analyst" | "operator"` tag on every entry in `/api/health/detailed` services dict. - New top-level `caller_role` field (derived from `user.is_admin`) so the client knows which aggregation to display. - New top-level `overall_analyst` field — analyst-only aggregation available to clients that don't want to recompute it. Client-side (`cli/commands/diagnose.py`): - When the server reports `caller_role`, analyst aggregations exclude audience=operator checks from the headline. Analyst-side warnings AND server-side errors still escalate (errors are universal). - Secondary line surfaces operator warning count so they're visible: "Overall: healthy (analyst-side); 2 operator-side warnings". - Admin/operator role auto-promotes to full aggregation; analysts can manually opt in via `--include-operator-checks` flag. - Legacy servers without `caller_role` keep the pre-#345-B full aggregation — no silent regression against older deployments. Audience defaults (`_AUDIENCE` map in health.py): - analyst: duckdb_state - operator: db_schema, data, users, bq_config, session_pipeline Tests: 4 new in TestAnalystAudienceFilter (analyst-only filtering, admin auto-promote, --include-operator-checks opt-in, legacy server fallback). 26/26 diagnose + health tests pass.
This commit is contained in:
parent
82a4b663d7
commit
0a16e8f44e
4 changed files with 202 additions and 20 deletions
|
|
@ -10,6 +10,9 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- `agnes diagnose` is now role-aware. A fresh analyst install no longer reports `Overall: degraded` just because the server has operator-side warnings (stale tables, session-pipeline cadence, BQ billing-project config) that the analyst can't act on. Server (`/api/health/detailed`) tags every check with `audience: "analyst" | "operator"` plus a top-level `caller_role` derived from `user.is_admin` and an `overall_analyst` aggregation. Client excludes operator checks from the headline for analyst callers, surfaces operator warning count on a secondary line so they stay visible, auto-promotes admin/operator callers to the full aggregation, and lets analysts opt in via `--include-operator-checks`. Legacy servers (no `caller_role`) keep the pre-#345-B full aggregation — no silent regression. Closes #345 B.
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
- `AGNES_MARKETPLACE_URL` env override for `agnes refresh-marketplace --bootstrap`. Pre-fix the marketplace endpoint was hardcoded to `{server_host}/marketplace.git/`, which broke deployments that serve the marketplace from a different host than the API (reverse-proxy split, CDN-fronted marketplace). When set, the env var is parsed via `urlparse`; missing scheme or host fails fast with a clear error (operator misconfiguration surfaces immediately). The PAT injection / strip behavior is preserved on the override path. Default behavior unchanged when the env var is empty / unset. Closes #345 A.
|
- `AGNES_MARKETPLACE_URL` env override for `agnes refresh-marketplace --bootstrap`. Pre-fix the marketplace endpoint was hardcoded to `{server_host}/marketplace.git/`, which broke deployments that serve the marketplace from a different host than the API (reverse-proxy split, CDN-fronted marketplace). When set, the env var is parsed via `urlparse`; missing scheme or host fails fast with a clear error (operator misconfiguration surfaces immediately). The PAT injection / strip behavior is preserved on the override path. Default behavior unchanged when the env var is empty / unset. Closes #345 A.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -422,24 +422,55 @@ async def health_check_detailed(
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
checks["session_pipeline"] = {"status": "unknown", "detail": str(e)}
|
checks["session_pipeline"] = {"status": "unknown", "detail": str(e)}
|
||||||
|
|
||||||
# Aggregate to overall status. `info` and `unknown` surface in the
|
# Per-check audience tagging (issue #345 B): which checks does an
|
||||||
# response but never escalate the headline (issue #178). `warning`
|
# analyst care about vs. which are operator-only? An analyst seeing
|
||||||
# promotes to `degraded`; `error` (or a schema mismatch when the
|
# `Overall: degraded` because the scheduler is behind on session
|
||||||
# caller asked for it) promotes to `unhealthy`.
|
# extraction has no way to act on it — the warning is real, but it's
|
||||||
overall = "healthy"
|
# not their warning. ``audience`` lets clients compute a role-aware
|
||||||
for check in checks.values():
|
# headline without dropping checks from the response payload.
|
||||||
|
_AUDIENCE = {
|
||||||
|
"duckdb_state": "analyst", # query path; visible everywhere
|
||||||
|
"db_schema": "operator", # schema migration sanity
|
||||||
|
"data": "operator", # stale tables — admin scheduler
|
||||||
|
"users": "operator", # admin-side ops only
|
||||||
|
"bq_config": "operator", # cluster-level config
|
||||||
|
"session_pipeline": "operator", # admin-driven cadence
|
||||||
|
}
|
||||||
|
for name, check in checks.items():
|
||||||
|
check.setdefault("audience", _AUDIENCE.get(name, "operator"))
|
||||||
|
|
||||||
|
def _aggregate(items) -> str:
|
||||||
|
out = "healthy"
|
||||||
|
for check in items:
|
||||||
if check.get("status") == "error":
|
if check.get("status") == "error":
|
||||||
overall = "unhealthy"
|
return "unhealthy"
|
||||||
break
|
|
||||||
if check.get("status") == "warning":
|
if check.get("status") == "warning":
|
||||||
overall = "degraded"
|
out = "degraded"
|
||||||
# Schema mismatch only escalates when the caller asked for the check
|
return out
|
||||||
|
|
||||||
|
# Schema mismatch escalates only when the caller asked for the check
|
||||||
# — otherwise the absent key is treated as "not asserted".
|
# — otherwise the absent key is treated as "not asserted".
|
||||||
|
def _apply_schema_escalation(status: str) -> str:
|
||||||
if "db_schema" in checks and checks["db_schema"].get("db_schema") != "ok":
|
if "db_schema" in checks and checks["db_schema"].get("db_schema") != "ok":
|
||||||
overall = "unhealthy"
|
return "unhealthy"
|
||||||
|
return status
|
||||||
|
|
||||||
|
overall = _apply_schema_escalation(_aggregate(checks.values()))
|
||||||
|
overall_analyst = _aggregate(
|
||||||
|
c for c in checks.values() if c.get("audience") == "analyst"
|
||||||
|
)
|
||||||
|
# db_schema is audience=operator, so it never escalates the analyst
|
||||||
|
# headline — that's the intent.
|
||||||
|
|
||||||
|
# Caller role surface: clients pick the headline they should display
|
||||||
|
# based on their own privilege. Admins/operators see ``overall``;
|
||||||
|
# analysts see ``overall_analyst`` and an optional flag to opt in.
|
||||||
|
caller_role = "admin" if _user.get("is_admin") else "analyst"
|
||||||
|
|
||||||
return {
|
return {
|
||||||
"status": overall,
|
"status": overall,
|
||||||
|
"overall_analyst": overall_analyst,
|
||||||
|
"caller_role": caller_role,
|
||||||
"version": os.environ.get("AGNES_VERSION", "dev"),
|
"version": os.environ.get("AGNES_VERSION", "dev"),
|
||||||
"channel": os.environ.get("RELEASE_CHANNEL", "dev"),
|
"channel": os.environ.get("RELEASE_CHANNEL", "dev"),
|
||||||
"image_tag": os.environ.get("AGNES_TAG", "unknown"),
|
"image_tag": os.environ.get("AGNES_TAG", "unknown"),
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,7 @@
|
||||||
|
|
||||||
import json
|
import json
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
import typer
|
import typer
|
||||||
|
|
||||||
|
|
@ -28,6 +29,20 @@ def diagnose(
|
||||||
"operator is verifying a migration."
|
"operator is verifying a migration."
|
||||||
),
|
),
|
||||||
),
|
),
|
||||||
|
include_operator_checks: bool = typer.Option(
|
||||||
|
False,
|
||||||
|
"--include-operator-checks",
|
||||||
|
help=(
|
||||||
|
"Aggregate the headline status across operator-side checks "
|
||||||
|
"(stale tables, session-pipeline cadence, BQ billing config) "
|
||||||
|
"in addition to analyst-side ones. Default off when the caller "
|
||||||
|
"is an analyst — those checks aren't actionable from a fresh "
|
||||||
|
"analyst install and reading `Overall: degraded` on first run "
|
||||||
|
"erodes trust in the install (issue #345 B). Admins/operators "
|
||||||
|
"auto-promote to the full headline based on the server-reported "
|
||||||
|
"caller_role."
|
||||||
|
),
|
||||||
|
),
|
||||||
):
|
):
|
||||||
"""Run comprehensive system diagnostics. AI-agent friendly output."""
|
"""Run comprehensive system diagnostics. AI-agent friendly output."""
|
||||||
# If a subcommand was invoked (e.g. `agnes diagnose system`), defer to it
|
# If a subcommand was invoked (e.g. `agnes diagnose system`), defer to it
|
||||||
|
|
@ -36,18 +51,26 @@ def diagnose(
|
||||||
return
|
return
|
||||||
|
|
||||||
checks = []
|
checks = []
|
||||||
|
# ``caller_role`` is present only on servers shipping the
|
||||||
|
# role-aware health fields (issue #345 B). Legacy servers don't
|
||||||
|
# ship it; absent role disables audience filtering so we don't
|
||||||
|
# regress against an older server with the full-aggregation
|
||||||
|
# contract the rest of the CLI was written against.
|
||||||
|
caller_role: Optional[str] = None
|
||||||
|
|
||||||
# 1. API reachability
|
# 1. API reachability
|
||||||
try:
|
try:
|
||||||
resp = api_get("/api/health")
|
resp = api_get("/api/health")
|
||||||
health = resp.json()
|
health = resp.json()
|
||||||
checks.append({"name": "api", "status": "ok", "latency_ms": resp.elapsed.total_seconds() * 1000})
|
checks.append({"name": "api", "status": "ok", "audience": "analyst", "latency_ms": resp.elapsed.total_seconds() * 1000})
|
||||||
|
|
||||||
# Detailed health (auth required) for service-level checks
|
# Detailed health (auth required) for service-level checks
|
||||||
try:
|
try:
|
||||||
params = {"include": "schema"} if include_schema else None
|
params = {"include": "schema"} if include_schema else None
|
||||||
resp_d = api_get("/api/health/detailed", params=params)
|
resp_d = api_get("/api/health/detailed", params=params)
|
||||||
detailed = resp_d.json()
|
detailed = resp_d.json()
|
||||||
|
if "caller_role" in detailed:
|
||||||
|
caller_role = detailed["caller_role"]
|
||||||
for svc_name, svc_data in detailed.get("services", {}).items():
|
for svc_name, svc_data in detailed.get("services", {}).items():
|
||||||
check = {"name": svc_name, "status": svc_data.get("status", "unknown")}
|
check = {"name": svc_name, "status": svc_data.get("status", "unknown")}
|
||||||
check.update({k: v for k, v in svc_data.items() if k != "status"})
|
check.update({k: v for k, v in svc_data.items() if k != "status"})
|
||||||
|
|
@ -56,20 +79,32 @@ def diagnose(
|
||||||
# Auth may not be configured — minimal reachability is sufficient
|
# Auth may not be configured — minimal reachability is sufficient
|
||||||
pass
|
pass
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
checks.append({"name": "api", "status": "error", "detail": str(e)})
|
checks.append({"name": "api", "status": "error", "audience": "analyst", "detail": str(e)})
|
||||||
|
|
||||||
# Issue #244: detect silently-broken capture-session by comparing
|
# Issue #244: detect silently-broken capture-session by comparing
|
||||||
# observed SessionStart files against the uploaded-log entries.
|
# observed SessionStart files against the uploaded-log entries.
|
||||||
# Adds one entry to `checks` with status ok / warning / info.
|
# Adds one entry to `checks` with status ok / warning / info.
|
||||||
try:
|
try:
|
||||||
checks.append(capture_session_health(Path.cwd()))
|
cap = capture_session_health(Path.cwd())
|
||||||
|
cap.setdefault("audience", "analyst")
|
||||||
|
checks.append(cap)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
checks.append({"name": "capture-session", "status": "info", "detail": f"health check failed: {e}"})
|
checks.append({"name": "capture-session", "status": "info", "audience": "analyst", "detail": f"health check failed: {e}"})
|
||||||
|
|
||||||
# Determine overall — `info` and `unknown` surface in the per-check
|
# Determine overall — `info` and `unknown` surface in the per-check
|
||||||
# output but never promote the headline (issue #178).
|
# output but never promote the headline (issue #178).
|
||||||
|
#
|
||||||
|
# Audience-aware headline (issue #345 B): when the server reports a
|
||||||
|
# ``caller_role``, analysts see analyst-only aggregation by default;
|
||||||
|
# operators auto-promote to the full headline; analysts can manually
|
||||||
|
# opt in via ``--include-operator-checks``. Legacy servers that don't
|
||||||
|
# ship ``caller_role`` keep the original full-aggregation behaviour
|
||||||
|
# — no analyst-only filtering until the server tags checks.
|
||||||
|
role_aware = caller_role is not None
|
||||||
|
operator_mode = (not role_aware) or include_operator_checks or caller_role != "analyst"
|
||||||
|
relevant = checks if operator_mode else [c for c in checks if c.get("audience") == "analyst"]
|
||||||
overall = "healthy"
|
overall = "healthy"
|
||||||
for c in checks:
|
for c in relevant:
|
||||||
if c["status"] == "error":
|
if c["status"] == "error":
|
||||||
overall = "unhealthy"
|
overall = "unhealthy"
|
||||||
break
|
break
|
||||||
|
|
@ -93,12 +128,27 @@ def diagnose(
|
||||||
|
|
||||||
result = {
|
result = {
|
||||||
"overall": overall,
|
"overall": overall,
|
||||||
|
"caller_role": caller_role,
|
||||||
"checks": checks,
|
"checks": checks,
|
||||||
"suggested_actions": actions,
|
"suggested_actions": actions,
|
||||||
}
|
}
|
||||||
|
|
||||||
if as_json:
|
if as_json:
|
||||||
typer.echo(json.dumps(result, indent=2))
|
typer.echo(json.dumps(result, indent=2))
|
||||||
|
else:
|
||||||
|
# When analysts are filtered to analyst-only aggregation, surface
|
||||||
|
# any operator-side warnings as a secondary line so they're not
|
||||||
|
# invisible — they just don't get to drive the headline.
|
||||||
|
operator_warns = [
|
||||||
|
c for c in checks
|
||||||
|
if c.get("audience") == "operator" and c.get("status") in ("warning", "error")
|
||||||
|
]
|
||||||
|
if not operator_mode and operator_warns:
|
||||||
|
typer.echo(
|
||||||
|
f"Overall: {overall} (analyst-side); "
|
||||||
|
f"{len(operator_warns)} operator-side "
|
||||||
|
f"{'warning' if len(operator_warns) == 1 else 'warnings'}"
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
typer.echo(f"Overall: {overall}")
|
typer.echo(f"Overall: {overall}")
|
||||||
for c in checks:
|
for c in checks:
|
||||||
|
|
|
||||||
|
|
@ -96,3 +96,101 @@ class TestDiagnoseJson:
|
||||||
data = json.loads(result.output)
|
data = json.loads(result.output)
|
||||||
api_check = next(c for c in data["checks"] if c["name"] == "api")
|
api_check = next(c for c in data["checks"] if c["name"] == "api")
|
||||||
assert "latency_ms" in api_check
|
assert "latency_ms" in api_check
|
||||||
|
|
||||||
|
|
||||||
|
class TestAnalystAudienceFilter:
|
||||||
|
"""Issue #345 B — analysts shouldn't see ``Overall: degraded`` on fresh
|
||||||
|
install just because the server has operator-level warnings (stale
|
||||||
|
tables, session-pipeline behind). The role-aware headline lets the
|
||||||
|
server tag each check with ``audience`` and report ``caller_role``;
|
||||||
|
the CLI filters the overall when caller is analyst.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def test_analyst_sees_healthy_when_only_operator_checks_warn(self):
|
||||||
|
"""Analyst role + operator-side warning → ``Overall: healthy``
|
||||||
|
with a secondary line surfacing the operator warning count.
|
||||||
|
Pre-fix behaviour was ``Overall: degraded`` even though the
|
||||||
|
analyst can't act on the stale-tables warning."""
|
||||||
|
health = {
|
||||||
|
"caller_role": "analyst",
|
||||||
|
"services": {
|
||||||
|
"duckdb_state": {"status": "ok", "audience": "analyst"},
|
||||||
|
"data": {
|
||||||
|
"status": "warning",
|
||||||
|
"audience": "operator",
|
||||||
|
"stale_tables": ["orders"],
|
||||||
|
},
|
||||||
|
"session_pipeline": {
|
||||||
|
"status": "warning",
|
||||||
|
"audience": "operator",
|
||||||
|
"detail": "verification-detector behind by ~8.8h",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)):
|
||||||
|
result = runner.invoke(app, ["diagnose"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
# Headline is analyst-side healthy with a secondary count line
|
||||||
|
assert "healthy (analyst-side)" in result.output
|
||||||
|
assert "2 operator-side warnings" in result.output
|
||||||
|
# Per-check rows still show the warnings — we just don't escalate
|
||||||
|
assert "[warning] data" in result.output
|
||||||
|
|
||||||
|
def test_admin_caller_role_aggregates_full_set(self):
|
||||||
|
"""Admin/operator role auto-promotes to the full aggregation —
|
||||||
|
operator-side warnings DO escalate the headline."""
|
||||||
|
health = {
|
||||||
|
"caller_role": "admin",
|
||||||
|
"services": {
|
||||||
|
"duckdb_state": {"status": "ok", "audience": "analyst"},
|
||||||
|
"data": {
|
||||||
|
"status": "warning",
|
||||||
|
"audience": "operator",
|
||||||
|
"stale_tables": ["orders"],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)):
|
||||||
|
result = runner.invoke(app, ["diagnose"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "degraded" in result.output.lower()
|
||||||
|
# No analyst-side qualifier — admin sees the full headline directly
|
||||||
|
assert "analyst-side" not in result.output
|
||||||
|
|
||||||
|
def test_analyst_with_include_operator_checks_flag(self):
|
||||||
|
"""``--include-operator-checks`` lets an analyst opt in to the
|
||||||
|
full aggregation when they actually want to see the operator
|
||||||
|
warnings drive the headline (e.g. when paging an operator)."""
|
||||||
|
health = {
|
||||||
|
"caller_role": "analyst",
|
||||||
|
"services": {
|
||||||
|
"duckdb_state": {"status": "ok", "audience": "analyst"},
|
||||||
|
"data": {
|
||||||
|
"status": "warning",
|
||||||
|
"audience": "operator",
|
||||||
|
"stale_tables": ["orders"],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)):
|
||||||
|
result = runner.invoke(app, ["diagnose", "--include-operator-checks"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "degraded" in result.output.lower()
|
||||||
|
assert "analyst-side" not in result.output
|
||||||
|
|
||||||
|
def test_legacy_server_response_keeps_full_aggregation(self):
|
||||||
|
"""When the server doesn't ship ``caller_role`` (older deploy),
|
||||||
|
the CLI must NOT silently start filtering — that would regress
|
||||||
|
diagnoses against any pre-#345-B server. Test_diagnose_warning_service
|
||||||
|
above already covers the same shape; this test makes the contract
|
||||||
|
explicit."""
|
||||||
|
health = {
|
||||||
|
"services": {
|
||||||
|
"duckdb": {"status": "warning", "stale_tables": ["x"]},
|
||||||
|
},
|
||||||
|
# No caller_role → no role-aware filtering.
|
||||||
|
}
|
||||||
|
with patch("cli.commands.diagnose.api_get", return_value=_resp(200, health)):
|
||||||
|
result = runner.invoke(app, ["diagnose"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "degraded" in result.output.lower()
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue