agnes-the-ai-analyst/src/store_guardrails/llm_review.py
Vojtech a694a30a5e
fix(store): surface review failures + harden publish gate (#316)
* fix(store): surface review failures + harden publish gate

Four independent fixes to the flea-market submission pipeline, all surfaced
by an admin upload that landed at status='approved' without an LLM review.

1. LLM truncation no longer pins submissions in review_error.
   - Raised MAX_RESPONSE_TOKENS 2500 → 6000 in llm_review.py
   - Added one-shot retry-with-doubled-budget in anthropic_provider.py
     (capped at 4× initial)

2. Flea detail page surfaces the latest submission's failure verdict even
   when a previously-approved version is still serving (deferred-promotion
   path). The _quarantine_banner gate widened from `visibility != approved`
   to also fire on `blocked_inline / blocked_llm / review_error`, with copy
   that distinguishes the v2+ edit case ("Latest edit failed review —
   previously approved version (vN) keeps serving") from the initial-upload
   quarantine wording.

3. Restore button + endpoint no longer allow restoring a version that was
   never approved. Added StoreEntitiesRepository.get_with_version_approvals
   joining store_submissions, gated the UI button on submission_status in
   ('approved', None), rendered status pills for non-restorable rows, and
   added a 400 version_not_approved guard in POST /restore.

4. **BREAKING (operator-facing)**: publish gate is now fail-CLOSED on
   misconfig. The previous get_guardrails_enabled() silently fell back to
   "disabled, auto-approve everything" when guardrails.enabled=true in YAML
   but no ANTHROPIC_API_KEY was in env. Split into:
     - get_guardrails_enabled()              (intent — YAML)
     - get_guardrails_llm_provider_ready()   (readiness — env)
   Three-state matrix:
     enabled=false                       → auto-approve (unchanged)
     enabled=true + ready=true           → normal pipeline (unchanged)
     enabled=true + ready=false (NEW)    → submissions hold at pending_llm
                                           awaiting admin retry or override
                                           (was: silent auto-approve)
   Admin "Retry review" eligibility broadened to include pending_llm.
   Boot-time WARNING banner surfaces the misconfig in app/main.py.
   docs/STORE_GUARDRAILS.md updated with the three-state matrix.
   Operators relying on the auto-fallback for local-dev no-LLM setups must
   now explicitly set `guardrails.enabled: false` in instance.yaml.

Tests: 4623 passed. Added TestPublishGateFailClosed (4 tests) and
TestRestoreVersion::test_restore_rejects_* (3 tests). conftest.py adds an
autouse fixture defaulting guardrails OFF so legacy tests don't need to
know about the new toggle.

* fix(store): admin override promotes v2+ edits to current

The override handler at app/api/admin.py:3708 only flipped submission
status → 'overridden' and entity visibility → 'approved'. Under the v37+
deferred-promotion model that's insufficient for v2+ edits / restores:
the new bundle sits in versions/v<N>/plugin/ and the entity row stays at
the prior approved version_no + hash + on-disk live bundle. Installers
kept getting the OLD bytes the admin had just intended to replace.

Mirror the runner.run_llm_review auto-approval branch: look up the
submission's version_hash in entity.version_history, and if its `n`
differs from entity.version_no, promote_version + _swap_live_to_version.
Initial v1 overrides are unaffected — the loop finds n=1 == version_no
and skips promotion.

Tests:
- test_override_v2_edit_promotes_to_current: stage v1 approved + v2
  blocked_llm; override the v2 sub; assert entity.version_no=2,
  entity.version flips off the v1 hash, and the live plugin/ dir
  mirrors versions/v2/plugin/.
- test_override_v1_initial_upload_no_promote: regression guard so the
  promote loop doesn't accidentally bump a v1 override.

Audit log gains a promoted_to_version_no field on the override action.

* fix(store): retry/rescan review staged bundle; override forward-only

Two adversarial-review findings from a Codex pass on the publish-gate
work.

C1. Admin retry + rescan were passing live `plugin/` to the LLM. For a
v2+ submission held at `pending_llm` / `blocked_llm` / `review_error`,
live still holds the prior approved version's bytes — so the LLM
reviewed the WRONG bytes, and the runner's hash-match promotion in
`run_llm_review` would then advance the entity to staged bytes that
were never actually reviewed. Resolve the staged
`<entity>/versions/v<N>/plugin/` from the submission's
`version_history` entry, with a fall-back to live for legacy pre-v37
rows that never seeded a versions/ dir. Helpers
`_submission_plugin_dir` and `_version_no_for_submission` added to
`app/api/store.py` so override / retry / rescan share one path.

H1. Override's promote loop used `target != current`, which would
silently demote the live bundle when admin overrode a stale v2
submission while v3 was already approved + live. Changed to
`target > current` so override flips status + visibility on the row
regardless, but on-disk promotion only fires forward. Same `>`
defensive guard applied in `runner.run_llm_review` so a late LLM
verdict racing with a newer approval can't demote either.

Tests:
- TestAdminRetryReviewsStagedBundle::test_retry_v2_blocked_passes_staged_dir_not_live
- TestAdminRetryReviewsStagedBundle::test_rescan_v2_blocked_passes_staged_dir_not_live
- TestOverrideForwardOnly::test_override_stale_v2_does_not_demote_when_v3_current

* review polish: CHANGELOG drift, override eligibility, defensive copy

Three small additions on top of the retry/rescan staged-bundle fix:

1. CHANGELOG: the PR's bullets had drifted into the released
   [0.54.17] section during rebase (context-match landed them next
   to already-released content). Moved them up to [Unreleased] where
   they belong; [0.54.17] now holds only what was actually released
   (refresh-marketplace ls-remote, /me/activity hero, CI sharding +
   workflow polish).

2. app/api/admin.py: admin override eligibility now accepts
   pending_llm alongside blocked_inline + blocked_llm + review_error.
   Closes a UX gap from the new fail-CLOSED behavior: under
   enabled-but-not-ready, a known-good submission would otherwise
   sit indefinitely until the admin set credentials AND clicked
   Retry. Override already routes through version_history (and is
   now forward-only on promote), so it stays safe for v2+ deferred-
   promotion submissions.

3. src/repositories/store_entities.py: get_with_version_approvals
   defensively copies each version_history entry before annotating
   with submission_status. self.get() re-parses JSON each call today
   so this is belt-and-suspenders against any future caching layer
   leaking the annotated key into a subsequent plain get() call.

Tests: 112 passed (focused on test_store_entity_versions +
test_admin_store_submissions, covering the retry/rescan staged-
bundle fix the author shipped + this polish).

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-15 15:52:07 +02:00

223 lines
8.8 KiB
Python

"""LLM security review — agentic verdict over the uploaded bundle.
Mirrors the corporate-memory extraction pattern: builds a prompt, calls
``StructuredExtractor.extract_json`` against a strict JSON schema, parses
the result. The model tier is configurable per-instance via
``guardrails.review_model`` (Haiku / Sonnet / Opus) — see
``app/instance_config.get_guardrails_review_model``.
Cost cap: single-shot, MAX_REVIEW_BYTES content payload, max_tokens
budget tuned for the schema. Retries inside the extractor handle
transient errors; a hard timeout (configured at the connector level)
bounds wall-clock cost.
"""
from __future__ import annotations
import logging
from pathlib import Path
from typing import Any, Dict, Optional
from connectors.llm.anthropic_provider import AnthropicExtractor
from connectors.llm.exceptions import (
LLMError,
)
from .prompts import (
REVIEW_JSON_SCHEMA,
SYSTEM_PROMPT,
build_review_prompt,
)
logger = logging.getLogger(__name__)
# Bound the response budget. The schema's two arrays (findings +
# content_quality.issues) are individually capped at maxItems=20, but
# each item is ~120-180 tokens (severity/category/file/explanation/
# fix_hint or file/field/issue/hint). A bundle with many weak
# descriptions can easily hit 4-5k output tokens. Stay generous on
# Haiku/Sonnet — output cost is negligible compared to the cost of a
# truncated verdict pinning the submission in `review_error`.
MAX_RESPONSE_TOKENS = 6000
def review_bundle(
plugin_dir: Path,
*,
type_: str,
name: str,
version: str,
description: Optional[str],
api_key: str,
model: str,
) -> Dict[str, Any]:
"""Run the LLM review against the baked plugin tree.
Returns a dict with the schema:
``{risk_level, summary, findings[], template_placeholders_found,
reviewed_by_model, error}``
``error`` is set only when the LLM call itself failed — the runner
surfaces this as ``status='review_error'`` and exposes a retry path
in the admin UI. On success ``error`` is None and the schema fields
are populated by the model.
"""
user_payload = build_review_prompt(
plugin_dir,
type_=type_,
name=name,
version=version,
description=description,
)
extractor = AnthropicExtractor(api_key=api_key, model=model)
try:
# Pass SYSTEM_PROMPT via the SDK's separate ``system=`` parameter
# so a crafted README inside the uploaded bundle cannot override
# the reviewer rules. The user-content payload wraps the bundle
# files in <bundle>...</bundle> sentinels per the trust-boundary
# paragraph in SYSTEM_PROMPT.
result = extractor.extract_json(
prompt=user_payload,
system=SYSTEM_PROMPT,
max_tokens=MAX_RESPONSE_TOKENS,
json_schema=REVIEW_JSON_SCHEMA,
schema_name="store_guardrails_review",
)
except LLMError as e:
# Bubble up as a structured error the runner can persist into
# store_submissions.llm_findings — operators see the failure
# type without having to grep logs.
logger.warning(
"LLM guardrail review failed for %s/%s@%s: %s",
type_, name, version, type(e).__name__,
)
return {
"risk_level": None,
"summary": None,
"findings": [],
"template_placeholders_found": 0,
"reviewed_by_model": model,
"error": f"{type(e).__name__}: {e}",
}
except Exception as e:
# Catch-all: AnthropicExtractor only translates the four transient
# error classes into LLMError. Anything else (BadRequestError on
# schema mismatch, NotFoundError on a stale model ID, library
# bugs) would otherwise bubble through BackgroundTasks and land
# the request in the unhandled-error path while the submission
# row stays stuck in 'pending_llm' forever. Convert here so the
# admin queue surfaces a real review_error with a retry button.
logger.exception(
"LLM guardrail review unexpected error for %s/%s@%s",
type_, name, version,
)
return {
"risk_level": None,
"summary": None,
"findings": [],
"template_placeholders_found": 0,
"reviewed_by_model": model,
"error": f"{type(e).__name__}: {e}",
}
# Defensive: ensure the keys we rely on exist with sane defaults even
# if the model returns the optional ones empty. ``risk_level`` is
# special-cased — defaulting it to "medium" would silently look like
# a model decision and trigger an implicit block. Surface as an error
# so the runner persists `status='review_error'` and the admin sees
# a retry button.
risk_level = result.get("risk_level")
content_quality = _normalize_content_quality(result.get("content_quality"))
if not risk_level:
return {
"risk_level": None,
"summary": result.get("summary") or "",
"findings": result.get("findings") or [],
"template_placeholders_found": int(result.get("template_placeholders_found") or 0),
"content_quality": content_quality,
"reviewed_by_model": model,
"error": "missing_risk_level",
}
return {
"risk_level": risk_level,
"summary": result.get("summary") or "",
"findings": result.get("findings") or [],
"template_placeholders_found": int(result.get("template_placeholders_found") or 0),
"content_quality": content_quality,
"reviewed_by_model": model,
"error": None,
}
def _normalize_content_quality(value: Any) -> Dict[str, Any]:
"""Coerce the model's content_quality output to a stable shape.
Missing or malformed content_quality is treated as pass — keeps
backwards compatibility with older recorded verdicts and ensures a
weird LLM response can't accidentally block all submissions. The
safe-by-default-on-empty stance is intentional: hard blocking is
the mechanical tier's job; the LLM tier is the substantive
judgement layer.
The verdict is treated as an aggregate of the evidence: if the
model said `fail` with empty issues we downgrade to `pass` (no
visible reason to block); if it said `pass` with non-empty issues
we promote to `fail` (defense in depth — a compromised or
prompt-injected model that flipped the verdict without zeroing the
issues would otherwise sneak through). #277 LOW #2.
"""
if not isinstance(value, dict):
return {"verdict": "pass", "issues": []}
verdict = value.get("verdict")
if verdict not in {"pass", "fail"}:
verdict = "pass"
issues_raw = value.get("issues") or []
issues: list = []
if isinstance(issues_raw, list):
for item in issues_raw:
if not isinstance(item, dict):
continue
issues.append({
"file": str(item.get("file") or ""),
"field": str(item.get("field") or "frontmatter.description"),
"issue": str(item.get("issue") or ""),
"hint": str(item.get("hint") or ""),
})
# If verdict claims fail but no issues were enumerated, downgrade to
# pass — we'd otherwise block a submission with no rendered reason.
if verdict == "fail" and not issues:
verdict = "pass"
# Symmetric defense: if the model emitted issues but said pass,
# promote to fail. The verdict must aggregate the evidence; a
# compromised or prompt-injected model that flips the verdict
# without zeroing the issues list would otherwise sneak through.
# Issue #277 LOW finding #2.
if verdict == "pass" and issues:
verdict = "fail"
return {"verdict": verdict, "issues": issues}
def is_safe(verdict: Dict[str, Any]) -> bool:
"""Decide whether a review verdict permits publication.
Pass condition: ``risk_level IN ('safe','low')`` AND no individual
finding has severity ``high|critical`` AND ``content_quality.verdict
== 'pass'``. We intentionally let ``medium`` findings through with a
low-risk verdict — the model uses medium for "would benefit from
review but no immediate exploit". Operators escalate to
Sonnet/Opus if they want a stricter floor. Content quality is a
hard gate: weak descriptions block, no severity scale.
"""
if verdict.get("error"):
return False
risk = (verdict.get("risk_level") or "").lower()
if risk not in {"safe", "low"}:
return False
for finding in verdict.get("findings") or []:
sev = (finding.get("severity") or "").lower()
if sev in {"high", "critical"}:
return False
content_quality = verdict.get("content_quality") or {}
if (content_quality.get("verdict") or "pass").lower() == "fail":
return False
return True