agnes-the-ai-analyst/tests/test_store_guardrails_content.py
Vojtech fb6e930bc9
feat(store-guardrails): per-component description quality + plain-language UX (#276)
* feat(store-guardrails): enforce per-component description quality

Two-tier hard guardrail on flea-market submissions. Empty / placeholder /
single-word descriptions now block before any LLM call; vague-but-passes-
floor descriptions block on the substantive LLM review layer.

Tier 1 — inline mechanical check (src/store_guardrails/content_check.py).
Walks the baked plugin tree, evaluates each component (plugin manifest,
agents, skills, commands) plus the submission-level form description
against a 60-char / 25-char (commands) / 5-distinct-word / 200-char-body
floor with a placeholder denylist (TODO, TBD, {{var}}, etc.). Floors
calibrated against real ecosystem norms: Claude / superpowers /
compound-engineering skill packs cluster 150–220 chars, npm / Docker /
VS Code at 100–120. InlineResult.passed now ANDs in content.status.

Tier 2 — LLM review extension (prompts.py + llm_review.py). System
prompt gains a content-quality criterion; REVIEW_JSON_SCHEMA carries a
content_quality {verdict, issues[]} object alongside the existing
security findings. is_safe() requires content_quality.verdict == 'pass'.
Single LLM call covers both dimensions. MAX_RESPONSE_TOKENS bumped
2000 → 2500 for the extra payload. Verdicts missing content_quality
treated as pass (backwards compat with already-recorded rows).

Submitter UX:
- /store/new wizard now carries a "Before you upload — what passes
  review" collapsible disclosure on both step 1 and step 2 with the
  bar + patterns that work. Live char counter on the description
  field. Per-component preview table (green/red dots from the new
  summarize_for_preview helper) renders after the ZIP /preview round
  trip, scoping each finding to its file.
- New /store/examples page with rejected/passes pairs for skill /
  agent / plugin / command plus a "Why these limits" research table.
  Anchored sections (#skill / #agent / #plugin / #command) so the
  rejection banner can deep-link by component_type.
- Quarantine banner _content_findings.html groups findings by file
  (one "See <type> example ↗" per component, not per field) and
  translates field codes (frontmatter.description / body / etc.) to
  plain-English labels. _content_howto_fix.html surfaces a static
  "Re-upload as new version" + "See examples" action row beneath any
  content failure on the entity detail page.
- _parse_frontmatter moved to src/store_guardrails/_frontmatter.py so
  the new check module shares the parser without inverting the
  app → src dependency direction.

Tests:
- New tests/test_store_guardrails_content.py (29 cases) covering
  every failure code per component type plus submission-level checks
  and the summarize_components / summarize_for_preview helpers.
- Extended test_store_guardrails_inline.py for the new
  InlineResult.content field + aggregate behaviour.
- Extended test_store_guardrails_llm.py for the new
  content_quality verdict pathways (fail blocks, missing field passes).
- Backfilled fixture descriptions across test_store_api.py,
  test_store_entity_versions.py, test_store_put_atomic.py,
  test_admin_store_submissions.py, test_marketplace_api.py,
  test_marketplace_v32_endpoints.py so existing happy-path tests
  clear the new 60-char floor.

* fix(content-guardrail): align agents walker with preview + drop import-time .format()

Two cleanups from the takeover review on #276 (vr/guardrails-content).

1) `_iter_components` for agents now skips files lacking frontmatter
   (no `name` AND no `description`). Pre-fix the walker greedily
   evaluated every `*.md` under `agents/` — `agents/README.md` and
   helper docs got flagged as "frontmatter.description empty"
   rejections. Worse: `summarize_for_preview` for `type=agent` ALREADY
   filters the same shape, so the upload preview gave a green dot
   while the post-bake check gave a red rejection on submit. Two new
   regression tests in TestAgentsWalkerSkipsNonAgentFiles pin both
   shapes (README + _NOTES.md) so the preview/check parity stays
   aligned.

2) `body_too_short` hints now use the same runtime-kwarg substitution
   pattern as every other hint in the table. Pre-fix the skill +
   agent body_too_short hints called `.format(min_chars=_MIN_BODY_CHARS)`
   at module-load time, but the call site `_hint_for(type_,
   "body_too_short")` didn't pass `min_chars=`, so the format() was
   just baking the constant at import. Cosmetic inconsistency; pass
   `min_chars=_MIN_BODY_CHARS` at the call site instead and let
   `_hint_for` do the substitution like it does for `too_short`.

Verified end-to-end:
- New TestAgentsWalkerSkipsNonAgentFiles cases fail on the unfixed
  walker (verified by reverting to the pre-fix file and re-running);
  pass cleanly after the fix.
- Full content-guardrail suite: 25/25 (23 existing + 2 new).
- Full pytest: 4189 passed, 25 skipped.

* release: 0.53.5 — content guardrail (flea-market submitter UX) + catalog ENTITY column + BQ hint dispatch

Bundles three threads landed in [Unreleased]:
- Vojta's flea-market content guardrail (two-tier mechanical + LLM)
- Zdeněk's `agnes catalog` ENTITY column replacement for FLAVOR
- Zdeněk's `/api/query` remote_estimate_failed hint dispatch fix

Plus the takeover hygiene from #276 review (agents walker preview/check
parity + body_too_short hint runtime kwarg consistency) and the
backslash-escape fix follow-up to v0.53.4 #275.

No DB migration; no API change. Patch upgrade lands transparently.
Upload form's new "Before you upload" disclosure + per-component preview
table appear on the next dev-VM auto-pull. Quarantine banner now groups
findings by file with "See <type> example ↗" deep-links to the new
/store/examples reference page.

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-12 21:48:27 +02:00

302 lines
12 KiB
Python

"""Content-guardrail tests — the mechanical per-component description check.
Exercises every failure-mode code on a synthetic baked plugin tree:
empty, placeholder_text, too_short, low_word_count, body_too_short.
Also verifies the aggregate ``InlineResult.passed`` flips false when the
content tier fails even with manifest + security passing.
"""
from __future__ import annotations
import json
import shutil
import tempfile
from pathlib import Path
import pytest
from src.store_guardrails import run_inline_checks
from src.store_guardrails.content_check import (
check as content_check,
check_submission_description,
summarize_components,
summarize_for_preview,
)
_OK_DESC = "Use when validating per-component description guardrails end to end"
_OK_BODY = "Body content explaining what this component does, when to use it, and the constraints. " * 4
@pytest.fixture
def plugin_dir():
d = Path(tempfile.mkdtemp(prefix="agnes_content_test_"))
yield d
shutil.rmtree(d, ignore_errors=True)
def _write_skill(plugin_dir: Path, *, description: str = _OK_DESC, body: str = _OK_BODY) -> None:
target = plugin_dir / "skills" / "test-skill"
target.mkdir(parents=True, exist_ok=True)
(target / "SKILL.md").write_text(
f"---\nname: test-skill\ndescription: {description}\n---\n\n{body}\n",
encoding="utf-8",
)
def _write_agent(plugin_dir: Path, *, name: str = "reviewer", description: str = _OK_DESC, body: str = _OK_BODY) -> None:
target = plugin_dir / "agents"
target.mkdir(parents=True, exist_ok=True)
(target / f"{name}.md").write_text(
f"---\nname: {name}\ndescription: {description}\n---\n\n{body}\n",
encoding="utf-8",
)
def _write_plugin_json(plugin_dir: Path, *, description: str = _OK_DESC) -> None:
target = plugin_dir / ".claude-plugin"
target.mkdir(parents=True, exist_ok=True)
(target / "plugin.json").write_text(
json.dumps({"name": "test-plugin", "description": description, "version": "0.1.0"}),
encoding="utf-8",
)
def _write_command(plugin_dir: Path, *, name: str = "run", description: str = "Run the test suite and report failures") -> None:
target = plugin_dir / "commands"
target.mkdir(parents=True, exist_ok=True)
(target / f"{name}.md").write_text(
f"---\nname: {name}\ndescription: {description}\n---\n\nrun it\n",
encoding="utf-8",
)
# ---------------------------------------------------------------------------
# Component-level failure codes
# ---------------------------------------------------------------------------
class TestSkillDescriptions:
def test_empty_description_fails(self, plugin_dir):
_write_skill(plugin_dir, description="")
out = content_check(plugin_dir)
assert out["status"] == "fail"
codes = {i["code"] for i in out["issues"]}
assert "empty" in codes
def test_todo_literal_fails_as_placeholder(self, plugin_dir):
_write_skill(plugin_dir, description="TODO")
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "placeholder_text" in codes
def test_todo_prefix_fails(self, plugin_dir):
_write_skill(plugin_dir, description="TODO add the real description later")
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "placeholder_text" in codes
def test_short_description_fails(self, plugin_dir):
_write_skill(plugin_dir, description="too short here") # 14 chars
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "too_short" in codes
def test_unfilled_jinja_placeholder_fails(self, plugin_dir):
_write_skill(plugin_dir, description="Use when {{my_skill}} fires")
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "placeholder_text" in codes
def test_length_floor_takes_precedence_over_word_count(self, plugin_dir):
# 4 words but well under 30 chars.
_write_skill(plugin_dir, description="foo bar baz quux")
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "too_short" in codes
def test_low_distinct_words_fails(self, plugin_dir):
# 80 chars clears the length floor but only 1 distinct word
# after stripping punctuation — low_word_count fires.
_write_skill(plugin_dir, description=("foo " * 20).strip())
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "low_word_count" in codes
def test_body_too_short_fails(self, plugin_dir):
_write_skill(plugin_dir, body="short body")
out = content_check(plugin_dir)
codes = {(i["field"], i["code"]) for i in out["issues"]}
assert ("body", "body_too_short") in codes
def test_well_formed_skill_passes(self, plugin_dir):
_write_skill(plugin_dir)
out = content_check(plugin_dir)
assert out["status"] == "pass"
assert out["issues"] == []
class TestPluginAndAgentDescriptions:
def test_plugin_description_empty_fails(self, plugin_dir):
_write_plugin_json(plugin_dir, description="")
out = content_check(plugin_dir)
files = {i["file"] for i in out["issues"]}
assert ".claude-plugin/plugin.json" in files
def test_one_bad_agent_among_many_is_isolated(self, plugin_dir):
_write_plugin_json(plugin_dir)
_write_agent(plugin_dir, name="good_one", description=_OK_DESC)
_write_agent(plugin_dir, name="bad_one", description="")
out = content_check(plugin_dir)
assert out["status"] == "fail"
# Only the bad agent's file shows in issues.
files = {i["file"] for i in out["issues"]}
assert "agents/bad_one.md" in files
assert "agents/good_one.md" not in files
def test_plugin_passes_when_descriptions_all_strong(self, plugin_dir):
_write_plugin_json(plugin_dir)
_write_agent(plugin_dir)
_write_skill(plugin_dir)
_write_command(plugin_dir)
out = content_check(plugin_dir)
assert out["status"] == "pass"
class TestCommands:
def test_command_short_description_fails(self, plugin_dir):
_write_command(plugin_dir, description="run") # 3 chars
out = content_check(plugin_dir)
codes = {i["code"] for i in out["issues"]}
assert "too_short" in codes
def test_command_lower_floor_still_enforced(self, plugin_dir):
# 38 chars + 6 distinct words — clears the 25/5 command floor.
_write_command(plugin_dir, description="Run tests, format output, report failures clearly")
out = content_check(plugin_dir)
assert out["status"] == "pass"
# ---------------------------------------------------------------------------
# Submission-level description
# ---------------------------------------------------------------------------
class TestSubmissionDescription:
def test_empty_submission_description_fails(self):
out = check_submission_description("")
assert out["status"] == "fail"
assert out["issues"][0]["code"] == "empty"
assert out["issues"][0]["file"] == "<submission>"
def test_placeholder_submission_description_fails(self):
out = check_submission_description("TBD")
codes = {i["code"] for i in out["issues"]}
assert "placeholder_text" in codes
def test_strong_submission_description_passes(self):
out = check_submission_description(_OK_DESC)
assert out["status"] == "pass"
assert out["issues"] == []
# ---------------------------------------------------------------------------
# Aggregation — InlineResult.passed flips on content failure
# ---------------------------------------------------------------------------
class TestInlineAggregate:
def test_content_failure_blocks_passed(self, plugin_dir):
# Skill with frontmatter description = TODO. Manifest + security
# pass; content fails; aggregate must be False.
_write_skill(plugin_dir, description="TODO")
r = run_inline_checks(plugin_dir, type_="skill", description=_OK_DESC)
assert r.manifest["status"] == "pass"
assert r.static_security["status"] == "pass"
assert r.content["status"] == "fail"
assert not r.passed
def test_submission_desc_failure_merges_into_content(self, plugin_dir):
_write_skill(plugin_dir)
r = run_inline_checks(plugin_dir, type_="skill", description="")
assert r.content["status"] == "fail"
files = {i["file"] for i in r.content["issues"]}
assert "<submission>" in files
def test_clean_bundle_passes(self, plugin_dir):
_write_skill(plugin_dir)
r = run_inline_checks(plugin_dir, type_="skill", description=_OK_DESC)
assert r.passed
# ---------------------------------------------------------------------------
# summarize_components + summarize_for_preview
# ---------------------------------------------------------------------------
class TestSummaries:
def test_summarize_components_baked_plugin_tree(self, plugin_dir):
_write_plugin_json(plugin_dir)
_write_agent(plugin_dir)
_write_skill(plugin_dir)
rows = summarize_components(plugin_dir)
types = {r["type"] for r in rows}
assert types == {"plugin", "agent", "skill"}
for r in rows:
assert r["ok"] is True
def test_summarize_for_preview_skill(self, plugin_dir):
# Single SKILL.md at root — preview should locate it without the
# `skills/<name>/` wrapper.
(plugin_dir / "SKILL.md").write_text(
f"---\nname: probe\ndescription: {_OK_DESC}\n---\n\n{_OK_BODY}\n",
encoding="utf-8",
)
rows = summarize_for_preview(plugin_dir, "skill")
assert len(rows) == 1
assert rows[0]["type"] == "skill"
assert rows[0]["ok"] is True
def test_summarize_for_preview_marks_bad_descriptions(self, plugin_dir):
(plugin_dir / "SKILL.md").write_text(
"---\nname: probe\ndescription: TODO\n---\n\n" + _OK_BODY + "\n",
encoding="utf-8",
)
rows = summarize_for_preview(plugin_dir, "skill")
assert len(rows) == 1
assert rows[0]["ok"] is False
codes = {i["code"] for i in rows[0]["issues"]}
assert "placeholder_text" in codes
class TestAgentsWalkerSkipsNonAgentFiles:
"""`agents/README.md` (and other helper files without frontmatter)
must not be evaluated as a missing-description agent. Pre-fix the
`_iter_components` walker greedily evaluated every `*.md` under
`agents/`, which gave a green dot in the upload preview (preview
walker correctly filtered) but a red rejection on submit (check
walker did not). Pin the parity here so the two stay aligned."""
def test_readme_under_agents_is_skipped(self, plugin_dir):
# One real agent + one README (no frontmatter at all).
_write_agent(plugin_dir, name="reviewer")
(plugin_dir / "agents" / "README.md").write_text(
"# How to author agents in this plugin\n\nA few notes for contributors.\n",
encoding="utf-8",
)
result = content_check(plugin_dir)
# README must NOT generate any issue. The lone real agent passes
# the floor, so the whole plugin passes.
assert result["status"] == "pass", result["issues"]
def test_helper_md_without_frontmatter_is_skipped(self, plugin_dir):
_write_agent(plugin_dir, name="reviewer")
(plugin_dir / "agents" / "_NOTES.md").write_text(
"Some helper notes — not an agent. No frontmatter, no agent shape.\n",
encoding="utf-8",
)
rows = summarize_components(plugin_dir)
types_files = {(r["type"], r["file"]) for r in rows}
# Only the real agent should appear; _NOTES.md is filtered out.
assert ("agent", "agents/reviewer.md") in types_files
assert ("agent", "agents/_NOTES.md") not in types_files