agnes-the-ai-analyst/app/web/templates/store_examples.html
Vojtech fb6e930bc9
feat(store-guardrails): per-component description quality + plain-language UX (#276)
* feat(store-guardrails): enforce per-component description quality

Two-tier hard guardrail on flea-market submissions. Empty / placeholder /
single-word descriptions now block before any LLM call; vague-but-passes-
floor descriptions block on the substantive LLM review layer.

Tier 1 — inline mechanical check (src/store_guardrails/content_check.py).
Walks the baked plugin tree, evaluates each component (plugin manifest,
agents, skills, commands) plus the submission-level form description
against a 60-char / 25-char (commands) / 5-distinct-word / 200-char-body
floor with a placeholder denylist (TODO, TBD, {{var}}, etc.). Floors
calibrated against real ecosystem norms: Claude / superpowers /
compound-engineering skill packs cluster 150–220 chars, npm / Docker /
VS Code at 100–120. InlineResult.passed now ANDs in content.status.

Tier 2 — LLM review extension (prompts.py + llm_review.py). System
prompt gains a content-quality criterion; REVIEW_JSON_SCHEMA carries a
content_quality {verdict, issues[]} object alongside the existing
security findings. is_safe() requires content_quality.verdict == 'pass'.
Single LLM call covers both dimensions. MAX_RESPONSE_TOKENS bumped
2000 → 2500 for the extra payload. Verdicts missing content_quality
treated as pass (backwards compat with already-recorded rows).

Submitter UX:
- /store/new wizard now carries a "Before you upload — what passes
  review" collapsible disclosure on both step 1 and step 2 with the
  bar + patterns that work. Live char counter on the description
  field. Per-component preview table (green/red dots from the new
  summarize_for_preview helper) renders after the ZIP /preview round
  trip, scoping each finding to its file.
- New /store/examples page with rejected/passes pairs for skill /
  agent / plugin / command plus a "Why these limits" research table.
  Anchored sections (#skill / #agent / #plugin / #command) so the
  rejection banner can deep-link by component_type.
- Quarantine banner _content_findings.html groups findings by file
  (one "See <type> example ↗" per component, not per field) and
  translates field codes (frontmatter.description / body / etc.) to
  plain-English labels. _content_howto_fix.html surfaces a static
  "Re-upload as new version" + "See examples" action row beneath any
  content failure on the entity detail page.
- _parse_frontmatter moved to src/store_guardrails/_frontmatter.py so
  the new check module shares the parser without inverting the
  app → src dependency direction.

Tests:
- New tests/test_store_guardrails_content.py (29 cases) covering
  every failure code per component type plus submission-level checks
  and the summarize_components / summarize_for_preview helpers.
- Extended test_store_guardrails_inline.py for the new
  InlineResult.content field + aggregate behaviour.
- Extended test_store_guardrails_llm.py for the new
  content_quality verdict pathways (fail blocks, missing field passes).
- Backfilled fixture descriptions across test_store_api.py,
  test_store_entity_versions.py, test_store_put_atomic.py,
  test_admin_store_submissions.py, test_marketplace_api.py,
  test_marketplace_v32_endpoints.py so existing happy-path tests
  clear the new 60-char floor.

* fix(content-guardrail): align agents walker with preview + drop import-time .format()

Two cleanups from the takeover review on #276 (vr/guardrails-content).

1) `_iter_components` for agents now skips files lacking frontmatter
   (no `name` AND no `description`). Pre-fix the walker greedily
   evaluated every `*.md` under `agents/` — `agents/README.md` and
   helper docs got flagged as "frontmatter.description empty"
   rejections. Worse: `summarize_for_preview` for `type=agent` ALREADY
   filters the same shape, so the upload preview gave a green dot
   while the post-bake check gave a red rejection on submit. Two new
   regression tests in TestAgentsWalkerSkipsNonAgentFiles pin both
   shapes (README + _NOTES.md) so the preview/check parity stays
   aligned.

2) `body_too_short` hints now use the same runtime-kwarg substitution
   pattern as every other hint in the table. Pre-fix the skill +
   agent body_too_short hints called `.format(min_chars=_MIN_BODY_CHARS)`
   at module-load time, but the call site `_hint_for(type_,
   "body_too_short")` didn't pass `min_chars=`, so the format() was
   just baking the constant at import. Cosmetic inconsistency; pass
   `min_chars=_MIN_BODY_CHARS` at the call site instead and let
   `_hint_for` do the substitution like it does for `too_short`.

Verified end-to-end:
- New TestAgentsWalkerSkipsNonAgentFiles cases fail on the unfixed
  walker (verified by reverting to the pre-fix file and re-running);
  pass cleanly after the fix.
- Full content-guardrail suite: 25/25 (23 existing + 2 new).
- Full pytest: 4189 passed, 25 skipped.

* release: 0.53.5 — content guardrail (flea-market submitter UX) + catalog ENTITY column + BQ hint dispatch

Bundles three threads landed in [Unreleased]:
- Vojta's flea-market content guardrail (two-tier mechanical + LLM)
- Zdeněk's `agnes catalog` ENTITY column replacement for FLAVOR
- Zdeněk's `/api/query` remote_estimate_failed hint dispatch fix

Plus the takeover hygiene from #276 review (agents walker preview/check
parity + body_too_short hint runtime kwarg consistency) and the
backslash-escape fix follow-up to v0.53.4 #275.

No DB migration; no API change. Patch upgrade lands transparently.
Upload form's new "Before you upload" disclosure + per-component preview
table appear on the next dev-VM auto-pull. Quarantine banner now groups
findings by file with "See <type> example ↗" deep-links to the new
/store/examples reference page.

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-12 21:48:27 +02:00

339 lines
12 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{% extends "base.html" %}
{% block title %}Submission examples — {{ config.INSTANCE_NAME }}{% endblock %}
{% block content %}
<style>
.examples-page { max-width: 980px; }
.examples-page h1 {
font-size: 26px; font-weight: 700; margin: 0 0 6px;
color: var(--text-primary, #111827);
}
.examples-page .sub {
font-size: 14px; color: var(--text-secondary, #6b7280);
margin-bottom: 24px; line-height: 1.55;
}
.ex-section {
background: var(--surface, #fff);
border: 1px solid var(--border, #e5e7eb);
border-radius: 10px;
padding: 20px 22px;
margin-bottom: 18px;
}
.ex-section h2 {
margin: 0 0 4px; font-size: 17px; font-weight: 600;
color: var(--text-primary, #111827);
}
.ex-section .why {
font-size: 13px; color: var(--text-secondary, #6b7280);
margin-bottom: 12px; line-height: 1.6;
}
.ex-block {
font-family: var(--font-mono); font-size: 12.5px;
background: #f6f8fa;
border: 1px solid var(--border, #e5e7eb);
border-radius: 8px;
padding: 12px 14px;
line-height: 1.5;
overflow-x: auto;
white-space: pre-wrap;
word-break: break-word;
color: #1f2937;
}
.ex-label {
display: inline-block; font-size: 11px; font-weight: 600;
padding: 2px 8px; border-radius: 4px;
background: rgba(16, 185, 129, 0.12); color: #047857;
text-transform: uppercase; letter-spacing: 0.5px;
margin-bottom: 6px;
}
.ex-label.bad { background: rgba(220, 38, 38, 0.10); color: #b91c1c; }
.ex-compare { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; }
@media (max-width: 720px) { .ex-compare { grid-template-columns: 1fr; } }
.ex-tips {
background: var(--background, #f9fafb);
border-left: 3px solid var(--primary, #0073D1);
padding: 10px 14px; border-radius: 4px;
font-size: 13px; margin-top: 12px; line-height: 1.55;
}
.ex-tips strong { color: var(--text-primary, #111827); }
.top-actions { margin-bottom: 16px; }
.top-actions a {
color: var(--primary, #0073D1); text-decoration: none;
font-size: 13px;
}
</style>
<div class="examples-page page-shell">
<div class="top-actions">
<a href="/store/new">← Back to upload</a>
</div>
<h1>Submission examples</h1>
<p class="sub">
Each component (plugin, agent, skill, command) needs a description
that names the trigger condition AND the action. Skills are the
strictest case because Claude reads the description verbatim when
deciding whether to invoke the skill. Examples below show the
minimum bar plus what a strong submission looks like.
</p>
<!-- ───── Why these limits ───────────────────────────────────────── -->
<div class="ex-section">
<h2>Why these limits?</h2>
<div class="why">
Descriptions that are too short don't give the assistant enough
to go on. Below is the minimum bar plus what a well-written
submission usually looks like.
</div>
<table style="width:100%; border-collapse: collapse; font-size: 13px;">
<thead>
<tr style="background: var(--background, #f9fafb); text-align: left;">
<th style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Field</th>
<th style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Minimum</th>
<th style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Recommended</th>
<th style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">What it does</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Skill / agent / plugin description</td>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);"><strong>60 chars</strong> · 5 distinct words</td>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">120220 chars (one full sentence)</td>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Tells the assistant when to use the component and what it does. Showed on the marketplace tile so others can pick it.</td>
</tr>
<tr>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Command description</td>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);"><strong>25 chars</strong> · 5 distinct words</td>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">40100 chars</td>
<td style="padding: 8px 10px; border-bottom: 1px solid var(--border, #e5e7eb);">Commands are one-verb actions ("run tests", "format code"). A short clear sentence is enough.</td>
</tr>
<tr>
<td style="padding: 8px 10px;">Skill / agent content body</td>
<td style="padding: 8px 10px;"><strong>200 chars</strong></td>
<td style="padding: 8px 10px;">5002000 chars</td>
<td style="padding: 8px 10px;">The body explains what the component does once used: inputs it expects, outputs it produces, edge cases. 200 chars is the bare-minimum "one paragraph" floor.</td>
</tr>
</tbody>
</table>
<div class="ex-tips" style="margin-top: 12px;">
<strong>Two-step review.</strong> First we check length and word
count to catch placeholders like <code>TODO</code> or
<code>description</code>. Submissions that pass the length check
then go to a substantive reviewer that judges whether the
description is genuinely useful or just padded filler that hit
the character count by accident.
</div>
</div>
<!-- ───── Skill ──────────────────────────────────────────────────── -->
<div class="ex-section" id="skill">
<h2>Skill</h2>
<div class="why">
<code>skills/&lt;name&gt;/SKILL.md</code> with YAML frontmatter and a
body. The frontmatter <code>description</code> IS the trigger
string Claude reads; the body explains how the skill works once
invoked.
</div>
<div class="ex-compare">
<div>
<span class="ex-label bad">Rejected</span>
<pre class="ex-block">---
name: code-review
description: A reviewer skill
---
Reviews code.
</pre>
<div class="ex-tips">
<strong>Why it fails:</strong> description is 15 chars
(floor 60), restates the name, body is 13 chars
(floor 200). Claude can't decide when to invoke it.
</div>
</div>
<div>
<span class="ex-label">Passes</span>
<pre class="ex-block">---
name: code-review
description: Use when reviewing pull requests to flag missing tests,
weak assertions, brittle implementation-coupled tests, and edge
cases the implementation forgot.
---
# Code review skill
Run this skill against the diff of an open pull request. It walks
the changed files and surfaces three categories of issues:
1. **Missing tests** — new functions / endpoints / migrations
without corresponding test coverage.
2. **Weak assertions** — tests that exist but only assert truthy
values, status codes, or shape without verifying the actual
behavior contract.
3. **Brittle coupling** — tests that depend on private state,
internal call counts, or implementation details that will break
on refactor without catching real regressions.
## Inputs
The skill expects to be invoked from a git repo with a PR branch
checked out. It reads `git diff <base>..HEAD` to scope the review.
## Output
Markdown comment grouped by file, with line refs and one fix
suggestion per finding.
</pre>
<div class="ex-tips">
<strong>Why it passes:</strong> description names WHEN (PR
review) and WHAT it does, body explains inputs/outputs.
</div>
</div>
</div>
</div>
<!-- ───── Agent ──────────────────────────────────────────────────── -->
<div class="ex-section" id="agent">
<h2>Agent</h2>
<div class="why">
Single <code>.md</code> file with YAML frontmatter. The
<code>description</code> drives dispatch — when a parent agent
decides which subagent to spawn, this is the string it reads.
</div>
<div class="ex-compare">
<div>
<span class="ex-label bad">Rejected</span>
<pre class="ex-block">---
name: debugger
description: A debugger
---
Helps with debugging.
</pre>
<div class="ex-tips">
<strong>Why it fails:</strong> description is 11 chars
(floor 60), body is 22 chars (floor 200), neither names a
dispatch criterion.
</div>
</div>
<div>
<span class="ex-label">Passes</span>
<pre class="ex-block">---
name: debugger
description: Diagnoses test failures, runtime errors, and unexpected
behavior. Use when a build is failing, a stack trace lands in the
conversation, or the user describes a symptom without a known
root cause.
---
# Debugger subagent
Triage flow:
1. Reproduce the failure deterministically (capture the exact
command + working directory + relevant env).
2. Reduce — strip the case down to the smallest input that still
fails. Confirm the reduction reproduces.
3. Bisect or trace, depending on whether the regression is recent
or the behavior was never correct.
4. Propose ONE root-cause hypothesis and a minimal fix. Don't
shotgun-fix multiple symptoms.
Report the hypothesis + fix path back to the dispatcher; do not
apply the fix yourself unless explicitly asked.
</pre>
<div class="ex-tips">
<strong>Why it passes:</strong> description names WHAT it
does (diagnose) AND WHEN to dispatch (test failures, stack
traces, symptoms).
</div>
</div>
</div>
</div>
<!-- ───── Plugin ─────────────────────────────────────────────────── -->
<div class="ex-section" id="plugin">
<h2>Plugin</h2>
<div class="why">
<code>.claude-plugin/plugin.json</code> at the bundle root. The
<code>description</code> is the marketplace tile copy — first
thing a user sees when browsing.
</div>
<div class="ex-compare">
<div>
<span class="ex-label bad">Rejected</span>
<pre class="ex-block">{
"name": "review-tools",
"description": "Tools for code review",
"version": "0.1.0"
}
</pre>
<div class="ex-tips">
<strong>Why it fails:</strong> 20 chars (floor 60),
generic enough to apply to any plugin.
</div>
</div>
<div>
<span class="ex-label">Passes</span>
<pre class="ex-block">{
"name": "review-tools",
"description": "Code review automation: PR diff analysis, test
coverage gaps, and a configurable rule pack for Rails / Python /
TypeScript projects.",
"version": "0.1.0"
}
</pre>
<div class="ex-tips">
<strong>Why it passes:</strong> names the audience (PR
reviewers), the value (gaps + rule pack), and the scope
(3 named languages).
</div>
</div>
</div>
</div>
<!-- ───── Command ────────────────────────────────────────────────── -->
<div class="ex-section" id="command">
<h2>Command</h2>
<div class="why">
<code>commands/&lt;name&gt;.md</code>. Shown in <code>/help</code>
and slash-command lists. Lower 20-char floor since commands tend
to be one-verb actions.
</div>
<div class="ex-compare">
<div>
<span class="ex-label bad">Rejected</span>
<pre class="ex-block">---
name: run-tests
description: Runs tests
---
</pre>
<div class="ex-tips">
<strong>Why it fails:</strong> 10 chars (floor 25),
restates the name.
</div>
</div>
<div>
<span class="ex-label">Passes</span>
<pre class="ex-block">---
name: run-tests
description: Run the project test suite and print failures grouped
by file with the first 5 lines of the traceback inline.
---
</pre>
<div class="ex-tips">
<strong>Why it passes:</strong> states the action, the
shape of the output, and the trimming behavior.
</div>
</div>
</div>
</div>
<div class="top-actions" style="margin-top: 20px;">
<a href="/store/new">← Back to upload</a>
</div>
</div>
{% endblock %}