agnes-the-ai-analyst/src/marketplace_asset_validation.py
minasarustamyan dc5e0e0d11
Marketplace UX overhaul: rich plugin/skill/agent detail + filename rename (#251)
* Rename agnes-metadata.json to marketplace-metadata.json

Curated marketplace enrichment file (.claude-plugin/agnes-metadata.json)
becomes marketplace-metadata.json. Clean cut, no fallback — curators of
upstream marketplace repos must rename the file on their side.

Python API renames mirror the file rename: read_agnes_metadata →
read_marketplace_metadata, AGNES_METADATA_REL → MARKETPLACE_METADATA_REL,
AGNES_METADATA_MAX_BYTES → MARKETPLACE_METADATA_MAX_BYTES. Synth Claude
Code marketplace strip rule (.agnes/** + the metadata file) follows the
new filename.

* Marketplace detail polish: window cover + 715:310 aspect + helper alignment

- Plugin & item (skill/agent) detail hero: 160x160 square cover replaced
  with a macOS-style window frame (3 traffic-light dots + titlebar label
  showing the entity name). Body is constrained to 715:310 so curator-
  uploaded covers no longer crop to a square. Window is 380px wide; meta
  column and absolutely-positioned top-right install/remove actions stay
  put. Fallback when no cover_photo_url (translucent gradient + PL/SK/AG
  initials) is unchanged, just inside the window body.

- Inner skill/agent cards in the plugin detail's Internal structure
  section adopt the same 715:310 aspect (was fixed 78px tall). No window
  chrome on inner cards — just the matching proportions so covers read
  consistently across hero, grid tiles, and listing cards.

- Curated nested item helper text ("This skill is part of ... — add the
  bundle to your stack to use it") now stacks UNDER the "Open parent
  plugin" button instead of being a side-by-side flex sibling in the
  actions-row. Added align-self: flex-end so the 260px helper box
  anchors at the right edge of the 300px actions column, matching the
  button's right edge.

* Marketplace My tab: surface the same category + type filters as Flea

- Frontend: mp-cat-row and mp-type-row now show on tab=my (previously
  hidden — type was flea-only, category was flea/curated-only). Curated
  browse stays plugin-only and continues to hide the type pills.
  fetchOne() sends the `type` param for tab=my too, so the items
  endpoint's existing my-branch filter actually receives it.

- Backend categories endpoint, tab=my branch: when the type filter is
  set to skill/agent, skip counting curated subscriptions. Curated
  plugins are always type='plugin', so they wouldn't survive the items
  endpoint's type filter; including them in the category counts made
  the pill numbers overstate what users could actually see in the
  grid. type=None or type='plugin' keeps the previous behaviour.

- CHANGELOG entry under [Unreleased].

* Marketplace plugin detail: render rich content from marketplace-metadata.json

Adds five optional plugin-level fields to marketplace-metadata.json and
renders them on the curated plugin detail page + listing card:

* display_name — friendly h1 / listing-card name / mac-window titlebar
  label (overrides the technical plugin id)
* tagline — punchy 1-line value prop for the hero subtitle and the
  listing card description (replacing the verbose marketplace.json
  description on cards)
* description — multi-paragraph markdown body, server-side rendered
  through markdown-it-py and sanitized through nh3 with a
  description-scoped allowlist (no iframes / no raw HTML / no
  javascript: links). Powers the "What it does" panel.
* use_cases[] — {title, description, prompt} entries that render as a
  3-column "When to use it" card grid; each card shows the literal
  prompt as a code chip so users can copy-paste into Claude Code.
* sample_interaction — {user, assistant} dialog rendered in a Claude
  Code-style dark Catppuccin Mocha transcript panel: monospace user
  row with a green ">" prompt indicator + sans-serif assistant body
  with markdown formatting (peach bold, yellow italic, pink inline
  code, mantle-dark fenced code blocks).

All five fields are optional; UI sections only render when populated,
so plugins without enrichment look identical to before. Fields are
read on-demand from the working tree (cached by mtime per marketplace
slug) so curator edits land at the next request without waiting for
a sync cycle — same pattern as the existing inner-skill/agent
enrichment path. No DB schema bump.

Skill / agent rich-content rendering is deferred to a later phase
(needs a source-of-truth decision: extend plugin.yml? LLM-generate
from SKILL.md / agent.md?). The schema accepts the same fields at
skill/agent level today for forward compatibility but the UI ignores
them for now.

Also: stripped a stale `background-color: var(--bg)` from the global
`code` rule in style.css (was making inline code visually disappear
on the page background).

* Skill / agent detail: render rich content from marketplace-metadata.json

Brings the skill/agent detail pages to parity with the plugin detail
page. Same rich-content schema (display_name, tagline, description as
markdown, use_cases[], sample_interaction) plus two per-item additions:

* invocation — curator-provided literal command string. When set,
  overrides the computed "<manifest_name>:<inner_name>" chip and
  cleanly supports both "/" skill prefix and "@" agent prefix (the
  hardcoded "/" in the chip markup is hidden when the curator provides
  the invocation, so /grpn-eng:query <q> and @grpn-eng:cto-architect
  both render correctly).
* when_to_use — markdown disambiguation block ("Use this for X. For
  similar Y, see /other-skill") rendered into a new "When to use this"
  panel below the Example section.

Skill / agent category is now per-item overridable in
marketplace-metadata.json. When absent, the API keeps the parent
plugin's category as the badge so existing items don't lose their
category until curators opt in to per-item categorization.

The new "Example" Q&A panel uses the same Claude Code-style dark
Catppuccin Mocha transcript treatment as the plugin detail —
monospace user row with a green ">" prompt indicator + sans-serif
assistant body with markdown formatting.

All new fields are optional and read on-demand from the working tree.
Skills / agents whose marketplace-metadata.json doesn't carry rich
content render exactly the same way they did before (frontmatter
description + computed slash command + cover from existing v32
enrichment). No DB schema bump.

* Fix TypeError in skill / agent detail when curator sets per-item category

`curated_skill_detail` and `curated_agent_detail` were passing both
`**parent` (from `_curated_inner_parent_fields`, which returns the
parent plugin's category as a fallback) and `**enrichment` (from
`_curated_inner_enrichment`, which returns the per-item category
override when the curator set one) into `InnerDetailResponse(...)`.

Python function-call kwargs unpacking with overlapping keys raises
`TypeError: got multiple values for keyword argument 'category'`
— it doesn't merge like a literal dict does. The bug only surfaced
when the marketplace-metadata.json carried a `category` field at
skill / agent level (curator opting into per-item categorization);
items without that override hit the endpoint cleanly because only
parent provided the key.

Fix: build `merged = {**parent, **enrichment}` first (literal-dict
syntax DOES merge, with the right-hand-side winning) and unpack the
merged dict. Curator override still wins via the merge order, and
the same pattern is future-proof for any other field that lands in
both layers later.

Plus a regression test in test_marketplace_metadata.py asserting
that the inner-resolver carries `category` for downstream merging.

* Marketplace detail: tolerate partial curator JSON

Server constructed UseCase / SampleInteraction via raw dict indexing
(uc["title"], sample["assistant"]), so a curator commit missing any
required Pydantic field crashed the whole plugin / skill / agent detail
endpoint with a 500. Route both constructions through _safe_use_case /
_safe_sample_interaction helpers — partial input silently drops the
malformed card / section instead of breaking the page.

Regression test in test_marketplace_api.py covers the three shapes:
use_case missing a key, use_case with an empty string, and
sample_interaction with only user (no assistant). Sibling rich fields
still render.

* Address PR-251 review (must-fixes + S2/S3 polish) + release-cut 0.50.0

Five must-fixes from the review pass (3 from @cvrysanek's two-stage
review, 2 from my independent pass), plus the 0.50.0 release-cut as the
last commit on this PR per CLAUDE.md (CLAUDE.md "Release-cut belongs
to the PR" rule added in v0.49.1).

Must-fixes
----------

1. Cache eviction: bounded LRU instead of per-marketplace predicate.
   The previous predicate (`k[0] == marketplace_id and k[1] != mtime_ns`)
   only swept stale entries for the CURRENT marketplace; with N>100
   distinct marketplaces each holding one mtime key, the cap silently
   failed and memory grew linearly. Replaced with OrderedDict-backed
   bounded LRU at cap=256, drop oldest insert on overflow.
   Cache stress test pinned in test_marketplace_metadata.py.

2. Render CPU cap: per-field byte cap on description / when_to_use /
   sample_interaction.assistant via MARKETPLACE_METADATA_FIELD_MAX_BYTES
   (= 64 KiB). Without this, a 1 MiB curator markdown body × QPS =
   curator-controlled CPU burn through pure-Python markdown-it-py.
   Truncation respects UTF-8 boundaries and logs a warning so the
   curator sees the cap fire on the next sync. Test for cap +
   UTF-8-boundary preservation.

3. Inner-detail bypassed the metadata cache. _curated_inner_enrichment,
   _curated_inner_cover, and curated_detail all called
   read_marketplace_metadata directly, defeating the mtime cache the
   plugin listing already shared. Routed all three through
   _read_metadata_cached so skill/agent detail hits are O(1) re-parses
   per marketplace per mtime instead of O(QPS).

4. Truthy-vs-presence trap in plugin/inner enrichment merge. API-layer
   writers used `if resolved.get(k):` which silently dropped any
   future falsy-but-valid resolver field (bool featured=False, int
   priority=0, str category=''). Switched to presence check
   (`if k in resolved`) so the resolver is the authority on field
   presence; `{**parent, **enrichment}` merge respects whatever the
   resolver decided to ship.

5. Vendor-agnostic OSS cleanup. Removed operator-specific token
   references (/grpn-eng:, @grpn-eng:, .foundryai/) from
   src/marketplace_metadata.py docstring, app/web/templates/
   marketplace_item_detail.html JS comment, docs/curated-marketplace-
   format.md, and tests/test_marketplace_metadata.py fixtures. Replaced
   with generic /my-plugin:tool / @my-agent:role / .example/ placeholders.

CHANGELOG
---------
- New "### Fixed (PR #251 follow-ups)" section documenting all 4
  code-side must-fixes
- New "### Internal" section noting the vendor cleanup + new tests
- BREAKING bullet for the file rename now covers operator-side
  migration: running instances see plugin enrichment disappear from
  the UI until upstream curator renames + nightly sync overwrites the
  working tree; POST /api/marketplaces/{id}/sync forces refresh sooner
- Stripped /grpn-eng: leaks from the existing skill/agent rich-content
  bullet

Tests
-----
128 targeted tests pass (test_marketplace_metadata, test_marketplace_api,
test_marketplace, test_markdown_render, test_marketplace_synth_strip,
test_marketplace_filter). New tests added:
- 6 XSS regression tests on render_safe (javascript:/data:/vbscript:
  schemes via autolink, reference link, and mixed-case + positive
  http/https/mailto + noopener noreferrer rel)
- 3 byte-cap tests (truncation + UTF-8 boundary + under-cap pass-through)
- 1 cache eviction stress test (>256 marketplaces -> bounded at cap)
- 1 truthy-vs-presence resolver-contract test

Release-cut
-----------
- pyproject.toml 0.49.1 -> 0.50.0 (minor; BREAKING file rename per
  pre-1.0 CHANGELOG note: "breaking changes called out under Changed
  or Removed with the BREAKING marker")
- CHANGELOG [Unreleased] -> [0.50.0] - 2026-05-12, new empty
  [Unreleased] on top.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-12 08:38:39 +00:00

334 lines
13 KiB
Python

"""Asset allowlists + validators shared between the curated marketplace mirror
flow (``src/marketplace_asset_mirror.py``) and the Flea / Store upload flow
(``app/api/store.py``).
Two allowlists are exposed:
* **Documents** — PDF, Markdown, plain text. The set is deliberately narrow so
that what we serve back to users is something a browser can render directly
or download cleanly. HTML and DOCX are rejected (HTML has unbounded
external-asset dependencies and looks broken offline; DOCX is opaque to
most readers).
* **Images** — PNG, JPEG, WEBP. SVG is rejected because inline ``<script>``
inside an SVG is a ready-made XSS vector when the file is served with the
``image/svg+xml`` Content-Type.
Validators come in two shapes:
* :func:`validate_doc_file` / :func:`validate_image_file` — for **already
downloaded** bytes (Flea uploads, mirror cache writes).
* :func:`accept_doc_response` / :func:`accept_image_response` — for **HTTP
responses** during external-URL mirroring, where the body may not yet be
in memory and the decision needs to be made from the URL + Content-Type.
All functions return a small ``(ok, reason)`` tuple instead of raising, so
the caller decides whether a rejection is a HTTP 400 (Flea) or a silent log
(curated mirror).
"""
from __future__ import annotations
import re
from dataclasses import dataclass
from pathlib import PurePosixPath
from typing import Tuple
from urllib.parse import urlparse
# ---------------------------------------------------------------------------
# Allowlist constants
# ---------------------------------------------------------------------------
DOC_EXTENSIONS = (".pdf", ".md", ".markdown", ".txt")
"""Lowercase extensions accepted as documents. Used as both the MIME-fallback
hint (when servers send ``application/octet-stream``) and the Flea ``accept``
attribute source-of-truth."""
DOC_CONTENT_TYPES = (
"application/pdf",
"text/markdown",
"text/x-markdown",
"text/plain",
)
"""Content-Types unambiguously accepted for documents."""
DOC_GENERIC_CONTENT_TYPES = (
"application/octet-stream",
"application/x-download",
"binary/octet-stream",
)
"""Generic Content-Types that need an extension match before acceptance.
Real-world CDNs frequently send these for ``.md`` / ``.pdf`` files when the
MIME database doesn't have a hit."""
IMAGE_EXTENSIONS = (".png", ".jpg", ".jpeg", ".webp")
IMAGE_CONTENT_TYPES = (
"image/png",
"image/jpeg",
"image/webp",
)
# Magic-bytes prefix for PDF files — first 4 bytes are always ``%PDF`` regardless
# of PDF version. We don't try to validate Markdown / plain text by sniffing
# (every byte sequence is "valid" Markdown) — for those we rely on extension.
_PDF_MAGIC = b"%PDF"
# Magic-bytes prefixes for image formats. Used as a belt-and-suspenders check
# alongside Content-Type so an attacker can't trivially smuggle an SVG through
# a renamed ``.png`` file.
_PNG_MAGIC = b"\x89PNG\r\n\x1a\n"
_JPEG_MAGIC = b"\xff\xd8\xff"
# WEBP is RIFF-wrapped: first 4 bytes "RIFF", bytes 8-12 "WEBP".
_WEBP_RIFF = b"RIFF"
_WEBP_TAG = b"WEBP"
@dataclass(frozen=True)
class ValidationResult:
ok: bool
reason: str = ""
def __bool__(self) -> bool: # convenience for `if result:`
return self.ok
_OK = ValidationResult(True)
def _ext(path_or_url: str) -> str:
"""Return the lowercase trailing extension of ``path_or_url`` (with dot).
Works on bare filenames, paths, and URLs (query strings are dropped).
Returns ``""`` when there is no extension.
"""
if not path_or_url:
return ""
# Strip query / fragment so URLs like https://x/file.pdf?token=… are
# classified by their visible extension.
cleaned = path_or_url.split("?", 1)[0].split("#", 1)[0]
suffix = PurePosixPath(cleaned).suffix
return suffix.lower() if suffix else ""
def _normalize_content_type(value: str) -> str:
"""Strip parameters (``; charset=…``) and lowercase. Returns ``""`` for None."""
if not value:
return ""
return value.split(";", 1)[0].strip().lower()
# ---------------------------------------------------------------------------
# External URL detection
# ---------------------------------------------------------------------------
_HTTP_URL_RE = re.compile(r"^https?://", re.IGNORECASE)
def is_external_url(value: str) -> bool:
"""Return True when ``value`` looks like an absolute http(s) URL.
Used to discriminate between ``cover_photo: ".agnes/cover.png"`` (internal
git-tree path) and ``cover_photo: "https://cdn.example.com/cover.png"``
(external URL — eligible for the asset mirror).
"""
return bool(value) and bool(_HTTP_URL_RE.match(value.strip()))
# ---------------------------------------------------------------------------
# Body-based validators (Flea uploads, mirror cache writes)
# ---------------------------------------------------------------------------
def validate_doc_file(filename: str, body: bytes) -> ValidationResult:
"""Accept iff filename has an allowed extension AND (for PDF) magic bytes match.
Markdown and plain text aren't sniffed — any byte sequence is technically
valid text. We rely on the extension for those. PDF gets the magic-byte
check because mislabeled ``.pdf`` files (someone renamed an EXE) are a
real concern.
"""
ext = _ext(filename)
if ext not in DOC_EXTENSIONS:
return ValidationResult(
False,
f"unsupported_doc_extension: {ext or '(none)'} not in {DOC_EXTENSIONS}",
)
if ext == ".pdf" and not body.startswith(_PDF_MAGIC):
return ValidationResult(False, "pdf_magic_bytes_mismatch")
return _OK
def validate_image_file(filename: str, body: bytes) -> ValidationResult:
"""Accept iff extension is in the image allowlist AND magic bytes match.
SVG is not in the allowlist — it isn't ``image/svg+xml`` here even if the
extension says so. Magic bytes are the authoritative signal: a renamed
``payload.png`` carrying SVG XML fails this check.
"""
ext = _ext(filename)
if ext not in IMAGE_EXTENSIONS:
return ValidationResult(
False,
f"unsupported_image_extension: {ext or '(none)'} not in {IMAGE_EXTENSIONS}",
)
if ext == ".png" and not body.startswith(_PNG_MAGIC):
return ValidationResult(False, "png_magic_bytes_mismatch")
if ext in (".jpg", ".jpeg") and not body.startswith(_JPEG_MAGIC):
return ValidationResult(False, "jpeg_magic_bytes_mismatch")
if ext == ".webp":
# WEBP is "RIFF" + 4 bytes size + "WEBP". Need at least 12 bytes.
if len(body) < 12 or body[:4] != _WEBP_RIFF or body[8:12] != _WEBP_TAG:
return ValidationResult(False, "webp_magic_bytes_mismatch")
return _OK
# ---------------------------------------------------------------------------
# Response-based validators (curated mirror — pre-download checks)
# ---------------------------------------------------------------------------
def accept_doc_response(url: str, content_type: str) -> ValidationResult:
"""Should we mirror this external doc URL based on its HTTP HEAD response?
Resolution order:
1. Content-Type matches an unambiguous doc allowlist entry → accept.
2. Content-Type is generic (octet-stream / x-download) AND URL extension
matches → accept (real-world CDN behavior for ``.md`` / ``.pdf``).
3. Otherwise reject.
HTML page links (Confluence, Notion, GitHub Wiki) don't survive this
filter — ``text/html`` is not in either list. The caller's contract for
rejected entries is to skip the mirror but keep the original URL as a
plain external link in the served ``doc_links`` (b1 fallback).
"""
ct = _normalize_content_type(content_type)
if ct in DOC_CONTENT_TYPES:
return _OK
if ct in DOC_GENERIC_CONTENT_TYPES and _ext(url) in DOC_EXTENSIONS:
return _OK
return ValidationResult(
False, f"doc_content_type_rejected: {ct or '(empty)'}"
)
def accept_image_response(url: str, content_type: str) -> ValidationResult:
"""Should we mirror this external image URL based on its HTTP HEAD response?
Stricter than docs — an image must report an explicit ``image/png``,
``image/jpeg``, or ``image/webp`` Content-Type. Generic octet-stream is
NOT accepted for images because the downstream renderer needs to know
the format and ``<img src>`` won't sniff the body.
"""
ct = _normalize_content_type(content_type)
if ct in IMAGE_CONTENT_TYPES:
return _OK
return ValidationResult(
False, f"image_content_type_rejected: {ct or '(empty)'}"
)
# ---------------------------------------------------------------------------
# Convenience helpers used by the marketplace-metadata.json parser
# ---------------------------------------------------------------------------
@dataclass(frozen=True)
class DocLinkRef:
"""A single ``doc_links[]`` entry resolved into one of three shapes:
* ``kind="internal"`` → ``path`` set, points at a file in the cloned repo.
* ``kind="external"`` → ``url`` set, original URL (used after a successful
mirror is replaced with ``mirrored_key`` by the asset-mirror layer, or
when mirroring failed and we link out).
* ``kind="mirrored"`` → ``url`` is the original; ``mirrored_key`` is set
to the cache lookup key. The marketplace-metadata parser only produces
``internal`` and ``external`` — the mirror layer flips ``external`` to
``mirrored`` after a successful fetch.
"""
name: str
kind: str # "internal" | "external" | "mirrored"
path: str = ""
url: str = ""
mirrored_key: str = ""
def parse_doc_link(entry: dict) -> Tuple[bool, DocLinkRef | str]:
"""Validate one ``doc_links[]`` dict from marketplace-metadata.json.
Returns ``(True, DocLinkRef)`` on accept, ``(False, reason)`` on reject.
Rejection reasons surface to the sync log so the curator can fix them.
Schema rules:
- ``name`` required (string).
- Exactly one of ``path`` or ``url``. Both → reject (ambiguous).
- ``path`` (when present) must not start with ``/`` and must not contain
``..`` segments — the asset endpoint enforces this again at serve time
but rejecting at parse time means the entry never reaches the cache.
- ``url`` (when present) must be ``http(s)://``.
"""
if not isinstance(entry, dict):
return False, "doc_link_not_object"
name = entry.get("name")
if not isinstance(name, str) or not name.strip():
return False, "doc_link_missing_name"
has_path = "path" in entry
has_url = "url" in entry
if has_path == has_url:
return False, "doc_link_must_have_exactly_one_of_path_or_url"
if has_path:
path = entry["path"]
if not isinstance(path, str) or not path.strip():
return False, "doc_link_path_empty"
if path.startswith("/") or ".." in PurePosixPath(path).parts:
return False, "doc_link_path_traversal_or_absolute"
# Internal paths must point at an allowlisted document type (PDF /
# Markdown / plain text). The serve endpoint enforces this again at
# download time, but rejecting at parse time means the entry never
# reaches the served `doc_links` list at all — exactly the user-facing
# contract: "any URL Agnes can't render as a real document is treated
# as if it weren't there."
if _ext(path) not in DOC_EXTENSIONS:
return False, (
f"doc_link_path_unsupported_extension: {_ext(path) or '(none)'} "
f"not in {DOC_EXTENSIONS}"
)
return True, DocLinkRef(name=name.strip(), kind="internal", path=path)
url = entry["url"]
if not isinstance(url, str) or not is_external_url(url):
return False, "doc_link_url_must_be_http_or_https"
# External URLs whose final extension is unambiguously NOT in the doc
# allowlist are dropped early — saves the mirror layer from a wasted HEAD
# request on something we'd never accept anyway. URLs without a clear
# extension still pass through (e.g. CDN pretty paths) and the mirror's
# Content-Type check decides at fetch time.
ext = _ext(url)
if ext and ext not in DOC_EXTENSIONS:
return False, (
f"doc_link_url_unsupported_extension: {ext} not in {DOC_EXTENSIONS}"
)
return True, DocLinkRef(name=name.strip(), kind="external", url=url)
def parse_cover_photo_ref(value: object) -> Tuple[bool, Tuple[str, str] | str]:
"""Resolve a ``cover_photo`` value into ``(kind, target)``.
Accepts:
* external URL (``http(s)://...``) → ``("external", url)``.
* internal git-tree path → ``("internal", path)``.
* empty / None / non-string → reject silently (callers tolerate absence).
The internal-path branch validates against directory traversal at parse
time. The serving endpoint validates again with ``Path.resolve()`` so the
parser-time check is defense-in-depth, not the only gate.
"""
if value is None or value == "":
return False, "cover_photo_empty"
if not isinstance(value, str):
return False, "cover_photo_not_string"
v = value.strip()
if is_external_url(v):
return True, ("external", v)
if v.startswith("/") or ".." in PurePosixPath(v).parts:
return False, "cover_photo_path_traversal_or_absolute"
return True, ("internal", v)