agnes-the-ai-analyst/src/marketplace_asset_mirror.py
minasarustamyan dc5e0e0d11
Marketplace UX overhaul: rich plugin/skill/agent detail + filename rename (#251)
* Rename agnes-metadata.json to marketplace-metadata.json

Curated marketplace enrichment file (.claude-plugin/agnes-metadata.json)
becomes marketplace-metadata.json. Clean cut, no fallback — curators of
upstream marketplace repos must rename the file on their side.

Python API renames mirror the file rename: read_agnes_metadata →
read_marketplace_metadata, AGNES_METADATA_REL → MARKETPLACE_METADATA_REL,
AGNES_METADATA_MAX_BYTES → MARKETPLACE_METADATA_MAX_BYTES. Synth Claude
Code marketplace strip rule (.agnes/** + the metadata file) follows the
new filename.

* Marketplace detail polish: window cover + 715:310 aspect + helper alignment

- Plugin & item (skill/agent) detail hero: 160x160 square cover replaced
  with a macOS-style window frame (3 traffic-light dots + titlebar label
  showing the entity name). Body is constrained to 715:310 so curator-
  uploaded covers no longer crop to a square. Window is 380px wide; meta
  column and absolutely-positioned top-right install/remove actions stay
  put. Fallback when no cover_photo_url (translucent gradient + PL/SK/AG
  initials) is unchanged, just inside the window body.

- Inner skill/agent cards in the plugin detail's Internal structure
  section adopt the same 715:310 aspect (was fixed 78px tall). No window
  chrome on inner cards — just the matching proportions so covers read
  consistently across hero, grid tiles, and listing cards.

- Curated nested item helper text ("This skill is part of ... — add the
  bundle to your stack to use it") now stacks UNDER the "Open parent
  plugin" button instead of being a side-by-side flex sibling in the
  actions-row. Added align-self: flex-end so the 260px helper box
  anchors at the right edge of the 300px actions column, matching the
  button's right edge.

* Marketplace My tab: surface the same category + type filters as Flea

- Frontend: mp-cat-row and mp-type-row now show on tab=my (previously
  hidden — type was flea-only, category was flea/curated-only). Curated
  browse stays plugin-only and continues to hide the type pills.
  fetchOne() sends the `type` param for tab=my too, so the items
  endpoint's existing my-branch filter actually receives it.

- Backend categories endpoint, tab=my branch: when the type filter is
  set to skill/agent, skip counting curated subscriptions. Curated
  plugins are always type='plugin', so they wouldn't survive the items
  endpoint's type filter; including them in the category counts made
  the pill numbers overstate what users could actually see in the
  grid. type=None or type='plugin' keeps the previous behaviour.

- CHANGELOG entry under [Unreleased].

* Marketplace plugin detail: render rich content from marketplace-metadata.json

Adds five optional plugin-level fields to marketplace-metadata.json and
renders them on the curated plugin detail page + listing card:

* display_name — friendly h1 / listing-card name / mac-window titlebar
  label (overrides the technical plugin id)
* tagline — punchy 1-line value prop for the hero subtitle and the
  listing card description (replacing the verbose marketplace.json
  description on cards)
* description — multi-paragraph markdown body, server-side rendered
  through markdown-it-py and sanitized through nh3 with a
  description-scoped allowlist (no iframes / no raw HTML / no
  javascript: links). Powers the "What it does" panel.
* use_cases[] — {title, description, prompt} entries that render as a
  3-column "When to use it" card grid; each card shows the literal
  prompt as a code chip so users can copy-paste into Claude Code.
* sample_interaction — {user, assistant} dialog rendered in a Claude
  Code-style dark Catppuccin Mocha transcript panel: monospace user
  row with a green ">" prompt indicator + sans-serif assistant body
  with markdown formatting (peach bold, yellow italic, pink inline
  code, mantle-dark fenced code blocks).

All five fields are optional; UI sections only render when populated,
so plugins without enrichment look identical to before. Fields are
read on-demand from the working tree (cached by mtime per marketplace
slug) so curator edits land at the next request without waiting for
a sync cycle — same pattern as the existing inner-skill/agent
enrichment path. No DB schema bump.

Skill / agent rich-content rendering is deferred to a later phase
(needs a source-of-truth decision: extend plugin.yml? LLM-generate
from SKILL.md / agent.md?). The schema accepts the same fields at
skill/agent level today for forward compatibility but the UI ignores
them for now.

Also: stripped a stale `background-color: var(--bg)` from the global
`code` rule in style.css (was making inline code visually disappear
on the page background).

* Skill / agent detail: render rich content from marketplace-metadata.json

Brings the skill/agent detail pages to parity with the plugin detail
page. Same rich-content schema (display_name, tagline, description as
markdown, use_cases[], sample_interaction) plus two per-item additions:

* invocation — curator-provided literal command string. When set,
  overrides the computed "<manifest_name>:<inner_name>" chip and
  cleanly supports both "/" skill prefix and "@" agent prefix (the
  hardcoded "/" in the chip markup is hidden when the curator provides
  the invocation, so /grpn-eng:query <q> and @grpn-eng:cto-architect
  both render correctly).
* when_to_use — markdown disambiguation block ("Use this for X. For
  similar Y, see /other-skill") rendered into a new "When to use this"
  panel below the Example section.

Skill / agent category is now per-item overridable in
marketplace-metadata.json. When absent, the API keeps the parent
plugin's category as the badge so existing items don't lose their
category until curators opt in to per-item categorization.

The new "Example" Q&A panel uses the same Claude Code-style dark
Catppuccin Mocha transcript treatment as the plugin detail —
monospace user row with a green ">" prompt indicator + sans-serif
assistant body with markdown formatting.

All new fields are optional and read on-demand from the working tree.
Skills / agents whose marketplace-metadata.json doesn't carry rich
content render exactly the same way they did before (frontmatter
description + computed slash command + cover from existing v32
enrichment). No DB schema bump.

* Fix TypeError in skill / agent detail when curator sets per-item category

`curated_skill_detail` and `curated_agent_detail` were passing both
`**parent` (from `_curated_inner_parent_fields`, which returns the
parent plugin's category as a fallback) and `**enrichment` (from
`_curated_inner_enrichment`, which returns the per-item category
override when the curator set one) into `InnerDetailResponse(...)`.

Python function-call kwargs unpacking with overlapping keys raises
`TypeError: got multiple values for keyword argument 'category'`
— it doesn't merge like a literal dict does. The bug only surfaced
when the marketplace-metadata.json carried a `category` field at
skill / agent level (curator opting into per-item categorization);
items without that override hit the endpoint cleanly because only
parent provided the key.

Fix: build `merged = {**parent, **enrichment}` first (literal-dict
syntax DOES merge, with the right-hand-side winning) and unpack the
merged dict. Curator override still wins via the merge order, and
the same pattern is future-proof for any other field that lands in
both layers later.

Plus a regression test in test_marketplace_metadata.py asserting
that the inner-resolver carries `category` for downstream merging.

* Marketplace detail: tolerate partial curator JSON

Server constructed UseCase / SampleInteraction via raw dict indexing
(uc["title"], sample["assistant"]), so a curator commit missing any
required Pydantic field crashed the whole plugin / skill / agent detail
endpoint with a 500. Route both constructions through _safe_use_case /
_safe_sample_interaction helpers — partial input silently drops the
malformed card / section instead of breaking the page.

Regression test in test_marketplace_api.py covers the three shapes:
use_case missing a key, use_case with an empty string, and
sample_interaction with only user (no assistant). Sibling rich fields
still render.

* Address PR-251 review (must-fixes + S2/S3 polish) + release-cut 0.50.0

Five must-fixes from the review pass (3 from @cvrysanek's two-stage
review, 2 from my independent pass), plus the 0.50.0 release-cut as the
last commit on this PR per CLAUDE.md (CLAUDE.md "Release-cut belongs
to the PR" rule added in v0.49.1).

Must-fixes
----------

1. Cache eviction: bounded LRU instead of per-marketplace predicate.
   The previous predicate (`k[0] == marketplace_id and k[1] != mtime_ns`)
   only swept stale entries for the CURRENT marketplace; with N>100
   distinct marketplaces each holding one mtime key, the cap silently
   failed and memory grew linearly. Replaced with OrderedDict-backed
   bounded LRU at cap=256, drop oldest insert on overflow.
   Cache stress test pinned in test_marketplace_metadata.py.

2. Render CPU cap: per-field byte cap on description / when_to_use /
   sample_interaction.assistant via MARKETPLACE_METADATA_FIELD_MAX_BYTES
   (= 64 KiB). Without this, a 1 MiB curator markdown body × QPS =
   curator-controlled CPU burn through pure-Python markdown-it-py.
   Truncation respects UTF-8 boundaries and logs a warning so the
   curator sees the cap fire on the next sync. Test for cap +
   UTF-8-boundary preservation.

3. Inner-detail bypassed the metadata cache. _curated_inner_enrichment,
   _curated_inner_cover, and curated_detail all called
   read_marketplace_metadata directly, defeating the mtime cache the
   plugin listing already shared. Routed all three through
   _read_metadata_cached so skill/agent detail hits are O(1) re-parses
   per marketplace per mtime instead of O(QPS).

4. Truthy-vs-presence trap in plugin/inner enrichment merge. API-layer
   writers used `if resolved.get(k):` which silently dropped any
   future falsy-but-valid resolver field (bool featured=False, int
   priority=0, str category=''). Switched to presence check
   (`if k in resolved`) so the resolver is the authority on field
   presence; `{**parent, **enrichment}` merge respects whatever the
   resolver decided to ship.

5. Vendor-agnostic OSS cleanup. Removed operator-specific token
   references (/grpn-eng:, @grpn-eng:, .foundryai/) from
   src/marketplace_metadata.py docstring, app/web/templates/
   marketplace_item_detail.html JS comment, docs/curated-marketplace-
   format.md, and tests/test_marketplace_metadata.py fixtures. Replaced
   with generic /my-plugin:tool / @my-agent:role / .example/ placeholders.

CHANGELOG
---------
- New "### Fixed (PR #251 follow-ups)" section documenting all 4
  code-side must-fixes
- New "### Internal" section noting the vendor cleanup + new tests
- BREAKING bullet for the file rename now covers operator-side
  migration: running instances see plugin enrichment disappear from
  the UI until upstream curator renames + nightly sync overwrites the
  working tree; POST /api/marketplaces/{id}/sync forces refresh sooner
- Stripped /grpn-eng: leaks from the existing skill/agent rich-content
  bullet

Tests
-----
128 targeted tests pass (test_marketplace_metadata, test_marketplace_api,
test_marketplace, test_markdown_render, test_marketplace_synth_strip,
test_marketplace_filter). New tests added:
- 6 XSS regression tests on render_safe (javascript:/data:/vbscript:
  schemes via autolink, reference link, and mixed-case + positive
  http/https/mailto + noopener noreferrer rel)
- 3 byte-cap tests (truncation + UTF-8 boundary + under-cap pass-through)
- 1 cache eviction stress test (>256 marketplaces -> bounded at cap)
- 1 truthy-vs-presence resolver-contract test

Release-cut
-----------
- pyproject.toml 0.49.1 -> 0.50.0 (minor; BREAKING file rename per
  pre-1.0 CHANGELOG note: "breaking changes called out under Changed
  or Removed with the BREAKING marker")
- CHANGELOG [Unreleased] -> [0.50.0] - 2026-05-12, new empty
  [Unreleased] on top.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-12 08:38:39 +00:00

750 lines
30 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""External-asset mirror cache for curated marketplaces.
The curator's ``.claude-plugin/marketplace-metadata.json`` may reference cover
photos and doc files by external HTTP(S) URL. Linkrot would then mean the
Agnes web UI starts showing broken images / dead links the moment the
upstream CDN serves a 404. This module mirrors those URLs to disk at sync
time and serves the local copy thereafter.
**On-disk layout** (per marketplace slug)::
${DATA_DIR}/marketplace-cache/<slug>/
├── manifest.json # url → cache entry
└── <plugin>/
├── cover.<ext>
└── docs/<sha8>-<filename>
**Re-fetch logic per URL on every sync:**
1. URL not yet in manifest → unconditional GET, save body + record
ETag / Last-Modified / sha256.
2. URL already mirrored → conditional GET (``If-None-Match`` /
``If-Modified-Since``):
- 304 Not Modified → keep cached file, refresh ``fetched_at`` only.
- 200 OK with same sha256 → keep file, refresh validators.
- 200 OK with new sha256 → overwrite local file.
3. URL removed from marketplace-metadata.json → ``cleanup_unused`` removes the
manifest entry and the local file.
**Failure modes** (b1 fallback per the design discussion):
fetch failure (timeout, 4xx/5xx, allowlist reject, oversized, SSRF block)
keeps the **last good copy** intact in the cache, sets ``status = "failed_*"``
on the manifest entry, and logs a warning. The caller surfaces "mirror failed"
in the admin UI but never breaks the served plugin detail.
**SSRF guards:** only ``http(s)://`` schemes accepted, DNS resolution rejects
private / loopback / link-local / metadata IPs, 30-second timeout, 10 MB cap,
max 4 concurrent fetches per sync.
"""
from __future__ import annotations
import concurrent.futures
import hashlib
import ipaddress
import json
import logging
import re
import shutil
import socket
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from urllib.parse import urlparse
import httpx
from src.marketplace_asset_validation import (
DOC_EXTENSIONS,
IMAGE_EXTENSIONS,
accept_doc_response,
accept_image_response,
validate_doc_file,
validate_image_file,
)
logger = logging.getLogger(__name__)
# Hardcoded operational caps. The plan deferred making these configurable —
# the comment in `instance.yaml` would be one line if/when an operator hits
# a real limit (today nothing in our org has cover images > 10 MB).
HTTP_TIMEOUT_SEC = 60
"""Per-request timeout for outgoing mirror fetches.
Larger PDFs from slow CDNs (e.g. Adobe support, government archives)
routinely exceed 30s on a residential connection — bumped from 30 → 60.
The sync runs nightly under a thread pool with bounded concurrency so
worst-case sync time grows linearly, not multiplicatively, with this
value. Operators can still cap a runaway curator by trimming
``MAX_BODY_BYTES`` (10 MB) — the timeout only matters for slow tails."""
MAX_BODY_BYTES = 10 * 1024 * 1024 # 10 MB
MAX_CONCURRENT_FETCHES = 4
USER_AGENT = (
"Agnes-Marketplace-Mirror/1.0 "
"(+https://github.com/keboola/agnes-the-ai-analyst; agnes-mirror)"
)
"""HTTP User-Agent for outgoing mirror fetches.
Wikipedia / Wikimedia commons strictly enforces a User-Agent policy and
returns HTTP 400 to clients with generic strings (see
https://meta.wikimedia.org/wiki/User-Agent_policy). The format below
includes a contact URL + descriptor which satisfies their parser. Other
strict CDNs (e.g. arXiv, some news sites) similarly require a non-trivial
UA — using the same string everywhere keeps debugging simple."""
MANIFEST_FILENAME = "manifest.json"
@dataclass
class MirrorEntry:
"""One row in ``manifest.json`` — keyed by external URL."""
url: str
kind: str # "cover" | "doc"
plugin_name: str
local: str # relative path inside the marketplace cache dir
etag: str = ""
last_modified: str = ""
sha256: str = ""
fetched_at: str = "" # ISO timestamp of last successful body write
last_checked_at: str = "" # ISO timestamp of last fetch attempt
status: str = "unknown" # "ok" | "failed_recent" | "failed_first" | "rejected"
error: str = ""
def to_json(self) -> dict:
return asdict(self)
@classmethod
def from_json(cls, d: dict) -> "MirrorEntry":
return cls(
url=d.get("url", ""),
kind=d.get("kind", ""),
plugin_name=d.get("plugin_name", ""),
local=d.get("local", ""),
etag=d.get("etag", ""),
last_modified=d.get("last_modified", ""),
sha256=d.get("sha256", ""),
fetched_at=d.get("fetched_at", ""),
last_checked_at=d.get("last_checked_at", ""),
status=d.get("status", "unknown"),
error=d.get("error", ""),
)
@dataclass
class MirrorReport:
"""Per-sync summary returned to the caller."""
requested: int = 0
fetched: int = 0
not_modified: int = 0
failed: int = 0
rejected: int = 0
removed: int = 0
entries: Dict[Tuple[str, str], MirrorEntry] = field(default_factory=dict)
# ---------------------------------------------------------------------------
# SSRF / safety helpers
# ---------------------------------------------------------------------------
def _resolve_safe(url: str) -> Tuple[bool, str, str]:
"""Reject URLs we shouldn't follow and return the IP the caller MUST connect to.
Returns ``(ok, reason, pinned_ip)``. On rejection ``pinned_ip`` is empty.
Why the pinned IP matters: ``urllib`` would otherwise re-resolve the
hostname at connection time, and an attacker-controlled DNS server can
return a public IP for the validation lookup and ``127.0.0.1`` /
``169.254.169.254`` for the connection lookup (DNS rebinding). Resolving
once here and connecting to that exact IP defeats the rebind. ALL
addresses returned by ``getaddrinfo`` are validated — round-robin DNS
that mixes public + private IPs is treated as unsafe regardless of which
one we'd have picked first.
"""
try:
parts = urlparse(url)
except ValueError as e:
return False, f"bad_url: {e}", ""
if parts.scheme not in ("http", "https"):
return False, f"unsupported_scheme: {parts.scheme}", ""
host = parts.hostname or ""
if not host:
return False, "missing_host", ""
try:
infos = socket.getaddrinfo(host, None)
except socket.gaierror as e:
return False, f"dns_failure: {e}", ""
chosen_ip = ""
for info in infos:
sockaddr = info[4]
ip_str = sockaddr[0]
try:
addr = ipaddress.ip_address(ip_str)
except ValueError:
return False, f"unparseable_address: {ip_str}", ""
if (
addr.is_private
or addr.is_loopback
or addr.is_link_local
or addr.is_multicast
or addr.is_reserved
or addr.is_unspecified
):
return False, f"address_in_blocked_range: {ip_str}", ""
# AWS / GCP / Azure metadata endpoints fall under is_link_local
# (169.254.169.254) above — explicit additional check for IPv6
# ULA + the broad metadata-style catchall would be belt-and-
# suspenders only.
# Prefer the first IPv4 result for connection pinning (broader CDN
# compatibility); fall back to the first record otherwise.
if not chosen_ip and info[0] == socket.AF_INET:
chosen_ip = ip_str
if not chosen_ip and infos:
chosen_ip = infos[0][4][0]
if not chosen_ip:
return False, "no_address", ""
return True, "", chosen_ip
def _is_safe_url(url: str) -> Tuple[bool, str]:
"""Backwards-compatible 2-tuple wrapper over :func:`_resolve_safe`.
Existing tests (and any external callers that only care about the
accept/reject decision) keep working unchanged. The pinned IP returned
by ``_resolve_safe`` is consumed internally by the connection-pinning
handlers below.
"""
ok, reason, _ = _resolve_safe(url)
return ok, reason
# ---------------------------------------------------------------------------
# SSRF-aware httpx transport + shared client
#
# Two threats against the simple "validate URL, then GET" pattern:
# 1. Redirect bypass — without revalidation, an attacker 302s to
# http://169.254.169.254/... and we mirror cloud metadata.
# 2. DNS rebinding — without IP pinning, the connect-time DNS lookup
# can return a different IP than the validation lookup.
#
# httpx makes both defences collapse into a single custom Transport:
# httpx invokes ``handle_request()`` on EVERY outgoing request — including
# every redirect hop — so re-running SSRF validation in the transport
# closes the redirect bypass for free. Within ``handle_request`` we also
# rewrite the URL host to the IP we just validated and stash the original
# hostname in the ``Host`` header + the ``sni_hostname`` extension so TLS
# SNI / cert verification still bind to the curator-supplied hostname.
# ---------------------------------------------------------------------------
class _SSRFRejected(Exception):
"""Raised inside ``_SSRFGuardTransport`` when the SSRF allowlist rejects
the (initial or redirected) URL.
Distinct from ``httpx.RequestError`` so ``_fetch_url`` maps this to
``status='rejected'`` (terminal — security decision, never retry).
"""
def __init__(self, reason: str) -> None:
self.reason = reason
super().__init__(reason)
class _SSRFGuardTransport(httpx.HTTPTransport):
"""Transport that re-validates SSRF rules on every outgoing request and
pins the connection to the IP we just resolved.
Redirect re-validation comes for free because httpx invokes
``handle_request()`` once per redirect hop (when the client is
configured with ``follow_redirects=True``). DNS-rebinding defence
comes from rewriting the URL host to the validated IP — httpcore
no longer re-resolves the hostname at connect time.
"""
def handle_request(self, request: httpx.Request) -> httpx.Response:
ok, reason, ip = _resolve_safe(str(request.url))
if not ok:
raise _SSRFRejected(reason)
original_host = request.url.host
# Rewrite the URL host to the validated IP. httpcore opens the
# connection to whatever ``request.url.host`` says, so this is what
# actually pins the connection.
request.url = request.url.copy_with(host=ip)
# Preserve the original hostname for vhost routing + TLS SNI / cert
# verification. ``sni_hostname`` is a documented httpx extension
# honored by the TLS layer in 0.24+.
request.headers["Host"] = original_host
request.extensions = {
**request.extensions,
"sni_hostname": original_host,
}
return super().handle_request(request)
_CLIENT: Optional[httpx.Client] = None
def _get_client() -> httpx.Client:
"""Lazy module-level ``httpx.Client`` shared across the fetch pool.
Same lifecycle pattern as ``cli/client.py``'s ``_get_shared_client``:
build once on first use, reuse for the process lifetime. ``httpx.Client``
is thread-safe for concurrent ``send()`` / ``stream()`` calls so a
``ThreadPoolExecutor`` can hammer it without external locking.
"""
global _CLIENT
if _CLIENT is None:
_CLIENT = httpx.Client(
transport=_SSRFGuardTransport(),
timeout=HTTP_TIMEOUT_SEC,
follow_redirects=True,
# Tightened from the httpx default of 20. Legitimate CDN chains
# (S3 → presigned, DOI → publisher) routinely use 34 hops;
# 5 leaves headroom without giving attackers many hops to scan.
max_redirects=5,
headers={"User-Agent": USER_AGENT},
)
return _CLIENT
def _safe_filename(url: str, default_ext: str) -> str:
"""Derive a stable, FS-safe filename from a URL.
Format: ``<sha8(url)>-<basename>``. The hash prefix means two URLs with
the same trailing filename don't collide; the human-readable basename
helps when an operator browses the cache dir directly.
"""
parts = urlparse(url)
base = Path(parts.path).name or "download"
base = re.sub(r"[^a-zA-Z0-9._-]", "_", base)[:64]
if not base or base.startswith("."):
base = f"download{default_ext}"
sha8 = hashlib.sha256(url.encode("utf-8")).hexdigest()[:8]
return f"{sha8}-{base}"
# ---------------------------------------------------------------------------
# Manifest persistence
# ---------------------------------------------------------------------------
def _load_manifest(cache_dir: Path) -> Dict[Tuple[str, str], MirrorEntry]:
"""Read the on-disk manifest into an in-memory ``(plugin_name, url) → entry`` map.
The composite key is what makes the manifest RBAC-safe: two plugins in
the same marketplace can reference the same external URL (shared CDN
icon, common cover image) and each gets its own entry pointing under
its own plugin subdir, so an analyst with grant on plugin B never
receives a URL pointing under plugin A's tree.
On-disk format is a list of self-describing entries (each carries
``plugin_name`` + ``url`` fields), not a JSON dict — JSON keys can't
be tuples and concatenating ``"plugin::url"`` would just shift the
parsing burden.
"""
path = cache_dir / MANIFEST_FILENAME
if not path.is_file():
return {}
try:
data = json.loads(path.read_text(encoding="utf-8"))
except (OSError, ValueError) as e:
logger.warning("mirror manifest %s unreadable, starting fresh: %s", path, e)
return {}
entries = data.get("entries") if isinstance(data, dict) else None
if not isinstance(entries, list):
return {}
out: Dict[Tuple[str, str], MirrorEntry] = {}
for raw in entries:
if not isinstance(raw, dict):
continue
entry = MirrorEntry.from_json(raw)
if not entry.url or not entry.plugin_name:
continue
out[(entry.plugin_name, entry.url)] = entry
return out
def _write_manifest(
cache_dir: Path,
entries: Dict[Tuple[str, str], MirrorEntry],
) -> None:
cache_dir.mkdir(parents=True, exist_ok=True)
path = cache_dir / MANIFEST_FILENAME
body = {
"version": 2,
"entries": [e.to_json() for e in entries.values()],
}
tmp = path.with_suffix(".json.tmp")
tmp.write_text(json.dumps(body, indent=2), encoding="utf-8")
tmp.replace(path)
# ---------------------------------------------------------------------------
# HTTP fetch
# ---------------------------------------------------------------------------
@dataclass
class FetchOutcome:
status: str # "ok" | "not_modified" | "failed" | "rejected"
body: bytes = b""
content_type: str = ""
etag: str = ""
last_modified: str = ""
error: str = ""
def _fetch_url(
url: str,
*,
prior: Optional[MirrorEntry],
expect_kind: str,
) -> FetchOutcome:
"""Single HTTP GET (with conditional headers when ``prior`` provides them).
SSRF + size + allowlist enforcement happen here. Any rejection produces
``status="rejected"`` (terminal — caller doesn't retry); any transient
network error produces ``status="failed"`` (caller may surface and try
again next sync).
Pre-flight ``_resolve_safe`` here gives us a fast, type-safe rejection
*before* httpx is invoked. The transport will revalidate again (and
perform the IP pin), but bailing out early avoids the cost of building
a request object for an obviously bad URL.
"""
safe, reason, _ip = _resolve_safe(url)
if not safe:
return FetchOutcome(status="rejected", error=reason)
headers: Dict[str, str] = {}
if prior:
if prior.etag:
headers["If-None-Match"] = prior.etag
if prior.last_modified:
headers["If-Modified-Since"] = prior.last_modified
client = _get_client()
try:
with client.stream("GET", url, headers=headers) as resp:
status_code = resp.status_code
if status_code == 304:
return FetchOutcome(
status="not_modified",
etag=prior.etag if prior else "",
last_modified=prior.last_modified if prior else "",
)
if status_code >= 400:
return FetchOutcome(status="failed", error=f"http_{status_code}")
content_type = resp.headers.get("Content-Type", "") or ""
etag = resp.headers.get("ETag", "") or ""
last_modified = resp.headers.get("Last-Modified", "") or ""
# Allowlist gate based on Content-Type (cheaper than reading body
# before deciding). For docs we additionally accept generic types
# backed by a URL-extension match.
if expect_kind == "cover":
check = accept_image_response(url, content_type)
else:
check = accept_doc_response(url, content_type)
if not check.ok:
return FetchOutcome(
status="rejected",
content_type=content_type,
error=check.reason,
)
# Stream with a hard cap so a misbehaving server can't OOM us.
# Bail out as soon as the cap is exceeded — don't read the
# rest of the body just to discard it.
body = bytearray()
for chunk in resp.iter_bytes(chunk_size=65536):
body.extend(chunk)
if len(body) > MAX_BODY_BYTES:
return FetchOutcome(
status="rejected",
error=f"body_exceeds_cap: > {MAX_BODY_BYTES} bytes",
)
return FetchOutcome(
status="ok",
body=bytes(body),
content_type=content_type,
etag=etag,
last_modified=last_modified,
)
except _SSRFRejected as e:
return FetchOutcome(status="rejected", error=e.reason)
except httpx.TooManyRedirects:
return FetchOutcome(status="failed", error="too_many_redirects")
except httpx.TimeoutException:
return FetchOutcome(status="failed", error="timeout")
except httpx.HTTPError as e:
# Catches ConnectError, ReadError, RemoteProtocolError, and the
# rest of the httpx transport-error hierarchy. Same shape as
# ``cli/client.py:_translate_transport_error`` — collapse all
# transient failures into one ``failed`` outcome with an error tag
# the operator can grep for.
return FetchOutcome(status="failed", error=f"http_error: {e!r}")
except Exception as e: # noqa: BLE001 — defensive, never abort the sync
logger.exception("mirror fetch crashed for %s", url)
return FetchOutcome(status="failed", error=f"crash: {e!r}")
# ---------------------------------------------------------------------------
# Body-side validation + write
# ---------------------------------------------------------------------------
def _validate_body(filename: str, body: bytes, kind: str):
if kind == "cover":
return validate_image_file(filename, body)
return validate_doc_file(filename, body)
def _local_relpath(plugin_name: str, kind: str, fname: str) -> str:
if kind == "cover":
return f"{plugin_name}/{fname}"
return f"{plugin_name}/docs/{fname}"
def _write_body(cache_dir: Path, relpath: str, body: bytes) -> None:
"""Write ``body`` to ``cache_dir/relpath`` atomically (tmp + rename)."""
full = cache_dir / relpath
full.parent.mkdir(parents=True, exist_ok=True)
tmp = full.with_suffix(full.suffix + ".tmp")
tmp.write_bytes(body)
tmp.replace(full)
# ---------------------------------------------------------------------------
# Public API — one entry point per plugin
# ---------------------------------------------------------------------------
def sync_assets(
*,
cache_dir: Path,
requests: List[Tuple[str, str, str]],
) -> MirrorReport:
"""Mirror every URL in ``requests`` into ``cache_dir``.
``requests`` is a list of ``(plugin_name, kind, url)`` tuples produced by
:func:`src.marketplace_metadata.collect_external_urls`. Returns a
:class:`MirrorReport` summarising the outcome plus the resulting manifest
so the caller can build a ``url → served_path`` lookup for the DB write.
Exceptions inside :func:`_fetch_url` and :func:`_write_body` are caught
by the surrounding ``except`` so one bad URL never aborts the rest of the
sync. URLs absent from ``requests`` but present in the existing manifest
are removed from disk + manifest (the curator dropped them upstream).
"""
cache_dir.mkdir(parents=True, exist_ok=True)
manifest = _load_manifest(cache_dir)
report = MirrorReport(requested=len(requests))
requested_keys = {(plugin_name, url) for plugin_name, _, url in requests}
# Phase 1 — dedup fetches by URL. Two plugins referencing the same
# external image share one HTTP fetch (saves bandwidth, avoids the
# rate-limit pressure on slow CDNs the previous version would have
# caused). We pick any owning plugin's prior MirrorEntry as the source
# of conditional-GET headers — if it has an etag, all owning plugins
# benefit from the 304; if their etags diverge (rare), worst case is
# one full re-download instead of an optimal mix.
fetch_inputs: Dict[str, Tuple[str, Optional[MirrorEntry]]] = {}
for plugin_name, kind, url in requests:
if url in fetch_inputs:
continue
fetch_inputs[url] = (kind, manifest.get((plugin_name, url)))
def _do_one(item: Tuple[str, Tuple[str, Optional[MirrorEntry]]]) -> Tuple[str, FetchOutcome]:
url, (kind, prior) = item
return url, _fetch_url(url, prior=prior, expect_kind=kind)
outcome_by_url: Dict[str, FetchOutcome] = {}
with concurrent.futures.ThreadPoolExecutor(
max_workers=MAX_CONCURRENT_FETCHES
) as pool:
for url, outcome in pool.map(_do_one, list(fetch_inputs.items())):
outcome_by_url[url] = outcome
# Phase 2 — process outcomes per (plugin, url) pair so each owner gets
# its own manifest entry pointing under its own plugin subdir.
now_iso = datetime.now(timezone.utc).isoformat()
for plugin_name, kind, url in requests:
outcome = outcome_by_url[url]
key = (plugin_name, url)
prior = manifest.get(key)
if outcome.status == "not_modified" and prior:
prior.last_checked_at = now_iso
prior.error = ""
prior.status = "ok" # 304 means the cached file is still valid
manifest[key] = prior
report.not_modified += 1
continue
if outcome.status == "rejected":
entry = prior or MirrorEntry(url=url, kind=kind, plugin_name=plugin_name, local="")
entry.status = "rejected"
entry.last_checked_at = now_iso
entry.error = outcome.error
manifest[key] = entry
report.rejected += 1
logger.warning(
"mirror rejected plugin=%s url=%s kind=%s reason=%s",
plugin_name, url, kind, outcome.error,
)
continue
if outcome.status == "failed":
entry = prior or MirrorEntry(url=url, kind=kind, plugin_name=plugin_name, local="")
# First-time failures distinguish from "we previously had a copy" failures
entry.status = "failed_recent" if prior and prior.local else "failed_first"
entry.last_checked_at = now_iso
entry.error = outcome.error
manifest[key] = entry
report.failed += 1
logger.warning(
"mirror fetch failed plugin=%s url=%s kind=%s reason=%s (keep_prior=%s)",
plugin_name, url, kind, outcome.error, bool(prior and prior.local),
)
continue
# outcome.status == "ok" — body present
# Pick filename: extension comes from URL preferentially, else from
# Content-Type. Fall back to a kind-default when neither is helpful.
default_ext = ".bin"
if kind == "cover":
for e in IMAGE_EXTENSIONS:
if url.lower().endswith(e):
default_ext = e
break
else:
ct = outcome.content_type.split(";", 1)[0].strip().lower()
default_ext = {
"image/png": ".png",
"image/jpeg": ".jpg",
"image/webp": ".webp",
}.get(ct, ".png")
else:
for e in DOC_EXTENSIONS:
if url.lower().endswith(e):
default_ext = e
break
else:
ct = outcome.content_type.split(";", 1)[0].strip().lower()
default_ext = {
"application/pdf": ".pdf",
"text/markdown": ".md",
"text/x-markdown": ".md",
"text/plain": ".txt",
}.get(ct, ".txt")
fname = _safe_filename(url, default_ext)
# Ensure the chosen filename has the resolved extension so body
# validation (which looks at the suffix) accepts it.
if not fname.lower().endswith(default_ext):
fname = fname + default_ext
validation = _validate_body(fname, outcome.body, kind)
if not validation.ok:
entry = prior or MirrorEntry(url=url, kind=kind, plugin_name=plugin_name, local="")
entry.status = "rejected"
entry.last_checked_at = now_iso
entry.error = f"body_validation: {validation.reason}"
manifest[key] = entry
report.rejected += 1
logger.warning(
"mirror body rejected plugin=%s url=%s reason=%s",
plugin_name, url, validation.reason,
)
continue
new_sha = hashlib.sha256(outcome.body).hexdigest()
relpath = _local_relpath(plugin_name, kind, fname)
# Only rewrite the body when the hash actually changed — saves disk
# IO + lets future-us correlate "fetched_at" with content changes.
if not prior or prior.sha256 != new_sha or not (cache_dir / prior.local).is_file():
try:
_write_body(cache_dir, relpath, outcome.body)
except OSError as e:
logger.warning("mirror write failed url=%s: %s", url, e)
report.failed += 1
continue
entry = MirrorEntry(
url=url,
kind=kind,
plugin_name=plugin_name,
local=relpath,
etag=outcome.etag,
last_modified=outcome.last_modified,
sha256=new_sha,
fetched_at=now_iso,
last_checked_at=now_iso,
status="ok",
)
manifest[key] = entry
# Persist the manifest BEFORE unlinking the old body. A kill -9
# between body-write and the end-of-batch persist would otherwise
# leave on-disk files the next sync's manifest never references —
# disk bloats over time as URLs come and go from the curator's
# marketplace-metadata.json. Per-iteration persist narrows the crash
# window from "all of Phase 2" to "between persist and unlink"
# (microseconds). Cost: ~one tmp+rename per body write; manifest
# is a few KB so the overhead is negligible vs. the HTTP fetches.
try:
_write_manifest(cache_dir, manifest)
except OSError as e:
# Body is on disk but the manifest didn't commit. Don't
# unlink the old body — the on-disk manifest still
# references it, and serving a stale-but-existing file
# beats serving a 404.
logger.warning(
"mirror manifest persist failed mid-batch url=%s: %s", url, e,
)
report.failed += 1
continue
# If the previous local file lived at a different path, drop it.
if prior and prior.local and prior.local != relpath:
try:
(cache_dir / prior.local).unlink(missing_ok=True)
except OSError:
pass
else:
prior.etag = outcome.etag or prior.etag
prior.last_modified = outcome.last_modified or prior.last_modified
prior.last_checked_at = now_iso
prior.status = "ok"
prior.error = ""
manifest[key] = prior
report.fetched += 1
# Phase 3 — drop manifest entries the curator removed upstream, plus
# their on-disk bodies. Same persist-before-unlink discipline as
# Phase 2: collect the relpaths to delete, persist the manifest with
# the entries already gone, *then* unlink. A crash mid-cleanup leaves
# at most a microsecond window where a file is still on disk despite
# the manifest no longer naming it — the next sync simply re-reads
# the (now-correct) manifest and the orphan stays orphaned, but the
# served state stays consistent.
removed_paths: List[str] = []
for key in list(manifest.keys()):
if key in requested_keys:
continue
entry = manifest.pop(key)
if entry.local:
removed_paths.append(entry.local)
report.removed += 1
_write_manifest(cache_dir, manifest)
for relpath in removed_paths:
try:
(cache_dir / relpath).unlink(missing_ok=True)
except OSError:
pass
report.entries = manifest
return report
def delete_cache_dir(cache_dir: Path) -> bool:
"""Remove the entire mirror cache for one marketplace. True iff removed."""
if cache_dir.exists():
shutil.rmtree(cache_dir, ignore_errors=True)
return True
return False