* Rename agnes-metadata.json to marketplace-metadata.json
Curated marketplace enrichment file (.claude-plugin/agnes-metadata.json)
becomes marketplace-metadata.json. Clean cut, no fallback — curators of
upstream marketplace repos must rename the file on their side.
Python API renames mirror the file rename: read_agnes_metadata →
read_marketplace_metadata, AGNES_METADATA_REL → MARKETPLACE_METADATA_REL,
AGNES_METADATA_MAX_BYTES → MARKETPLACE_METADATA_MAX_BYTES. Synth Claude
Code marketplace strip rule (.agnes/** + the metadata file) follows the
new filename.
* Marketplace detail polish: window cover + 715:310 aspect + helper alignment
- Plugin & item (skill/agent) detail hero: 160x160 square cover replaced
with a macOS-style window frame (3 traffic-light dots + titlebar label
showing the entity name). Body is constrained to 715:310 so curator-
uploaded covers no longer crop to a square. Window is 380px wide; meta
column and absolutely-positioned top-right install/remove actions stay
put. Fallback when no cover_photo_url (translucent gradient + PL/SK/AG
initials) is unchanged, just inside the window body.
- Inner skill/agent cards in the plugin detail's Internal structure
section adopt the same 715:310 aspect (was fixed 78px tall). No window
chrome on inner cards — just the matching proportions so covers read
consistently across hero, grid tiles, and listing cards.
- Curated nested item helper text ("This skill is part of ... — add the
bundle to your stack to use it") now stacks UNDER the "Open parent
plugin" button instead of being a side-by-side flex sibling in the
actions-row. Added align-self: flex-end so the 260px helper box
anchors at the right edge of the 300px actions column, matching the
button's right edge.
* Marketplace My tab: surface the same category + type filters as Flea
- Frontend: mp-cat-row and mp-type-row now show on tab=my (previously
hidden — type was flea-only, category was flea/curated-only). Curated
browse stays plugin-only and continues to hide the type pills.
fetchOne() sends the `type` param for tab=my too, so the items
endpoint's existing my-branch filter actually receives it.
- Backend categories endpoint, tab=my branch: when the type filter is
set to skill/agent, skip counting curated subscriptions. Curated
plugins are always type='plugin', so they wouldn't survive the items
endpoint's type filter; including them in the category counts made
the pill numbers overstate what users could actually see in the
grid. type=None or type='plugin' keeps the previous behaviour.
- CHANGELOG entry under [Unreleased].
* Marketplace plugin detail: render rich content from marketplace-metadata.json
Adds five optional plugin-level fields to marketplace-metadata.json and
renders them on the curated plugin detail page + listing card:
* display_name — friendly h1 / listing-card name / mac-window titlebar
label (overrides the technical plugin id)
* tagline — punchy 1-line value prop for the hero subtitle and the
listing card description (replacing the verbose marketplace.json
description on cards)
* description — multi-paragraph markdown body, server-side rendered
through markdown-it-py and sanitized through nh3 with a
description-scoped allowlist (no iframes / no raw HTML / no
javascript: links). Powers the "What it does" panel.
* use_cases[] — {title, description, prompt} entries that render as a
3-column "When to use it" card grid; each card shows the literal
prompt as a code chip so users can copy-paste into Claude Code.
* sample_interaction — {user, assistant} dialog rendered in a Claude
Code-style dark Catppuccin Mocha transcript panel: monospace user
row with a green ">" prompt indicator + sans-serif assistant body
with markdown formatting (peach bold, yellow italic, pink inline
code, mantle-dark fenced code blocks).
All five fields are optional; UI sections only render when populated,
so plugins without enrichment look identical to before. Fields are
read on-demand from the working tree (cached by mtime per marketplace
slug) so curator edits land at the next request without waiting for
a sync cycle — same pattern as the existing inner-skill/agent
enrichment path. No DB schema bump.
Skill / agent rich-content rendering is deferred to a later phase
(needs a source-of-truth decision: extend plugin.yml? LLM-generate
from SKILL.md / agent.md?). The schema accepts the same fields at
skill/agent level today for forward compatibility but the UI ignores
them for now.
Also: stripped a stale `background-color: var(--bg)` from the global
`code` rule in style.css (was making inline code visually disappear
on the page background).
* Skill / agent detail: render rich content from marketplace-metadata.json
Brings the skill/agent detail pages to parity with the plugin detail
page. Same rich-content schema (display_name, tagline, description as
markdown, use_cases[], sample_interaction) plus two per-item additions:
* invocation — curator-provided literal command string. When set,
overrides the computed "<manifest_name>:<inner_name>" chip and
cleanly supports both "/" skill prefix and "@" agent prefix (the
hardcoded "/" in the chip markup is hidden when the curator provides
the invocation, so /grpn-eng:query <q> and @grpn-eng:cto-architect
both render correctly).
* when_to_use — markdown disambiguation block ("Use this for X. For
similar Y, see /other-skill") rendered into a new "When to use this"
panel below the Example section.
Skill / agent category is now per-item overridable in
marketplace-metadata.json. When absent, the API keeps the parent
plugin's category as the badge so existing items don't lose their
category until curators opt in to per-item categorization.
The new "Example" Q&A panel uses the same Claude Code-style dark
Catppuccin Mocha transcript treatment as the plugin detail —
monospace user row with a green ">" prompt indicator + sans-serif
assistant body with markdown formatting.
All new fields are optional and read on-demand from the working tree.
Skills / agents whose marketplace-metadata.json doesn't carry rich
content render exactly the same way they did before (frontmatter
description + computed slash command + cover from existing v32
enrichment). No DB schema bump.
* Fix TypeError in skill / agent detail when curator sets per-item category
`curated_skill_detail` and `curated_agent_detail` were passing both
`**parent` (from `_curated_inner_parent_fields`, which returns the
parent plugin's category as a fallback) and `**enrichment` (from
`_curated_inner_enrichment`, which returns the per-item category
override when the curator set one) into `InnerDetailResponse(...)`.
Python function-call kwargs unpacking with overlapping keys raises
`TypeError: got multiple values for keyword argument 'category'`
— it doesn't merge like a literal dict does. The bug only surfaced
when the marketplace-metadata.json carried a `category` field at
skill / agent level (curator opting into per-item categorization);
items without that override hit the endpoint cleanly because only
parent provided the key.
Fix: build `merged = {**parent, **enrichment}` first (literal-dict
syntax DOES merge, with the right-hand-side winning) and unpack the
merged dict. Curator override still wins via the merge order, and
the same pattern is future-proof for any other field that lands in
both layers later.
Plus a regression test in test_marketplace_metadata.py asserting
that the inner-resolver carries `category` for downstream merging.
* Marketplace detail: tolerate partial curator JSON
Server constructed UseCase / SampleInteraction via raw dict indexing
(uc["title"], sample["assistant"]), so a curator commit missing any
required Pydantic field crashed the whole plugin / skill / agent detail
endpoint with a 500. Route both constructions through _safe_use_case /
_safe_sample_interaction helpers — partial input silently drops the
malformed card / section instead of breaking the page.
Regression test in test_marketplace_api.py covers the three shapes:
use_case missing a key, use_case with an empty string, and
sample_interaction with only user (no assistant). Sibling rich fields
still render.
* Address PR-251 review (must-fixes + S2/S3 polish) + release-cut 0.50.0
Five must-fixes from the review pass (3 from @cvrysanek's two-stage
review, 2 from my independent pass), plus the 0.50.0 release-cut as the
last commit on this PR per CLAUDE.md (CLAUDE.md "Release-cut belongs
to the PR" rule added in v0.49.1).
Must-fixes
----------
1. Cache eviction: bounded LRU instead of per-marketplace predicate.
The previous predicate (`k[0] == marketplace_id and k[1] != mtime_ns`)
only swept stale entries for the CURRENT marketplace; with N>100
distinct marketplaces each holding one mtime key, the cap silently
failed and memory grew linearly. Replaced with OrderedDict-backed
bounded LRU at cap=256, drop oldest insert on overflow.
Cache stress test pinned in test_marketplace_metadata.py.
2. Render CPU cap: per-field byte cap on description / when_to_use /
sample_interaction.assistant via MARKETPLACE_METADATA_FIELD_MAX_BYTES
(= 64 KiB). Without this, a 1 MiB curator markdown body × QPS =
curator-controlled CPU burn through pure-Python markdown-it-py.
Truncation respects UTF-8 boundaries and logs a warning so the
curator sees the cap fire on the next sync. Test for cap +
UTF-8-boundary preservation.
3. Inner-detail bypassed the metadata cache. _curated_inner_enrichment,
_curated_inner_cover, and curated_detail all called
read_marketplace_metadata directly, defeating the mtime cache the
plugin listing already shared. Routed all three through
_read_metadata_cached so skill/agent detail hits are O(1) re-parses
per marketplace per mtime instead of O(QPS).
4. Truthy-vs-presence trap in plugin/inner enrichment merge. API-layer
writers used `if resolved.get(k):` which silently dropped any
future falsy-but-valid resolver field (bool featured=False, int
priority=0, str category=''). Switched to presence check
(`if k in resolved`) so the resolver is the authority on field
presence; `{**parent, **enrichment}` merge respects whatever the
resolver decided to ship.
5. Vendor-agnostic OSS cleanup. Removed operator-specific token
references (/grpn-eng:, @grpn-eng:, .foundryai/) from
src/marketplace_metadata.py docstring, app/web/templates/
marketplace_item_detail.html JS comment, docs/curated-marketplace-
format.md, and tests/test_marketplace_metadata.py fixtures. Replaced
with generic /my-plugin:tool / @my-agent:role / .example/ placeholders.
CHANGELOG
---------
- New "### Fixed (PR #251 follow-ups)" section documenting all 4
code-side must-fixes
- New "### Internal" section noting the vendor cleanup + new tests
- BREAKING bullet for the file rename now covers operator-side
migration: running instances see plugin enrichment disappear from
the UI until upstream curator renames + nightly sync overwrites the
working tree; POST /api/marketplaces/{id}/sync forces refresh sooner
- Stripped /grpn-eng: leaks from the existing skill/agent rich-content
bullet
Tests
-----
128 targeted tests pass (test_marketplace_metadata, test_marketplace_api,
test_marketplace, test_markdown_render, test_marketplace_synth_strip,
test_marketplace_filter). New tests added:
- 6 XSS regression tests on render_safe (javascript:/data:/vbscript:
schemes via autolink, reference link, and mixed-case + positive
http/https/mailto + noopener noreferrer rel)
- 3 byte-cap tests (truncation + UTF-8 boundary + under-cap pass-through)
- 1 cache eviction stress test (>256 marketplaces -> bounded at cap)
- 1 truthy-vs-presence resolver-contract test
Release-cut
-----------
- pyproject.toml 0.49.1 -> 0.50.0 (minor; BREAKING file rename per
pre-1.0 CHANGELOG note: "breaking changes called out under Changed
or Removed with the BREAKING marker")
- CHANGELOG [Unreleased] -> [0.50.0] - 2026-05-12, new empty
[Unreleased] on top.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
750 lines
30 KiB
Python
750 lines
30 KiB
Python
"""External-asset mirror cache for curated marketplaces.
|
||
|
||
The curator's ``.claude-plugin/marketplace-metadata.json`` may reference cover
|
||
photos and doc files by external HTTP(S) URL. Linkrot would then mean the
|
||
Agnes web UI starts showing broken images / dead links the moment the
|
||
upstream CDN serves a 404. This module mirrors those URLs to disk at sync
|
||
time and serves the local copy thereafter.
|
||
|
||
**On-disk layout** (per marketplace slug)::
|
||
|
||
${DATA_DIR}/marketplace-cache/<slug>/
|
||
├── manifest.json # url → cache entry
|
||
└── <plugin>/
|
||
├── cover.<ext>
|
||
└── docs/<sha8>-<filename>
|
||
|
||
**Re-fetch logic per URL on every sync:**
|
||
|
||
1. URL not yet in manifest → unconditional GET, save body + record
|
||
ETag / Last-Modified / sha256.
|
||
2. URL already mirrored → conditional GET (``If-None-Match`` /
|
||
``If-Modified-Since``):
|
||
- 304 Not Modified → keep cached file, refresh ``fetched_at`` only.
|
||
- 200 OK with same sha256 → keep file, refresh validators.
|
||
- 200 OK with new sha256 → overwrite local file.
|
||
3. URL removed from marketplace-metadata.json → ``cleanup_unused`` removes the
|
||
manifest entry and the local file.
|
||
|
||
**Failure modes** (b1 fallback per the design discussion):
|
||
fetch failure (timeout, 4xx/5xx, allowlist reject, oversized, SSRF block)
|
||
keeps the **last good copy** intact in the cache, sets ``status = "failed_*"``
|
||
on the manifest entry, and logs a warning. The caller surfaces "mirror failed"
|
||
in the admin UI but never breaks the served plugin detail.
|
||
|
||
**SSRF guards:** only ``http(s)://`` schemes accepted, DNS resolution rejects
|
||
private / loopback / link-local / metadata IPs, 30-second timeout, 10 MB cap,
|
||
max 4 concurrent fetches per sync.
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import concurrent.futures
|
||
import hashlib
|
||
import ipaddress
|
||
import json
|
||
import logging
|
||
import re
|
||
import shutil
|
||
import socket
|
||
from dataclasses import asdict, dataclass, field
|
||
from datetime import datetime, timezone
|
||
from pathlib import Path
|
||
from typing import Dict, List, Optional, Tuple
|
||
from urllib.parse import urlparse
|
||
|
||
import httpx
|
||
|
||
from src.marketplace_asset_validation import (
|
||
DOC_EXTENSIONS,
|
||
IMAGE_EXTENSIONS,
|
||
accept_doc_response,
|
||
accept_image_response,
|
||
validate_doc_file,
|
||
validate_image_file,
|
||
)
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
# Hardcoded operational caps. The plan deferred making these configurable —
|
||
# the comment in `instance.yaml` would be one line if/when an operator hits
|
||
# a real limit (today nothing in our org has cover images > 10 MB).
|
||
HTTP_TIMEOUT_SEC = 60
|
||
"""Per-request timeout for outgoing mirror fetches.
|
||
|
||
Larger PDFs from slow CDNs (e.g. Adobe support, government archives)
|
||
routinely exceed 30s on a residential connection — bumped from 30 → 60.
|
||
The sync runs nightly under a thread pool with bounded concurrency so
|
||
worst-case sync time grows linearly, not multiplicatively, with this
|
||
value. Operators can still cap a runaway curator by trimming
|
||
``MAX_BODY_BYTES`` (10 MB) — the timeout only matters for slow tails."""
|
||
MAX_BODY_BYTES = 10 * 1024 * 1024 # 10 MB
|
||
MAX_CONCURRENT_FETCHES = 4
|
||
|
||
USER_AGENT = (
|
||
"Agnes-Marketplace-Mirror/1.0 "
|
||
"(+https://github.com/keboola/agnes-the-ai-analyst; agnes-mirror)"
|
||
)
|
||
"""HTTP User-Agent for outgoing mirror fetches.
|
||
|
||
Wikipedia / Wikimedia commons strictly enforces a User-Agent policy and
|
||
returns HTTP 400 to clients with generic strings (see
|
||
https://meta.wikimedia.org/wiki/User-Agent_policy). The format below
|
||
includes a contact URL + descriptor which satisfies their parser. Other
|
||
strict CDNs (e.g. arXiv, some news sites) similarly require a non-trivial
|
||
UA — using the same string everywhere keeps debugging simple."""
|
||
|
||
MANIFEST_FILENAME = "manifest.json"
|
||
|
||
|
||
@dataclass
|
||
class MirrorEntry:
|
||
"""One row in ``manifest.json`` — keyed by external URL."""
|
||
url: str
|
||
kind: str # "cover" | "doc"
|
||
plugin_name: str
|
||
local: str # relative path inside the marketplace cache dir
|
||
etag: str = ""
|
||
last_modified: str = ""
|
||
sha256: str = ""
|
||
fetched_at: str = "" # ISO timestamp of last successful body write
|
||
last_checked_at: str = "" # ISO timestamp of last fetch attempt
|
||
status: str = "unknown" # "ok" | "failed_recent" | "failed_first" | "rejected"
|
||
error: str = ""
|
||
|
||
def to_json(self) -> dict:
|
||
return asdict(self)
|
||
|
||
@classmethod
|
||
def from_json(cls, d: dict) -> "MirrorEntry":
|
||
return cls(
|
||
url=d.get("url", ""),
|
||
kind=d.get("kind", ""),
|
||
plugin_name=d.get("plugin_name", ""),
|
||
local=d.get("local", ""),
|
||
etag=d.get("etag", ""),
|
||
last_modified=d.get("last_modified", ""),
|
||
sha256=d.get("sha256", ""),
|
||
fetched_at=d.get("fetched_at", ""),
|
||
last_checked_at=d.get("last_checked_at", ""),
|
||
status=d.get("status", "unknown"),
|
||
error=d.get("error", ""),
|
||
)
|
||
|
||
|
||
@dataclass
|
||
class MirrorReport:
|
||
"""Per-sync summary returned to the caller."""
|
||
requested: int = 0
|
||
fetched: int = 0
|
||
not_modified: int = 0
|
||
failed: int = 0
|
||
rejected: int = 0
|
||
removed: int = 0
|
||
entries: Dict[Tuple[str, str], MirrorEntry] = field(default_factory=dict)
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# SSRF / safety helpers
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
def _resolve_safe(url: str) -> Tuple[bool, str, str]:
|
||
"""Reject URLs we shouldn't follow and return the IP the caller MUST connect to.
|
||
|
||
Returns ``(ok, reason, pinned_ip)``. On rejection ``pinned_ip`` is empty.
|
||
|
||
Why the pinned IP matters: ``urllib`` would otherwise re-resolve the
|
||
hostname at connection time, and an attacker-controlled DNS server can
|
||
return a public IP for the validation lookup and ``127.0.0.1`` /
|
||
``169.254.169.254`` for the connection lookup (DNS rebinding). Resolving
|
||
once here and connecting to that exact IP defeats the rebind. ALL
|
||
addresses returned by ``getaddrinfo`` are validated — round-robin DNS
|
||
that mixes public + private IPs is treated as unsafe regardless of which
|
||
one we'd have picked first.
|
||
"""
|
||
try:
|
||
parts = urlparse(url)
|
||
except ValueError as e:
|
||
return False, f"bad_url: {e}", ""
|
||
if parts.scheme not in ("http", "https"):
|
||
return False, f"unsupported_scheme: {parts.scheme}", ""
|
||
host = parts.hostname or ""
|
||
if not host:
|
||
return False, "missing_host", ""
|
||
try:
|
||
infos = socket.getaddrinfo(host, None)
|
||
except socket.gaierror as e:
|
||
return False, f"dns_failure: {e}", ""
|
||
|
||
chosen_ip = ""
|
||
for info in infos:
|
||
sockaddr = info[4]
|
||
ip_str = sockaddr[0]
|
||
try:
|
||
addr = ipaddress.ip_address(ip_str)
|
||
except ValueError:
|
||
return False, f"unparseable_address: {ip_str}", ""
|
||
if (
|
||
addr.is_private
|
||
or addr.is_loopback
|
||
or addr.is_link_local
|
||
or addr.is_multicast
|
||
or addr.is_reserved
|
||
or addr.is_unspecified
|
||
):
|
||
return False, f"address_in_blocked_range: {ip_str}", ""
|
||
# AWS / GCP / Azure metadata endpoints fall under is_link_local
|
||
# (169.254.169.254) above — explicit additional check for IPv6
|
||
# ULA + the broad metadata-style catchall would be belt-and-
|
||
# suspenders only.
|
||
# Prefer the first IPv4 result for connection pinning (broader CDN
|
||
# compatibility); fall back to the first record otherwise.
|
||
if not chosen_ip and info[0] == socket.AF_INET:
|
||
chosen_ip = ip_str
|
||
if not chosen_ip and infos:
|
||
chosen_ip = infos[0][4][0]
|
||
if not chosen_ip:
|
||
return False, "no_address", ""
|
||
return True, "", chosen_ip
|
||
|
||
|
||
def _is_safe_url(url: str) -> Tuple[bool, str]:
|
||
"""Backwards-compatible 2-tuple wrapper over :func:`_resolve_safe`.
|
||
|
||
Existing tests (and any external callers that only care about the
|
||
accept/reject decision) keep working unchanged. The pinned IP returned
|
||
by ``_resolve_safe`` is consumed internally by the connection-pinning
|
||
handlers below.
|
||
"""
|
||
ok, reason, _ = _resolve_safe(url)
|
||
return ok, reason
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# SSRF-aware httpx transport + shared client
|
||
#
|
||
# Two threats against the simple "validate URL, then GET" pattern:
|
||
# 1. Redirect bypass — without revalidation, an attacker 302s to
|
||
# http://169.254.169.254/... and we mirror cloud metadata.
|
||
# 2. DNS rebinding — without IP pinning, the connect-time DNS lookup
|
||
# can return a different IP than the validation lookup.
|
||
#
|
||
# httpx makes both defences collapse into a single custom Transport:
|
||
# httpx invokes ``handle_request()`` on EVERY outgoing request — including
|
||
# every redirect hop — so re-running SSRF validation in the transport
|
||
# closes the redirect bypass for free. Within ``handle_request`` we also
|
||
# rewrite the URL host to the IP we just validated and stash the original
|
||
# hostname in the ``Host`` header + the ``sni_hostname`` extension so TLS
|
||
# SNI / cert verification still bind to the curator-supplied hostname.
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
class _SSRFRejected(Exception):
|
||
"""Raised inside ``_SSRFGuardTransport`` when the SSRF allowlist rejects
|
||
the (initial or redirected) URL.
|
||
|
||
Distinct from ``httpx.RequestError`` so ``_fetch_url`` maps this to
|
||
``status='rejected'`` (terminal — security decision, never retry).
|
||
"""
|
||
|
||
def __init__(self, reason: str) -> None:
|
||
self.reason = reason
|
||
super().__init__(reason)
|
||
|
||
|
||
class _SSRFGuardTransport(httpx.HTTPTransport):
|
||
"""Transport that re-validates SSRF rules on every outgoing request and
|
||
pins the connection to the IP we just resolved.
|
||
|
||
Redirect re-validation comes for free because httpx invokes
|
||
``handle_request()`` once per redirect hop (when the client is
|
||
configured with ``follow_redirects=True``). DNS-rebinding defence
|
||
comes from rewriting the URL host to the validated IP — httpcore
|
||
no longer re-resolves the hostname at connect time.
|
||
"""
|
||
|
||
def handle_request(self, request: httpx.Request) -> httpx.Response:
|
||
ok, reason, ip = _resolve_safe(str(request.url))
|
||
if not ok:
|
||
raise _SSRFRejected(reason)
|
||
original_host = request.url.host
|
||
# Rewrite the URL host to the validated IP. httpcore opens the
|
||
# connection to whatever ``request.url.host`` says, so this is what
|
||
# actually pins the connection.
|
||
request.url = request.url.copy_with(host=ip)
|
||
# Preserve the original hostname for vhost routing + TLS SNI / cert
|
||
# verification. ``sni_hostname`` is a documented httpx extension
|
||
# honored by the TLS layer in 0.24+.
|
||
request.headers["Host"] = original_host
|
||
request.extensions = {
|
||
**request.extensions,
|
||
"sni_hostname": original_host,
|
||
}
|
||
return super().handle_request(request)
|
||
|
||
|
||
_CLIENT: Optional[httpx.Client] = None
|
||
|
||
|
||
def _get_client() -> httpx.Client:
|
||
"""Lazy module-level ``httpx.Client`` shared across the fetch pool.
|
||
|
||
Same lifecycle pattern as ``cli/client.py``'s ``_get_shared_client``:
|
||
build once on first use, reuse for the process lifetime. ``httpx.Client``
|
||
is thread-safe for concurrent ``send()`` / ``stream()`` calls so a
|
||
``ThreadPoolExecutor`` can hammer it without external locking.
|
||
"""
|
||
global _CLIENT
|
||
if _CLIENT is None:
|
||
_CLIENT = httpx.Client(
|
||
transport=_SSRFGuardTransport(),
|
||
timeout=HTTP_TIMEOUT_SEC,
|
||
follow_redirects=True,
|
||
# Tightened from the httpx default of 20. Legitimate CDN chains
|
||
# (S3 → presigned, DOI → publisher) routinely use 3–4 hops;
|
||
# 5 leaves headroom without giving attackers many hops to scan.
|
||
max_redirects=5,
|
||
headers={"User-Agent": USER_AGENT},
|
||
)
|
||
return _CLIENT
|
||
|
||
|
||
def _safe_filename(url: str, default_ext: str) -> str:
|
||
"""Derive a stable, FS-safe filename from a URL.
|
||
|
||
Format: ``<sha8(url)>-<basename>``. The hash prefix means two URLs with
|
||
the same trailing filename don't collide; the human-readable basename
|
||
helps when an operator browses the cache dir directly.
|
||
"""
|
||
parts = urlparse(url)
|
||
base = Path(parts.path).name or "download"
|
||
base = re.sub(r"[^a-zA-Z0-9._-]", "_", base)[:64]
|
||
if not base or base.startswith("."):
|
||
base = f"download{default_ext}"
|
||
sha8 = hashlib.sha256(url.encode("utf-8")).hexdigest()[:8]
|
||
return f"{sha8}-{base}"
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Manifest persistence
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
def _load_manifest(cache_dir: Path) -> Dict[Tuple[str, str], MirrorEntry]:
|
||
"""Read the on-disk manifest into an in-memory ``(plugin_name, url) → entry`` map.
|
||
|
||
The composite key is what makes the manifest RBAC-safe: two plugins in
|
||
the same marketplace can reference the same external URL (shared CDN
|
||
icon, common cover image) and each gets its own entry pointing under
|
||
its own plugin subdir, so an analyst with grant on plugin B never
|
||
receives a URL pointing under plugin A's tree.
|
||
|
||
On-disk format is a list of self-describing entries (each carries
|
||
``plugin_name`` + ``url`` fields), not a JSON dict — JSON keys can't
|
||
be tuples and concatenating ``"plugin::url"`` would just shift the
|
||
parsing burden.
|
||
"""
|
||
path = cache_dir / MANIFEST_FILENAME
|
||
if not path.is_file():
|
||
return {}
|
||
try:
|
||
data = json.loads(path.read_text(encoding="utf-8"))
|
||
except (OSError, ValueError) as e:
|
||
logger.warning("mirror manifest %s unreadable, starting fresh: %s", path, e)
|
||
return {}
|
||
entries = data.get("entries") if isinstance(data, dict) else None
|
||
if not isinstance(entries, list):
|
||
return {}
|
||
out: Dict[Tuple[str, str], MirrorEntry] = {}
|
||
for raw in entries:
|
||
if not isinstance(raw, dict):
|
||
continue
|
||
entry = MirrorEntry.from_json(raw)
|
||
if not entry.url or not entry.plugin_name:
|
||
continue
|
||
out[(entry.plugin_name, entry.url)] = entry
|
||
return out
|
||
|
||
|
||
def _write_manifest(
|
||
cache_dir: Path,
|
||
entries: Dict[Tuple[str, str], MirrorEntry],
|
||
) -> None:
|
||
cache_dir.mkdir(parents=True, exist_ok=True)
|
||
path = cache_dir / MANIFEST_FILENAME
|
||
body = {
|
||
"version": 2,
|
||
"entries": [e.to_json() for e in entries.values()],
|
||
}
|
||
tmp = path.with_suffix(".json.tmp")
|
||
tmp.write_text(json.dumps(body, indent=2), encoding="utf-8")
|
||
tmp.replace(path)
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# HTTP fetch
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
@dataclass
|
||
class FetchOutcome:
|
||
status: str # "ok" | "not_modified" | "failed" | "rejected"
|
||
body: bytes = b""
|
||
content_type: str = ""
|
||
etag: str = ""
|
||
last_modified: str = ""
|
||
error: str = ""
|
||
|
||
|
||
def _fetch_url(
|
||
url: str,
|
||
*,
|
||
prior: Optional[MirrorEntry],
|
||
expect_kind: str,
|
||
) -> FetchOutcome:
|
||
"""Single HTTP GET (with conditional headers when ``prior`` provides them).
|
||
|
||
SSRF + size + allowlist enforcement happen here. Any rejection produces
|
||
``status="rejected"`` (terminal — caller doesn't retry); any transient
|
||
network error produces ``status="failed"`` (caller may surface and try
|
||
again next sync).
|
||
|
||
Pre-flight ``_resolve_safe`` here gives us a fast, type-safe rejection
|
||
*before* httpx is invoked. The transport will revalidate again (and
|
||
perform the IP pin), but bailing out early avoids the cost of building
|
||
a request object for an obviously bad URL.
|
||
"""
|
||
safe, reason, _ip = _resolve_safe(url)
|
||
if not safe:
|
||
return FetchOutcome(status="rejected", error=reason)
|
||
|
||
headers: Dict[str, str] = {}
|
||
if prior:
|
||
if prior.etag:
|
||
headers["If-None-Match"] = prior.etag
|
||
if prior.last_modified:
|
||
headers["If-Modified-Since"] = prior.last_modified
|
||
|
||
client = _get_client()
|
||
try:
|
||
with client.stream("GET", url, headers=headers) as resp:
|
||
status_code = resp.status_code
|
||
if status_code == 304:
|
||
return FetchOutcome(
|
||
status="not_modified",
|
||
etag=prior.etag if prior else "",
|
||
last_modified=prior.last_modified if prior else "",
|
||
)
|
||
if status_code >= 400:
|
||
return FetchOutcome(status="failed", error=f"http_{status_code}")
|
||
|
||
content_type = resp.headers.get("Content-Type", "") or ""
|
||
etag = resp.headers.get("ETag", "") or ""
|
||
last_modified = resp.headers.get("Last-Modified", "") or ""
|
||
# Allowlist gate based on Content-Type (cheaper than reading body
|
||
# before deciding). For docs we additionally accept generic types
|
||
# backed by a URL-extension match.
|
||
if expect_kind == "cover":
|
||
check = accept_image_response(url, content_type)
|
||
else:
|
||
check = accept_doc_response(url, content_type)
|
||
if not check.ok:
|
||
return FetchOutcome(
|
||
status="rejected",
|
||
content_type=content_type,
|
||
error=check.reason,
|
||
)
|
||
# Stream with a hard cap so a misbehaving server can't OOM us.
|
||
# Bail out as soon as the cap is exceeded — don't read the
|
||
# rest of the body just to discard it.
|
||
body = bytearray()
|
||
for chunk in resp.iter_bytes(chunk_size=65536):
|
||
body.extend(chunk)
|
||
if len(body) > MAX_BODY_BYTES:
|
||
return FetchOutcome(
|
||
status="rejected",
|
||
error=f"body_exceeds_cap: > {MAX_BODY_BYTES} bytes",
|
||
)
|
||
return FetchOutcome(
|
||
status="ok",
|
||
body=bytes(body),
|
||
content_type=content_type,
|
||
etag=etag,
|
||
last_modified=last_modified,
|
||
)
|
||
except _SSRFRejected as e:
|
||
return FetchOutcome(status="rejected", error=e.reason)
|
||
except httpx.TooManyRedirects:
|
||
return FetchOutcome(status="failed", error="too_many_redirects")
|
||
except httpx.TimeoutException:
|
||
return FetchOutcome(status="failed", error="timeout")
|
||
except httpx.HTTPError as e:
|
||
# Catches ConnectError, ReadError, RemoteProtocolError, and the
|
||
# rest of the httpx transport-error hierarchy. Same shape as
|
||
# ``cli/client.py:_translate_transport_error`` — collapse all
|
||
# transient failures into one ``failed`` outcome with an error tag
|
||
# the operator can grep for.
|
||
return FetchOutcome(status="failed", error=f"http_error: {e!r}")
|
||
except Exception as e: # noqa: BLE001 — defensive, never abort the sync
|
||
logger.exception("mirror fetch crashed for %s", url)
|
||
return FetchOutcome(status="failed", error=f"crash: {e!r}")
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Body-side validation + write
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
def _validate_body(filename: str, body: bytes, kind: str):
|
||
if kind == "cover":
|
||
return validate_image_file(filename, body)
|
||
return validate_doc_file(filename, body)
|
||
|
||
|
||
def _local_relpath(plugin_name: str, kind: str, fname: str) -> str:
|
||
if kind == "cover":
|
||
return f"{plugin_name}/{fname}"
|
||
return f"{plugin_name}/docs/{fname}"
|
||
|
||
|
||
def _write_body(cache_dir: Path, relpath: str, body: bytes) -> None:
|
||
"""Write ``body`` to ``cache_dir/relpath`` atomically (tmp + rename)."""
|
||
full = cache_dir / relpath
|
||
full.parent.mkdir(parents=True, exist_ok=True)
|
||
tmp = full.with_suffix(full.suffix + ".tmp")
|
||
tmp.write_bytes(body)
|
||
tmp.replace(full)
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Public API — one entry point per plugin
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
def sync_assets(
|
||
*,
|
||
cache_dir: Path,
|
||
requests: List[Tuple[str, str, str]],
|
||
) -> MirrorReport:
|
||
"""Mirror every URL in ``requests`` into ``cache_dir``.
|
||
|
||
``requests`` is a list of ``(plugin_name, kind, url)`` tuples produced by
|
||
:func:`src.marketplace_metadata.collect_external_urls`. Returns a
|
||
:class:`MirrorReport` summarising the outcome plus the resulting manifest
|
||
so the caller can build a ``url → served_path`` lookup for the DB write.
|
||
|
||
Exceptions inside :func:`_fetch_url` and :func:`_write_body` are caught
|
||
by the surrounding ``except`` so one bad URL never aborts the rest of the
|
||
sync. URLs absent from ``requests`` but present in the existing manifest
|
||
are removed from disk + manifest (the curator dropped them upstream).
|
||
"""
|
||
cache_dir.mkdir(parents=True, exist_ok=True)
|
||
manifest = _load_manifest(cache_dir)
|
||
report = MirrorReport(requested=len(requests))
|
||
requested_keys = {(plugin_name, url) for plugin_name, _, url in requests}
|
||
|
||
# Phase 1 — dedup fetches by URL. Two plugins referencing the same
|
||
# external image share one HTTP fetch (saves bandwidth, avoids the
|
||
# rate-limit pressure on slow CDNs the previous version would have
|
||
# caused). We pick any owning plugin's prior MirrorEntry as the source
|
||
# of conditional-GET headers — if it has an etag, all owning plugins
|
||
# benefit from the 304; if their etags diverge (rare), worst case is
|
||
# one full re-download instead of an optimal mix.
|
||
fetch_inputs: Dict[str, Tuple[str, Optional[MirrorEntry]]] = {}
|
||
for plugin_name, kind, url in requests:
|
||
if url in fetch_inputs:
|
||
continue
|
||
fetch_inputs[url] = (kind, manifest.get((plugin_name, url)))
|
||
|
||
def _do_one(item: Tuple[str, Tuple[str, Optional[MirrorEntry]]]) -> Tuple[str, FetchOutcome]:
|
||
url, (kind, prior) = item
|
||
return url, _fetch_url(url, prior=prior, expect_kind=kind)
|
||
|
||
outcome_by_url: Dict[str, FetchOutcome] = {}
|
||
with concurrent.futures.ThreadPoolExecutor(
|
||
max_workers=MAX_CONCURRENT_FETCHES
|
||
) as pool:
|
||
for url, outcome in pool.map(_do_one, list(fetch_inputs.items())):
|
||
outcome_by_url[url] = outcome
|
||
|
||
# Phase 2 — process outcomes per (plugin, url) pair so each owner gets
|
||
# its own manifest entry pointing under its own plugin subdir.
|
||
now_iso = datetime.now(timezone.utc).isoformat()
|
||
for plugin_name, kind, url in requests:
|
||
outcome = outcome_by_url[url]
|
||
key = (plugin_name, url)
|
||
prior = manifest.get(key)
|
||
if outcome.status == "not_modified" and prior:
|
||
prior.last_checked_at = now_iso
|
||
prior.error = ""
|
||
prior.status = "ok" # 304 means the cached file is still valid
|
||
manifest[key] = prior
|
||
report.not_modified += 1
|
||
continue
|
||
if outcome.status == "rejected":
|
||
entry = prior or MirrorEntry(url=url, kind=kind, plugin_name=plugin_name, local="")
|
||
entry.status = "rejected"
|
||
entry.last_checked_at = now_iso
|
||
entry.error = outcome.error
|
||
manifest[key] = entry
|
||
report.rejected += 1
|
||
logger.warning(
|
||
"mirror rejected plugin=%s url=%s kind=%s reason=%s",
|
||
plugin_name, url, kind, outcome.error,
|
||
)
|
||
continue
|
||
if outcome.status == "failed":
|
||
entry = prior or MirrorEntry(url=url, kind=kind, plugin_name=plugin_name, local="")
|
||
# First-time failures distinguish from "we previously had a copy" failures
|
||
entry.status = "failed_recent" if prior and prior.local else "failed_first"
|
||
entry.last_checked_at = now_iso
|
||
entry.error = outcome.error
|
||
manifest[key] = entry
|
||
report.failed += 1
|
||
logger.warning(
|
||
"mirror fetch failed plugin=%s url=%s kind=%s reason=%s (keep_prior=%s)",
|
||
plugin_name, url, kind, outcome.error, bool(prior and prior.local),
|
||
)
|
||
continue
|
||
# outcome.status == "ok" — body present
|
||
# Pick filename: extension comes from URL preferentially, else from
|
||
# Content-Type. Fall back to a kind-default when neither is helpful.
|
||
default_ext = ".bin"
|
||
if kind == "cover":
|
||
for e in IMAGE_EXTENSIONS:
|
||
if url.lower().endswith(e):
|
||
default_ext = e
|
||
break
|
||
else:
|
||
ct = outcome.content_type.split(";", 1)[0].strip().lower()
|
||
default_ext = {
|
||
"image/png": ".png",
|
||
"image/jpeg": ".jpg",
|
||
"image/webp": ".webp",
|
||
}.get(ct, ".png")
|
||
else:
|
||
for e in DOC_EXTENSIONS:
|
||
if url.lower().endswith(e):
|
||
default_ext = e
|
||
break
|
||
else:
|
||
ct = outcome.content_type.split(";", 1)[0].strip().lower()
|
||
default_ext = {
|
||
"application/pdf": ".pdf",
|
||
"text/markdown": ".md",
|
||
"text/x-markdown": ".md",
|
||
"text/plain": ".txt",
|
||
}.get(ct, ".txt")
|
||
fname = _safe_filename(url, default_ext)
|
||
# Ensure the chosen filename has the resolved extension so body
|
||
# validation (which looks at the suffix) accepts it.
|
||
if not fname.lower().endswith(default_ext):
|
||
fname = fname + default_ext
|
||
validation = _validate_body(fname, outcome.body, kind)
|
||
if not validation.ok:
|
||
entry = prior or MirrorEntry(url=url, kind=kind, plugin_name=plugin_name, local="")
|
||
entry.status = "rejected"
|
||
entry.last_checked_at = now_iso
|
||
entry.error = f"body_validation: {validation.reason}"
|
||
manifest[key] = entry
|
||
report.rejected += 1
|
||
logger.warning(
|
||
"mirror body rejected plugin=%s url=%s reason=%s",
|
||
plugin_name, url, validation.reason,
|
||
)
|
||
continue
|
||
new_sha = hashlib.sha256(outcome.body).hexdigest()
|
||
relpath = _local_relpath(plugin_name, kind, fname)
|
||
# Only rewrite the body when the hash actually changed — saves disk
|
||
# IO + lets future-us correlate "fetched_at" with content changes.
|
||
if not prior or prior.sha256 != new_sha or not (cache_dir / prior.local).is_file():
|
||
try:
|
||
_write_body(cache_dir, relpath, outcome.body)
|
||
except OSError as e:
|
||
logger.warning("mirror write failed url=%s: %s", url, e)
|
||
report.failed += 1
|
||
continue
|
||
entry = MirrorEntry(
|
||
url=url,
|
||
kind=kind,
|
||
plugin_name=plugin_name,
|
||
local=relpath,
|
||
etag=outcome.etag,
|
||
last_modified=outcome.last_modified,
|
||
sha256=new_sha,
|
||
fetched_at=now_iso,
|
||
last_checked_at=now_iso,
|
||
status="ok",
|
||
)
|
||
manifest[key] = entry
|
||
# Persist the manifest BEFORE unlinking the old body. A kill -9
|
||
# between body-write and the end-of-batch persist would otherwise
|
||
# leave on-disk files the next sync's manifest never references —
|
||
# disk bloats over time as URLs come and go from the curator's
|
||
# marketplace-metadata.json. Per-iteration persist narrows the crash
|
||
# window from "all of Phase 2" to "between persist and unlink"
|
||
# (microseconds). Cost: ~one tmp+rename per body write; manifest
|
||
# is a few KB so the overhead is negligible vs. the HTTP fetches.
|
||
try:
|
||
_write_manifest(cache_dir, manifest)
|
||
except OSError as e:
|
||
# Body is on disk but the manifest didn't commit. Don't
|
||
# unlink the old body — the on-disk manifest still
|
||
# references it, and serving a stale-but-existing file
|
||
# beats serving a 404.
|
||
logger.warning(
|
||
"mirror manifest persist failed mid-batch url=%s: %s", url, e,
|
||
)
|
||
report.failed += 1
|
||
continue
|
||
# If the previous local file lived at a different path, drop it.
|
||
if prior and prior.local and prior.local != relpath:
|
||
try:
|
||
(cache_dir / prior.local).unlink(missing_ok=True)
|
||
except OSError:
|
||
pass
|
||
else:
|
||
prior.etag = outcome.etag or prior.etag
|
||
prior.last_modified = outcome.last_modified or prior.last_modified
|
||
prior.last_checked_at = now_iso
|
||
prior.status = "ok"
|
||
prior.error = ""
|
||
manifest[key] = prior
|
||
report.fetched += 1
|
||
|
||
# Phase 3 — drop manifest entries the curator removed upstream, plus
|
||
# their on-disk bodies. Same persist-before-unlink discipline as
|
||
# Phase 2: collect the relpaths to delete, persist the manifest with
|
||
# the entries already gone, *then* unlink. A crash mid-cleanup leaves
|
||
# at most a microsecond window where a file is still on disk despite
|
||
# the manifest no longer naming it — the next sync simply re-reads
|
||
# the (now-correct) manifest and the orphan stays orphaned, but the
|
||
# served state stays consistent.
|
||
removed_paths: List[str] = []
|
||
for key in list(manifest.keys()):
|
||
if key in requested_keys:
|
||
continue
|
||
entry = manifest.pop(key)
|
||
if entry.local:
|
||
removed_paths.append(entry.local)
|
||
report.removed += 1
|
||
|
||
_write_manifest(cache_dir, manifest)
|
||
|
||
for relpath in removed_paths:
|
||
try:
|
||
(cache_dir / relpath).unlink(missing_ok=True)
|
||
except OSError:
|
||
pass
|
||
|
||
report.entries = manifest
|
||
return report
|
||
|
||
|
||
def delete_cache_dir(cache_dir: Path) -> bool:
|
||
"""Remove the entire mirror cache for one marketplace. True iff removed."""
|
||
if cache_dir.exists():
|
||
shutil.rmtree(cache_dir, ignore_errors=True)
|
||
return True
|
||
return False
|