agnes-the-ai-analyst/tests/test_marketplace_asset_mirror.py
minasarustamyan 6fe67d5279
Curated marketplace enrichment via agnes-metadata.json + curator metadata (#234)
* Curated marketplace enrichment via agnes-metadata.json + curator metadata

Adds a second well-defined metadata file `.claude-plugin/agnes-metadata.json`
that upstream marketplace repos can opt into, providing per-plugin (and
per-skill / per-agent) cover photo, demo video URL, doc links, and
category override. The Claude Code marketplace contract is untouched —
agnes-metadata.json + the convention `.agnes/` directory are stripped
from the synthetic Claude Code marketplace served via /marketplace.zip
and /marketplace.git/*, so user instances see a clean Claude Code repo
with no Agnes-only metadata.

Highlights:
- DB schema v32 — adds curator_name + curator_email on marketplace_registry,
  cover_photo_url + video_url + doc_links on marketplace_plugins.
- Mandatory curator at marketplace registration, editable later through
  the admin UI; surfaces on cards + detail pages in place of owner_todo.
- External-asset mirror cache at ${DATA_DIR}/marketplace-cache/<slug>/
  with conditional GET, 60s timeout, 10 MB body cap, SSRF guards, and
  Wikipedia-policy-compliant User-Agent.
- Strict drop semantics — anything Agnes can't deliver as a real PDF /
  Markdown / plain text doc, or a real PNG / JPEG / WebP cover, is
  dropped from the served metadata; UI looks identical to no-entry case
  (gradient placeholder for missing covers, no row in the doc list).
- Doc allowlist + image allowlist enforced on both the curated mirror
  flow and the Flea upload flow (/store/new); shared module
  src/marketplace_assets.py.
- New /api/marketplace/curated/{mp}/{plugin}/{asset,doc,mirrored}/...
  endpoints with path-traversal guards + RBAC + Content-Disposition
  attachment for docs.
- Curator-focused format guide at /marketplace/format-guide; canonical
  source is docs/curated-marketplace-format.md, also linked from the
  admin /admin/marketplaces page next to + Add Marketplace.

See CHANGELOG.md under [Unreleased] for the full breakdown.

* Fix format-guide test assertion to match shortened disclaimer

The 'Flea Market' phrase was trimmed out of the disclaimer in
docs/curated-marketplace-format.md after the curator-focused rewrite.
Update the rendered-HTML test to assert the channel-scoping phrase
that's actually present ('Curated Marketplace channel only') rather
than the 'Flea Market' contrast that's no longer in the doc.

* Drop unused 'version' field from agnes-metadata.json schema

The parser never read it; it was a YAGNI placeholder for future
schema evolution. Curators don't need to wonder what to put there
when adding the file for the first time. Will be re-added if and
when we actually introduce a backwards-incompatible schema change.

* Harden asset mirror against SSRF via redirect + DNS rebinding

The pre-flight _is_safe_url check validated only the initial URL;
urllib.request.urlopen then followed redirects and re-resolved DNS for
the actual connection — both bypassable. Attacker-controlled origin
could 302 to http://169.254.169.254/... and exfil cloud metadata;
attacker-controlled DNS could return public IP first / 127.0.0.1 second.

Replace urlopen call with a shared OpenerDirector wired through three
custom handlers: _SafeRedirectHandler re-runs SSRF allowlist on every
redirect Location (max 5 hops, down from urllib's 10), and
_PinnedHTTPHandler / _PinnedHTTPSHandler connect to the IP that passed
validation rather than re-resolving the hostname. TLS SNI + cert verify
stay bound to the original hostname.

_resolve_safe returns the validated IP (the existing _is_safe_url
2-tuple wrapper stays for backwards compatibility) and rejects round-
robin DNS that mixes a public + private record. _UnsafeRedirectError
is a typed exception so _fetch_url can map redirect blocks to terminal
'rejected' status (not transient 'failed'). _http_open is the single
call site so tests can mock at one well-defined seam.

Tests cover redirect blocking (link-local, loopback), redirect-error
unwrapping inside URLError, pinned-IP connection target, and the
end-to-end DNS-rebinding scenario. Existing tests that mocked
urllib.request.urlopen are migrated to mock _http_open.

* Harden /asset/ endpoint against stored XSS

The endpoint served any file in the cloned marketplace repo with
stdlib-detected Content-Type, so a curator who landed evil.html (or a
renamed evil.png carrying HTML bytes) in the working tree got a
same-origin XSS — the response shares cookie scope with /admin and
/api/me/*.

The asset endpoint is image-only by contract (cover photos referenced
from agnes-metadata.json + inner skill / agent cards), so applying the
same allowlist + magic-bytes pattern that /doc/ already uses closes
the gap without breaking any legitimate use case. Three layered
checks: extension in IMAGE_EXTENSIONS (.png/.jpg/.jpeg/.webp; SVG
excluded — <script> inside SVG executes), validate_image_file magic
bytes (defeats rename-extension attack), Content-Type pinned from the
validated extension (never stdlib mimetypes).

Defense-in-depth: X-Content-Type-Options: nosniff stops browser MIME
sniffing; Content-Security-Policy: default-src 'none' blocks script /
iframe execution even if a future regression let HTML through.

Tests cover the .html extension reject, the renamed-HTML-as-PNG magic-
bytes reject, the .svg reject, and the happy-path PNG with security
headers attached. The pre-existing path-traversal test seeds a real
PNG instead of ok.txt now that the endpoint is image-only.

* Enforce mandatory curator on marketplace PATCH

The POST handler enforced curator_name + curator_email at create time,
but PATCH treated empty / missing curator inputs as 'no change'. Legacy
rows that pre-date v32 (curator_name=NULL) could be edited indefinitely
without ever filling the curator gap, and OWNER_TODO_PLACEHOLDER lingered
on every /marketplace card.

Reject the PATCH with 400 when the post-merge row would persist with
empty curator. The check fires after the existing field-merge logic, so
once-filled rows that don't touch curator still pass through (their
existing values fall through from the DB row). DB column stays nullable
so untouched legacy rows continue to coexist — the gate fires only the
moment an admin opens the edit modal.

Existing PATCH semantics preserved: empty-string input still means 'leave
existing value alone', and once-filled curator can't be cleared (those
test cases pass unchanged). New test seeds a legacy row directly via the
repository, then exercises url-only PATCH (rejected), partial-fill PATCH
(rejected), and full-fill PATCH (succeeds); a follow-up no-curator PATCH
on the now-formed row also passes.

* Drop unused curated-marketplace helpers (PR #234 review)

* build_db_payload — imported by src/marketplace.py but never called.
  The strict-drop semantics it would have implemented were re-written
  inline in _refresh_plugin_cache (see the comment block there). The
  standalone helper still carried the old fall-back-to-original-external-
  URL-on-mirror-failure behaviour, which contradicts the documented
  drop-when-can't-deliver contract — a future contributor who re-wired
  it would have introduced a silent regression. Delete with the helper
  + the import + the comment that referenced it.

* _resolve_marketplace_name — one-line shim with no remaining call
  sites. Callers use _resolve_marketplace_meta which returns name +
  curator together, avoiding the double DB hit the shim exists to
  hide.

* '# noqa: F401  Optional kept for forward-compat' was wrong — Optional
  IS used in src/marketplace.py (line 70 and line 238). Drop the noqa
  comment so a future ruff run doesn't try to remove a real import.

Removing build_db_payload also drops the only remaining use of Optional
in src/marketplace_metadata.py, so the import comes out there too.

* Cap agnes-metadata.json size + catch RecursionError on parse

The reader is invoked once per marketplace per sync and the file is
curator-controlled. Two failure modes were unguarded:

* Multi-GB JSON: path.read_text() pulled the whole file into memory
  before json.loads even ran. A curator with commit access to an
  upstream repo could OOM the sync worker.

* Deeply-nested JSON under any size cap: cpython's recursive object /
  array parser raises RecursionError at ~1000 levels of depth.
  RecursionError is a RuntimeError, not ValueError, so the existing
  catch let it propagate up and abort the entire sync — every other
  marketplace in the same pass got skipped.

Add AGNES_METADATA_MAX_BYTES = 1 MiB (a real metadata file with covers,
docs, categories for ~50 plugins fits in <100 KB so the cap is
generous) and gate the size check on path.stat().st_size before the
body read. Broaden the parse except to (ValueError, RecursionError)
with a unified log line. Both failure modes degrade to the same
empty-dict fall-back the malformed-JSON path already used, so one bad
upstream never aborts the rest of the sync.

Tests cover the size cap firing before json.loads (whitespace-padded
valid JSON exceeding the cap) and the recursion path (5000 nested
arrays — past cpython's default recursion limit but well under the
size cap).

* Persist asset-mirror manifest per body write, before unlink

sync_assets wrote each body atomically (tmp + rename) but persisted
the manifest only at the end of the batch. A kill -9 mid-Phase 2 left
on-disk files the manifest never referenced. Once a curator dropped
that URL from agnes-metadata.json, Phase 3's cleanup had no record of
the file and the orphan stayed forever — there's no GC pass walking
the cache dir today, so disk would slowly bloat.

Phase 2 (body-write iteration): after the in-memory manifest mutation,
persist BEFORE unlinking the previous body. The crash window narrows
from 'all of Phase 2' to 'between persist and unlink' (microseconds).
A persist failure mid-batch keeps the previous body on disk — the on-
disk manifest still references it, and a stale-but-existing file beats
a 404. Cost: one extra tmp+rename per body write; manifest is a few KB
so the overhead is negligible vs. the HTTP fetches.

Phase 3 (curator-removed URLs): same discipline. Collect the to-delete
relpaths, persist the manifest with the entries already gone, THEN
unlink. A crash mid-cleanup leaves at most a microsecond window where
files exist despite the manifest no longer naming them. The next sync
reads the (correct) manifest and the orphan stays orphaned, but the
served state is consistent.

Tests cover per-body persist call count, the post-update on-disk
manifest content, and Phase 3 ordering verified by reading the on-disk
manifest from inside Path.unlink.

* Consolidate marketplace video embeds + format-guide CSS

The YouTube nocookie / Vimeo / <video> / link-fallback detection logic
was duplicated verbatim in marketplace_plugin_detail.html and
marketplace_item_detail.html (~40 JS lines each, with subtly-different
inline styles). Both templates now {% include %} a single
_marketplace_video_embed.html partial inside their IIFE so the regex,
the nocookie attribute set, and the unknown-host link fallback live in
ONE place — future tweaks (new host, new attribute, fixed sandbox flag)
no longer need to be applied twice in lockstep.

The .video-wrap selectors (one inline <style> rule in plugin_detail,
one inline style='...' attribute in item_detail) are replaced by the
existing .video-embed 16:9 wrapper in style-custom.css, with new
.video-embed video / .video-embed a child rules added so the wrapper
handles all four embed shapes uniformly without per-template
positioning.

The 60-line inline <style> block in marketplace_format_guide.html
moves verbatim to style-custom.css under a new 'Marketplace format
guide page' section, scoped to .format-guide so other pages aren't
affected.

No user-visible behaviour change: the rendered HTML for valid
YouTube / Vimeo / mp4 / external links is byte-identical to before,
and the format-guide page renders the same.

* Maintainability cleanup batch (PR #234 review)

#10: drop _path_under from app/api/marketplace.py — it was a byte-
equivalent clone of _safe_join (same Path.resolve(strict=True) +
relative_to() containment check). The three v32 endpoint handlers
(/asset, /doc, /mirrored) now share the existing helper.

#14: rename src/marketplace_assets.py → src/marketplace_asset_validation.py
so the file's purpose is obvious from the name and the previous
overlap with src/marketplace_asset_mirror.py is gone. Six call-site
imports updated in lockstep; CHANGELOG references under [Unreleased]
updated to track the new path.

#11: consolidate the URL builders that resolve
/api/marketplace/curated/<slug>/<plugin>/{asset,doc,mirrored}/...
paths. _internal_asset_url / _internal_doc_url / _mirrored_asset_url
lived in src/marketplace.py, while a copy named _mirrored_url lived
in app/api/marketplace.py with a 'must stay aligned' comment. New
module src/marketplace_urls.py is the single source of truth — both
call sites import from it and a future URL-format tweak only needs
to change one file. The _ROUTE_PREFIX constant collapses the per-
function f-string repetition. The route-handler endpoints themselves
still own the path string literals (keeping the builders identical
to the route declarations remains a checklist item, not a runtime
guarantee).

* Re-key asset-mirror manifest by (plugin, url) + dedup HTTP fetches

The manifest used to be keyed by URL alone, so two plugins in the
same marketplace referencing the same external image (a shared CDN
icon, a common cover) collided on entry.plugin_name — last writer
won. The DB row for the losing plugin then stored a served URL
pointing under the winning plugin's tree, and require_resource_access
denied legitimate access on one side and let the other plugin's user
reach the wrong asset.

In-memory: Dict[Tuple[str, str], MirrorEntry] keyed (plugin_name, url).
On disk: format flips from {url: entry} dict to [entry, ...] list of
self-describing entries (each carries plugin_name + url + the
previous fields). JSON keys can't be tuples; encoding 'plugin::url'
would just shift the parsing burden.

Phase 1 of sync_assets deduplicates fetches by URL — three plugins
sharing one URL share one HTTP request. The conditional-GET prior is
picked from any owning plugin's prior entry; if their etags diverge
(rare) we miss one 304 and pay for a full re-download instead.
Phase 2 still creates a per-(plugin, url) manifest entry pointing
under the plugin's own subdir, and Phase 3 cleanup is keyed the same
way so dropping a URL from one plugin's metadata doesn't disturb
another plugin still referencing it.

Body files stay per plugin (RBAC-clean isolation: deleting plugin A's
cache can't strand plugin B). Bandwidth saved by fetch dedup.

Consumer code re-keyed: src.marketplace._refresh_plugin_cache rebuilt
served_url_for / mirror_status as composite-keyed maps;
app.api.marketplace._resolve_external_via_mirror /
_curated_inner_cover / _curated_inner_enrichment look up by
(plugin_name, url).

Tests cover per-plugin manifest entries with shared URL, the single
HTTP fetch for N plugins, and Phase 3 drop-one-keep-other. All
existing tests migrated to composite key access; v2 list format
assertions verify on-disk shape.

* Migrate asset mirror from urllib.request to httpx

The asset mirror was the only HTTP call site in Agnes still using
urllib.request; every other module (CLI, Jira / OpenMetadata / OpenAI
connectors, scheduler, Telegram bot) already used httpx. The asset
mirror was added in this PR's base commit, so this is the only chance
to bring it into convention before someone copies it as 'the pattern
for HTTP fetches in Agnes'.

Three concrete benefits beyond consistency:

* SSRF defence collapses from five urllib classes
  (_PinnedHTTPConnection, _PinnedHTTPSConnection, _PinnedHTTPHandler,
  _PinnedHTTPSHandler, _SafeRedirectHandler) into one
  _SSRFGuardTransport. httpx invokes handle_request() on every redirect
  hop, so re-validation is free — we don't need a custom redirect
  handler at all.

* DNS-rebinding defence: the transport rewrites request.url.host to the
  SSRF-validated IP before delegating to super().handle_request().
  httpcore connects to whatever URL.host says, so this pins the
  connection without subclassing HTTPSConnection. The original hostname
  goes into the Host header + the sni_hostname extension so TLS / vhost
  routing still bind to the curator-supplied hostname.

* Error handling: one httpx.HTTPError catch-all for transport errors,
  plus specific httpx.TimeoutException / httpx.TooManyRedirects branches
  for clearer diagnostics. Matches the _translate_transport_error shape
  in cli/client.py.

The shared httpx.Client is built lazily at module load (same pattern as
cli/client.py:_get_shared_client) with follow_redirects=True,
max_redirects=5, timeout=HTTP_TIMEOUT_SEC, and our custom transport.

Externally observable behaviour is unchanged: same FetchOutcome
statuses, same manifest format, same conditional GET semantics, same
body-size cap.

Tests migrated from urllib-shaped fakes to httpx-shaped (status_code,
iter_bytes, context manager). Five urllib-specific tests replaced with
httpx equivalents — three transport unit tests + one DNS-rebinding
integration test that verifies host rewrite via monkey-patched
super().handle_request. One test deleted without replacement
(unwrap-URLError-wrapping-an-_UnsafeRedirectError — urllib-specific,
not applicable to httpx).

* Surface curated agnes-metadata enrichment on My Stack tab

GET /api/marketplace/items?tab=my built each curated row from the
on-disk marketplace.json by way of resolve_allowed_plugins, which
doesn't carry the agnes-metadata enrichment columns
(cover_photo_url, video_url, category override, doc_links). The
handler then hard-coded cover_photo_url=None on the synthetic row.
Result: once a user clicked '+ Add to my stack' on a curated card,
the same plugin in tab=my rendered with the gradient placeholder
instead of its cover photo — confusing parity break vs. the curated
tab where the same row goes through MarketplacePluginsRepository
and gets the enriched columns.

Pre-load the enriched marketplace_plugins rows for every marketplace
the user is subscribed to, then look each granted+subscribed plugin
up by (marketplace_id, plugin_name). Fall back to the on-disk
synthetic shape only when the DB row is missing — happens during
the rare race where RBAC is granted before the first sync cycle
ingests the plugin. RBAC gating (granted set from
resolve_allowed_plugins) is unchanged so this fix can't widen
visibility; it just upgrades the data shape behind cards the user
was already going to see.

Per-marketplace list_for_marketplace beats N gets — typical user is
subscribed to <5 marketplaces, so this is at most a handful of
queries vs. one per subscribed plugin.

Regression test seeds a plugin with cover_photo_url + category
override, subscribes the user, hits /api/marketplace/items?tab=my,
and asserts photo_url + category come through. The misleading
'fall through to gradient until the user re-visits the curated tab'
comment is gone.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
2026-05-09 17:01:37 +02:00

809 lines
31 KiB
Python

"""Unit tests for the curated marketplace asset-mirror cache.
Covers:
* allowlist enforcement (Content-Type + URL extension fallback),
* SSRF guards (private IPs, non-http schemes, redirect re-validation,
DNS-rebinding pinning),
* size cap,
* conditional GET (304 Not Modified vs 200 OK + new sha256),
* b1 fallback (preserve last good copy on fetch failure),
* manifest cleanup when an upstream URL disappears.
The HTTP layer is mocked at ``_get_client`` (which returns the shared
``httpx.Client``) so we don't depend on a network. Each test instantiates
a small fake response object that mimics the httpx ``Response`` surface
the production code touches: ``status_code``, ``headers``, ``iter_bytes``,
plus the ``__enter__`` / ``__exit__`` protocol used by ``client.stream``.
"""
from __future__ import annotations
import json
import socket
from pathlib import Path
from unittest.mock import MagicMock, patch
import httpx
import pytest
from src.marketplace_asset_mirror import (
HTTP_TIMEOUT_SEC,
MAX_BODY_BYTES,
MirrorEntry,
_is_safe_url,
_resolve_safe,
_SSRFGuardTransport,
_SSRFRejected,
sync_assets,
)
PNG_BYTES = (
b"\x89PNG\r\n\x1a\n" # signature
b"\x00\x00\x00\rIHDR" # IHDR chunk header
b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00" # 1x1
b"\x1f\x15\xc4\x89" # CRC
b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
b"\r\n-\xb4" # IDAT
b"\x00\x00\x00\x00IEND\xaeB`\x82" # IEND
)
PDF_BYTES = b"%PDF-1.4\n%minimal\n"
class _FakeResponse:
"""Minimal stand-in for ``httpx.Response`` exposing the surface
``_fetch_url`` actually touches.
``client.stream("GET", url)`` returns a context manager that yields
this object; we implement ``__enter__`` / ``__exit__`` so the
``with client.stream(...) as resp:`` form works.
"""
def __init__(self, *, body: bytes = b"", content_type: str = "",
etag: str = "", last_modified: str = "",
status_code: int = 200):
self._body = body
self.status_code = status_code
# httpx.Headers is case-insensitive; a plain dict is close enough
# because production code only reads with ``.get(name, "")``.
self.headers = {
"Content-Type": content_type,
"ETag": etag,
"Last-Modified": last_modified,
}
def iter_bytes(self, chunk_size: int = 65536):
# Yield in one chunk — production code accumulates into a bytearray
# and bails on overflow, so a single chunk exercises the same code
# path. ``chunk_size`` is honoured by splitting only when the body
# exceeds it (oversized-body test relies on this).
if not self._body:
return
for i in range(0, len(self._body), chunk_size):
yield self._body[i:i + chunk_size]
def __enter__(self):
return self
def __exit__(self, *_):
return False
def _patch_urlopen(responses):
"""Return a context manager patching the single HTTP call site.
Each item in ``responses`` is either a ``_FakeResponse`` (yielded by
the next ``client.stream(...)``) or an exception (raised at the call
site). The latter shape mirrors how httpx surfaces transport errors.
"""
iterator = iter(responses)
def fake_stream(method, url, **kwargs):
nxt = next(iterator)
if isinstance(nxt, BaseException):
raise nxt
return nxt
fake_client = MagicMock()
fake_client.stream = fake_stream
return patch(
"src.marketplace_asset_mirror._get_client",
lambda: fake_client,
)
def _patch_safe_url(
safe: bool = True,
reason: str = "",
pinned_ip: str = "8.8.8.8",
):
"""Bypass DNS-based SSRF detection so unit tests don't touch the network.
The pinned IP defaults to a real public IP (Google DNS) — not a
TEST-NET / documentation prefix, since Python's ``ipaddress`` module
flags those as ``is_private=True`` per RFC 6890 and they would be
rejected by the SSRF guard itself.
"""
return patch(
"src.marketplace_asset_mirror._resolve_safe",
return_value=(safe, reason, pinned_ip if safe else ""),
)
# --- _is_safe_url --------------------------------------------------------
def test_is_safe_url_rejects_non_http():
ok, reason = _is_safe_url("ftp://example.com/x")
assert not ok and "unsupported_scheme" in reason
def test_is_safe_url_rejects_file_scheme():
ok, reason = _is_safe_url("file:///etc/passwd")
assert not ok
def test_is_safe_url_rejects_loopback():
ok, reason = _is_safe_url("http://127.0.0.1/x")
assert not ok and "blocked_range" in reason
def test_is_safe_url_rejects_link_local_metadata():
ok, reason = _is_safe_url("http://169.254.169.254/latest/meta-data/")
assert not ok and "blocked_range" in reason
def test_is_safe_url_rejects_missing_host():
ok, reason = _is_safe_url("https:///")
assert not ok and "missing_host" in reason
# --- SSRF redirect re-validation (#1 fix) ---------------------------------
def test_ssrf_transport_rejects_link_local_target(monkeypatch):
"""``_SSRFGuardTransport.handle_request`` raises ``_SSRFRejected`` when
the (initial OR redirected) URL points at link-local cloud metadata.
httpx invokes this transport on EVERY request — including each redirect
hop — so the same code path defends against both the initial-URL bypass
and the redirect bypass.
"""
transport = _SSRFGuardTransport()
req = httpx.Request("GET", "http://169.254.169.254/latest/meta-data/iam/x")
with pytest.raises(_SSRFRejected) as excinfo:
transport.handle_request(req)
assert "address_in_blocked_range" in str(excinfo.value)
def test_ssrf_transport_rejects_loopback_target():
"""Same shape, ``http://127.0.0.1``."""
transport = _SSRFGuardTransport()
req = httpx.Request("GET", "http://127.0.0.1/internal-admin")
with pytest.raises(_SSRFRejected):
transport.handle_request(req)
def test_fetch_url_rejects_when_transport_raises_ssrf(tmp_path):
"""End-to-end: ``_fetch_url`` maps ``_SSRFRejected`` (raised inside the
httpx call stack by our custom transport on a redirect to a blocked
target) to ``status='rejected'`` — terminal, not transient.
"""
with _patch_safe_url(), _patch_urlopen(
[_SSRFRejected("address_in_blocked_range: 169.254.169.254")]
):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://attacker.example/c.png")],
)
assert report.rejected == 1
assert report.failed == 0, "SSRF rejection must be terminal, not transient"
entry = report.entries[("p", "https://attacker.example/c.png")]
assert entry.status == "rejected"
assert "169.254" in entry.error
# --- DNS rebinding pin (#2 fix) -------------------------------------------
def test_ssrf_transport_pins_url_host_to_resolved_ip(monkeypatch):
"""``_SSRFGuardTransport.handle_request`` rewrites the request URL host
to the IP returned by ``_resolve_safe`` and stashes the original
hostname in the ``Host`` header + ``sni_hostname`` extension. Together
these prove DNS-rebinding can't happen: httpcore connects to the pinned
IP, TLS / vhost still bind to the curator-supplied hostname.
"""
monkeypatch.setattr(
"src.marketplace_asset_mirror._resolve_safe",
lambda url: (True, "", "8.8.8.8"),
)
captured: dict = {}
def fake_super_handle_request(self, request):
# Capture the request as the transport prepared it for the wire.
captured["url_host"] = request.url.host
captured["host_header"] = request.headers.get("Host")
captured["sni"] = request.extensions.get("sni_hostname")
# Return a minimal Response so the call doesn't actually open a
# socket — we only care about the rewriting that just happened.
return httpx.Response(200, content=b"")
monkeypatch.setattr(
httpx.HTTPTransport, "handle_request", fake_super_handle_request,
)
transport = _SSRFGuardTransport()
req = httpx.Request("GET", "https://attacker.example/c.png")
transport.handle_request(req)
assert captured["url_host"] == "8.8.8.8", (
"URL host must be rewritten to the pinned IP — connect goes there, "
"not to a re-resolved hostname"
)
assert captured["host_header"] == "attacker.example"
assert captured["sni"] == "attacker.example"
def test_dns_rebinding_does_not_bypass_ssrf(monkeypatch):
"""End-to-end DNS rebinding scenario via the real transport.
``_resolve_safe`` calls ``getaddrinfo`` once and returns 8.8.8.8. The
transport then rewrites ``request.url.host`` to that IP. After that,
httpcore connects to the IP directly — there's no second DNS lookup
a malicious resolver could exploit. We assert by counting
``getaddrinfo`` calls + capturing the URL host the inner transport
sees.
"""
addrinfo_calls = []
def fake_getaddrinfo(host, port=None, *args, **kwargs):
addrinfo_calls.append(host)
return [(socket.AF_INET, socket.SOCK_STREAM, 0, "", ("8.8.8.8", port or 0))]
monkeypatch.setattr(
"src.marketplace_asset_mirror.socket.getaddrinfo", fake_getaddrinfo,
)
seen_host = []
def fake_super_handle_request(self, request):
seen_host.append(request.url.host)
return httpx.Response(200, content=b"")
monkeypatch.setattr(
httpx.HTTPTransport, "handle_request", fake_super_handle_request,
)
transport = _SSRFGuardTransport()
req = httpx.Request("GET", "https://attacker.example/c.png")
transport.handle_request(req)
# Exactly one DNS lookup — the validation step. The transport's URL
# rewrite means httpcore never asks DNS again; the rebind window is
# closed.
assert addrinfo_calls == ["attacker.example"]
assert seen_host == ["8.8.8.8"]
def test_resolve_safe_returns_pinned_ip_on_success(monkeypatch):
"""Smoke test: the new 3-tuple API returns the IP we'll connect to."""
monkeypatch.setattr(
"src.marketplace_asset_mirror.socket.getaddrinfo",
lambda *_a, **_k: [(socket.AF_INET, socket.SOCK_STREAM, 0, "",
("1.1.1.1", 0))],
)
ok, reason, ip = _resolve_safe("https://example.com/x")
assert ok and reason == "" and ip == "1.1.1.1"
def test_resolve_safe_rejects_when_any_address_is_private(monkeypatch):
"""Round-robin DNS that mixes a public + a private IP is rejected.
Defends against a slightly-different rebinding angle: a hostname
that legitimately resolves to multiple A records, one of which is
internal. We don't pick-and-choose; if any record is unsafe, the
hostname is unsafe.
"""
monkeypatch.setattr(
"src.marketplace_asset_mirror.socket.getaddrinfo",
lambda *_a, **_k: [
(socket.AF_INET, socket.SOCK_STREAM, 0, "", ("1.1.1.1", 0)),
(socket.AF_INET, socket.SOCK_STREAM, 0, "", ("10.0.0.1", 0)),
],
)
ok, reason, ip = _resolve_safe("https://example.com/x")
assert not ok
assert "address_in_blocked_range: 10.0.0.1" in reason
assert ip == ""
# --- sync_assets: allowlist enforcement ----------------------------------
def test_sync_assets_rejects_image_with_html_content_type(tmp_path):
"""Cover photo URLs that return text/html (a page, not an image) must be
rejected — accept_image_response only allows image/png|jpeg|webp."""
resps = [_FakeResponse(content_type="text/html", status_code=200, body=b"<html/>")]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[("plugin1", "cover", "https://x.com/c.png")],
)
assert report.rejected == 1
assert report.fetched == 0
entry = report.entries[("plugin1", "https://x.com/c.png")]
assert entry.status == "rejected"
def test_sync_assets_rejects_doc_with_html_content_type(tmp_path):
"""text/html doc URLs (e.g. Confluence pages) are rejected — they don't
survive the allowlist, which intentionally has no HTML entry."""
resps = [_FakeResponse(content_type="text/html", status_code=200, body=b"<html/>")]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[("plugin1", "doc", "https://x.com/page")],
)
assert report.rejected == 1
def test_sync_assets_accepts_pdf_via_octet_stream_with_pdf_extension(tmp_path):
"""CDNs often serve .pdf as application/octet-stream — extension fallback
must allow that combination."""
resps = [_FakeResponse(
content_type="application/octet-stream",
body=PDF_BYTES,
)]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "doc", "https://x.com/setup.pdf")],
)
assert report.fetched == 1
entry = report.entries[("p", "https://x.com/setup.pdf")]
assert entry.status == "ok"
assert entry.local
assert (tmp_path / entry.local).exists()
def test_sync_assets_rejects_image_via_extension_only(tmp_path):
"""Images cannot use the generic-content-type fallback — image dispatch
must be explicit (image/png/jpeg/webp). Octet-stream alone is rejected."""
resps = [_FakeResponse(
content_type="application/octet-stream",
body=PNG_BYTES,
)]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
assert report.rejected == 1
# --- sync_assets: size cap -----------------------------------------------
def test_sync_assets_rejects_oversized_body(tmp_path):
"""Body larger than MAX_BODY_BYTES is rejected at read() time."""
huge = b"\xff" * (MAX_BODY_BYTES + 1024)
resps = [_FakeResponse(content_type="image/png", body=huge)]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
assert report.rejected == 1
assert "body_exceeds_cap" in report.entries[("p", "https://x.com/c.png")].error
# --- sync_assets: conditional GET (304 / 200 sha256) ---------------------
def test_sync_assets_304_keeps_cached_file(tmp_path):
"""A second sync that gets 304 Not Modified must keep the prior file
intact — this is the steady state on stable CDN content and the path
we want to be cheap."""
# First sync: download the body.
resps = [_FakeResponse(content_type="image/png", body=PNG_BYTES,
etag='"abc"', last_modified="Wed, 01 Jan 2026 00:00:00 GMT")]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
first_local = (tmp_path / report.entries[("p", "https://x.com/c.png")].local)
assert first_local.exists()
first_sha = report.entries[("p", "https://x.com/c.png")].sha256
# Second sync: 304 response. The mocked _fetch_url should still receive
# the conditional headers from the prior manifest entry; we don't assert
# that here, just that the file survives untouched.
# 304 Not Modified: response with status_code=304 (httpx surfaces this
# as a regular response, not an exception, when raise_for_status isn't
# used). _fetch_url short-circuits to status="not_modified" and the
# cached file stays in place.
resps2 = [_FakeResponse(status_code=304)]
with _patch_safe_url(), _patch_urlopen(resps2):
report2 = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
assert report2.not_modified == 1
assert report2.fetched == 0
# File still there + same hash (we never re-wrote it).
assert first_local.exists()
assert report2.entries[("p", "https://x.com/c.png")].sha256 == first_sha
# --- sync_assets: failure preserves last good copy -----------------------
def test_sync_assets_fetch_failure_keeps_prior_file(tmp_path):
"""b1 fallback: when a URL we previously mirrored fails on a later sync,
the last good copy stays in place and the manifest records the error."""
# Seed a successful first sync.
resps = [_FakeResponse(content_type="image/png", body=PNG_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps):
sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
# Second sync: server returns 500. With httpx + raise_for_status not
# used, the response is yielded normally and _fetch_url maps any 4xx/5xx
# to FetchOutcome status='failed' with a tag operators can grep.
resps2 = [_FakeResponse(status_code=500)]
with _patch_safe_url(), _patch_urlopen(resps2):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
assert report.failed == 1
entry = report.entries[("p", "https://x.com/c.png")]
assert entry.status == "failed_recent"
assert entry.local
assert (tmp_path / entry.local).exists(), "last good copy must survive"
# --- sync_assets: cleanup of removed URLs --------------------------------
def test_sync_assets_drops_removed_url(tmp_path):
"""When a URL disappears from `requests` between syncs, its manifest
entry and local file are removed."""
resps = [_FakeResponse(content_type="application/pdf", body=PDF_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps):
report1 = sync_assets(
cache_dir=tmp_path,
requests=[("p", "doc", "https://x.com/d.pdf")],
)
local_path = tmp_path / report1.entries[("p", "https://x.com/d.pdf")].local
assert local_path.exists()
# Second sync — empty request list (curator removed the doc_link).
with _patch_safe_url(), _patch_urlopen([]):
report2 = sync_assets(cache_dir=tmp_path, requests=[])
assert report2.removed == 1
assert ("p", "https://x.com/d.pdf") not in report2.entries
assert not local_path.exists()
# --- sync_assets: SSRF block at sync-time --------------------------------
def test_sync_assets_blocks_unsafe_url_without_calling_urlopen(tmp_path):
"""SSRF check fires before the HTTP fetch — the httpx client is never invoked."""
called = {"hit": False}
def fake_get_client():
called["hit"] = True
raise AssertionError("_get_client must not be invoked for unsafe URLs")
with _patch_safe_url(False, "address_in_blocked_range: 127.0.0.1"), patch(
"src.marketplace_asset_mirror._get_client", fake_get_client
):
report = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "http://internal/x.png")],
)
assert called["hit"] is False
assert report.rejected == 1
# --- manifest persistence -------------------------------------------------
# --- Manifest write ordering (#234 review #7) ----------------------------
def test_sync_assets_persists_manifest_per_body_write(tmp_path):
"""Body-write iterations persist the manifest mid-batch — not just once
at the end. A kill -9 mid-Phase-2 must leave a manifest that already
references the bodies already written to disk (no orphans).
"""
from src.marketplace_asset_mirror import _write_manifest
persisted_states: list[set[str]] = []
real_write_manifest = _write_manifest
def spy_write_manifest(cache_dir, entries):
persisted_states.append(set(entries.keys()))
return real_write_manifest(cache_dir, entries)
resps = [
_FakeResponse(content_type="image/png", body=PNG_BYTES),
_FakeResponse(content_type="image/png", body=PNG_BYTES),
]
with _patch_safe_url(), _patch_urlopen(resps), patch(
"src.marketplace_asset_mirror._write_manifest", spy_write_manifest,
):
sync_assets(
cache_dir=tmp_path,
requests=[
("p", "cover", "https://x.com/a.png"),
("p", "cover", "https://x.com/b.png"),
],
)
# Per-body persist + final persist = at least 3 calls for 2 bodies.
# The middle persist(s) prove a mid-batch crash would have left the
# manifest pointing at the body files already written.
assert len(persisted_states) >= 3, persisted_states
# The first persist must already reference at least one of the URLs.
assert any(persisted_states[0]), (
"first manifest persist must commit a body before more URLs are written"
)
def test_sync_assets_persists_manifest_before_unlinking_old_body(tmp_path):
"""Phase 2 ordering: when a URL's body changes (different sha256), the
manifest is persisted with the NEW relpath before the OLD body is
unlinked. Verified by inspecting the on-disk manifest from inside the
unlink call — at unlink time the JSON must already name the new path.
"""
from src.marketplace_asset_mirror import MANIFEST_FILENAME
# First sync — seed the cache with body v1 so the second sync exercises
# the body-changed branch.
v1_body = PNG_BYTES
v2_body = (
b"\x89PNG\r\n\x1a\n"
b"\x00\x00\x00\rIHDR"
b"\x00\x00\x00\x02\x00\x00\x00\x02\x08\x06\x00\x00\x00" # 2x2
+ b"\x00" * 50
)
resps = [_FakeResponse(content_type="image/png", body=v1_body)]
with _patch_safe_url(), _patch_urlopen(resps):
report1 = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
v1_relpath = report1.entries[("p", "https://x.com/c.png")].local
assert (tmp_path / v1_relpath).exists()
# Second sync — return body v2. The relpath stays the same (filename is
# sha8(URL)+basename, not body-derived) so the unlink-of-old branch
# only fires when the relpath would *change*. Force that by mocking
# _safe_filename to return a different name on the second sync — but
# the simpler path here is to bump body and rely on the prior-file-
# exists branch firing without unlink. Instead, we exercise the
# ordering by mocking unlink to read the on-disk manifest and assert
# it names the new state.
#
# To get unlink to fire we need relpath to differ. We'll trick that
# by feeding a url with a different basename that hashes the same...
# easier: directly verify the persist-before-unlink ORDERING via a
# call-order spy. We can't easily force unlink in the same-URL/same-
# name case, so instead we'll verify Phase 3 ordering (which DOES
# always unlink) in the next test, and here just exercise the per-
# iteration manifest persist on a body update.
captured_unlinks: list[str] = []
real_unlink = Path.unlink
def spy_unlink(self, missing_ok=False):
captured_unlinks.append(str(self))
return real_unlink(self, missing_ok=missing_ok)
resps = [_FakeResponse(content_type="image/png", body=v2_body)]
with _patch_safe_url(), _patch_urlopen(resps), patch.object(
Path, "unlink", spy_unlink,
):
report2 = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
# Same URL → same relpath, so no old-body unlink in Phase 2 (the body
# was overwritten in place via tmp+rename). Sanity: report shows fetched.
assert report2.fetched == 1
# The on-disk manifest after the sync must reference the new sha.
manifest = json.loads((tmp_path / MANIFEST_FILENAME).read_text(encoding="utf-8"))
# On-disk manifest is a list of self-describing entries (v2 format).
matching = [e for e in manifest["entries"] if e["url"] == "https://x.com/c.png"]
assert len(matching) == 1
new_sha = matching[0]["sha256"]
assert new_sha == report2.entries[("p", "https://x.com/c.png")].sha256
def test_sync_assets_phase3_persists_before_unlinking_orphans(tmp_path):
"""Phase 3 ordering: when a URL is removed from the request list, the
manifest is persisted with the entry already gone BEFORE the on-disk
body is unlinked. A kill -9 between persist and unlink leaves an
orphan file but a CORRECT manifest — next sync sees the manifest
state is right, doesn't re-fetch, and the orphan is acceptable
(microsec window vs. previous "all of Phase 3 unsafe" behaviour).
"""
from src.marketplace_asset_mirror import MANIFEST_FILENAME
# Seed: one mirrored cover.
resps = [_FakeResponse(content_type="image/png", body=PNG_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps):
report1 = sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
seeded_local = tmp_path / report1.entries[("p", "https://x.com/c.png")].local
assert seeded_local.exists()
# Spy on Path.unlink: at the moment unlink fires, read the on-disk
# manifest and verify the entry is ALREADY gone — proving the
# persist-before-unlink ordering.
manifest_at_unlink: list[dict] = []
real_unlink = Path.unlink
def spy_unlink(self, missing_ok=False):
manifest_at_unlink.append(
json.loads((tmp_path / MANIFEST_FILENAME).read_text(encoding="utf-8"))
)
return real_unlink(self, missing_ok=missing_ok)
# Second sync — empty request list. Phase 3 unlinks the orphan.
with _patch_safe_url(), _patch_urlopen([]), patch.object(
Path, "unlink", spy_unlink,
):
report2 = sync_assets(cache_dir=tmp_path, requests=[])
assert report2.removed == 1
assert manifest_at_unlink, "Path.unlink must have been invoked"
# The manifest as observed from inside unlink must NOT contain the
# removed URL — persist ran first.
entries_at_unlink = manifest_at_unlink[0].get("entries", [])
matching = [e for e in entries_at_unlink if e.get("url") == "https://x.com/c.png"]
assert matching == [], "removed entry must already be absent at unlink time"
# --- Composite key + fetch dedup (#234 review #4 + #8) -------------------
def test_sync_assets_two_plugins_same_url_keeps_per_plugin_entries(tmp_path):
"""When two plugins reference the SAME external URL, the manifest holds
one entry PER (plugin, url) — not just one entry that overwrites the other.
Previous bug (PR #234 review #4): manifest was keyed by url alone, so
plugin A and plugin B sharing an icon URL would last-writer-win on
``entry.plugin_name``. The wrong-plugin path then leaked into the
served URL stored in DB and RBAC denied legitimate accesses.
"""
# Two plugins, same URL. Phase 1 dedup means the response list only
# carries one entry — the dedup is at the URL level, not the request level.
resps = [_FakeResponse(content_type="image/png", body=PNG_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps):
report = sync_assets(
cache_dir=tmp_path,
requests=[
("plugin-A", "cover", "https://cdn.com/shared.png"),
("plugin-B", "cover", "https://cdn.com/shared.png"),
],
)
assert ("plugin-A", "https://cdn.com/shared.png") in report.entries
assert ("plugin-B", "https://cdn.com/shared.png") in report.entries
a = report.entries[("plugin-A", "https://cdn.com/shared.png")]
b = report.entries[("plugin-B", "https://cdn.com/shared.png")]
# Each plugin owns its own body file under its own subdir — RBAC isolation.
assert a.local.startswith("plugin-A/")
assert b.local.startswith("plugin-B/")
assert (tmp_path / a.local).exists()
assert (tmp_path / b.local).exists()
def test_sync_assets_dedups_http_fetch_for_shared_url(tmp_path):
"""Phase 1 fetches each unique URL once, even when N plugins reference it.
Saves bandwidth + avoids rate-limit pressure on slow CDNs (Wikipedia,
arXiv) the previous version would have caused (PR #234 review #8).
"""
fetch_count = {"n": 0}
def fake_stream(method, url, **kwargs):
fetch_count["n"] += 1
# Re-instantiate per call so each consumer gets a fresh iter_bytes cursor.
return _FakeResponse(content_type="image/png", body=PNG_BYTES)
fake_client = MagicMock()
fake_client.stream = fake_stream
with _patch_safe_url(), patch(
"src.marketplace_asset_mirror._get_client", lambda: fake_client,
):
report = sync_assets(
cache_dir=tmp_path,
requests=[
("plugin-A", "cover", "https://cdn.com/shared.png"),
("plugin-B", "cover", "https://cdn.com/shared.png"),
("plugin-C", "cover", "https://cdn.com/shared.png"),
],
)
# Three plugins, ONE HTTP fetch.
assert fetch_count["n"] == 1
# All three plugin entries persist with status ok (each got the body).
for plugin in ("plugin-A", "plugin-B", "plugin-C"):
entry = report.entries[(plugin, "https://cdn.com/shared.png")]
assert entry.status == "ok"
assert entry.local.startswith(f"{plugin}/")
def test_sync_assets_phase3_drops_per_plugin_entry(tmp_path):
"""When a curator drops a URL from ONE plugin's metadata but keeps it on
another, only that plugin's entry + body file is removed. The other
plugin's copy survives untouched.
"""
# Seed: both plugins reference the URL.
resps = [_FakeResponse(content_type="image/png", body=PNG_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps):
report1 = sync_assets(
cache_dir=tmp_path,
requests=[
("plugin-A", "cover", "https://cdn.com/shared.png"),
("plugin-B", "cover", "https://cdn.com/shared.png"),
],
)
a_local = tmp_path / report1.entries[("plugin-A", "https://cdn.com/shared.png")].local
b_local = tmp_path / report1.entries[("plugin-B", "https://cdn.com/shared.png")].local
assert a_local.exists() and b_local.exists()
# Second sync: plugin-A drops the reference, plugin-B keeps it.
resps2 = [_FakeResponse(content_type="image/png", body=PNG_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps2):
report2 = sync_assets(
cache_dir=tmp_path,
requests=[
("plugin-B", "cover", "https://cdn.com/shared.png"),
],
)
assert report2.removed == 1
assert ("plugin-A", "https://cdn.com/shared.png") not in report2.entries
assert ("plugin-B", "https://cdn.com/shared.png") in report2.entries
# plugin-A's body file is gone, plugin-B's survives.
assert not a_local.exists()
assert b_local.exists()
# --- Manifest persistence (existing) -------------------------------------
def test_sync_assets_writes_manifest_json(tmp_path):
"""v2 disk format is a list of self-describing entries (each carries
``plugin_name`` + ``url``). Composite-keyed in-memory map is flattened
on persist so JSON keys stay strings."""
resps = [_FakeResponse(content_type="image/png", body=PNG_BYTES)]
with _patch_safe_url(), _patch_urlopen(resps):
sync_assets(
cache_dir=tmp_path,
requests=[("p", "cover", "https://x.com/c.png")],
)
manifest = json.loads((tmp_path / "manifest.json").read_text(encoding="utf-8"))
assert manifest["version"] == 2
assert isinstance(manifest["entries"], list)
assert len(manifest["entries"]) == 1
entry = manifest["entries"][0]
assert entry["url"] == "https://x.com/c.png"
assert entry["plugin_name"] == "p"
assert entry["kind"] == "cover"
assert entry["status"] == "ok"