agnes-the-ai-analyst/app/api
minasarustamyan 6fe67d5279
Curated marketplace enrichment via agnes-metadata.json + curator metadata (#234)
* Curated marketplace enrichment via agnes-metadata.json + curator metadata

Adds a second well-defined metadata file `.claude-plugin/agnes-metadata.json`
that upstream marketplace repos can opt into, providing per-plugin (and
per-skill / per-agent) cover photo, demo video URL, doc links, and
category override. The Claude Code marketplace contract is untouched —
agnes-metadata.json + the convention `.agnes/` directory are stripped
from the synthetic Claude Code marketplace served via /marketplace.zip
and /marketplace.git/*, so user instances see a clean Claude Code repo
with no Agnes-only metadata.

Highlights:
- DB schema v32 — adds curator_name + curator_email on marketplace_registry,
  cover_photo_url + video_url + doc_links on marketplace_plugins.
- Mandatory curator at marketplace registration, editable later through
  the admin UI; surfaces on cards + detail pages in place of owner_todo.
- External-asset mirror cache at ${DATA_DIR}/marketplace-cache/<slug>/
  with conditional GET, 60s timeout, 10 MB body cap, SSRF guards, and
  Wikipedia-policy-compliant User-Agent.
- Strict drop semantics — anything Agnes can't deliver as a real PDF /
  Markdown / plain text doc, or a real PNG / JPEG / WebP cover, is
  dropped from the served metadata; UI looks identical to no-entry case
  (gradient placeholder for missing covers, no row in the doc list).
- Doc allowlist + image allowlist enforced on both the curated mirror
  flow and the Flea upload flow (/store/new); shared module
  src/marketplace_assets.py.
- New /api/marketplace/curated/{mp}/{plugin}/{asset,doc,mirrored}/...
  endpoints with path-traversal guards + RBAC + Content-Disposition
  attachment for docs.
- Curator-focused format guide at /marketplace/format-guide; canonical
  source is docs/curated-marketplace-format.md, also linked from the
  admin /admin/marketplaces page next to + Add Marketplace.

See CHANGELOG.md under [Unreleased] for the full breakdown.

* Fix format-guide test assertion to match shortened disclaimer

The 'Flea Market' phrase was trimmed out of the disclaimer in
docs/curated-marketplace-format.md after the curator-focused rewrite.
Update the rendered-HTML test to assert the channel-scoping phrase
that's actually present ('Curated Marketplace channel only') rather
than the 'Flea Market' contrast that's no longer in the doc.

* Drop unused 'version' field from agnes-metadata.json schema

The parser never read it; it was a YAGNI placeholder for future
schema evolution. Curators don't need to wonder what to put there
when adding the file for the first time. Will be re-added if and
when we actually introduce a backwards-incompatible schema change.

* Harden asset mirror against SSRF via redirect + DNS rebinding

The pre-flight _is_safe_url check validated only the initial URL;
urllib.request.urlopen then followed redirects and re-resolved DNS for
the actual connection — both bypassable. Attacker-controlled origin
could 302 to http://169.254.169.254/... and exfil cloud metadata;
attacker-controlled DNS could return public IP first / 127.0.0.1 second.

Replace urlopen call with a shared OpenerDirector wired through three
custom handlers: _SafeRedirectHandler re-runs SSRF allowlist on every
redirect Location (max 5 hops, down from urllib's 10), and
_PinnedHTTPHandler / _PinnedHTTPSHandler connect to the IP that passed
validation rather than re-resolving the hostname. TLS SNI + cert verify
stay bound to the original hostname.

_resolve_safe returns the validated IP (the existing _is_safe_url
2-tuple wrapper stays for backwards compatibility) and rejects round-
robin DNS that mixes a public + private record. _UnsafeRedirectError
is a typed exception so _fetch_url can map redirect blocks to terminal
'rejected' status (not transient 'failed'). _http_open is the single
call site so tests can mock at one well-defined seam.

Tests cover redirect blocking (link-local, loopback), redirect-error
unwrapping inside URLError, pinned-IP connection target, and the
end-to-end DNS-rebinding scenario. Existing tests that mocked
urllib.request.urlopen are migrated to mock _http_open.

* Harden /asset/ endpoint against stored XSS

The endpoint served any file in the cloned marketplace repo with
stdlib-detected Content-Type, so a curator who landed evil.html (or a
renamed evil.png carrying HTML bytes) in the working tree got a
same-origin XSS — the response shares cookie scope with /admin and
/api/me/*.

The asset endpoint is image-only by contract (cover photos referenced
from agnes-metadata.json + inner skill / agent cards), so applying the
same allowlist + magic-bytes pattern that /doc/ already uses closes
the gap without breaking any legitimate use case. Three layered
checks: extension in IMAGE_EXTENSIONS (.png/.jpg/.jpeg/.webp; SVG
excluded — <script> inside SVG executes), validate_image_file magic
bytes (defeats rename-extension attack), Content-Type pinned from the
validated extension (never stdlib mimetypes).

Defense-in-depth: X-Content-Type-Options: nosniff stops browser MIME
sniffing; Content-Security-Policy: default-src 'none' blocks script /
iframe execution even if a future regression let HTML through.

Tests cover the .html extension reject, the renamed-HTML-as-PNG magic-
bytes reject, the .svg reject, and the happy-path PNG with security
headers attached. The pre-existing path-traversal test seeds a real
PNG instead of ok.txt now that the endpoint is image-only.

* Enforce mandatory curator on marketplace PATCH

The POST handler enforced curator_name + curator_email at create time,
but PATCH treated empty / missing curator inputs as 'no change'. Legacy
rows that pre-date v32 (curator_name=NULL) could be edited indefinitely
without ever filling the curator gap, and OWNER_TODO_PLACEHOLDER lingered
on every /marketplace card.

Reject the PATCH with 400 when the post-merge row would persist with
empty curator. The check fires after the existing field-merge logic, so
once-filled rows that don't touch curator still pass through (their
existing values fall through from the DB row). DB column stays nullable
so untouched legacy rows continue to coexist — the gate fires only the
moment an admin opens the edit modal.

Existing PATCH semantics preserved: empty-string input still means 'leave
existing value alone', and once-filled curator can't be cleared (those
test cases pass unchanged). New test seeds a legacy row directly via the
repository, then exercises url-only PATCH (rejected), partial-fill PATCH
(rejected), and full-fill PATCH (succeeds); a follow-up no-curator PATCH
on the now-formed row also passes.

* Drop unused curated-marketplace helpers (PR #234 review)

* build_db_payload — imported by src/marketplace.py but never called.
  The strict-drop semantics it would have implemented were re-written
  inline in _refresh_plugin_cache (see the comment block there). The
  standalone helper still carried the old fall-back-to-original-external-
  URL-on-mirror-failure behaviour, which contradicts the documented
  drop-when-can't-deliver contract — a future contributor who re-wired
  it would have introduced a silent regression. Delete with the helper
  + the import + the comment that referenced it.

* _resolve_marketplace_name — one-line shim with no remaining call
  sites. Callers use _resolve_marketplace_meta which returns name +
  curator together, avoiding the double DB hit the shim exists to
  hide.

* '# noqa: F401  Optional kept for forward-compat' was wrong — Optional
  IS used in src/marketplace.py (line 70 and line 238). Drop the noqa
  comment so a future ruff run doesn't try to remove a real import.

Removing build_db_payload also drops the only remaining use of Optional
in src/marketplace_metadata.py, so the import comes out there too.

* Cap agnes-metadata.json size + catch RecursionError on parse

The reader is invoked once per marketplace per sync and the file is
curator-controlled. Two failure modes were unguarded:

* Multi-GB JSON: path.read_text() pulled the whole file into memory
  before json.loads even ran. A curator with commit access to an
  upstream repo could OOM the sync worker.

* Deeply-nested JSON under any size cap: cpython's recursive object /
  array parser raises RecursionError at ~1000 levels of depth.
  RecursionError is a RuntimeError, not ValueError, so the existing
  catch let it propagate up and abort the entire sync — every other
  marketplace in the same pass got skipped.

Add AGNES_METADATA_MAX_BYTES = 1 MiB (a real metadata file with covers,
docs, categories for ~50 plugins fits in <100 KB so the cap is
generous) and gate the size check on path.stat().st_size before the
body read. Broaden the parse except to (ValueError, RecursionError)
with a unified log line. Both failure modes degrade to the same
empty-dict fall-back the malformed-JSON path already used, so one bad
upstream never aborts the rest of the sync.

Tests cover the size cap firing before json.loads (whitespace-padded
valid JSON exceeding the cap) and the recursion path (5000 nested
arrays — past cpython's default recursion limit but well under the
size cap).

* Persist asset-mirror manifest per body write, before unlink

sync_assets wrote each body atomically (tmp + rename) but persisted
the manifest only at the end of the batch. A kill -9 mid-Phase 2 left
on-disk files the manifest never referenced. Once a curator dropped
that URL from agnes-metadata.json, Phase 3's cleanup had no record of
the file and the orphan stayed forever — there's no GC pass walking
the cache dir today, so disk would slowly bloat.

Phase 2 (body-write iteration): after the in-memory manifest mutation,
persist BEFORE unlinking the previous body. The crash window narrows
from 'all of Phase 2' to 'between persist and unlink' (microseconds).
A persist failure mid-batch keeps the previous body on disk — the on-
disk manifest still references it, and a stale-but-existing file beats
a 404. Cost: one extra tmp+rename per body write; manifest is a few KB
so the overhead is negligible vs. the HTTP fetches.

Phase 3 (curator-removed URLs): same discipline. Collect the to-delete
relpaths, persist the manifest with the entries already gone, THEN
unlink. A crash mid-cleanup leaves at most a microsecond window where
files exist despite the manifest no longer naming them. The next sync
reads the (correct) manifest and the orphan stays orphaned, but the
served state is consistent.

Tests cover per-body persist call count, the post-update on-disk
manifest content, and Phase 3 ordering verified by reading the on-disk
manifest from inside Path.unlink.

* Consolidate marketplace video embeds + format-guide CSS

The YouTube nocookie / Vimeo / <video> / link-fallback detection logic
was duplicated verbatim in marketplace_plugin_detail.html and
marketplace_item_detail.html (~40 JS lines each, with subtly-different
inline styles). Both templates now {% include %} a single
_marketplace_video_embed.html partial inside their IIFE so the regex,
the nocookie attribute set, and the unknown-host link fallback live in
ONE place — future tweaks (new host, new attribute, fixed sandbox flag)
no longer need to be applied twice in lockstep.

The .video-wrap selectors (one inline <style> rule in plugin_detail,
one inline style='...' attribute in item_detail) are replaced by the
existing .video-embed 16:9 wrapper in style-custom.css, with new
.video-embed video / .video-embed a child rules added so the wrapper
handles all four embed shapes uniformly without per-template
positioning.

The 60-line inline <style> block in marketplace_format_guide.html
moves verbatim to style-custom.css under a new 'Marketplace format
guide page' section, scoped to .format-guide so other pages aren't
affected.

No user-visible behaviour change: the rendered HTML for valid
YouTube / Vimeo / mp4 / external links is byte-identical to before,
and the format-guide page renders the same.

* Maintainability cleanup batch (PR #234 review)

#10: drop _path_under from app/api/marketplace.py — it was a byte-
equivalent clone of _safe_join (same Path.resolve(strict=True) +
relative_to() containment check). The three v32 endpoint handlers
(/asset, /doc, /mirrored) now share the existing helper.

#14: rename src/marketplace_assets.py → src/marketplace_asset_validation.py
so the file's purpose is obvious from the name and the previous
overlap with src/marketplace_asset_mirror.py is gone. Six call-site
imports updated in lockstep; CHANGELOG references under [Unreleased]
updated to track the new path.

#11: consolidate the URL builders that resolve
/api/marketplace/curated/<slug>/<plugin>/{asset,doc,mirrored}/...
paths. _internal_asset_url / _internal_doc_url / _mirrored_asset_url
lived in src/marketplace.py, while a copy named _mirrored_url lived
in app/api/marketplace.py with a 'must stay aligned' comment. New
module src/marketplace_urls.py is the single source of truth — both
call sites import from it and a future URL-format tweak only needs
to change one file. The _ROUTE_PREFIX constant collapses the per-
function f-string repetition. The route-handler endpoints themselves
still own the path string literals (keeping the builders identical
to the route declarations remains a checklist item, not a runtime
guarantee).

* Re-key asset-mirror manifest by (plugin, url) + dedup HTTP fetches

The manifest used to be keyed by URL alone, so two plugins in the
same marketplace referencing the same external image (a shared CDN
icon, a common cover) collided on entry.plugin_name — last writer
won. The DB row for the losing plugin then stored a served URL
pointing under the winning plugin's tree, and require_resource_access
denied legitimate access on one side and let the other plugin's user
reach the wrong asset.

In-memory: Dict[Tuple[str, str], MirrorEntry] keyed (plugin_name, url).
On disk: format flips from {url: entry} dict to [entry, ...] list of
self-describing entries (each carries plugin_name + url + the
previous fields). JSON keys can't be tuples; encoding 'plugin::url'
would just shift the parsing burden.

Phase 1 of sync_assets deduplicates fetches by URL — three plugins
sharing one URL share one HTTP request. The conditional-GET prior is
picked from any owning plugin's prior entry; if their etags diverge
(rare) we miss one 304 and pay for a full re-download instead.
Phase 2 still creates a per-(plugin, url) manifest entry pointing
under the plugin's own subdir, and Phase 3 cleanup is keyed the same
way so dropping a URL from one plugin's metadata doesn't disturb
another plugin still referencing it.

Body files stay per plugin (RBAC-clean isolation: deleting plugin A's
cache can't strand plugin B). Bandwidth saved by fetch dedup.

Consumer code re-keyed: src.marketplace._refresh_plugin_cache rebuilt
served_url_for / mirror_status as composite-keyed maps;
app.api.marketplace._resolve_external_via_mirror /
_curated_inner_cover / _curated_inner_enrichment look up by
(plugin_name, url).

Tests cover per-plugin manifest entries with shared URL, the single
HTTP fetch for N plugins, and Phase 3 drop-one-keep-other. All
existing tests migrated to composite key access; v2 list format
assertions verify on-disk shape.

* Migrate asset mirror from urllib.request to httpx

The asset mirror was the only HTTP call site in Agnes still using
urllib.request; every other module (CLI, Jira / OpenMetadata / OpenAI
connectors, scheduler, Telegram bot) already used httpx. The asset
mirror was added in this PR's base commit, so this is the only chance
to bring it into convention before someone copies it as 'the pattern
for HTTP fetches in Agnes'.

Three concrete benefits beyond consistency:

* SSRF defence collapses from five urllib classes
  (_PinnedHTTPConnection, _PinnedHTTPSConnection, _PinnedHTTPHandler,
  _PinnedHTTPSHandler, _SafeRedirectHandler) into one
  _SSRFGuardTransport. httpx invokes handle_request() on every redirect
  hop, so re-validation is free — we don't need a custom redirect
  handler at all.

* DNS-rebinding defence: the transport rewrites request.url.host to the
  SSRF-validated IP before delegating to super().handle_request().
  httpcore connects to whatever URL.host says, so this pins the
  connection without subclassing HTTPSConnection. The original hostname
  goes into the Host header + the sni_hostname extension so TLS / vhost
  routing still bind to the curator-supplied hostname.

* Error handling: one httpx.HTTPError catch-all for transport errors,
  plus specific httpx.TimeoutException / httpx.TooManyRedirects branches
  for clearer diagnostics. Matches the _translate_transport_error shape
  in cli/client.py.

The shared httpx.Client is built lazily at module load (same pattern as
cli/client.py:_get_shared_client) with follow_redirects=True,
max_redirects=5, timeout=HTTP_TIMEOUT_SEC, and our custom transport.

Externally observable behaviour is unchanged: same FetchOutcome
statuses, same manifest format, same conditional GET semantics, same
body-size cap.

Tests migrated from urllib-shaped fakes to httpx-shaped (status_code,
iter_bytes, context manager). Five urllib-specific tests replaced with
httpx equivalents — three transport unit tests + one DNS-rebinding
integration test that verifies host rewrite via monkey-patched
super().handle_request. One test deleted without replacement
(unwrap-URLError-wrapping-an-_UnsafeRedirectError — urllib-specific,
not applicable to httpx).

* Surface curated agnes-metadata enrichment on My Stack tab

GET /api/marketplace/items?tab=my built each curated row from the
on-disk marketplace.json by way of resolve_allowed_plugins, which
doesn't carry the agnes-metadata enrichment columns
(cover_photo_url, video_url, category override, doc_links). The
handler then hard-coded cover_photo_url=None on the synthetic row.
Result: once a user clicked '+ Add to my stack' on a curated card,
the same plugin in tab=my rendered with the gradient placeholder
instead of its cover photo — confusing parity break vs. the curated
tab where the same row goes through MarketplacePluginsRepository
and gets the enriched columns.

Pre-load the enriched marketplace_plugins rows for every marketplace
the user is subscribed to, then look each granted+subscribed plugin
up by (marketplace_id, plugin_name). Fall back to the on-disk
synthetic shape only when the DB row is missing — happens during
the rare race where RBAC is granted before the first sync cycle
ingests the plugin. RBAC gating (granted set from
resolve_allowed_plugins) is unchanged so this fix can't widen
visibility; it just upgrades the data shape behind cards the user
was already going to see.

Per-marketplace list_for_marketplace beats N gets — typical user is
subscribed to <5 marketplaces, so this is at most a handful of
queries vs. one per subscribed plugin.

Regression test seeds a plugin with cover_photo_url + category
override, subscribes the user, hits /api/marketplace/items?tab=my,
and asserts photo_url + category come through. The misleading
'fall through to gradient until the user re-visits the curated tab'
comment is gone.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
2026-05-09 17:01:37 +02:00
..
__init__.py feat: add FastAPI server with auth, RBAC, and all API endpoints 2026-03-27 15:19:18 +01:00
_metadata_models.py release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
access.py Add /marketplace browse page + Model B opt-in stack composition (#230) 2026-05-08 14:22:19 +02:00
admin.py Flea-market upload guardrails + soft delete + JOIN-based admin queue (#233) 2026-05-09 17:32:53 +04:00
admin_bigquery_test.py feat(admin): #160 BQ test-connection endpoint + billing_project placeholder UI 2026-05-04 10:31:35 +02:00
cache_warmup.py release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
catalog.py feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150) 2026-04-30 22:02:16 +02:00
claude_md.py chore(cli-rename): replace stale da verbs in active code paths 2026-05-04 21:10:43 +02:00
cli_artifacts.py chore: rename stale 'da' references to 'agnes' + CHANGELOG 2026-05-06 23:23:59 +02:00
data.py feat(caddy): file_server for parquet downloads — bypass uvicorn 2026-05-05 16:41:33 +02:00
health.py Extract session-pipeline framework + UsageProcessor skeleton (#232) 2026-05-08 19:47:46 +02:00
jira_webhooks.py fix(security): close Jira webhook fail-open + path traversal (#83) (#93) 2026-04-27 19:53:55 +02:00
marketplace.py Curated marketplace enrichment via agnes-metadata.json + curator metadata (#234) 2026-05-09 17:01:37 +02:00
marketplaces.py Curated marketplace enrichment via agnes-metadata.json + curator metadata (#234) 2026-05-09 17:01:37 +02:00
me.py feat(home): state-aware /home + /setup-advanced + schema v26 (#228) 2026-05-08 18:28:47 +02:00
me_debug.py feat(auth): /me/debug self-only auth diagnostic page (#116) 2026-04-29 06:36:28 +02:00
memory.py feat(memory): admin Edit + MEMORY_DOMAIN RBAC + ai-section UI (#141) 2026-04-30 11:04:41 +02:00
metadata.py feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
metrics.py feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
my_stack.py Flea-market upload guardrails + soft delete + JOIN-based admin queue (#233) 2026-05-09 17:32:53 +04:00
news.py feat(home): state-aware /home + /setup-advanced + schema v26 (#228) 2026-05-08 18:28:47 +02:00
query.py release: 0.47.2 — restore #218 + #219 fixes silently reverted by #217 (#225) 2026-05-07 19:57:18 +02:00
query_hybrid.py feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
scripts.py feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135) 2026-04-29 22:06:30 +02:00
settings.py feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150) 2026-04-30 22:02:16 +02:00
store.py Curated marketplace enrichment via agnes-metadata.json + curator metadata (#234) 2026-05-09 17:01:37 +02:00
sync.py release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217) 2026-05-07 19:01:27 +02:00
telegram.py feat: complete system — web UI, all API endpoints, governance, admin, CLI commands 2026-03-27 16:52:22 +01:00
tokens.py chore(lint): final ruff fixes 2026-05-04 19:32:52 +02:00
upload.py fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
users.py feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150) 2026-04-30 22:02:16 +02:00
v2_arrow.py feat(v2): claude-driven fetch primitives + 0.14.0 (#102) 2026-04-29 01:07:19 +02:00
v2_cache.py feat(v2): claude-driven fetch primitives + 0.14.0 (#102) 2026-04-29 01:07:19 +02:00
v2_catalog.py release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
v2_quota.py refactor(quota): #160 relocate _build_quota_tracker to v2_quota.py 2026-05-04 10:31:35 +02:00
v2_sample.py release: 0.46.5 — agnes describe -n parses, server sanitizes NaN (#224) 2026-05-07 18:16:21 +02:00
v2_scan.py perf: Tier 1 event-loop unblocking — async def → def on BQ-bound handlers 2026-05-05 17:44:08 +02:00
v2_schema.py release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
welcome.py fix(devin-review): dashboard CTA respects override; PUT validates anon path 2026-05-03 21:45:32 +02:00
where_validator.py feat(v2): claude-driven fetch primitives + 0.14.0 (#102) 2026-04-29 01:07:19 +02:00