* feat(store): flea-market upload guardrails + soft delete + JOIN-based admin queue
Adds an end-to-end guardrails pipeline for store uploads (manifest +
static-security + LLM review), persists blocked bundles for forensics,
introduces soft-delete (Archive) semantics, consolidates the legacy
/store/{id} surface into /marketplace/flea/{id}, and reworks the admin
queue so lifecycle filters read live entity visibility via LEFT JOIN
rather than a denormalized submission column.
Schema v29 → v35:
* v29 store_submissions table + store_entities.visibility_status
* v30 file_size, bundle_sha256, bundle_purged_at on submissions
* v31 reshape store_submissions (drop legacy unique on entity_id)
* v32 store_entities.archived_at/by + 'archived' visibility value
* v33 drop store_submissions.retry_count (unused)
* v34 ensure idx_store_submissions_entity exists post column-drop
* v35 broaden visibility_status enum + JOIN architecture cutover
Pipeline (src/store_guardrails/):
* Inline checks: manifest_check, static_scan, quality_check
* LLM review configurable haiku|sonnet|opus (default haiku)
* BackgroundTasks-driven async path with structured-output JSON
* Per-submitter daily quota (default 50)
* 30-day TTL purge job (POST /api/admin/run-blocked-purge)
* Bundle SHA256 + size persisted; sha256 survives purge for forensics
Visibility model:
* pending | approved | hidden | archived
* _enforce_visibility returns 404 (no leak) for non-owner non-admin
* Owner sees own non-approved entries via include_owner_id widening
* Install refused with 409 entity_not_approved when not approved
Soft-delete (DELETE /api/store/entities/{id}):
* Default = soft (visibility_status='archived'); existing installs
keep getting served the bundle so users don't lose the plugin
* ?hard=true admin-only: drops bundle + cascades user_store_installs
* Hard-delete preserves entity_id on submission as tombstone so
audit_log linkage survives for the activity timeline
Admin queue lifecycle (the JOIN refactor):
* Verdict (store_submissions.status) is immutable forensic record
* Lifecycle (store_entities.visibility_status) is live state
* /admin/store/submissions Archived chip translates to
`e.visibility_status='archived'` via LEFT JOIN — any path that
flips visibility surfaces in the queue immediately
* Detail page renders Status (verdict) and Entity lifecycle side by
side so admins see "approved at review, now archived" at a glance
URL consolidation:
* /store/{id} deleted (no redirect, stale bookmarks 404)
* /marketplace/flea/{id} is the canonical detail surface
* Three in-tree callers (upload-success, my-stack card, store
listing card) updated to point at the new URL
* Quarantine banner extracted to _quarantine_banner.html partial,
self-guarded, included from both flea detail templates
* Banner JS auto-refreshes when the verdict lands by polling
/api/marketplace/flea/{id}/detail (visibility_status +
submission_status — the latter is needed because blocked_llm
keeps the entity at visibility_status='pending')
Audit log resource format:
* runner.py emits prefixed `store_submission:{id}` (post-fix)
* Detail-page timeline query handles three patterns: prefixed
submission, helper-emitted `store_entity:{sub_id}`, and bare-id
legacy rows — all surface in the activity timeline
UX fixes:
* Owner sees Under review / Quarantined / Hidden banner with status
* Install button gray-disabled (not blue) when non-approved
* Owner cannot delete quarantined entries (403); admin can
* Admin queue: filter chips, sortable columns, paging, page-size
* Auto-refresh queue every 5s while pending rows are visible
* Store upload page file picker no longer opens twice (label →
input default action collided with explicit JS handler)
Tests: 168 passed across the guardrails suites (admin submissions,
store API, inline / LLM / purge guardrails, store repositories,
marketplace filter, schema version). New regression coverage
includes: archive surfaces via JOIN even when API path is bypassed;
deleted submission renders activity timeline (tombstone); flea
detail surfaces submission_status only for owner/admin; detail page
renders Entity lifecycle row; audit log resource format covers both
helper and runner paths.
* fix(store-guardrails): PR #233 follow-up — prompt injection, atomic PUT, BG race, schema, reaper, sort whitelist
Addresses 9 of the 23 findings from the PR #233 review (spec at
docs/superpowers/specs/2026-05-09-pr233-guardrails-fixes-spec.md).
Merge-gate items #1-#6 plus high-value mediums #7, #9-#12, #23.
Architectural items (#8 enum split, #14 factory) and pure
maintainability (#15-#22) deferred to follow-ups.
Security:
* #1 prompt injection — SYSTEM_PROMPT now passed via the SDK's
dedicated system= parameter; bundle wrapped in <bundle>...</bundle>
sentinels declared data-only by the system prompt; literal
sentinel strings in user content are escaped so an adversarial
README can't forge a close tag.
* #6 static scan honesty — module docstring + admin copy + docs
declare static scan as signal not gate; .md/.txt/.rst/.html/.json/
.yaml/.yml/.toml skipped to avoid false positives on prose.
AST mode for Python deferred (separate flag, FP comparison work).
Correctness:
* #2 PUT atomicity — bundles bake into plugin.staging-<rand>/
alongside live, atomic-rename on success; failed checks leave
live tree byte-for-byte intact.
* #3 BG-task race — set_visibility_if_pending guards verdict flips
to the (pending, hidden) review window; admin archives during
review survive; skipped flips audit-logged.
* #4 v35 NOT NULL/DEFAULT — schema v35→v36 re-applies them on
store_entities.visibility_status. CHECK constraint enforced
application-side (DuckDB ADD CHECK on existing column unsupported).
* #7 stuck-review reaper — reap_stuck_llm_reviews flips pending_llm
rows older than guardrails.stuck_review_grace_seconds (default
1800) to review_error. Scheduler runs every 15 min via new
/api/admin/run-reap-stuck-reviews. Set knob to 0 to disable.
* #9 quota counter — count_blocked_for_submitter_since now counts
blocked_inline + blocked_llm + review_error so a submitter
triggering only LLM-blocked verdicts is bounded.
* #10 missing risk_level — surfaces as review_error with
error='missing_risk_level' instead of silently defaulting to
'medium' (which looked like a model-decided block).
* #11 archived_at clear — set_visibility nulls archived_at +
archived_by when transitioning out of 'archived' so a future
read doesn't show stale archive forensics on an approved row.
Maintainability:
* #12 FSM doc comment — accurate insert/transition/lifecycle
description in src/db.py near store_submissions schema.
* #23 sort-key whitelist — admin queue rejects unknown sort keys
with 400 invalid_sort_key; substring-replace footgun removed.
Deferred (separate PRs):
* #5 quota race — proper fix requires asyncio.Lock spanning the
full pipeline; threading.Lock blocks event loop, DuckDB MVCC
doesn't help. API-level slowapi bounds worst case for now.
* #6 part 3 (AST static scan), #8 (enum split), #13 (import
bundle docs), #14 (factory consolidation), #15-#22 (maint).
Tests:
* New: tests/test_store_guardrails_prompt_injection.py (corpus +
trust-boundary invariants), tests/test_store_put_atomic.py,
tests/test_store_guardrails_reaper.py.
* Extended: test_store_guardrails_llm.py (system param, missing
risk_level, BG race), test_admin_store_submissions.py (quota
counter widening, sort whitelist 400), test_store_repositories.py
(un-archive metadata clear), test_db_schema_version.py (v36).
* Full suite: 3738 passed; 17 pre-existing baseline failures
unchanged (db migration tests, cli binary rename, catalog export,
user mgmt v5 backfill — confirmed by stash + rerun on clean tree).
383 lines
15 KiB
Python
383 lines
15 KiB
Python
"""LLM-review wiring tests — Anthropic call mocked.
|
|
|
|
These tests cover the runner's persistence behavior on each verdict
|
|
shape (safe / risky / error). The actual prompt engineering lives in
|
|
``src/store_guardrails/prompts.py`` and is exercised at integration
|
|
time, not here.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import shutil
|
|
import tempfile
|
|
from pathlib import Path
|
|
from unittest.mock import patch
|
|
|
|
import duckdb
|
|
import pytest
|
|
|
|
from connectors.llm.exceptions import LLMTimeoutError
|
|
from src import db as src_db
|
|
from src.repositories.store_entities import StoreEntitiesRepository
|
|
from src.repositories.store_submissions import StoreSubmissionsRepository
|
|
from src.store_guardrails.runner import LlmResult, run_llm_review
|
|
|
|
|
|
@pytest.fixture
|
|
def conn(tmp_path, monkeypatch):
|
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
|
src_db._system_db_conn = None
|
|
src_db._system_db_path = None
|
|
c = src_db.get_system_db()
|
|
yield c
|
|
c.close()
|
|
|
|
|
|
@pytest.fixture
|
|
def plugin_dir():
|
|
d = Path(tempfile.mkdtemp(prefix="agnes_llm_test_"))
|
|
(d / "SKILL.md").write_text("# Test\nbody " * 30)
|
|
yield d
|
|
shutil.rmtree(d, ignore_errors=True)
|
|
|
|
|
|
def _seed_pending_submission(conn, plugin_dir: Path) -> tuple[str, str]:
|
|
"""Stage a store_entities row + a pending_llm submission.
|
|
|
|
Returns ``(entity_id, submission_id)`` so the test can assert against
|
|
final state.
|
|
"""
|
|
from src.repositories.users import UserRepository
|
|
UserRepository(conn).create(id="u1", email="alice@x.com", name="alice")
|
|
ents = StoreEntitiesRepository(conn)
|
|
ents.create(
|
|
id="e1", owner_user_id="u1", owner_username="alice",
|
|
type="skill", name="probe", description="probe skill",
|
|
category=None, version="1.0.0", file_size=10,
|
|
visibility_status="pending",
|
|
)
|
|
subs = StoreSubmissionsRepository(conn)
|
|
sub_id = subs.create(
|
|
submitter_id="u1", submitter_email="alice@x.com",
|
|
type="skill", name="probe", version="1.0.0",
|
|
status="pending_llm", entity_id="e1",
|
|
inline_checks={"manifest": {"status": "pass"}},
|
|
)
|
|
return "e1", sub_id
|
|
|
|
|
|
# The runner closes its own cursor in `finally`. Hand it a *fresh* cursor
|
|
# each call (mirrors the production `get_system_db` behavior) so closing
|
|
# it doesn't poison the test's primary cursor used for assertions.
|
|
def _conn_factory(_unused):
|
|
def _f():
|
|
return src_db.get_system_db()
|
|
return _f
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# run_llm_review outcomes
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestLlmReviewRunner:
|
|
def test_safe_verdict_approves_entity(self, conn, plugin_dir):
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
|
|
verdict = {
|
|
"risk_level": "safe", "summary": "OK", "findings": [],
|
|
"template_placeholders_found": 0, "reviewed_by_model": "claude-haiku-4-5-20251001",
|
|
"error": None,
|
|
}
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle",
|
|
return_value=verdict,
|
|
):
|
|
result = run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
assert isinstance(result, LlmResult)
|
|
assert result.passed
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "approved"
|
|
ent = StoreEntitiesRepository(conn).get(eid)
|
|
assert ent["visibility_status"] == "approved"
|
|
|
|
def test_high_risk_verdict_blocks(self, conn, plugin_dir):
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
|
|
verdict = {
|
|
"risk_level": "high", "summary": "exfil",
|
|
"findings": [{
|
|
"severity": "high", "category": "exfiltration",
|
|
"file": "run.py", "explanation": "ships token to remote",
|
|
"fix_hint": "remove the POST",
|
|
}],
|
|
"template_placeholders_found": 0, "reviewed_by_model": "claude-haiku-4-5-20251001",
|
|
"error": None,
|
|
}
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle",
|
|
return_value=verdict,
|
|
):
|
|
run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "blocked_llm"
|
|
ent = StoreEntitiesRepository(conn).get(eid)
|
|
# Entity stays pending — not visible until override.
|
|
assert ent["visibility_status"] == "pending"
|
|
|
|
def test_low_risk_with_high_finding_blocks(self, conn, plugin_dir):
|
|
"""Pass condition requires BOTH risk_level<=low AND no high findings.
|
|
A 'low' verdict with a high finding still blocks."""
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
|
|
verdict = {
|
|
"risk_level": "low", "summary": "mostly ok",
|
|
"findings": [{
|
|
"severity": "high", "category": "credentials",
|
|
"file": "creds.py", "explanation": "key in source",
|
|
}],
|
|
"template_placeholders_found": 0, "reviewed_by_model": "claude-haiku-4-5-20251001",
|
|
"error": None,
|
|
}
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle",
|
|
return_value=verdict,
|
|
):
|
|
run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "blocked_llm"
|
|
|
|
def test_medium_finding_with_safe_risk_passes(self, conn, plugin_dir):
|
|
"""Medium findings shouldn't block when overall risk is safe — that's
|
|
the 'noise but no exploit' band the operator opted into when picking
|
|
Haiku as the tier. Operators who want stricter pin Sonnet/Opus."""
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
|
|
verdict = {
|
|
"risk_level": "safe", "summary": "noise",
|
|
"findings": [{
|
|
"severity": "medium", "category": "code_quality",
|
|
"file": "x.py", "explanation": "could be cleaner",
|
|
}],
|
|
"template_placeholders_found": 1, "reviewed_by_model": "claude-haiku-4-5-20251001",
|
|
"error": None,
|
|
}
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle",
|
|
return_value=verdict,
|
|
):
|
|
run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "approved"
|
|
|
|
def test_review_error_keeps_pending(self, conn, plugin_dir):
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
|
|
# llm_review.review_bundle catches LLMError and returns a dict with
|
|
# error set; the runner records review_error.
|
|
verdict = {
|
|
"risk_level": None, "summary": None, "findings": [],
|
|
"template_placeholders_found": 0, "reviewed_by_model": "claude-haiku-4-5-20251001",
|
|
"error": "LLMTimeoutError: Anthropic connection error",
|
|
}
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle",
|
|
return_value=verdict,
|
|
):
|
|
run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "review_error"
|
|
ent = StoreEntitiesRepository(conn).get(eid)
|
|
assert ent["visibility_status"] == "pending"
|
|
|
|
def test_missing_plugin_dir_records_review_error(self, conn, tmp_path):
|
|
eid, sub_id = _seed_pending_submission(conn, tmp_path / "exists")
|
|
# Point at a path that doesn't exist.
|
|
ghost = tmp_path / "ghost-plugin-dir"
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle"
|
|
) as m:
|
|
run_llm_review(
|
|
sub_id, plugin_dir=ghost,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
m.assert_not_called()
|
|
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "review_error"
|
|
|
|
def test_safe_verdict_skipped_when_admin_archived_during_review(self, conn, plugin_dir):
|
|
"""#3 — BG-task race: admin archives entity while LLM review is
|
|
in flight. Pre-fix the verdict's set_visibility('approved')
|
|
clobbered the archive. Post-fix the guarded variant refuses
|
|
and audit-logs the skip."""
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
# Admin archives BEFORE the LLM verdict lands.
|
|
StoreEntitiesRepository(conn).archive(eid, by_user_id="admin")
|
|
|
|
verdict = {
|
|
"risk_level": "safe", "summary": "OK", "findings": [],
|
|
"template_placeholders_found": 0,
|
|
"reviewed_by_model": "claude-haiku-4-5-20251001",
|
|
"error": None,
|
|
}
|
|
with patch(
|
|
"src.store_guardrails.runner.llm_review.review_bundle",
|
|
return_value=verdict,
|
|
):
|
|
run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=lambda: "sk-test",
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
ent = StoreEntitiesRepository(conn).get(eid)
|
|
assert ent["visibility_status"] == "archived", (
|
|
"BG verdict must NOT clobber an admin archive"
|
|
)
|
|
# Submission still flips to 'approved' (verdict is forensic
|
|
# record); the lifecycle stays archived because the admin won.
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "approved"
|
|
|
|
# And we wrote an audit row explaining the skip.
|
|
audits = conn.execute(
|
|
"SELECT action, params FROM audit_log "
|
|
"WHERE resource = ? AND action = ?",
|
|
[f"store_submission:{sub_id}",
|
|
"store.submission.bg_verdict_skipped"],
|
|
).fetchall()
|
|
assert audits, "missing skip audit row"
|
|
|
|
def test_config_loader_failure_records_review_error(self, conn, plugin_dir):
|
|
eid, sub_id = _seed_pending_submission(conn, plugin_dir)
|
|
|
|
def boom():
|
|
raise RuntimeError("no api key")
|
|
|
|
run_llm_review(
|
|
sub_id, plugin_dir=plugin_dir,
|
|
conn_factory=_conn_factory(conn),
|
|
api_key_loader=boom,
|
|
model_loader=lambda: "claude-haiku-4-5-20251001",
|
|
)
|
|
|
|
sub = StoreSubmissionsRepository(conn).get(sub_id)
|
|
assert sub["status"] == "review_error"
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# llm_review.review_bundle — single-shot transport-error path
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestReviewBundleErrorTransport:
|
|
def test_anthropic_timeout_returns_error_dict(self, plugin_dir):
|
|
from src.store_guardrails import llm_review
|
|
|
|
with patch(
|
|
"src.store_guardrails.llm_review.AnthropicExtractor"
|
|
) as MockEx:
|
|
inst = MockEx.return_value
|
|
inst.extract_json.side_effect = LLMTimeoutError("connection error")
|
|
result = llm_review.review_bundle(
|
|
plugin_dir, type_="skill", name="x", version="1.0.0",
|
|
description="x" * 30,
|
|
api_key="sk-test", model="claude-haiku-4-5-20251001",
|
|
)
|
|
assert result["error"]
|
|
assert result["risk_level"] is None
|
|
assert result["reviewed_by_model"] == "claude-haiku-4-5-20251001"
|
|
|
|
def test_missing_risk_level_surfaces_as_review_error(self, plugin_dir):
|
|
"""#10 — model returns a structured response with no/empty
|
|
risk_level. Pre-fix this defaulted to 'medium' and looked like a
|
|
model-decided block. Post-fix it surfaces as
|
|
``error='missing_risk_level'`` so the runner persists
|
|
``status='review_error'`` (admin gets the retry button)."""
|
|
from src.store_guardrails import llm_review
|
|
|
|
with patch(
|
|
"src.store_guardrails.llm_review.AnthropicExtractor"
|
|
) as MockEx:
|
|
inst = MockEx.return_value
|
|
inst.extract_json.return_value = {
|
|
"summary": "model didn't fill risk_level",
|
|
"findings": [],
|
|
"template_placeholders_found": 0,
|
|
}
|
|
result = llm_review.review_bundle(
|
|
plugin_dir, type_="skill", name="x", version="1.0.0",
|
|
description="x" * 30,
|
|
api_key="sk-test", model="claude-haiku-4-5-20251001",
|
|
)
|
|
assert result["risk_level"] is None
|
|
assert result["error"] == "missing_risk_level"
|
|
|
|
def test_system_prompt_passed_via_dedicated_parameter(self, plugin_dir):
|
|
"""#1 — the system prompt MUST be passed via the SDK's separate
|
|
``system=`` parameter, not concatenated into user content. This
|
|
is the trust boundary that keeps a crafted README inside the
|
|
bundle from overriding the reviewer rules."""
|
|
from src.store_guardrails import llm_review
|
|
from src.store_guardrails.prompts import SYSTEM_PROMPT
|
|
|
|
with patch(
|
|
"src.store_guardrails.llm_review.AnthropicExtractor"
|
|
) as MockEx:
|
|
inst = MockEx.return_value
|
|
inst.extract_json.return_value = {
|
|
"risk_level": "safe", "summary": "ok", "findings": [],
|
|
"template_placeholders_found": 0,
|
|
}
|
|
llm_review.review_bundle(
|
|
plugin_dir, type_="skill", name="x", version="1.0.0",
|
|
description="x" * 30,
|
|
api_key="sk-test", model="claude-haiku-4-5-20251001",
|
|
)
|
|
call = inst.extract_json.call_args
|
|
assert call.kwargs.get("system") == SYSTEM_PROMPT, (
|
|
"SYSTEM_PROMPT must be passed via system= so it lands "
|
|
"in the SDK's system role — not in user content"
|
|
)
|
|
user_payload = call.kwargs.get("prompt") or ""
|
|
assert "<bundle>" in user_payload and "</bundle>" in user_payload, (
|
|
"user payload must wrap the bundle files in trust-boundary "
|
|
"sentinel tags"
|
|
)
|
|
assert SYSTEM_PROMPT not in user_payload, (
|
|
"SYSTEM_PROMPT must NOT be inlined into user content"
|
|
)
|