* docs(spec): admin observability spec + Activity Center MVP plan
Parent spec (480 lines) + executable plan (2295 lines, 14 TDD tasks).
Covers Activity Center rebuild (/admin/activity), with /admin/sessions
and /admin/feedback deferred to follow-up plans.
Already incorporates reviewer-pass revisions across three angles
(security, production resilience, code architecture):
- _get_db import path corrected to app.auth.dependencies
- Test fixtures aligned with seeded_app / admin_user / get_system_db
- All new audit writes wrapped in try/except + logger.exception
- Filename sanitization on session uploads
- DuckDB DESC index behavior documented; upgrade window flagged
- Migration idempotency + evolved-DB test cases
- reveal_raw + shared-cache multi-worker explicitly deferred
Targets schema v40 (audit_log gains params_before, client_ip,
client_kind, correlation_id + 3 indices).
* feat(db): schema v40 — audit_log gains params_before, client_ip, client_kind, correlation_id + 3 indices
* chore(test): clean up Task 1 — drop unused import, rename stale test
* feat(audit): AuditRepository.log() accepts params_before/client_ip/client_kind/correlation_id
* test(audit): strengthen params_before assertion to round-trip JSON content
* feat(audit): AuditRepository.query() rich filters + keyset cursor pagination
* feat(sync): SyncStateRepository.list_recent() cross-table feed
* feat(audit): POST /api/sync/trigger writes audit_log row
* feat(audit): POST /api/scripts/run-due writes audit_log row
* feat(audit): POST /api/upload/sessions writes audit_log row + sanitizes filename
* feat(audit): GET /api/data/{table_id}/download writes audit_log row
* feat(activity): /api/admin/activity timeline + /health + /sync endpoints
* feat(ui): /admin/activity rebuilt — health pulse, timeline, sync grid; /activity-center → 308 redirect
BREAKING: removed demo executive-pulse / maturity-roadmap content from activity_center.html.
The page now reflects real audit_log + sync_history data.
* feat(ui): admin nav + dashboard widget point at /admin/activity
* feat(activity): recursive-audit suppression for AC read endpoints (60s window per actor+filter)
* feat(activity): emit PostHog events when integration enabled (no-op default)
* fix(audit): move v40 indices out of _SYSTEM_SCHEMA + update test_repositories to unpack query() tuple
_SYSTEM_SCHEMA CREATE INDEX on audit_log(timestamp) failed when migration
tests hand-roll a bare audit_log (id, action) without the timestamp column.
Fix: remove indices from _SYSTEM_SCHEMA; add ADD COLUMN IF NOT EXISTS guards
for timestamp and other pre-v40 columns in _v39_to_v40() so the upgrade path
is safe on any hand-rolled schema; call _v39_to_v40 explicitly in the
fresh-install (current==0) path to restore index creation there.
Also unpack the (rows, next_cursor) tuple from AuditRepository.query() in
the three TestAuditRepository tests that still treated it as a list.
* docs: CHANGELOG entry for Activity Center MVP
* chore: refresh stale module docstring in app/api/activity.py
* feat(cli): agnes admin activity — terminal access to Activity Center (timeline + health + sync)
* fix(db): _v39_to_v40 — add IF NOT EXISTS guard for 'action' column
The v39→v40 ladder step adds defensive ADD COLUMN IF NOT EXISTS for
every audit_log column so a hand-rolled bare audit_log (id only) is
safe through the ladder. 'action' was missing from the guard list,
causing CREATE INDEX idx_audit_action_time to fail on tests that
stub audit_log with only an id column (tests/test_e2e_extract.py::
TestSchemaMigration::test_migration_preserves_and_extends).
Local 6/6 schema tests + the previously-failing CI test pass.
* docs(spec): platform telemetry epic — Boss directive + Activity Monitoring plan rebased onto v40 (stacked on zs/spec-activity-center)
* feat(db): schema v41 — 7 usage_* tables for telemetry (events, summary, rollups, attribution)
* chore(db): tighten v41 — usage_session_summary.session_id NOT NULL + upgrade test asserts all 7 tables
* feat(usage): UsageAttributionRepository — replace/delete/lookup over usage_attribution_* tables
* refactor(marketplace): extract list_inner_skills/agents/commands to src/marketplace_listing.py for reuse
* feat(usage): explode plugin attribution on marketplace sync + store entity write; backfill script
* refactor(marketplace): finish src/marketplace_listing.py extraction — drop duplicate _list_inner_* + _parse_frontmatter from app/api/marketplace.py
* feat(usage): promote attribution helpers to src/usage_attribution_helpers.py; hook update_entity rename + bundle-swap; clarify best-effort semantics
* feat(usage): UsageProcessor real extraction + rollup rebuild + 10 fixture-driven tests
* fix(usage): include tool_id in event hash + executemany + rollup transaction (critical multi-tool-turn drop fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(marketplace): popularity stats — invocations_30d + trend + sort=most_used|trending + Most Popular section
* feat(admin): /admin/users/<id> Sessions section — list + single-file + bulk-zip downloads (audit-logged)
* feat(usage): admin export endpoint + CLI — csv/json/parquet streaming, filters, audit-logged
* feat(usage): agnes admin ask — LLM Text-to-SQL over usage_events with SELECT-only validator (audit-logged)
* feat(usage): reprocess + prune endpoints + scheduler daily prune job + CLI
* docs: PLATFORM_SETUP.md operator playbook + HOWTO/ cookbook (5 guides + index)
Adds docs/PLATFORM_SETUP.md as a consolidated operator playbook covering
bootstrap, TLS, marketplaces (curated + flea), scheduler env vars, telemetry
extraction/export/ask/prune, privacy posture, and daily routine.
Adds docs/HOWTO/ with 5 analyst cookbook guides: first query, snapshots for
remote tables, private sessions, feedback + admin ask, and customizing skills.
Existing setup docs (QUICKSTART, DEPLOYMENT, ONBOARDING, HEADLESS_USAGE)
get a one-line cross-reference at the top pointing to PLATFORM_SETUP.md.
* docs(changelog): platform telemetry epic — usage_* foundation + surfaces + admin access + docs
Comprehensive [Unreleased] entry covering: usage_events/session_summary/
tool_daily/plugin_daily tables (v41), attribution lookup tables, backfill
script, marketplace Most Popular + invocation chips + sort, admin Sessions
section, export/ask/reprocess/prune endpoints + CLI mirrors, Activity Center
(v40), PLATFORM_SETUP.md + HOWTO/ docs, and operations notes for v41 upgrade.
* fix(security): block DuckDB read_*/http_*/glob functions in usage_ask validator + symlink escape guard in session zip + clarify mark-private semantics
* fix(admin): parquet export tempfile cleanup on COPY failure + correct processed-first sort on /admin/users/<id>/sessions
* feat(audit): close 8 production audit gaps — query (local/remote/hybrid), catalog/schema/sample, snapshot estimate/create, check-access
* feat(ui): /admin/usage summary dashboard + per-user activity tab on /admin/users/<id>
* fix(audit): cap error messages at 200 chars + audit user_activity reads + recursion guard on usage.summary
* fix(audit): catalog.list audits on error path + clean up deferred json import
* fix(ux): client_kind=cli for PAT auth + timeline empty state + email-instead-of-uuid + nav reorder + help text + loading indicators + ask doc
* feat(observability): unify /admin/activity into single page with saved views
- KPI cards (events, users, error rate, p95) clickable as quick-filters
- Faceted filter dropdowns populated from audit_log in the current window
- Sortable audit table, cursor pagination, per-row JSON side panel
- Saved views (schema v43: user_observability_views) — per-user state
- Top bar: window selector + 30s Live toggle + saved views dropdown
- /admin/scheduler-runs → 308 redirect (source=scheduler filter)
- New endpoints: /api/admin/observability/{facets,kpis,views}
* test: update activity + scheduler-runs tests for unified page
- test_admin_activity_page_renders asserts new structural anchors
- test_admin_scheduler_runs_page_admin_only asserts 308 redirect
* fix(observability): respect [hidden] on modal + side panel
CSS `display: flex` on .obs-modal beat the [hidden] attribute's UA
display:none, so the save-view modal rendered on page load and Cancel
clicks couldn't dismiss it. Gate the modal's flex layout on
:not([hidden]); add the same display:none guard prophylactically to
.obs-panel and .obs-views-panel.
* feat(observability): user enrichment in audit + interactive /admin/usage
Activity:
- /api/admin/activity now joins users for user_email + user_name per row
- User column renders "name (id-prefix)" or "email (id-prefix)" instead
of an opaque truncated UUID; falls back to id when the user record is
missing
Usage:
- /admin/usage rewritten as the same filter/group-by/search pattern as
/admin/activity. Faceted dropdowns (User / Tool / Source / Event type)
populated from usage_events; debounced free-text search across
tool_name / skill_name / subagent_type / command_name
- New endpoints /api/admin/usage/{facets,kpis,query}; the query endpoint
supports group_by in {day, username, tool_name, source, ref_id} with
sort + offset pagination, plus an ungrouped raw-events mode
- 4 KPI cards (events, distinct users, distinct tools, error rate) are
clickable quick-filters; clicking a grouped row applies the bucket as
a filter
- Old static `?window=7d|30d|all` server preload removed; all state is
client-side via since_minutes + group_by + filters in the URL
* fix(observability): clearer labels, all-column sort, drop saved views UI
- Rename page titles: "Activity" → "Server activity", "Usage" → "Tool usage"
with a one-line subtitle on each explaining what the page covers and
linking the other one. The two pages source different data (audit_log
vs usage_events) and the previous labels conflated them.
- Drop the saved-views dropdown + save modal from /admin/activity. The
modal pop-open bug was the trigger; the value wasn't there yet. The
/api/admin/observability/views CRUD + DuckDB table stay in place.
- Rename "Live (30s)" to "Auto-refresh (30s)" with a tooltip clarifying
that it's the re-fetch rate, not the time range. Time range now
labeled "Time range" instead of "Window".
- All audit-table columns are sortable (User, Source, Action, Resource,
Result added); sort is page-local with a Jinja comment explaining the
trade-off. Same for raw usage rows.
- Fix duplicate sort-arrow bug — the literal "▼" in the Time th HTML was
rendering alongside the CSS ::before arrow. Removed the literal; CSS
is the single source of truth.
* feat(observability): global Sessions browser + transcript viewer + CLI
Web:
- /admin/sessions — list every collected session JSONL across all users
with time-range, user, model, errors-only and free-text filters. Default
sort surfaces error-heavy sessions first. KPI cards (sessions, distinct
users, sessions w/ errors, tool error rate) clickable as quick-filters.
- /admin/sessions/<username>/<file> — transcript viewer rendering the
JSONL chronologically: user prompts, assistant text, tool calls (with
JSON input) and tool results (with flattened output). Errors get a red
border + chip and a "Next error" navigation button at the top.
- Admin dropdown gains a "Sessions" link.
API:
- GET /api/admin/sessions/{list,kpis,facets} — filtered cross-user reads
off usage_session_summary
- GET /api/admin/sessions/{username}/{file}/transcript — parses JSONL via
the existing services.session_pipeline.lib, returns chronological events
- GET /api/admin/sessions/{username}/{file}/download — JSONL stream, same
path-safety guards as the per-user endpoint, audit-logged
CLI:
- `agnes admin sessions list [--user X] [--errors] [--since 7d]` — table
output with `!` prefix on rows that hit a tool error
- `agnes admin sessions show <username> <file>` — transcript dump, with
`--errors` to print only the failed tool_result blocks
- `agnes admin sessions download <username> <file> [-o path]`
- `agnes admin sessions kpis` — top-level numbers
* feat(internal): expose telemetry tables to agnes query with row-level RBAC
Three new registered tables backed by system.duckdb, queryable through
the same /api/query plumbing analysts use for Keboola / BigQuery /
local sources:
agnes_sessions → usage_session_summary (filter: username)
agnes_usage → usage_events (filter: username)
agnes_audit → audit_log (filter: user_id)
RBAC is per-row, not per-table: admins see every user's rows; non-admins
see only their own. The filter is built server-side from the auth user
dict; non-admin filter values are regex-validated before SQL interpolation.
Implementation:
- new connector connectors/internal/ with access (filter+exec) + registry
(idempotent table_registry seed at startup)
- /api/query detects internal table refs and short-circuits to a CTE
wrapper that prepends "WITH agnes_x AS (SELECT * FROM <src> WHERE …),
…" then "SELECT * FROM (<user_sql>) AS _q". DuckDB cursor on the
shared system.duckdb handle — opening parallel handles / ATTACH on the
same file is blocked process-wide.
- mixing internal + BQ / registered local tables in one SELECT is
rejected (v1 limitation)
- src.rbac.can_access_table waves internal tables through for all
authenticated users; row scoping is the actual security control
- /api/v2/schema and /api/v2/sample gained internal branches; sample
intentionally skips its cache because rows are RBAC-scoped per caller
- audit row written as action='query.internal' with is_admin flag
Tests: connectors/internal/access — RBAC, filter clause, schema, CTE
wrapper coexistence with user-supplied aggregations, unsafe-username
rejection. 16/16 passing.
Motivating queries this enables:
SELECT tool_name, COUNT(*) FROM agnes_usage
WHERE is_error GROUP BY 1 ORDER BY 2 DESC
-- analyst self-introspection: which tools fail for me?
SELECT user_id, COUNT(*) FROM agnes_audit
WHERE action = 'session.transcript_view' GROUP BY 1
-- admin: who's been looking at whose session transcripts?
* feat(admin): group dropdown into 5 named sections + internal tables in /catalog
Admin dropdown gains section headers so admins can land on the right
page without re-reading the full menu:
Activity Center Server activity / Tool usage / Sessions
Users & Access Users / Groups / Resource access / Tokens
Data Tables
Agent Experience Curated Marketplaces / Flea Submissions /
Agent Setup Prompt / Agent Workspace Prompt
Server Server config
"Agent Experience" frames the curated content + prompts as one cluster
— it's all admin-controlled material that shapes what an analyst's AI
agent encounters. "Configuration" → "Server" since only one item lives
there now.
Renamed the section's first two items:
"Activity" → "Server activity" (matches page H1)
"Usage" → "Tool usage"
Also fixes /catalog visibility of the internal tables (agnes_sessions /
_usage / _audit) for non-admin users: ``app.auth.access.can_access``
short-circuits to True for resource_type='table' + an internal-table id.
Without this, non-admins saw the tables in /api/v2/catalog (which uses
the same RBAC bypass) but not on the /catalog HTML page (which calls
can_access directly, requiring a resource_grants row internal tables
don't have).
CSS for `.app-nav-menu-section`: small caps, muted, non-clickable; first
section trims top padding so the panel doesn't open with an awkward gap.
* refactor(admin): move corporate memory into Admin > Agent Experience
Memory link was the only admin-only entry in the primary nav (gated by
session.user.is_admin). Moves it into the Admin dropdown under Agent
Experience, alongside Curated Marketplaces / Flea Submissions / Prompts
— all admin-curated content that shapes what an analyst's AI agent
encounters.
Renamed the nav label to "Shared Knowledge" to match what the page
actually is (admin-curated organisational knowledge from session
verification, surfaced to agents). URL stays at /corporate-memory; the
route still gates on require_admin per the existing comment.
Side effect: primary nav (Home / Marketplace / Data Packages) is now
uniform for every authenticated user — no conditional admin-only entry.
* ui: rename admin entries to Curated Knowledge / Init Prompt / Workspace Prompt
- "Shared Knowledge" → "Curated Knowledge" (parallel with "Curated
Marketplaces" in the same Agent Experience section; "curated" tells
the admin what they do there — review + approve)
- "Agent Setup Prompt" → "Init Prompt" (matches the `agnes init` flow
it actually drives)
- "Agent Workspace Prompt" → "Workspace Prompt" (the "Agent" prefix
was redundant — every item in the section is agent-facing)
Renames page titles + H1s on /admin/agent-prompt and
/admin/workspace-prompt to match.
* refactor: rename Usage → Telemetry across user-facing surfaces
External surfaces all switch; internal Python module / file names and the
physical DB tables (usage_events, usage_session_summary, usage_tool_daily,
usage_plugin_daily) stay — renaming them would force a schema migration
+ a redo of the LLM Text-to-SQL prompt for no analyst-visible win.
Changes:
- Admin dropdown: "Tool usage" → "Telemetry"
- Page H1 / <title>: same
- URL: /admin/usage → /admin/telemetry; old URL 308-redirects
- API prefix: /api/admin/usage/* → /api/admin/telemetry/*
- CLI: primary command `agnes admin telemetry …`; `agnes admin usage` kept
as a deprecated alias so existing operator scripts keep working
- Internal data-source table id: agnes_usage → agnes_telemetry. The
registry seed now evicts any stale internal-source row whose id no
longer matches INTERNAL_TABLES, so the old `agnes_usage` row is
removed from table_registry on next app boot
- All tests + JS endpoint paths updated
* test(rbac): include auto-appended internal tables in expectations
get_accessible_tables now appends agnes_sessions / agnes_telemetry /
agnes_audit to every authenticated user's accessible-tables list so the
internal data source shows up in /catalog. The two existing rbac tests
asserted hardcoded list shapes that pre-dated the change.
Rewritten to assert "granted tables + the canonical internal-table set"
instead of literal lists, so the test stays correct if the internal
table roster changes again later.
* ui: visual dividers between admin-dropdown sections
Adds a 1px top border + 6px top margin to every section header except
the first, so the five named groups (Activity Center, Users & Access,
Data, Agent Experience, Server) read as visually separated clusters.
The header itself stays small-caps + muted as before — the border is
additive.
* ui(memory): match obs-topbar visual on /corporate-memory
The Curated Knowledge page (linked from the admin dropdown's Agent
Experience section) opened straight into the stats bar — no title,
no subtitle, no shared chrome with the other admin pages. Adds an
obs-topbar-style header at the top of .container-memory:
- H1 "Curated Knowledge"
- subtitle explaining what the page is + how AI agents pull from it
The `.ck-*` class set duplicates the inline obs-* styles from
/admin/activity etc. for this one page; promoting the obs-* class set
to style-custom.css for shared reuse is the obvious next step (4 pages
already inline the same CSS), tracked as a follow-up.
Page <title> also renamed from "Corporate Memory" → "Curated Knowledge".
* ui(tables): list Agnes internal tables in /admin/tables + group in /catalog
/admin/tables previously rendered three per-source-type listings
(BQ / Keboola / Jira) and dropped any row whose source_type didn't
match — so the agnes_sessions / agnes_telemetry / agnes_audit rows
seeded into table_registry were invisible. Adds a fourth read-only
section "Agnes internal tables" that filters source_type === 'internal'
and renders the same registry-table layout the other sections use,
with two changes:
- no Register button (these rows are seeded on every app boot from
connectors/internal/registry.py)
- Edit + Delete actions hidden (any change would be reverted on the
next start). Manage access stays so admins can still inspect.
Mode badge picks up a new mode-internal CSS class (teal accent) so the
display doesn't lie and call it "local".
In /catalog, internal tables now group under an "agnes" accordion
section (bucket="agnes" on seed) instead of falling into the catch-all
"default". Single source of truth for which tables exist; admins find
them where they expect.
* ui(tables): Agnes internal as a 4th tab next to BQ/Keboola/Jira
Previous iteration mounted the internal-table listing as a separate
standalone card under the tab strip. Reshapes it to a proper
tab-content section so admins switch between data sources via one
consistent nav (BigQuery / Keboola / Jira / Agnes internal).
- New tab button "Agnes internal" in the tab-nav.
- The listing card becomes <section id="tab-content-internal"
class="tab-content">; switchTab() already routes by id so no JS
change beyond extending the hash allowlist for direct #internal
links.
- Tab content keeps the read-only treatment from the previous commit
(no Register button, no Edit / Delete in renderRegistryListing).
* ui: rename Curated Knowledge → Curated Memory
Settles the naming back on "Curated Memory" — parallel structure with
"Curated Marketplaces" in the same Agent Experience section, and zero
rename ripple: URL (/corporate-memory), API (/api/memory/*), CLI
(agnes admin memory), and Python modules all stay on "memory" so the
admin label finally lines up with the underlying surfaces.
The "Curated" prefix still tells admins what they do on the page
(review pending → approve / mandate / reject) and reads as a sibling
of "Curated Marketplaces" right next to it in the dropdown.
Touches: admin dropdown label, page <title>, page H1. DB tables stay
on knowledge_* (already the canonical naming for the data shape).
* ui: rename "Server activity" → "Audit log"
"Audit log" is what the page actually is — server-side audit_log table
rendered with KPI cards + filter bar + sortable table. The "Server
activity" label confused the term with Claude Code session telemetry
(Telemetry page) and didn't make the source/concept clear.
Touches:
- Admin dropdown nav label
- /admin/activity page H1 + subtitle
- /admin/telemetry subtitle cross-link
- test_activity_api page-renders assertion
URL (/admin/activity) and API (/api/admin/activity/*) stay — the
"activity" name has stuck at the route layer for a year; rerouting
those would churn dashboards/bookmarks for zero analyst-visible win.
* ui(admin-nav): gray band on each section header for clearer separation
Previous iteration used a 1px top border between section labels — the
labels still blended into the items above/below at a glance. Switches
to a light gray background band per section header, extended edge-to-
edge inside the panel via negative horizontal margins. Bolder
font-weight (700) reinforces the separation; bumping the font color
isn't needed because the band itself does the work.
First section's header tucks into the panel's top border-radius so the
band reaches the corners without a gap.
* ui(catalog): rename internal-table category to "Agnes Internal"
`bucket` is what /catalog renders as the accordion category header
verbatim — "agnes" lowercase didn't read as a real category name and
got confused with a system identifier. Bumps to "Agnes Internal".
Seed re-applies on every app boot so existing rows pick up the new
bucket value via `ON CONFLICT (id) DO UPDATE`.
* ui(catalog): split Agnes Internal into its own card on /catalog
Previously the three internal tables landed inside the "Core Business
Data" card under an "Agnes Internal" accordion alongside Keboola / BQ
buckets — readers conflated system telemetry with business datasets,
and the data_stats header counter ("3 tables · ~X rows total") only
ever counted synced rows so internal tables looked invisible.
Split the catalog page into two cards:
- Core Business Data: only non-internal source_types (Keboola, BQ,
Jira). Accordions group by bucket as before. Stats counter reflects
this card's tables.
- Agnes Internal: a dedicated card with its own visual treatment
(teal accent matching the mode-internal badge in /admin/tables).
Flat list (no accordion — only 3 rows, never grows here), each
row carries the canonical `agnes query` snippet. Read-only — no
profiler click, no In-stack toggle, no sync metadata.
Route adds `internal_card` context object; template renders the new
card only when it's non-None.
* fix(rbac): hide internal tables from /admin/access + drop "my" framing
Two related cleanups for the Agnes-internal tables:
1. /admin/access (resource grants) no longer lists them. The
`can_access` check has a hardcoded internal-table bypass — security
is row-level (per-request view filter), so a table-grain
`resource_grants` row would do nothing. Surfacing them in the UI
let admins set up grants that silently no-op. Filter at the
`_table_blocks` projection so the UI tree never sees them.
2. Display names drop the analyst-perspective "my" framing:
"Agnes — my sessions" → "Agnes sessions"
"Agnes — my telemetry events" → "Agnes telemetry events"
"Agnes — my audit log" → "Agnes audit log"
The "my" only makes sense from the querying analyst's seat
(`SELECT … FROM agnes_sessions` returns *their* rows); on /admin/*
pages where admin sees / configures them across users, the
pronoun was misleading. Description text now spells out the
row-level RBAC contract explicitly.
Display names update via TableRegistryRepository.register's ON CONFLICT
UPDATE on next app boot; no manual cleanup needed.
* ui: subtitle notes about agnes_* tables on each Activity Center page
The recursive observability story — Agnes serves its own audit /
telemetry / session data through the same `agnes query` plumbing
analysts use for business data — wasn't surfaced anywhere on the
admin pages that show that data. Three pages get a one-liner with
the canonical `agnes query` snippet + the RBAC contract (analysts
see their own rows, admin sees all):
- /admin/activity (Audit log) → agnes_audit
- /admin/telemetry (Tool usage) → agnes_telemetry
- /admin/sessions → agnes_sessions
Sets up the discovery moment for admins: they're reading the page,
they see "you can query this from Claude Code", they remember it
when an analyst asks "how do I find my own failed tool calls?".
* ui(tables): explain "Show log" empty-state on /admin/tables
Cache warmup log <pre> renders with a dark background and is only
populated by the SSE stream during a Re-warm all run. Opening the
page cold + clicking Show log just revealed a black bar with no
context — admins couldn't tell what they were looking at.
Adds an inline paragraph above the <pre> explaining what the log is,
the row format, when it fills in, and where to find the historical
audit trail (/admin/activity). The actual <pre> stays empty until
SSE events arrive, but the surrounding copy carries the meaning.
* ui(tables): auto-open cache-warmup log on Re-warm all click
A Re-warm all run takes ~24s per remote BQ row. With the <details>
collapsed by default, operators saw the button disable, watched a
quiet ~24s pass, and assumed nothing had happened — the streaming
log was hidden behind a closed disclosure.
Two small JS tweaks:
- cacheWarmupRun() opens the details on click, so streamed lines
appear without an extra interaction
- cacheWarmupOnStart() hides the inline hint paragraph the moment
real log content lands, so the dark log block isn't competing
with redundant context
Hint paragraph also clarifies that only `query_mode='remote'` BQ
rows are warmed — operators with only materialized/internal tables
would see total=0 and the page would "do nothing" by spec.
* ui: trim Agnes internal copy across surfaces
Descriptions had grown to explain the extraction pipeline ("parsed
out of session JSONLs"), the underlying table ("Backed by
usage_session_summary"), the RBAC mechanic ("row-level RBAC at query
time — analysts see their own; admin sees all"), and the SQL snippet.
Every implementation detail meant another rewrite on the next iter.
Strips to one stable line per surface: what the data is, plus
"Also available locally for analysis". Mechanics live in code +
docs; the page copy says what the user needs to know.
Touched:
- connectors/internal/access.py: INTERNAL_TABLES descriptions
- activity_center.html / admin_usage.html / admin_sessions.html
subtitles
- catalog.html Agnes Internal card description + row strip
- admin_tables.html "Agnes internal" tab hint
* fix(internal): is_user_admin arity bugs + + saved-view payload cap
Round-1 code review (PR #278) caught two blocking bugs and three nits.
Blocking — both `is_user_admin(user)` (single dict arg) calls raised
TypeError. is_user_admin signature is `(user_id, conn)`. Affected:
- app/api/query.py:_run_internal_query — every POST /api/query that
references agnes_sessions / agnes_telemetry / agnes_audit blew up
with a 500. The headline analyst-facing feature of this PR was
unusable through the API.
- app/api/v2_sample.py — same shape; `GET /api/v2/sample/agnes_*`
returned 500.
Both fixed to call `is_user_admin(user.get("id"), conn)`. Added two
FastAPI-level tests in test_internal_data_source.py that go through
the TestClient — the existing unit tests on `execute_internal_query`
and `build_filter_clause` skipped the request-handler layer where the
bugs lived, which is why this landed.
Nits also closed:
- connectors/internal/access.py: `+` allowed in _USERNAME_RE /
_USER_ID_RE so RFC 5321 email local-parts (alice+test@x) resolve
correctly without hitting InternalAccessError.
- app/api/observability.py: saved-view payload capped at 64 KiB to
prevent an admin from bloating system.duckdb with a malformed save.
* fix(security): close non-admin data-leak via underlying-table refs
PR #278 R2 review surfaced a non-admin-exploitable bypass: SQL whose
string literal contains 'agnes_sessions' routed into the privileged
internal-query path, then queried the underlying physical table
(usage_session_summary / usage_events / audit_log) directly, escaping
the CTE wrapper's row filter. Two reinforcing defenses:
1. find_internal_refs() now strips single-quoted string literals
before scanning for alias names — a literal alone no longer
routes the request into the privileged code path.
2. execute_internal_query() rejects non-admin SQL that references
the underlying physical tables (usage_*, audit_log). The CTE
wrapper only scopes the agnes_* aliases; a direct FROM on the
base table — or a shadowing inner WITH that still has to read
the base table — bypasses RBAC. Block before execution with an
actionable error pointing to the agnes_* alias. Admins are
unaffected (god-mode short-circuit on the filter clause).
3. tests/test_internal_data_source.py — three new negative tests
covering literal-only matches, direct-table refs, and CTE
shadow attempts.
Also tightens usage_ask.py's SELECT-only validator: pragma_table_info,
pragma_storage_info, pragma_database_*, and duckdb_tables / columns /
views / indexes / schemas are reflection functions that leak metadata
the analyst question shouldn't reach. \bPRAGMA\b in _FORBIDDEN never
matched the function-call form (word-boundary between `A` and `_`).
* fix(security): dynamic denylist for non-admin internal queries
R3 review (PR #278) caught a wider data-leak than R2: the underlying-
physical-table guard listed only the 7 usage_* + audit_log tables,
but system.duckdb has 30+ other sensitive tables — users (emails +
ids), personal_access_tokens, resource_grants, user_groups,
user_observability_views, store_*, marketplace_*, knowledge_*, etc.
A non-admin SQL like
SELECT * FROM agnes_sessions
UNION ALL SELECT email, id, … FROM users LIMIT 1
would leak every user's row.
Replaces the hardcoded denylist with a **dynamic allowlist** —
non-admin SQL may reference ONLY the registered agnes_* aliases.
Every other table in `information_schema.tables` (main schema) is
rejected. Future migrations that add a new sensitive table are
automatically covered without re-editing this module.
Also strips SQL comments (`/* */` and `--`) before the identifier
scan so a comment-wrapped table name (`/**/users/**/`) can't slip
past the regex.
Four new negative tests pin: `users`, `personal_access_tokens`,
block-comment wrap, line-comment wrap.
Plus: per-user view-count cap (100) on /api/admin/observability/views
so an admin can't fill system.duckdb with thousands of saved views.
* release: 0.54.0 — Activity Center + Telemetry + Sessions + internal datasource
Cuts the work shipped across this PR (Activity Center build, recursive
internal data source) into a versioned release. Bumps pyproject.toml
to 0.54.0; renames the top of CHANGELOG.md from [Unreleased] to
[0.54.0] — 2026-05-12 with a header summary; opens a fresh
[Unreleased] section for the next round.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(store-guardrails): enforce per-component description quality
Two-tier hard guardrail on flea-market submissions. Empty / placeholder /
single-word descriptions now block before any LLM call; vague-but-passes-
floor descriptions block on the substantive LLM review layer.
Tier 1 — inline mechanical check (src/store_guardrails/content_check.py).
Walks the baked plugin tree, evaluates each component (plugin manifest,
agents, skills, commands) plus the submission-level form description
against a 60-char / 25-char (commands) / 5-distinct-word / 200-char-body
floor with a placeholder denylist (TODO, TBD, {{var}}, etc.). Floors
calibrated against real ecosystem norms: Claude / superpowers /
compound-engineering skill packs cluster 150–220 chars, npm / Docker /
VS Code at 100–120. InlineResult.passed now ANDs in content.status.
Tier 2 — LLM review extension (prompts.py + llm_review.py). System
prompt gains a content-quality criterion; REVIEW_JSON_SCHEMA carries a
content_quality {verdict, issues[]} object alongside the existing
security findings. is_safe() requires content_quality.verdict == 'pass'.
Single LLM call covers both dimensions. MAX_RESPONSE_TOKENS bumped
2000 → 2500 for the extra payload. Verdicts missing content_quality
treated as pass (backwards compat with already-recorded rows).
Submitter UX:
- /store/new wizard now carries a "Before you upload — what passes
review" collapsible disclosure on both step 1 and step 2 with the
bar + patterns that work. Live char counter on the description
field. Per-component preview table (green/red dots from the new
summarize_for_preview helper) renders after the ZIP /preview round
trip, scoping each finding to its file.
- New /store/examples page with rejected/passes pairs for skill /
agent / plugin / command plus a "Why these limits" research table.
Anchored sections (#skill / #agent / #plugin / #command) so the
rejection banner can deep-link by component_type.
- Quarantine banner _content_findings.html groups findings by file
(one "See <type> example ↗" per component, not per field) and
translates field codes (frontmatter.description / body / etc.) to
plain-English labels. _content_howto_fix.html surfaces a static
"Re-upload as new version" + "See examples" action row beneath any
content failure on the entity detail page.
- _parse_frontmatter moved to src/store_guardrails/_frontmatter.py so
the new check module shares the parser without inverting the
app → src dependency direction.
Tests:
- New tests/test_store_guardrails_content.py (29 cases) covering
every failure code per component type plus submission-level checks
and the summarize_components / summarize_for_preview helpers.
- Extended test_store_guardrails_inline.py for the new
InlineResult.content field + aggregate behaviour.
- Extended test_store_guardrails_llm.py for the new
content_quality verdict pathways (fail blocks, missing field passes).
- Backfilled fixture descriptions across test_store_api.py,
test_store_entity_versions.py, test_store_put_atomic.py,
test_admin_store_submissions.py, test_marketplace_api.py,
test_marketplace_v32_endpoints.py so existing happy-path tests
clear the new 60-char floor.
* fix(content-guardrail): align agents walker with preview + drop import-time .format()
Two cleanups from the takeover review on #276 (vr/guardrails-content).
1) `_iter_components` for agents now skips files lacking frontmatter
(no `name` AND no `description`). Pre-fix the walker greedily
evaluated every `*.md` under `agents/` — `agents/README.md` and
helper docs got flagged as "frontmatter.description empty"
rejections. Worse: `summarize_for_preview` for `type=agent` ALREADY
filters the same shape, so the upload preview gave a green dot
while the post-bake check gave a red rejection on submit. Two new
regression tests in TestAgentsWalkerSkipsNonAgentFiles pin both
shapes (README + _NOTES.md) so the preview/check parity stays
aligned.
2) `body_too_short` hints now use the same runtime-kwarg substitution
pattern as every other hint in the table. Pre-fix the skill +
agent body_too_short hints called `.format(min_chars=_MIN_BODY_CHARS)`
at module-load time, but the call site `_hint_for(type_,
"body_too_short")` didn't pass `min_chars=`, so the format() was
just baking the constant at import. Cosmetic inconsistency; pass
`min_chars=_MIN_BODY_CHARS` at the call site instead and let
`_hint_for` do the substitution like it does for `too_short`.
Verified end-to-end:
- New TestAgentsWalkerSkipsNonAgentFiles cases fail on the unfixed
walker (verified by reverting to the pre-fix file and re-running);
pass cleanly after the fix.
- Full content-guardrail suite: 25/25 (23 existing + 2 new).
- Full pytest: 4189 passed, 25 skipped.
* release: 0.53.5 — content guardrail (flea-market submitter UX) + catalog ENTITY column + BQ hint dispatch
Bundles three threads landed in [Unreleased]:
- Vojta's flea-market content guardrail (two-tier mechanical + LLM)
- Zdeněk's `agnes catalog` ENTITY column replacement for FLAVOR
- Zdeněk's `/api/query` remote_estimate_failed hint dispatch fix
Plus the takeover hygiene from #276 review (agents walker preview/check
parity + body_too_short hint runtime kwarg consistency) and the
backslash-escape fix follow-up to v0.53.4 #275.
No DB migration; no API change. Patch upgrade lands transparently.
Upload form's new "Before you upload" disclosure + per-component preview
table appear on the next dev-VM auto-pull. Quarantine banner now groups
findings by file with "See <type> example ↗" deep-links to the new
/store/examples reference page.
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
PR #274 (just merged) introduced `\`AS \\\`rows\\\`\`` in the syntax-error
branch of _hint_for_bq_bad_request. Python doesn't recognize \\\` as an
escape sequence, so the literal backslashes survived into the JSON
`hint:` field. Analyst reading the CLI error saw:
Backtick the alias (`AS \`rows\``) or rename it ...
with visible backslashes — exactly the misleading shape this dispatcher
exists to clean up. Self-review caught it; this PR replaces the
problematic substring with plain prose ("rename the alias to a
non-reserved word (AS row_count) or backtick-quote it BQ-style
(AS `rows` with literal backticks around the identifier)") that needs
no escape gymnastics.
New regression test test_no_hint_branch_leaks_literal_backslashes pins
every dispatch branch against `\\\`` and `\\\\` substrings — pytest now
catches this class on the next regression instead of waiting for an
analyst to spot it. CHANGELOG bullet rephrased to match (the same
broken backslashes leaked into the [Unreleased] entry).
Verified: 4162 tests pass; 26 in test_api_query_guardrail.py green;
demo print of the syntax-error branch shows clean output.
Two analyst-UX papercuts surfaced by the v0.53.4 onboarding smoke test.
1) /api/query remote_estimate_failed hint now branches on the BigQuery
error class instead of always claiming a column doesn't exist. The
previous hardcoded "Most often this means a column referenced …
doesn't exist" misled analysts whenever BigQuery actually rejected
on syntax — concretely, `SELECT COUNT(*) AS rows FROM …` fails with
`Syntax error: Unexpected keyword ROWS at [1:20]` (`rows` is a BQ
reserved word) and the hint pointed at non-existent columns.
New _hint_for_bq_bad_request() helper dispatches:
- "Syntax error" / "Unexpected keyword" → reserved-keyword alias hint
with `AS row_count` workaround
- "Unrecognized name" / "not found inside" → `agnes schema <id>`
- "Table not found" → `agnes catalog`
- fallback → enumerate all three
4 unit tests in TestHintForBqBadRequest pin each branch. Existing
guardrail tests (test_fallback_fails_fast_on_pure_duckdb_syntax,
test_remote_estimate_failed_surfaces_first_error_when_attempts_differ)
continue to pass — both hint substrings they assert on still appear in
the relevant branches.
2) `agnes catalog` replaces the FLAVOR column with ENTITY. FLAVOR
rendered t['sql_flavor'] which duplicated SOURCE for any catalog
dominated by one source type — analysts saw `SOURCE=bigquery
FLAVOR=bigquery` on every row. ENTITY instead surfaces the upstream
BigQuery entity_type (BASE TABLE / VIEW / MATERIALIZED_VIEW) for
remote rows; non-remote rows render `-`. The distinction matters
operationally: views don't support predicate pushdown, so `agnes
query --remote` against a view trips the cost guardrail where the
same query against a BASE TABLE pushes down cleanly. The
entity_type field has been in the v2 catalog response since 0.51.0;
this PR just stops hiding it behind a column header that conveyed
no information.
JSON output (`agnes catalog --json`) is unchanged — only the human-
readable column changed. No DB migration; no API change.
Verified: 4161 tests pass locally; 25 in test_api_query_guardrail.py
green; the 4 new TestHintForBqBadRequest cases pin each branch.
* fix(cli-install): move kbcstorage to [server] extra so wheel installs cleanly
The 0.53.3 wheel served at /cli/wheel/ is unsatisfiable on a clean machine:
analyst runs `uv tool install <wheel-url>` per the published /setup
instructions and the resolver immediately fails with
Because kbcstorage<=0.9.5 depends on urllib3<2.0.0 and
agnes-the-ai-analyst==0.53.3 depends on kbcstorage>=0.9.0 and
urllib3>=2.7.0, we can conclude that agnes-the-ai-analyst==0.53.3
cannot be used.
The `[tool.uv] override-dependencies = ["urllib3>=2.7.0"]` in pyproject.toml
masked the conflict in workspace contexts (Dockerfile + dev install) but
does NOT propagate to the wheel — wheel METADATA is plain PEP 621
Requires-Dist, and a fresh resolver context (uv tool install <wheel-url>)
never sees the override. Every existing test passed because the dev venv
already has kbcstorage 0.9.5 + urllib3 2.7.0 coexisting under workspace
overrides; the break only surfaces on the next analyst's first install.
Fix: kbcstorage moved out of [project] dependencies into
[project.optional-dependencies].server, since it is server-side only
(connectors/keboola/client.py is the sole import site, called from admin
endpoints, server connectors, and integration tests — never from the CLI
install path). Server install picks it up via Dockerfile's
`uv pip install --system --no-cache ".[server]"`. CI installs `.[dev,server]`
so workspace tests still cover the kbcstorage path. Analyst CLI wheel
METADATA now lists `kbcstorage>=0.9.0; extra == 'server'` (gated) and
`uv tool install <wheel>` resolves cleanly.
Verified end-to-end:
- Built wheel locally; inspected METADATA — kbcstorage line is now `; extra == 'server'`.
- `docker run --rm python:3.13-slim` + `uv tool install <wheel>`: agnes 0.53.4 installs, `agnes --version` works, `agnes catalog --help` renders, kbcstorage absent from CLI venv, urllib3 = 2.7.0.
- Same container with `.[server]` install path: kbcstorage present, urllib3 = 2.7.0 (override applies in workspace context).
- Full pytest suite green locally (4157 passed, 25 skipped).
* release: 0.53.4 — analyst CLI install hotfix (urllib3/kbcstorage resolver conflict)
Patch bump shipping the [server] extra split + new clean-install CI lane.
No DB migration; no API change; no operator-facing config change.
Operator side (Dockerfile path) auto-picks `.[server]` so the production
image gains kbcstorage transparently. Analyst onboarding (uv tool install
<wheel>) starts working again.
Three bundled improvements:
- #244 — new `agnes diagnose` check compares SessionStart events
(~/.claude/projects/<encoded>/*.jsonl) against agnes-push uploaded
log entries inside a 7-day window. Surfaces a warning when the gap
exceeds 3, hinting at silently-broken capture-session — previously
detectable only weeks after the fact.
- Dependabot — bumps transitive urllib3 from 1.26.20 to 2.7.0 to close
5 advisories (4 high, 1 medium). kbcstorage 0.9.5 still pins
urllib3<2.0.0 upstream; overridden via [tool.uv] override-dependencies
since the SDK works fine against 2.x in practice (Client + Tables
both flow through requests, which supports both lines).
- #252 — fix flaky test_scratch_dir_cleaned_up_after_failed_extraction
by redirecting tempfile.tempdir to a per-test tmp_path. Pre-#252 the
test scanned the shared system tmp dir and a sibling store test in
another pytest-xdist worker could trip the assertion mid-window.
Closes#244. Closes#252.
Cut [Unreleased] → [0.53.2]. Bumps pyproject from 0.53.1 to 0.53.2.
Contents: configurable instance.brand + workspace_dir, setup-script
polish (explicit mkdir step, final "restart Claude Code" step,
uniform connector markers), Asana revert from MCP to PAT+REST,
Atlassian longest-expiry guidance, and the BREAKING removal of
agnes query --register-bq.
The flag ran RemoteQueryEngine in-process on the caller's machine and
required local BigQuery credentials (BIGQUERY_PROJECT + ADC). Analysts
don't have those, so calling --register-bq from an analyst workspace
surfaced as a confusing not_configured error chain ("Could not load
static instance.yaml" + "BigQuery project not configured"). An agent
following CLAUDE.md's hybrid-queries guidance would land in exactly
that trap.
The underlying engine was originally designed server-side (commit
d180b201, "Step 28: Remote query architecture"); the CLI port (commit
d605e7d9) silently assumed parity with the server. Server-side hybrid
already exists as an admin-only POST /api/query/hybrid endpoint
(app/api/query_hybrid.py) and is untouched here.
Analysts combining local + remote data now have two documented paths:
agnes snapshot create a filtered slice and join locally, or run the
join server-side via agnes query --remote. CLAUDE.md, the agent skill,
docs/DATA_SOURCES.md, and connectors.md updated accordingly.
Three client-side fixes in admin_tables.html plus a regression test
file pinning the server-side PUT contract the new JS relies on.
Bug 1 — saveBqTabEdit (synced/custom) nulled bucket/source_table on
every save; the null was supposed to clear stale state on a true
remote→materialized mode flip but fired on every save, silently
wiping persisted bucket/source_table when admin edited only the
description on an already-materialized row. Now gated by
_editOriginalQueryMode !== 'materialized'.
Bug 2/3 — _buildBigQueryPayload (synced/whole) at register time did
not send bucket/source_table — only source_query — so whole-table
materialized rows persisted with bucket=NULL. Edit modal then loaded
empty Dataset/Table inputs over a SELECT * SQL. Register now sends
both fields; _openEditBqModal additionally parses source_query as
a fallback for rows that registered pre-0.53.1.
Closes#266.
- instance.brand (env AGNES_INSTANCE_BRAND, default "Agnes") +
instance.workspace_dir replace hard-coded "Agnes" / "~/Agnes" across
/home, /setup, /setup-advanced, /login, /install, /me/debug, and the
Claude Code clipboard setup script. Terraform-friendly env override;
defaults preserve existing Agnes branding.
- Explicit "create workspace folder" step on /home (OS-tabbed mkdir+cd)
+ same step baked into the clipboard script as step 2. Drops the
implicit assumption that `agnes init --workspace .` lands in a
sensibly-cd'd shell.
- Final "Restart Claude Code" step in the setup script (unconditional,
between connectors and Confirm) so freshly-installed plugins, MCP
servers, and SessionStart hooks load on the next Claude Code session.
- Asana reverted from hosted Remote MCP back to PAT + raw REST against
app.asana.com/api/1.0. MCP envelope shape consumed ~5x tokens per
call; the PAT path lets the agent read flat REST fields. Existing
MCP registration is detected and the user is asked whether to remove
it (default Y, with benefits listed: token cost, no third-party hop,
no OAuth refresh dance, deterministic envelope shape).
- Atlassian connector instructs picking the longest API-token expiry
(today "1 year") to cut re-mint friction. No public query-parameter
hook exists on id.atlassian.com to pre-select expiry, so the prompt
documents the manual click and acknowledges that limitation.
- Uniform ✅ / ❌ per-connector marker contract (Asana, GWS, Atlassian)
for the Confirm summary to grep. Each connector now ends with a
Claude-driven end-to-end test that uses Claude Code's own bash to
exercise the stored credential and prints
"✅ <Connector> integration verified — ..." (or the failure variant).
* release: 0.53.0 — Tier B trackers + admin UI bugfix
Closes#259 (init resume sentinel), #260 (startup parquet-lock sweep),
#261 (materialized schema uses local parquet, not BQ), #265 (admin
tables apostrophe → HTML-entity escape).
Tracker notes: #262 closed as obsolete (pre-empted by 0.51.0 changes),
#266 left open pending UX clarification.
* fix(init): move resume sentinel from .agnes/ to .claude/
The clean-install integration test (test_clean_install_integration.py)
forbids creating .agnes/ in the workspace root via its
forbidden_unconditional list — that path is reserved for ~/.agnes/ in
the user's HOME (marketplace clone, CA bundle).
.claude/ is already created by agnes init for settings.json + hooks,
so dropping init-complete next to those keeps the resume sentinel
consistent with the rest of Claude Code's workspace surface and lets
the clean-install assertions pass.
Issue #259.
* docs(changelog): point #259 entry at new .claude/init-complete path
Follows the sentinel move from .agnes/ → .claude/ to keep the changelog
in sync with what 0.53.0 actually ships.
* fix(memory-admin): pass RBAC user_groups (not YAML config) to /corporate-memory/admin
`GET /corporate-memory/admin` was passing `corporate_memory.groups` from
instance.yaml (a dict, default `{}`) as the template's `groups=`. The
template's `renderItemCard` does `GROUPS.map(g => ...)` to build the
mandate-form audience picker, which throws `{}.map is not a function`
the moment any pending item renders. The thrown TypeError bubbles up
through `renderReviewItems` and gets swallowed by the `loadReviewQueue`
catch block, which paints "Error loading pending items." over a
perfectly valid `/api/memory/admin/pending` response.
Bug was dormant since the initial system commit because `renderItemCard`
only runs when at least one pending item exists, so test fixtures and
empty queues never tripped it.
Mandate form actually targets RBAC user_groups (audience strings like
`group:Admin`), not the unrelated `corporate_memory.groups` YAML
section. Route now passes the user_groups list shaped as
`[{name, members_count}]`. Template additionally guards the `.map`
call with `Array.isArray(GROUPS) ? GROUPS : []` so a future shape
regression degrades to "no group options" instead of crashing the list.
No DB migration; no API change.
* test(memory-admin): assert /corporate-memory/admin renders GROUPS as array
Server-side regression for the bug Vojta's previous commit fixed: the route
used to pass `corporate_memory.groups` (a dict, default `{}`) into the
template as `groups=`, so `const GROUPS = {{ groups | tojson }}` rendered
as `{}` and `GROUPS.map(...)` inside renderItemCard threw at runtime.
The bug was dormant because renderItemCard only runs when at least one
pending item exists — empty queues never tripped it. Test seeds one
pending knowledge_item and asserts the rendered HTML contains
`const GROUPS = [` (array prefix), so any future regression that flips
the variable back to a dict shape fails immediately rather than waiting
for an operator to seed the queue and notice the broken admin page.
* release: 0.51.1 — corporate-memory admin pending-banner hotfix
Patch bump: ships the /corporate-memory/admin GROUPS-shape fix
(dict → array of {name, members_count}) and its server-side regression
test. No DB migration, no API change, no operator action required —
upgrades land transparently.
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* Rename agnes-metadata.json to marketplace-metadata.json
Curated marketplace enrichment file (.claude-plugin/agnes-metadata.json)
becomes marketplace-metadata.json. Clean cut, no fallback — curators of
upstream marketplace repos must rename the file on their side.
Python API renames mirror the file rename: read_agnes_metadata →
read_marketplace_metadata, AGNES_METADATA_REL → MARKETPLACE_METADATA_REL,
AGNES_METADATA_MAX_BYTES → MARKETPLACE_METADATA_MAX_BYTES. Synth Claude
Code marketplace strip rule (.agnes/** + the metadata file) follows the
new filename.
* Marketplace detail polish: window cover + 715:310 aspect + helper alignment
- Plugin & item (skill/agent) detail hero: 160x160 square cover replaced
with a macOS-style window frame (3 traffic-light dots + titlebar label
showing the entity name). Body is constrained to 715:310 so curator-
uploaded covers no longer crop to a square. Window is 380px wide; meta
column and absolutely-positioned top-right install/remove actions stay
put. Fallback when no cover_photo_url (translucent gradient + PL/SK/AG
initials) is unchanged, just inside the window body.
- Inner skill/agent cards in the plugin detail's Internal structure
section adopt the same 715:310 aspect (was fixed 78px tall). No window
chrome on inner cards — just the matching proportions so covers read
consistently across hero, grid tiles, and listing cards.
- Curated nested item helper text ("This skill is part of ... — add the
bundle to your stack to use it") now stacks UNDER the "Open parent
plugin" button instead of being a side-by-side flex sibling in the
actions-row. Added align-self: flex-end so the 260px helper box
anchors at the right edge of the 300px actions column, matching the
button's right edge.
* Marketplace My tab: surface the same category + type filters as Flea
- Frontend: mp-cat-row and mp-type-row now show on tab=my (previously
hidden — type was flea-only, category was flea/curated-only). Curated
browse stays plugin-only and continues to hide the type pills.
fetchOne() sends the `type` param for tab=my too, so the items
endpoint's existing my-branch filter actually receives it.
- Backend categories endpoint, tab=my branch: when the type filter is
set to skill/agent, skip counting curated subscriptions. Curated
plugins are always type='plugin', so they wouldn't survive the items
endpoint's type filter; including them in the category counts made
the pill numbers overstate what users could actually see in the
grid. type=None or type='plugin' keeps the previous behaviour.
- CHANGELOG entry under [Unreleased].
* Marketplace plugin detail: render rich content from marketplace-metadata.json
Adds five optional plugin-level fields to marketplace-metadata.json and
renders them on the curated plugin detail page + listing card:
* display_name — friendly h1 / listing-card name / mac-window titlebar
label (overrides the technical plugin id)
* tagline — punchy 1-line value prop for the hero subtitle and the
listing card description (replacing the verbose marketplace.json
description on cards)
* description — multi-paragraph markdown body, server-side rendered
through markdown-it-py and sanitized through nh3 with a
description-scoped allowlist (no iframes / no raw HTML / no
javascript: links). Powers the "What it does" panel.
* use_cases[] — {title, description, prompt} entries that render as a
3-column "When to use it" card grid; each card shows the literal
prompt as a code chip so users can copy-paste into Claude Code.
* sample_interaction — {user, assistant} dialog rendered in a Claude
Code-style dark Catppuccin Mocha transcript panel: monospace user
row with a green ">" prompt indicator + sans-serif assistant body
with markdown formatting (peach bold, yellow italic, pink inline
code, mantle-dark fenced code blocks).
All five fields are optional; UI sections only render when populated,
so plugins without enrichment look identical to before. Fields are
read on-demand from the working tree (cached by mtime per marketplace
slug) so curator edits land at the next request without waiting for
a sync cycle — same pattern as the existing inner-skill/agent
enrichment path. No DB schema bump.
Skill / agent rich-content rendering is deferred to a later phase
(needs a source-of-truth decision: extend plugin.yml? LLM-generate
from SKILL.md / agent.md?). The schema accepts the same fields at
skill/agent level today for forward compatibility but the UI ignores
them for now.
Also: stripped a stale `background-color: var(--bg)` from the global
`code` rule in style.css (was making inline code visually disappear
on the page background).
* Skill / agent detail: render rich content from marketplace-metadata.json
Brings the skill/agent detail pages to parity with the plugin detail
page. Same rich-content schema (display_name, tagline, description as
markdown, use_cases[], sample_interaction) plus two per-item additions:
* invocation — curator-provided literal command string. When set,
overrides the computed "<manifest_name>:<inner_name>" chip and
cleanly supports both "/" skill prefix and "@" agent prefix (the
hardcoded "/" in the chip markup is hidden when the curator provides
the invocation, so /grpn-eng:query <q> and @grpn-eng:cto-architect
both render correctly).
* when_to_use — markdown disambiguation block ("Use this for X. For
similar Y, see /other-skill") rendered into a new "When to use this"
panel below the Example section.
Skill / agent category is now per-item overridable in
marketplace-metadata.json. When absent, the API keeps the parent
plugin's category as the badge so existing items don't lose their
category until curators opt in to per-item categorization.
The new "Example" Q&A panel uses the same Claude Code-style dark
Catppuccin Mocha transcript treatment as the plugin detail —
monospace user row with a green ">" prompt indicator + sans-serif
assistant body with markdown formatting.
All new fields are optional and read on-demand from the working tree.
Skills / agents whose marketplace-metadata.json doesn't carry rich
content render exactly the same way they did before (frontmatter
description + computed slash command + cover from existing v32
enrichment). No DB schema bump.
* Fix TypeError in skill / agent detail when curator sets per-item category
`curated_skill_detail` and `curated_agent_detail` were passing both
`**parent` (from `_curated_inner_parent_fields`, which returns the
parent plugin's category as a fallback) and `**enrichment` (from
`_curated_inner_enrichment`, which returns the per-item category
override when the curator set one) into `InnerDetailResponse(...)`.
Python function-call kwargs unpacking with overlapping keys raises
`TypeError: got multiple values for keyword argument 'category'`
— it doesn't merge like a literal dict does. The bug only surfaced
when the marketplace-metadata.json carried a `category` field at
skill / agent level (curator opting into per-item categorization);
items without that override hit the endpoint cleanly because only
parent provided the key.
Fix: build `merged = {**parent, **enrichment}` first (literal-dict
syntax DOES merge, with the right-hand-side winning) and unpack the
merged dict. Curator override still wins via the merge order, and
the same pattern is future-proof for any other field that lands in
both layers later.
Plus a regression test in test_marketplace_metadata.py asserting
that the inner-resolver carries `category` for downstream merging.
* Marketplace detail: tolerate partial curator JSON
Server constructed UseCase / SampleInteraction via raw dict indexing
(uc["title"], sample["assistant"]), so a curator commit missing any
required Pydantic field crashed the whole plugin / skill / agent detail
endpoint with a 500. Route both constructions through _safe_use_case /
_safe_sample_interaction helpers — partial input silently drops the
malformed card / section instead of breaking the page.
Regression test in test_marketplace_api.py covers the three shapes:
use_case missing a key, use_case with an empty string, and
sample_interaction with only user (no assistant). Sibling rich fields
still render.
* Address PR-251 review (must-fixes + S2/S3 polish) + release-cut 0.50.0
Five must-fixes from the review pass (3 from @cvrysanek's two-stage
review, 2 from my independent pass), plus the 0.50.0 release-cut as the
last commit on this PR per CLAUDE.md (CLAUDE.md "Release-cut belongs
to the PR" rule added in v0.49.1).
Must-fixes
----------
1. Cache eviction: bounded LRU instead of per-marketplace predicate.
The previous predicate (`k[0] == marketplace_id and k[1] != mtime_ns`)
only swept stale entries for the CURRENT marketplace; with N>100
distinct marketplaces each holding one mtime key, the cap silently
failed and memory grew linearly. Replaced with OrderedDict-backed
bounded LRU at cap=256, drop oldest insert on overflow.
Cache stress test pinned in test_marketplace_metadata.py.
2. Render CPU cap: per-field byte cap on description / when_to_use /
sample_interaction.assistant via MARKETPLACE_METADATA_FIELD_MAX_BYTES
(= 64 KiB). Without this, a 1 MiB curator markdown body × QPS =
curator-controlled CPU burn through pure-Python markdown-it-py.
Truncation respects UTF-8 boundaries and logs a warning so the
curator sees the cap fire on the next sync. Test for cap +
UTF-8-boundary preservation.
3. Inner-detail bypassed the metadata cache. _curated_inner_enrichment,
_curated_inner_cover, and curated_detail all called
read_marketplace_metadata directly, defeating the mtime cache the
plugin listing already shared. Routed all three through
_read_metadata_cached so skill/agent detail hits are O(1) re-parses
per marketplace per mtime instead of O(QPS).
4. Truthy-vs-presence trap in plugin/inner enrichment merge. API-layer
writers used `if resolved.get(k):` which silently dropped any
future falsy-but-valid resolver field (bool featured=False, int
priority=0, str category=''). Switched to presence check
(`if k in resolved`) so the resolver is the authority on field
presence; `{**parent, **enrichment}` merge respects whatever the
resolver decided to ship.
5. Vendor-agnostic OSS cleanup. Removed operator-specific token
references (/grpn-eng:, @grpn-eng:, .foundryai/) from
src/marketplace_metadata.py docstring, app/web/templates/
marketplace_item_detail.html JS comment, docs/curated-marketplace-
format.md, and tests/test_marketplace_metadata.py fixtures. Replaced
with generic /my-plugin:tool / @my-agent:role / .example/ placeholders.
CHANGELOG
---------
- New "### Fixed (PR #251 follow-ups)" section documenting all 4
code-side must-fixes
- New "### Internal" section noting the vendor cleanup + new tests
- BREAKING bullet for the file rename now covers operator-side
migration: running instances see plugin enrichment disappear from
the UI until upstream curator renames + nightly sync overwrites the
working tree; POST /api/marketplaces/{id}/sync forces refresh sooner
- Stripped /grpn-eng: leaks from the existing skill/agent rich-content
bullet
Tests
-----
128 targeted tests pass (test_marketplace_metadata, test_marketplace_api,
test_marketplace, test_markdown_render, test_marketplace_synth_strip,
test_marketplace_filter). New tests added:
- 6 XSS regression tests on render_safe (javascript:/data:/vbscript:
schemes via autolink, reference link, and mixed-case + positive
http/https/mailto + noopener noreferrer rel)
- 3 byte-cap tests (truncation + UTF-8 boundary + under-cap pass-through)
- 1 cache eviction stress test (>256 marketplaces -> bounded at cap)
- 1 truthy-vs-presence resolver-contract test
Release-cut
-----------
- pyproject.toml 0.49.1 -> 0.50.0 (minor; BREAKING file rename per
pre-1.0 CHANGELOG note: "breaking changes called out under Changed
or Removed with the BREAKING marker")
- CHANGELOG [Unreleased] -> [0.50.0] - 2026-05-12, new empty
[Unreleased] on top.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
Three behavioural improvements driven by the sub-agent end-to-end test
findings, plus scheduler tweaks to prevent the post-deploy contention
burst we measured.
CATALOG (catalog-side bugs the test agents tripped on):
- new entity_type field per remote row (BASE TABLE / VIEW /
MATERIALIZED VIEW). For views, rows + size_bytes return null
instead of the misleading 0 that __TABLES__ reports.
- where_examples now validates against the table's actual schema
(cached known_columns from refresh). The pre-fix behavior
blindly advertised `country_code = 'CZ'` on tables with no
country_code column — the sub-agent tests reliably hit this on
unit_economics.
- new known_columns + entity_type columns on bq_metadata_cache;
populated by bq_metadata_refresh.refresh_one from the same
fetch_bq_columns_full call (no extra BQ roundtrip) plus a
cheap INFORMATION_SCHEMA.TABLES lookup for table_type.
QUERY COST-GUARD:
- remote_scan_too_large suggestion now names views explicitly:
`Target(s) <ids> are VIEW or MATERIALIZED VIEW. BigQuery does
not push LIMIT into the view body — SELECT * FROM <view>
LIMIT 1 still runs the full underlying scan.` Programmatic
consumers get a new view_targets field on the error detail.
SCHEDULER HYGIENE (the post-deploy 1-minute window where
concurrent parquet downloads dropped to ~1 MB/s):
- SCHEDULER_STARTUP_GRACE_SECONDS (default 60) holds the first
tick so the burst doesn't overlap cache_warmup writes.
- SCHEDULER_BQ_METADATA_INITIAL_OFFSET_MAX_SECONDS (default 900)
randomises bq-metadata-refresh's first-fire offset.
TESTS:
- test_bq_metadata_cache_repo: entity_type + known_columns round-trip
- test_v2_catalog_remote_metadata: where_examples validation, views
return null rows/size_bytes, cold rows have empty examples
- test_api_query_guardrail: VIEW-aware suggestion text + view_targets
- test_connectors_bigquery_metadata: entity_type lookup mock + new
fields in TableMetadata expectations
- test_scheduler_sidecar: grace + jitter env-var resolution
- DuckDB 1.5.1 regressed: rejected `ALTER TABLE … ADD COLUMN IF NOT EXISTS`
with `Cannot alter entry … because there are entries that depend on it`
when the target was FK-referenced from another table. Hit on `internal_roles`
(v8→v9) and `user_groups` (v11→v12) during migration replay. 1.5.2 fixes it.
CI already runs 1.5.2; this pins the same floor for local devs.
- tests/test_cli_binary_rename now skips with an actionable message instead
of failing when the local venv has no `agnes` on PATH (fresh checkout) or
has a stale shim from a prior editable install whose `cli` layout shifted.
CI installs fresh and still asserts the real contract.
Since 0.47.0 GET /api/v2/catalog enriched each remote BigQuery row by
fetching INFORMATION_SCHEMA.TABLE_STORAGE + COLUMNS through the DuckDB
BigQuery extension *inside the request*. On cold caches that fanned out
to O(N) sequential BQ jobs-API roundtrips — easily 90 s+ on partitioned
/ view-backed tables — and reliably blew the CLI's 30 s httpx
ReadTimeout. Reproduced with py-spy: three AnyIO worker threads stuck
inside connectors/bigquery/metadata._fetch_via_legacy_tables.
Refactor: enrichment is read exclusively from a new persistent
bq_metadata_cache DuckDB table (schema v40), populated by a scheduler-
driven refresh job at SCHEDULER_BQ_METADATA_REFRESH_INTERVAL (default
4 h). Cold catalog response on a fresh container is now tens of
milliseconds with metadata_freshness=never_fetched for unwarmed rows.
New surface:
- POST /api/admin/run-bq-metadata-refresh (scheduler-driven, full)
- POST /api/v2/metadata-cache/refresh?table=<id> (admin, single)
- GET /api/v2/metadata-cache/status (auth, non-admin)
- metadata_freshness field per catalog row
Removed (internal API): v2_catalog._size_hint_for_row,
_resolve_remote_metadata, _metadata_provider_for,
_build_metadata_request, _materialized_size_hint, in-memory
_metadata_cache. Response shape unchanged for external consumers.
991 tests passing; 2 pre-existing failures (test_db v3→v4 ladder,
test_cli_binary_rename) unrelated to this change.
* Make /home install-hero links readable against blue background
The Claude license-options link added in the previous commit inherited
the default `<a>` style (`var(--hp-primary)` blue), which renders as
blue-on-blue and is unreadable inside the blue install-hero. Add a
scoped `.install-hero a` rule that uses white with an underline
(matching the existing lead-paragraph contrast pattern) so any link
nested in the hero stays legible.
* Reorder /home install flow: auto-mode is now Step 2, Agnes install becomes Step 3
Step 3 (was Step 2) pastes a ~20-command bash bootstrap into a fresh
Claude Code session. Without auto-mode enabled first, each Bash/edit
command needs a manual approve click — bad UX for first-time users.
Move auto-mode from the outside-hero `<details>` reference block into
the install-hero as a real Step 2, between "install Claude Code" and
"install Agnes". Content is the persistent `acceptEdits` snippet
(write to ~/.claude/settings.json) plus a one-liner pointing at
Shift+Tab for users who are already inside a running Claude Code
session. YOLO mode for full Bash auto-approve stays on
/setup-advanced behind the existing link.
The outside-hero `setup-collapsible[data-section="step3"]` block is
dropped — auto-mode is no longer reference content, it's a real
install step, and duplicating it would just diverge over time.
Onboarded users no longer see the auto-mode block at all (consistent
with Steps 1 + 3 also hiding post-onboarding).
Completion banner copy updated: "Step 1, 2 & 3 done — Claude Code
installed, auto-mode set, Agnes ready". Dashboard CTA partial and
other templates don't reference step numbers for this flow, so no
adaptation needed there.
* Simplify /home Step 2 to Shift+Tab only — drop the JSON snippet
Operator pointed out two issues with the prior Step 2:
1. The settings.json snippet is redundant. Claude Code's first
Shift+Tab cycle to auto-accept mode already prompts the user
whether to persist it as default — Claude writes the config
itself, no manual file edit needed.
2. The snippet only showed the POSIX path `~/.claude/settings.json`,
which doesn't translate to native Windows.
Replace the snippet + copy button with a plain Shift+Tab instruction,
explicitly call out the first-time "make this the default?" prompt,
and note that Claude handles the config write itself — same flow on
macOS / Linux / WSL / Windows. Adds a fallback line for users who
already closed the post-OAuth session.
* Tighten /home Step 2 install-note to two paragraphs
Operator: drop the 'Claude writes the setting itself, so this works
the same on macOS / Linux / WSL / Windows...' line plus the
'auto-approves file edits going forward; Bash commands stay gated
— that's the safe default' line. Both were filler — the make-default
prompt already implies persistence, and gated Bash is the obvious
default users won't be surprised by.
Result: paragraph 1 carries Shift+Tab + first-time make-default
say-yes + closed-session fallback in one breath; paragraph 2 keeps
the verbatim YOLO link. Same affordances, less vertical space.
* Capture session paths via SessionStart hook + lock parallel pushes
Replace the encoding-based scan of ~/.claude/projects/<encoded-cwd>/ with
a queue file populated by a new `agnes capture-session` SessionStart hook.
The hook reads the documented `transcript_path` field from Claude Code's
hook stdin JSON, sidestepping the cwd-to-folder encoding (which is an
internal implementation detail and varies by Claude Code version).
- New `agnes capture-session` subcommand appends transcript_path to
<workspace>/.claude/agnes-sessions.txt. Silent on all malformed input
so a hook chain failure doesn't clutter Claude Code startup.
- `agnes push` now consumes the queue: atomic snapshot rename guards
against hooks writing during the push window, successful uploads land
in agnes-sessions-uploaded.txt (TSV: timestamp + path), failed paths
are requeued.
- Cross-platform single-instance lock via the filelock package (fcntl
on POSIX, msvcrt on Windows). Concurrent SessionEnd hooks — common
when the user closes several sessions at once — silent-exit on the
losing side instead of all racing the upload.
- Recovery: pre-existing snapshot files from a crashed push are picked
up and processed before the live queue.
- The SessionStart `agnes push` self-heal entry is dropped — it became
redundant once the queue persists across runs (orphans from headless /
crashed sessions ship out on the next interactive SessionEnd push).
Existing workspaces auto-migrate via the marker-based replace logic.
- Legacy encoding scan stays available behind `--legacy-scan` for one-
off backfills of sessions predating the queue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add /agnes-private + statusLine indicator for private sessions
Users handling sensitive data inside Claude Code can now opt a session
out of the Agnes upload pipeline, either proactively (right after session
start) or reactively (mid-session). The `/agnes-private` slash command
runs `agnes mark-private` deterministically via `!`-prefix direct bash —
no AI in the loop. A workspace-installed statusLine surfaces a
`🔒 agnes-private` indicator in Claude Code's status bar so the user
sees the state at a glance.
Authoritative source of "do not upload" is a separate file
`<workspace>/.claude/agnes-sessions-private.txt` (one session_id per
line). Both `capture-session` (queue writer) and `push` (queue reader)
consult the list. This makes the slash-command / SessionStart-hook race
impossible by construction: whichever runs first, the session is correctly
filtered out.
- `agnes mark-private` reads `CLAUDE_CODE_SESSION_ID` from env (set by
Claude Code in every bash subprocess it spawns — stable documented API)
and appends to the private list.
- `agnes statusline` reads the session JSON Claude Code pipes on stdin,
checks the private list, and emits the indicator or nothing. Optimized
for the high call frequency of statusLine renders.
- `capture-session` extracts session_id from hook stdin and skips queue
write when the ID is already on the private list (race protection).
- `push` filters snapshot entries by the private list and appends to a
per-workspace audit log `agnes-sessions-private-skipped.txt`.
- Queue format migrated from `<path>` to `<session_id>\t<path>`; legacy
one-column lines still parse (empty session_id, still upload, can't be
marked private retroactively — fine, they pre-date the feature).
- `install_claude_hooks` writes a workspace statusLine unless the user
already has a custom one (warn + preserve). Idempotent re-init.
- `install_claude_commands` ships `agnes-private.md` alongside
`update-agnes-plugins.md`. Per-template fallback so a missing template
doesn't get clobbered with the wrong content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix setup-prompt + CLAUDE.md marketplace copy + drop skills step
Three issues against the post-PR-#240 / post-PR-#237 state:
1. Setup prompt's marketplace block trailer (both has-stack and
empty-stack variants) claimed the SessionStart hook keeps the
marketplace clone in sync via `agnes refresh-marketplace --quiet`
on every session and that admin grants land automatically — both
false since PR #237 (0.47.x) moved the install/update path out of
the hook into the `/update-agnes-plugins` slash command. The hook
is `--check`-only: detects server-side changes, prompts the user
to run the slash command, which does the full reconcile
interactively with output visible in the transcript.
2. The empty-stack variant framed composition as "admin grants only",
missing the actual three-source served stack:
(admin RBAC ∩ /marketplace subscriptions)
∪ system-mandatory plugins (admin-pinned, auto-applied)
∪ Flea market installs (skills/agents bundled, plugins standalone)
Updated copy spells out all three sources so analysts know where
their stack picks live, and what the SessionStart hook actually
does on change detection.
3. CLAUDE.md template's "Agnes Marketplace" section conflated
eligibility (`resolve_allowed_plugins` — what's listed) with served
stack (`resolve_user_marketplace` — what actually reaches Claude
Code). The two are different: a user can be RBAC-eligible for a
plugin without having subscribed to it on /marketplace. Rewrote
the section to distinguish the eligibility set from the served
stack and to describe the `--check`-only hook accurately.
Plus: deleted the setup prompt's interactive Skills step (final step
before Confirm). The named-opinion question — "do you want me to
bulk-copy every skill into ~/.claude/skills/agnes/ or pull on-demand
via `agnes skills show <name>`?" — had no obvious right answer for
new users at the tail end of a wall of technical steps. On-demand
lookup is the one-size-fits-all default; `agnes skills list/show`
remain discoverable and the CLAUDE.md template references specific
skills inline (e.g. agnes-data-querying in the BigQuery section)
where they're relevant. Layout: Confirm shifts from step 9 to step 8.
Tests updated, full setup/marketplace/welcome surface green (115
passed). Remaining full-suite failures are pre-existing (BQ/Keboola
fixtures, Windows charmap collection error in test_v26_keboola_e2e)
— verified against a clean stash, unrelated to this diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix session-queue race + snapshot PID-reuse data loss
Two blocker fixes from the PR #242 review:
1. Concurrent SessionStart hooks could corrupt the queue file on
Windows. Python's `open(path, "a")` is not atomic there — the CRT
does not pass FILE_APPEND_DATA to CreateFile, so concurrent
appenders (user opening several Claude Code windows simultaneously)
could interleave bytes mid-line. The malformed lines then silently
fail the parser and the entries are dropped.
Fix: wrap append_to_queue, requeue_failed, and snapshot_queue in a
short-lived FileLock on a dedicated `agnes-queue.lock`. Separate
from `agnes-push.lock` so capture-session hooks don't block on the
push command. New test_append_concurrent_threads_no_corruption
reproduces the race with 4 threads x 50 appends.
2. Snapshot filenames embedded only the PID (`agnes-sessions.snapshot.
<PID>.txt`). After a crashed push left a snapshot on disk and the
OS recycled the PID for a new push, `os.rename` would atomically
overwrite the recovery snapshot — every entry in it lost, silently.
Fix: append a uuid8 hex tail (`agnes-sessions.snapshot.<PID>.
<uuid8>.txt`). find_recovery_snapshots already globs the prefix
so it picks up both old and new format. New
test_snapshot_filename_is_unique_per_call asserts two consecutive
snapshots under the same PID don't collide.
Targeted tests green (47/47 in session_queue/capture_session/cli_push).
Full suite failures unchanged from baseline (pre-existing BQ/Keboola
fixture issues per CLAUDE.md).
* Auto-refresh workspace hooks + bash-wrap all hook entries (Windows)
Fixes from PR #242 second review (ZdenekSrotyr):
1. `uv.lock` regenerated to include `filelock 3.29.0` (declared in
pyproject.toml but missing from the lock file — CI's
lockfile-consistency check would fail; `uv pip install` on a clean
cache would silently miss the dep).
2. `agnes self-upgrade` now auto-refreshes the workspace Claude Code
hooks via the new `cli.lib.hooks.maybe_refresh_claude_hooks`. Closes
the silent-stop migration gap: a v0.48 workspace would auto-upgrade
the CLI from its existing SessionStart self-upgrade entry but never
pick up the new `agnes capture-session` SessionStart hook, leaving
the queue empty and `agnes push` uploading nothing.
The refresh fires on both the "info is None" fast path (CLI already
current — catches the second SessionStart after a prior upgrade)
and the install-success path. Guarded by `workspace_has_agnes_hooks`
so it never writes `.claude/settings.json` into directories that
aren't Agnes workspaces (e.g. `agnes self-upgrade` invoked from
`~/`). Errors are surfaced on stderr but never flip the upgrade exit
code.
3. All Agnes-managed hooks are now wrapped in `bash -c "..."`. The
self-upgrade+pull chained SessionStart entry was the only one still
shipping unwrapped — Claude Code on Windows runs hook commands
directly without a shell, so the `;` chain + `2>/dev/null` +
`|| true` shell syntax silently no-op'd on native Windows installs
without Git Bash on PATH. Workspaces still on the old form
auto-upgrade via the refresh path above.
Tests: +12 in test_lib_hooks.py (guard semantics, v0.48→v0.49
migration end-to-end, third-party-hook preservation, bash-wrap
invariant). +5 in test_self_upgrade.py (refresh fires on info=None,
fires on install success, skipped on failure, skipped on --check-only,
refresh failure never flips exit code).
130 targeted tests green. The 2 pre-existing Windows path-separator
failures in `test_smoke_test_detects_version_mismatch[uv|pip]` are
unrelated (path mismatch `\fake\uv\bin\agnes` vs `/fake/uv/bin/agnes`
in test asserts, pre-PR baseline).
* CHANGELOG: document PR-242 main features
Closes ZdenekSrotyr #4: the [Unreleased] block was missing entries for
the PR's primary surface — only the post-merge fix bullets and the
unrelated setup-prompt copy change were captured. Adds:
- ### Added: 6 bullets covering the session capture queue + new
`agnes capture-session` subcommand, `/agnes-private` slash + `agnes
mark-private`, `agnes statusline` + statusLine wiring, `--legacy-scan`
opt-in fallback, single-instance push lock, and the new `filelock`
runtime dep.
- ### Changed: BREAKING bullet on the SessionStart / SessionEnd hook
wire format change (capture-session as first SessionStart entry,
push self-heal removed, SessionEnd push detached via nohup, all
entries bash-wrapped). Folds the prior standalone bash-wrap bullet
into this consolidated entry — Z's review flagged the layout shift
as BREAKING, and grouping the related sub-changes makes the
migration story readable in one place.
- Operator migration is auto-handled by `maybe_refresh_claude_hooks`
invoked from `agnes self-upgrade` (separate Changed entry below).
No `agnes init` re-run required. Pre-queue session jsonls on
upgrading workspaces still need a one-off `agnes push --legacy-scan`
— flagged in the BREAKING bullet.
No code change; doc only.
* Drop permanent 4xx uploads instead of requeueing forever
Closes ZdenekSrotyr #5. Previously the push retry path requeued any
non-200 response except the literal "file not found on disk", so 401
(token expired), 403 (RBAC denial), 413 (payload too large), 400
(server-side validation) cycled through every push run forever — the
queue grew without bound and each run re-bombarded the server with the
same deterministically-failing upload.
Now 4xx (except 408 Request Timeout + 429 Too Many Requests, which the
HTTP spec marks as transient) is dropped and audit-logged to
`<workspace>/.claude/agnes-sessions-failed.txt`:
<iso_ts>\t<session_id>\t<status>\t<transcript_path>
5xx and network errors continue to requeue — those reflect server /
transport state that can change between runs, so retry is the right
behavior.
The audit log piggybacks on the push single-instance lock
(agnes-push.lock) — push is the only writer to this file, same as the
existing `mark_uploaded` and `mark_private_skipped` paths, so no
separate filelock is needed.
`agnes push --json` surfaces a new `dropped_permanent` counter; non-
quiet stdout mentions the audit-log path so operators tailing the
output have a pointer to the forensic trail.
Tests: +7 in test_cli_push.py (401/400/403/413 → drop; 408/429 →
requeue; 500/502/503 → requeue; network exception → requeue;
--json `dropped_permanent` counter; stdout audit-log pointer). +1 in
test_session_queue.py (mark_failed_permanent TSV format).
127/129 targeted tests green. The 2 pre-existing Windows
path-separator failures in `test_smoke_test_detects_version_mismatch
[uv|pip]` are unrelated (path mismatch `\fake\uv\bin\agnes` vs
`/fake/uv/bin/agnes` in test asserts, pre-PR baseline).
* Catch OSError in push lock acquisition
Closes ZdenekSrotyr #8. `acquire_or_skip` in `cli/lib/push_lock.py`
previously caught only `filelock.Timeout`. Any `OSError` from
`FileLock.acquire` — read-only filesystem, permission denied on
`.claude/`, disk full, hardware I/O failure — propagated as an
unhandled traceback.
Two visible failure modes:
- SessionEnd hook: `|| true` in the wrapper swallowed the error, so
daily pushes silently never ran. Operator had no signal.
- Manual `agnes push`: ugly Python traceback dumped to the terminal
instead of a clean exit.
Now `OSError` is treated the same as `Timeout` — yield `None`, caller
returns cleanly with rc=0. The operator's environment in these
scenarios has bigger problems than missing session uploads, so we
swallow rather than retry-loop or surface a noisy warning.
Test: `test_push_silent_exit_when_filelock_raises_oserror` patches
the `FileLock` used inside `push_lock` to raise OSError on acquire,
verifies push exits 0 with no traceback and the queue is preserved
for the next attempt.
* Address remaining S2 items from PR-242 review
Four items from ZdenekSrotyr's S2 list:
S2.10 — `_install_statusline` truthy check (cli/lib/hooks.py): replace
`if existing:` with explicit `if existing is None or existing == "":`.
Documents and tests the behavior for both edge cases (explicit-null
and empty-string `statusLine`) — both treated as "not configured"
rather than "explicit user opt-out", so we install ours. Two new
tests in test_lib_hooks.py pin the contract.
S2.6 — onboarding docs for /agnes-private. New "Private sessions"
subsection in `config/claude_md_template.txt` (next to Data Sync)
covering the slash command, statusbar indicator, and audit-log
location. One-line tip in `app/web/setup_instructions.py` so the
feature is discoverable at onboarding.
S2.9 — e2e privacy test (tests/test_e2e_privacy.py). Wires
capture_session → mark_private → push against a recording fake
api_post and asserts zero session uploads for the marked one.
Three cases: mark-before-capture (queue write skipped),
mark-after-capture (push-side filter catches it + audit-logs),
control (unmarked sessions upload normally).
David #8 — `--legacy-scan` help text now documents the
private-list gap (legacy entries carry empty session_id, so
the filter is not consulted). The practical impact is bounded —
pre-queue sessions cannot have been marked private since the
private list is a queue-era feature — but the disclaimer in the
help text means an operator running a backfill is not surprised.
68 targeted tests green (3 new e2e + 2 new truthy edge tests +
existing). 2 pre-existing Windows path-separator failures in
test_smoke_test_detects_version_mismatch[uv|pip] unchanged.
Remaining S2 items (statusline mkdir push-back, capture-session
silent-fail follow-up) handled in PR comment + follow-up issue
respectively.
* Address remaining S2 follow-ups (David #8, S2.7, David #11)
Three items left over from Mina's bbf63472 batch — that commit
addressed S2.6/S2.9/S2.10 + documented David #8 in help text but
deferred the actual implementations of S2.7, David #11, and the real
David #8 fix to follow-ups. This commit closes them.
David #8 — `agnes push --legacy-scan` now consults the private list.
Claude Code names jsonls `<session-id>.jsonl`, so the file stem IS
the session id; the legacy-scan path can apply the same private filter
the queue path uses. Both the dry-run and live-upload code paths fixed.
Help text updated (no longer warns the filter is bypassed). Two new
tests in test_cli_push.py cover the upload-skip path + the dry-run
`would_skip_private` segregation.
S2.7 — `statusline`/`is_private` no longer mkdir-pollutes arbitrary
workdirs. Split `_claude_dir` into `_claude_dir_writable` (used only
from `add_private`) and `_claude_dir_readonly` (no mkdir). The
read-only public helpers (`private_list_path`, `read_all_private`,
`is_private`) compose the no-mkdir variant by default; `add_private`
opts in via `writable=True`. Added a process-local mtime-keyed cache
around `read_all_private` so in-process callers (push doing one stat
per upload candidate, future `agnes diagnose`) don't re-parse the
file on every check. Cache eviction on `add_private` so a sub-second
write+read sequence doesn't see stale data even on coarse-mtime
filesystems. Two new tests pin the no-mkdir contract + the
in-same-second add+read consistency.
David #11 — `agnes capture-session` writes a breadcrumb log on every
invocation. New `<workspace>/.claude/agnes-capture-session.log` TSV:
`<iso_ts>\t<outcome>\t<detail>` where outcome covers every silent-
exit path (`ok`, `private_skip`, `empty_stdin`, `bad_json`,
`not_object`, `no_transcript_path`, `stdin_read_error`,
`write_error`). Gives operators a signal to detect "hook fires but
queue stays empty" — without it, an upstream Claude Code stdin-
contract change is invisible because the hook always exits 0. Log
rolls at 256 KiB so it doesn't grow unbounded on long-lived
workspaces. Best-effort: a breadcrumb-write failure is itself
swallowed so the hook contract stays "exit 0 always". Skipped in
non-Agnes workdirs (no `.claude/` exists) so opening Claude Code
in `~/` doesn't pollute it. Five new tests in test_capture_session.py
cover the success / bad_json / no_transcript_path / private_skip /
no-pollute paths.
115 targeted tests green (test_cli_push, test_capture_session,
test_private_list, test_session_queue, test_e2e_privacy,
test_lib_hooks, test_statusline, test_mark_private).
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* System plugin tier with mark/unmark fanout (schema v39)
Adds a mandatory plugin tier so admins can pin a small set of curated
plugins into every user's stack from day one. Marking a plugin via the
new toggle on /admin/marketplaces materializes resource_grants for every
group and user_plugin_optouts subscriptions for every user, so the
existing resolver pulls the plugin into every served set without a new
filter layer. Hooks on user-create (Google OAuth, magic-link, admin
POST, scheduler) and group-create propagate the same materialization to
new principals. UI locks: /admin/access disables the checkbox with a
SYSTEM pill; /marketplace cards swap the "In stack" green pill for an
amber "Required" badge with shield icon; the plugin detail install
button reads "Required by your org"; /my-ai-stack toggle is disabled.
Bypass paths return 409 (DELETE /api/admin/grants for system grants,
PUT /api/my-stack/curated/.../{enabled:false}, DELETE
/api/marketplace/curated/.../install). Unmark only flips the flag —
materialized rows persist so admins curate cleanup at their leisure
through the now-unlocked /admin/access checkboxes.
* Marketplace UX polish + drop legacy /store and /my-ai-stack pages
Two-part cleanup post-v39:
(1) Page deletion. /store and /my-ai-stack were already replaced by
/marketplace?tab=flea and /marketplace?tab=my respectively, but the
standalone routes lingered. Hard delete in dev mode — no redirects,
stale bookmarks 404. The /store/new upload wizard, the flea
detail/edit pages, the admin queue, and all /api/store/* +
/api/my-stack endpoints (CLI consumers) stay. Internal hardcoded
hrefs in the upload wizard's Cancel button and the advanced-setup
page repointed to the marketplace tabs.
(2) Detail-page install button rework. The single button that morphed
between "+ Add to my stack" and "✓ In your stack" did not
communicate uninstall affordance. The installed state now renders an
inline white status label *before* a separate red-bordered
"✕ Remove from stack" button on the same row, both at identical
height to avoid layout shift. System plugins keep their locked amber
"✓ Required by your org" pill (no Remove button — API refuses 409).
The post-action hint panel now fires on remove too with the title
flipped to "✓ Removed from your stack" — Claude Code needs the same
/update-agnes-plugins refresh either way.
Also: /admin/marketplaces Details modal "Mark as system" toggle
redesigned. The button was near-invisible (matched neutral row
metadata). It's now a balanced amber-toned chip with shield icon
and a structured confirm modal replacing the native confirm() dialog
that summarizes fanout consequences before commit.
* Move stack-hint inside hero with glass-on-gradient styling
The post-action hint card ("✓ Added to your stack" with the
/update-agnes-plugins recipe) used to live below the hero in
panel-what (gray card on white page body). Clicking add/remove
inserted/removed it between the hero and content, shifting the
panels below — a noticeable scroll jump.
The hint is now anchored inside the hero's top-right corner alongside
the install/remove buttons, both as flex children of an absolutely
positioned .actions container. The card uses a translucent
white-on-glass treatment that adopts the hero's kind color (blue for
plugin, green for skill, purple for agent) without per-kind branching.
Hero is always tall enough (160px photo) to contain the action+hint
stack without overflow, so toggling the hint visibility doesn't grow
the hero or shift body content.
The hero-head grid reserves a third 300px column for the absolute
actions overlay so meta gets the proper 1fr free space instead of
being squeezed by a padding-right hack. Responsive breakpoint at
1100px reflows the actions stack below hero-head when the viewport
isn't wide enough to keep meta + actions side-by-side comfortably.
* Add optional -DataPath bind mount to run-local-dev.ps1
When the operator wants to inspect DuckDB files (system.duckdb, extracts,
marketplaces, store/, …) directly from Windows Explorer, the named volume
inside the Docker Desktop WSL VM isn't reachable. The new -DataPath param
generates a transient compose override that rebinds /data on app, scheduler,
extract (and Caddy's /srv:ro mirror) to a Windows host folder.
Fully additive — when -DataPath is omitted everything behaves exactly as
before: no override file is generated, $composeFiles array is unchanged,
finally cleanup is a no-op. Existing positional invocations
(.\run-local-dev.ps1 up | down | logs) keep binding to $Action because
$DataPath is a named-only parameter with no Position attribute.
The override is written via [System.IO.File]::WriteAllText so the YAML is
BOM-less across PS 5.1 / 7+ — Compose rejects BOM-prefixed YAML on Windows.
The override file is unique per PID and removed in the script's finally
block so concurrent invocations and crashes don't leak files.
* factor mark_system fanout into UserCuratedSubscriptionsRepository
The endpoint imported UserCuratedSubscriptionsRepository, ignored it
(noqa: F841), then duplicated the user-side fanout SQL inline. Adds
fanout_system_for_plugin() symmetric to the existing
fanout_system_for_user() and routes mark_plugin_system through it —
removes the dead import + 14 lines of inline SQL, returns the same
`affected_users` delta count, no behavior change.
* drop customer-specific path from .ps1 example
Per CLAUDE.md vendor-agnostic OSS rule: replaced
C:\\Business\\Groupon\\Agnes\\agnes-data with the generic
C:\\Users\\<you>\\agnes-data placeholder so the docstring
example reads cleanly on any reviewer's box.
* release: 0.48.0 + parallelize Release-workflow pytest
Cuts the release shipped via #228#230#231#232#233#234#236#237#238#239#240 plus this PR (#241). Major changes:
- System plugin tier (schema v39) — admins mark a plugin mandatory; fans
out RBAC grants + subscriptions to every existing user/group plus
hooks for new principals
- BREAKING: removed standalone /store + /my-ai-stack page routes
(replaced by /marketplace?tab=flea + /marketplace?tab=my)
- Setup-prompt + bootstrap recovery fixes (#240)
- DuckDB CHECKPOINT-on-shutdown + 60s compose grace (#235)
- Marketplace + flea-market UX polish, agnes-metadata.json enrichment
Bonus: switch release.yml test step to `-n auto` (matches ci.yml).
Single-threaded was 15-20 min and frequently the bottleneck on PR
mergeability — now ~6 min.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption
Default 10s stop_grace_period + missing CHECKPOINT on shutdown produced a
class of WAL-replay failures during agnes-auto-upgrade recreates. Sequence:
1. New image digest detected → docker compose up -d → SIGTERM to app
2. App's lifespan close_system_db() called .close() but never CHECKPOINT,
so any uncheckpointed ops stayed in system.duckdb.wal
3. Container didn't exit within 10s → dockerd SIGKILL (verified in journal:
"Container failed to exit within 10s of signal 15 - using the force")
4. New container started with possibly-different DuckDB version, replay
hit "Failure while replaying WAL ... GetDefaultDatabase with no
default database set" assertion → 500 on every authed request
Observed on foundryai-dev-vrysanek 2026-05-05; recovered by removing the
WAL manually. _try_open_system_db already exists as a recovery net but
requires a system.duckdb.pre-migrate snapshot, which doesn't exist
outside migration windows.
Two-part prevention:
- src/db.py::close_system_db: execute CHECKPOINT before .close() so the
WAL is empty when the file is released. Best-effort (try/except), so a
locked or full-disk CHECKPOINT does not block close.
- docker-compose.yml: stop_grace_period: 60s on app + scheduler, gives
uvicorn + lifespan room to run shutdown handlers under load before
Docker's SIGKILL fires.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* log CHECKPOINT outcome on system DB close
Silent except: pass on both CHECKPOINT and close() left operators
without any signal when the WAL-flush safety net actually saved them
(or didn't). Add logger.warning on CHECKPOINT failure (operator-actionable
- recovery via _try_open_system_db kicks in next start) and debug-level
trace on success / close exception.
* drop customer-specific token + add CHANGELOG entries
Per CLAUDE.md vendor-agnostic OSS rule: nothing customer-specific in
shipped code/comments. Replaced "foundryai-dev-vrysanek 2026-05-05"
references in docker-compose.yml and src/db.py docstring with generic
"Docker image upgrade window where DuckDB versions differ" framing.
The original incident date + host live in the commit history / PR body,
not in the tree.
Adds CHANGELOG entries under Unreleased:
- Fixed: close_system_db CHECKPOINT-on-shutdown semantics + WAL-replay
failure mode the fix protects against
- Changed: docker-compose stop_grace_period 60s on app + scheduler
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* Setup-prompt + bootstrap fixes from David's 2026-05-10 init report
Three issues from clean-machine bootstrap evidence:
1. `agnes refresh-marketplace --bootstrap` failed to recover when the
local clone existed but Claude Code's marketplace registry had lost
the `agnes` entry. Bootstrap path now parses
`claude plugin marketplace list`, re-runs
`claude plugin marketplace add ~/.agnes/marketplace` when missing,
and treats `add` failures as fatal (was warn-and-continue, root cause
of the cascade into "Marketplace 'agnes' not found" plugin install
errors).
2. Setup prompt now always emits the marketplace-registration block,
even when the operator has zero plugin grants. Pre-wires the
SessionStart hook so future admin grants land automatically without
re-running setup. Block copy adapts: empty list shows
"no plugins granted yet", populated list shows "install plugins".
3. Setup prompt registers the Atlassian Remote MCP server unattended
(`claude mcp add --transport sse atlassian
https://mcp.atlassian.com/v1/sse`). Hosted Remote MCP, OAuth handled
automatically by Claude Code on first use. Asana / GWS stay on the
/home connector cards (PAT/keychain flows don't fit unattended
bootstrap).
Confirm step nudges the user toward the /home connector cards for the
PAT-flow services. CLAUDE.md template renames the marketplace section
to "Agnes Marketplace" and documents that all plugins are addressed as
`<plugin>@agnes` regardless of upstream slug.
Layout: Confirm shifts from step 6/8 to step 9 across all variants
(preflight, marketplace, MCP all unconditional). Tests updated.
* Link Claude license options from /home install pane
Step-1 Claude install on /home pointed users to OAuth without
explaining what to do if they don't have a Pro/Max subscription. Add
a one-line follow-up link to the plan-tier section on /setup-advanced
(new `#claude-plan` anchor) so first-time users discover the
subscription tiers rather than bouncing on the OAuth screen.
* Add idempotent + no-TLS-bypass guardrails to /home connector prompts
The Asana / Google Workspace / Atlassian connector prompts on /home
already shipped a precheck step that short-circuits when the service
is already wired, but they didn't carry the same idempotency +
surface-errors-verbatim + don't-disable-TLS-verification guardrails
the bash bootstrap prompt has. Add a one-paragraph 'Ground rules'
block at the top of each prompt so a connector failure doesn't
tempt the model into bypass workarounds, matching the same posture
David's 2026-05-10 init report flagged for the bash flow.
* skip Source: lines in marketplace registry detector
`claude plugin marketplace list` prints a `Source: <local path>` line
under each registered marketplace; the local clone almost always lives
under a path containing the marketplace name itself
(`~/.agnes/marketplace`). A naive \\bagnes\\b match over the full
stdout therefore false-positives whenever ANY unrelated marketplace
sits under `~/.agnes-…/` or similar. Filter Source: lines out before
matching so the recovery path actually re-adds when needed instead of
silently falling through to a broken `marketplace update agnes`.
Adds regression test covering the substring-only case.
* drop customer-specific tokens from CHANGELOG entries
Per CLAUDE.md vendor-agnostic OSS rule ("nothing customer-specific
... in changelogs"):
- "agnes-vrysanek.groupondev.com" -> "a private-CA Agnes deployment"
- "Groupon Marketplace / groupon-marketplace" -> "<Org> Marketplace /
<org>-marketplace" (placeholder example)
- Removed "David flagged" attribution language; init-report context
stays intact, just stripped of the named host + brand
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(store): flea-market entity edit feature with version history (schema v38)
Owner + admin can now edit a store entity from a real Edit page at
/marketplace/flea/{id}/edit, replacing the prior "coming soon"
placeholder. Editable: display name, description, category, video
URL, cover photo, and an optional new bundle. Type is locked (400
type_locked). Display-name change renames the on-disk slug for both
live plugin/ and version dirs (reuses rename-on-archive helper).
Schema v38 (originally drafted as v37; renumbered after rebase onto
main where v37 was taken by the curated marketplace enrichment).
Versioning model:
* Each bundle update bakes into ${DATA_DIR}/store/<id>/versions/v<N+1>/plugin/
and runs the standard guardrails pipeline.
* DEFERRED PROMOTION: live plugin/ + entity.version_no stay at the
prior approved version through the LLM review window so existing
installers keep receiving the previously approved bundle. Live swap
+ version_no/version/file_size bump happen only on LLM approval.
Blocked verdicts leave the prior version serving forever.
* store_entities gains version_no INTEGER + version_history JSON.
Each version_history entry carries hash, sha256, size, submission_id,
created_at, created_by.
* Existing entities backfill to v1 with a single-entry history seeded
from the row's current `version` hash. Initial create also seeds
versions/v1/plugin/ so future restore can copy v1 bytes forward.
Concurrency:
* Block-while-pending: an in-flight LLM review blocks any further edit
with 409 prior_version_pending. Owner waits 5-30s; Edit button on
detail page renders disabled in the same window via the new
edit_in_flight flag (decoupled from quarantine_sub since the
deferred-promotion flow keeps visibility='approved').
Rollback:
* New endpoint POST /api/store/entities/{id}/versions/{n}/restore
(owner + admin). Copies vN bundle forward as v<max+1> and re-runs
guardrails (rules tighten over time; pre-approved bundles re-validate).
Forward-only history. Same deferred-promotion semantics — live stays
at prior version until LLM approves the restored copy.
UI:
* New /marketplace/flea/{id}/edit page (owner + admin gated).
* Versions card on plugin + item detail templates (owner/admin only)
via shared _flea_versions.html partial.
* Admin queue gains v# column with current badge + separate Hash
column. Submission detail surfaces Version + Bundle hash rows.
* Activity timeline split into per-submission + entity-wide cards;
entity-wide rows render vN chips when audit row params reference
a specific version.
* Section headers (Manifest / Static / Quality / LLM review) tag
with vN chip via shared macro.
* Reviewed-by-model field surfaces explanatory text per status.
* Banner upload-failure now redirects to detail page on
submission_blocked instead of staying stuck.
Tests: 24 in tests/test_store_entity_versions.py covering metadata-
only edit, bundle-edit version bump, type lock, block-while-pending,
name change disk rename, restore flow + 404/400/403 paths, edit page
404 for non-owner, versions card visibility gating, admin queue v#
column, admin detail Version/Hash rows, deferred-promotion installer
contract (pending review doesn't break installer / blocked verdict
keeps prior / approved promotes), admin can edit/restore non-owned,
restore deferred promotion, audit log per-version params. 214 tests
green across guardrails + edit + admin + repo + schema suites.
* docs(store): refresh update_entity docstring to match deferred-promotion + submission-status gate
Bring the docstring in sync with the actual fixes from the prior
commit. The pre-fix wording said the gate read
visibility_status='pending' AND submission status — under deferred
promotion that would never fire for v2+ edits. Now describes:
- Block-while-pending gates on submission.status DIRECTLY,
independent of visibility (so v2+ deferred-promotion edits don't
slip through).
- Display-name + bundle change defers the live rename to promotion;
metadata-only renames stay immediate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Move marketplace plugin updates from hook to /update-agnes-plugins skill
The SessionStart hook used to run `agnes refresh-marketplace --quiet`,
which performed a full fetch+reset+install cycle on every Claude Code
session start. That work was invisible to the user, slowed session
startup, and was unrecoverable interactively when something failed.
Split the responsibility:
- `agnes refresh-marketplace --check` is a new lightweight detector:
`git fetch` only, compares local HEAD with remote FETCH_HEAD, emits
a Claude Code hook JSON message pointing the user at
`/update-agnes-plugins` when the marketplace has changes. No reset,
no plugin install/update side effects.
- `/update-agnes-plugins` is a new slash command (installed by
`agnes init` into `<workspace>/.claude/commands/`) that runs
`agnes refresh-marketplace` (default chatty path). Output streams
into the Claude Code transcript so the user sees install/update
progress and can react to errors interactively.
- The SessionStart hook now runs `--check`. Existing workspaces
auto-upgrade on next `agnes init` (substring marker matches both
the old `--quiet` entry and the new `--check` one).
BREAKING: `agnes refresh-marketplace --quiet` is removed. Old hooks
calling it silent-noop after the CLI upgrade (the hook's `|| true`
swallows the unknown-flag error) until re-init rewrites them.
* Point marketplace 'Added to your stack' hint at /update-agnes-plugins
The post-install green panel on plugin and skill/agent detail pages
referenced the SessionStart auto-install path and a shell-prompt
`agnes refresh-marketplace` invocation. With the hook now being
detect-only, that copy was misleading — the actual install path is
the new slash command.
Condensed to a single instruction: "Open a new Claude Code session
and run:" followed by `/update-agnes-plugins` in a copy-chip.
JS clipboard string updated to match.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
The marketplace step (step 5) emitted `git config --global
http.<host>/.sslVerify false` on AGNES_DEBUG_AUTH=1 instances when no
ca_pem was readable from AGNES_TLS_FULLCHAIN_PATH. Two problems:
1. Claude Code auto-mode classifiers ("do not disable TLS verification"
rule) block the line, breaking hands-free setup.
2. It silently masked operator misconfiguration — a debug-auth instance
without a fullchain on disk fell through to a TLS-disabled clone
instead of surfacing the missing cert.
After the cross-platform trust block (#137), self-signed and private-CA
servers are fully covered by step 0 reading the fullchain via
_read_agnes_ca_pem; publicly-trusted certs need no bootstrap at all.
The legacy fallback no longer covers a real scenario — verified by
running step 5 without sslVerify=false against a self-signed instance.
BREAKING: drops `self_signed_tls` parameter from
app.web.setup_instructions.resolve_lines and render_setup_instructions
(only consumed by the deleted block). The AGNES_DEBUG_AUTH env var
itself is unchanged — still gates /api/me_debug and the dropdown link.
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
* Curated marketplace enrichment via agnes-metadata.json + curator metadata
Adds a second well-defined metadata file `.claude-plugin/agnes-metadata.json`
that upstream marketplace repos can opt into, providing per-plugin (and
per-skill / per-agent) cover photo, demo video URL, doc links, and
category override. The Claude Code marketplace contract is untouched —
agnes-metadata.json + the convention `.agnes/` directory are stripped
from the synthetic Claude Code marketplace served via /marketplace.zip
and /marketplace.git/*, so user instances see a clean Claude Code repo
with no Agnes-only metadata.
Highlights:
- DB schema v32 — adds curator_name + curator_email on marketplace_registry,
cover_photo_url + video_url + doc_links on marketplace_plugins.
- Mandatory curator at marketplace registration, editable later through
the admin UI; surfaces on cards + detail pages in place of owner_todo.
- External-asset mirror cache at ${DATA_DIR}/marketplace-cache/<slug>/
with conditional GET, 60s timeout, 10 MB body cap, SSRF guards, and
Wikipedia-policy-compliant User-Agent.
- Strict drop semantics — anything Agnes can't deliver as a real PDF /
Markdown / plain text doc, or a real PNG / JPEG / WebP cover, is
dropped from the served metadata; UI looks identical to no-entry case
(gradient placeholder for missing covers, no row in the doc list).
- Doc allowlist + image allowlist enforced on both the curated mirror
flow and the Flea upload flow (/store/new); shared module
src/marketplace_assets.py.
- New /api/marketplace/curated/{mp}/{plugin}/{asset,doc,mirrored}/...
endpoints with path-traversal guards + RBAC + Content-Disposition
attachment for docs.
- Curator-focused format guide at /marketplace/format-guide; canonical
source is docs/curated-marketplace-format.md, also linked from the
admin /admin/marketplaces page next to + Add Marketplace.
See CHANGELOG.md under [Unreleased] for the full breakdown.
* Fix format-guide test assertion to match shortened disclaimer
The 'Flea Market' phrase was trimmed out of the disclaimer in
docs/curated-marketplace-format.md after the curator-focused rewrite.
Update the rendered-HTML test to assert the channel-scoping phrase
that's actually present ('Curated Marketplace channel only') rather
than the 'Flea Market' contrast that's no longer in the doc.
* Drop unused 'version' field from agnes-metadata.json schema
The parser never read it; it was a YAGNI placeholder for future
schema evolution. Curators don't need to wonder what to put there
when adding the file for the first time. Will be re-added if and
when we actually introduce a backwards-incompatible schema change.
* Harden asset mirror against SSRF via redirect + DNS rebinding
The pre-flight _is_safe_url check validated only the initial URL;
urllib.request.urlopen then followed redirects and re-resolved DNS for
the actual connection — both bypassable. Attacker-controlled origin
could 302 to http://169.254.169.254/... and exfil cloud metadata;
attacker-controlled DNS could return public IP first / 127.0.0.1 second.
Replace urlopen call with a shared OpenerDirector wired through three
custom handlers: _SafeRedirectHandler re-runs SSRF allowlist on every
redirect Location (max 5 hops, down from urllib's 10), and
_PinnedHTTPHandler / _PinnedHTTPSHandler connect to the IP that passed
validation rather than re-resolving the hostname. TLS SNI + cert verify
stay bound to the original hostname.
_resolve_safe returns the validated IP (the existing _is_safe_url
2-tuple wrapper stays for backwards compatibility) and rejects round-
robin DNS that mixes a public + private record. _UnsafeRedirectError
is a typed exception so _fetch_url can map redirect blocks to terminal
'rejected' status (not transient 'failed'). _http_open is the single
call site so tests can mock at one well-defined seam.
Tests cover redirect blocking (link-local, loopback), redirect-error
unwrapping inside URLError, pinned-IP connection target, and the
end-to-end DNS-rebinding scenario. Existing tests that mocked
urllib.request.urlopen are migrated to mock _http_open.
* Harden /asset/ endpoint against stored XSS
The endpoint served any file in the cloned marketplace repo with
stdlib-detected Content-Type, so a curator who landed evil.html (or a
renamed evil.png carrying HTML bytes) in the working tree got a
same-origin XSS — the response shares cookie scope with /admin and
/api/me/*.
The asset endpoint is image-only by contract (cover photos referenced
from agnes-metadata.json + inner skill / agent cards), so applying the
same allowlist + magic-bytes pattern that /doc/ already uses closes
the gap without breaking any legitimate use case. Three layered
checks: extension in IMAGE_EXTENSIONS (.png/.jpg/.jpeg/.webp; SVG
excluded — <script> inside SVG executes), validate_image_file magic
bytes (defeats rename-extension attack), Content-Type pinned from the
validated extension (never stdlib mimetypes).
Defense-in-depth: X-Content-Type-Options: nosniff stops browser MIME
sniffing; Content-Security-Policy: default-src 'none' blocks script /
iframe execution even if a future regression let HTML through.
Tests cover the .html extension reject, the renamed-HTML-as-PNG magic-
bytes reject, the .svg reject, and the happy-path PNG with security
headers attached. The pre-existing path-traversal test seeds a real
PNG instead of ok.txt now that the endpoint is image-only.
* Enforce mandatory curator on marketplace PATCH
The POST handler enforced curator_name + curator_email at create time,
but PATCH treated empty / missing curator inputs as 'no change'. Legacy
rows that pre-date v32 (curator_name=NULL) could be edited indefinitely
without ever filling the curator gap, and OWNER_TODO_PLACEHOLDER lingered
on every /marketplace card.
Reject the PATCH with 400 when the post-merge row would persist with
empty curator. The check fires after the existing field-merge logic, so
once-filled rows that don't touch curator still pass through (their
existing values fall through from the DB row). DB column stays nullable
so untouched legacy rows continue to coexist — the gate fires only the
moment an admin opens the edit modal.
Existing PATCH semantics preserved: empty-string input still means 'leave
existing value alone', and once-filled curator can't be cleared (those
test cases pass unchanged). New test seeds a legacy row directly via the
repository, then exercises url-only PATCH (rejected), partial-fill PATCH
(rejected), and full-fill PATCH (succeeds); a follow-up no-curator PATCH
on the now-formed row also passes.
* Drop unused curated-marketplace helpers (PR #234 review)
* build_db_payload — imported by src/marketplace.py but never called.
The strict-drop semantics it would have implemented were re-written
inline in _refresh_plugin_cache (see the comment block there). The
standalone helper still carried the old fall-back-to-original-external-
URL-on-mirror-failure behaviour, which contradicts the documented
drop-when-can't-deliver contract — a future contributor who re-wired
it would have introduced a silent regression. Delete with the helper
+ the import + the comment that referenced it.
* _resolve_marketplace_name — one-line shim with no remaining call
sites. Callers use _resolve_marketplace_meta which returns name +
curator together, avoiding the double DB hit the shim exists to
hide.
* '# noqa: F401 Optional kept for forward-compat' was wrong — Optional
IS used in src/marketplace.py (line 70 and line 238). Drop the noqa
comment so a future ruff run doesn't try to remove a real import.
Removing build_db_payload also drops the only remaining use of Optional
in src/marketplace_metadata.py, so the import comes out there too.
* Cap agnes-metadata.json size + catch RecursionError on parse
The reader is invoked once per marketplace per sync and the file is
curator-controlled. Two failure modes were unguarded:
* Multi-GB JSON: path.read_text() pulled the whole file into memory
before json.loads even ran. A curator with commit access to an
upstream repo could OOM the sync worker.
* Deeply-nested JSON under any size cap: cpython's recursive object /
array parser raises RecursionError at ~1000 levels of depth.
RecursionError is a RuntimeError, not ValueError, so the existing
catch let it propagate up and abort the entire sync — every other
marketplace in the same pass got skipped.
Add AGNES_METADATA_MAX_BYTES = 1 MiB (a real metadata file with covers,
docs, categories for ~50 plugins fits in <100 KB so the cap is
generous) and gate the size check on path.stat().st_size before the
body read. Broaden the parse except to (ValueError, RecursionError)
with a unified log line. Both failure modes degrade to the same
empty-dict fall-back the malformed-JSON path already used, so one bad
upstream never aborts the rest of the sync.
Tests cover the size cap firing before json.loads (whitespace-padded
valid JSON exceeding the cap) and the recursion path (5000 nested
arrays — past cpython's default recursion limit but well under the
size cap).
* Persist asset-mirror manifest per body write, before unlink
sync_assets wrote each body atomically (tmp + rename) but persisted
the manifest only at the end of the batch. A kill -9 mid-Phase 2 left
on-disk files the manifest never referenced. Once a curator dropped
that URL from agnes-metadata.json, Phase 3's cleanup had no record of
the file and the orphan stayed forever — there's no GC pass walking
the cache dir today, so disk would slowly bloat.
Phase 2 (body-write iteration): after the in-memory manifest mutation,
persist BEFORE unlinking the previous body. The crash window narrows
from 'all of Phase 2' to 'between persist and unlink' (microseconds).
A persist failure mid-batch keeps the previous body on disk — the on-
disk manifest still references it, and a stale-but-existing file beats
a 404. Cost: one extra tmp+rename per body write; manifest is a few KB
so the overhead is negligible vs. the HTTP fetches.
Phase 3 (curator-removed URLs): same discipline. Collect the to-delete
relpaths, persist the manifest with the entries already gone, THEN
unlink. A crash mid-cleanup leaves at most a microsecond window where
files exist despite the manifest no longer naming them. The next sync
reads the (correct) manifest and the orphan stays orphaned, but the
served state is consistent.
Tests cover per-body persist call count, the post-update on-disk
manifest content, and Phase 3 ordering verified by reading the on-disk
manifest from inside Path.unlink.
* Consolidate marketplace video embeds + format-guide CSS
The YouTube nocookie / Vimeo / <video> / link-fallback detection logic
was duplicated verbatim in marketplace_plugin_detail.html and
marketplace_item_detail.html (~40 JS lines each, with subtly-different
inline styles). Both templates now {% include %} a single
_marketplace_video_embed.html partial inside their IIFE so the regex,
the nocookie attribute set, and the unknown-host link fallback live in
ONE place — future tweaks (new host, new attribute, fixed sandbox flag)
no longer need to be applied twice in lockstep.
The .video-wrap selectors (one inline <style> rule in plugin_detail,
one inline style='...' attribute in item_detail) are replaced by the
existing .video-embed 16:9 wrapper in style-custom.css, with new
.video-embed video / .video-embed a child rules added so the wrapper
handles all four embed shapes uniformly without per-template
positioning.
The 60-line inline <style> block in marketplace_format_guide.html
moves verbatim to style-custom.css under a new 'Marketplace format
guide page' section, scoped to .format-guide so other pages aren't
affected.
No user-visible behaviour change: the rendered HTML for valid
YouTube / Vimeo / mp4 / external links is byte-identical to before,
and the format-guide page renders the same.
* Maintainability cleanup batch (PR #234 review)
#10: drop _path_under from app/api/marketplace.py — it was a byte-
equivalent clone of _safe_join (same Path.resolve(strict=True) +
relative_to() containment check). The three v32 endpoint handlers
(/asset, /doc, /mirrored) now share the existing helper.
#14: rename src/marketplace_assets.py → src/marketplace_asset_validation.py
so the file's purpose is obvious from the name and the previous
overlap with src/marketplace_asset_mirror.py is gone. Six call-site
imports updated in lockstep; CHANGELOG references under [Unreleased]
updated to track the new path.
#11: consolidate the URL builders that resolve
/api/marketplace/curated/<slug>/<plugin>/{asset,doc,mirrored}/...
paths. _internal_asset_url / _internal_doc_url / _mirrored_asset_url
lived in src/marketplace.py, while a copy named _mirrored_url lived
in app/api/marketplace.py with a 'must stay aligned' comment. New
module src/marketplace_urls.py is the single source of truth — both
call sites import from it and a future URL-format tweak only needs
to change one file. The _ROUTE_PREFIX constant collapses the per-
function f-string repetition. The route-handler endpoints themselves
still own the path string literals (keeping the builders identical
to the route declarations remains a checklist item, not a runtime
guarantee).
* Re-key asset-mirror manifest by (plugin, url) + dedup HTTP fetches
The manifest used to be keyed by URL alone, so two plugins in the
same marketplace referencing the same external image (a shared CDN
icon, a common cover) collided on entry.plugin_name — last writer
won. The DB row for the losing plugin then stored a served URL
pointing under the winning plugin's tree, and require_resource_access
denied legitimate access on one side and let the other plugin's user
reach the wrong asset.
In-memory: Dict[Tuple[str, str], MirrorEntry] keyed (plugin_name, url).
On disk: format flips from {url: entry} dict to [entry, ...] list of
self-describing entries (each carries plugin_name + url + the
previous fields). JSON keys can't be tuples; encoding 'plugin::url'
would just shift the parsing burden.
Phase 1 of sync_assets deduplicates fetches by URL — three plugins
sharing one URL share one HTTP request. The conditional-GET prior is
picked from any owning plugin's prior entry; if their etags diverge
(rare) we miss one 304 and pay for a full re-download instead.
Phase 2 still creates a per-(plugin, url) manifest entry pointing
under the plugin's own subdir, and Phase 3 cleanup is keyed the same
way so dropping a URL from one plugin's metadata doesn't disturb
another plugin still referencing it.
Body files stay per plugin (RBAC-clean isolation: deleting plugin A's
cache can't strand plugin B). Bandwidth saved by fetch dedup.
Consumer code re-keyed: src.marketplace._refresh_plugin_cache rebuilt
served_url_for / mirror_status as composite-keyed maps;
app.api.marketplace._resolve_external_via_mirror /
_curated_inner_cover / _curated_inner_enrichment look up by
(plugin_name, url).
Tests cover per-plugin manifest entries with shared URL, the single
HTTP fetch for N plugins, and Phase 3 drop-one-keep-other. All
existing tests migrated to composite key access; v2 list format
assertions verify on-disk shape.
* Migrate asset mirror from urllib.request to httpx
The asset mirror was the only HTTP call site in Agnes still using
urllib.request; every other module (CLI, Jira / OpenMetadata / OpenAI
connectors, scheduler, Telegram bot) already used httpx. The asset
mirror was added in this PR's base commit, so this is the only chance
to bring it into convention before someone copies it as 'the pattern
for HTTP fetches in Agnes'.
Three concrete benefits beyond consistency:
* SSRF defence collapses from five urllib classes
(_PinnedHTTPConnection, _PinnedHTTPSConnection, _PinnedHTTPHandler,
_PinnedHTTPSHandler, _SafeRedirectHandler) into one
_SSRFGuardTransport. httpx invokes handle_request() on every redirect
hop, so re-validation is free — we don't need a custom redirect
handler at all.
* DNS-rebinding defence: the transport rewrites request.url.host to the
SSRF-validated IP before delegating to super().handle_request().
httpcore connects to whatever URL.host says, so this pins the
connection without subclassing HTTPSConnection. The original hostname
goes into the Host header + the sni_hostname extension so TLS / vhost
routing still bind to the curator-supplied hostname.
* Error handling: one httpx.HTTPError catch-all for transport errors,
plus specific httpx.TimeoutException / httpx.TooManyRedirects branches
for clearer diagnostics. Matches the _translate_transport_error shape
in cli/client.py.
The shared httpx.Client is built lazily at module load (same pattern as
cli/client.py:_get_shared_client) with follow_redirects=True,
max_redirects=5, timeout=HTTP_TIMEOUT_SEC, and our custom transport.
Externally observable behaviour is unchanged: same FetchOutcome
statuses, same manifest format, same conditional GET semantics, same
body-size cap.
Tests migrated from urllib-shaped fakes to httpx-shaped (status_code,
iter_bytes, context manager). Five urllib-specific tests replaced with
httpx equivalents — three transport unit tests + one DNS-rebinding
integration test that verifies host rewrite via monkey-patched
super().handle_request. One test deleted without replacement
(unwrap-URLError-wrapping-an-_UnsafeRedirectError — urllib-specific,
not applicable to httpx).
* Surface curated agnes-metadata enrichment on My Stack tab
GET /api/marketplace/items?tab=my built each curated row from the
on-disk marketplace.json by way of resolve_allowed_plugins, which
doesn't carry the agnes-metadata enrichment columns
(cover_photo_url, video_url, category override, doc_links). The
handler then hard-coded cover_photo_url=None on the synthetic row.
Result: once a user clicked '+ Add to my stack' on a curated card,
the same plugin in tab=my rendered with the gradient placeholder
instead of its cover photo — confusing parity break vs. the curated
tab where the same row goes through MarketplacePluginsRepository
and gets the enriched columns.
Pre-load the enriched marketplace_plugins rows for every marketplace
the user is subscribed to, then look each granted+subscribed plugin
up by (marketplace_id, plugin_name). Fall back to the on-disk
synthetic shape only when the DB row is missing — happens during
the rare race where RBAC is granted before the first sync cycle
ingests the plugin. RBAC gating (granted set from
resolve_allowed_plugins) is unchanged so this fix can't widen
visibility; it just upgrades the data shape behind cards the user
was already going to see.
Per-marketplace list_for_marketplace beats N gets — typical user is
subscribed to <5 marketplaces, so this is at most a handful of
queries vs. one per subscribed plugin.
Regression test seeds a plugin with cover_photo_url + category
override, subscribes the user, hits /api/marketplace/items?tab=my,
and asserts photo_url + category come through. The misleading
'fall through to gradient until the user re-visits the curated tab'
comment is gone.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
The original list-form _V34_TO_V35_MIGRATIONS ran four ALTER
statements in sequence:
ADD _vis_v35 → UPDATE _vis_v35 = visibility_status →
DROP visibility_status → RENAME _vis_v35 TO visibility_status
If the RENAME failed for any reason after the DROP succeeded — DuckDB
lock contention at startup, scheduler-vs-app race opening
system.duckdb, container kill mid-migration, etc. — the DB was
stranded with _vis_v35 populated and visibility_status missing. The
schema_version row never bumped because the UPDATE at the bottom of
the migration ladder runs only when every step succeeded. Subsequent
restarts then hit DROP visibility_status again with no IF EXISTS
guard and looped on the same error; the only recovery was hand-
editing the DB.
Replace the list with a Python function _v34_to_v35_migrate that
inspects the table's columns up front and dispatches into one of
three paths:
* clean v34 (visibility_status present, _vis_v35 absent) — run the
full rebuild
* partial v35 (_vis_v35 present, visibility_status absent) — finish
the RENAME alone, data is already in _vis_v35 from the prior
UPDATE
* both columns present (rare; aborted before DROP) — drop the temp
and keep visibility_status
The audit columns (archived_at, archived_by) ship first behind
IF NOT EXISTS so they're safe in all states. Operators stranded by
the original bug now recover automatically on next startup.
Tests cover the three direct paths plus an end-to-end scenario where
_ensure_schema walks a schema_version=32 DB with the half-applied
state up through to v36.
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
* feat(store): flea-market upload guardrails + soft delete + JOIN-based admin queue
Adds an end-to-end guardrails pipeline for store uploads (manifest +
static-security + LLM review), persists blocked bundles for forensics,
introduces soft-delete (Archive) semantics, consolidates the legacy
/store/{id} surface into /marketplace/flea/{id}, and reworks the admin
queue so lifecycle filters read live entity visibility via LEFT JOIN
rather than a denormalized submission column.
Schema v29 → v35:
* v29 store_submissions table + store_entities.visibility_status
* v30 file_size, bundle_sha256, bundle_purged_at on submissions
* v31 reshape store_submissions (drop legacy unique on entity_id)
* v32 store_entities.archived_at/by + 'archived' visibility value
* v33 drop store_submissions.retry_count (unused)
* v34 ensure idx_store_submissions_entity exists post column-drop
* v35 broaden visibility_status enum + JOIN architecture cutover
Pipeline (src/store_guardrails/):
* Inline checks: manifest_check, static_scan, quality_check
* LLM review configurable haiku|sonnet|opus (default haiku)
* BackgroundTasks-driven async path with structured-output JSON
* Per-submitter daily quota (default 50)
* 30-day TTL purge job (POST /api/admin/run-blocked-purge)
* Bundle SHA256 + size persisted; sha256 survives purge for forensics
Visibility model:
* pending | approved | hidden | archived
* _enforce_visibility returns 404 (no leak) for non-owner non-admin
* Owner sees own non-approved entries via include_owner_id widening
* Install refused with 409 entity_not_approved when not approved
Soft-delete (DELETE /api/store/entities/{id}):
* Default = soft (visibility_status='archived'); existing installs
keep getting served the bundle so users don't lose the plugin
* ?hard=true admin-only: drops bundle + cascades user_store_installs
* Hard-delete preserves entity_id on submission as tombstone so
audit_log linkage survives for the activity timeline
Admin queue lifecycle (the JOIN refactor):
* Verdict (store_submissions.status) is immutable forensic record
* Lifecycle (store_entities.visibility_status) is live state
* /admin/store/submissions Archived chip translates to
`e.visibility_status='archived'` via LEFT JOIN — any path that
flips visibility surfaces in the queue immediately
* Detail page renders Status (verdict) and Entity lifecycle side by
side so admins see "approved at review, now archived" at a glance
URL consolidation:
* /store/{id} deleted (no redirect, stale bookmarks 404)
* /marketplace/flea/{id} is the canonical detail surface
* Three in-tree callers (upload-success, my-stack card, store
listing card) updated to point at the new URL
* Quarantine banner extracted to _quarantine_banner.html partial,
self-guarded, included from both flea detail templates
* Banner JS auto-refreshes when the verdict lands by polling
/api/marketplace/flea/{id}/detail (visibility_status +
submission_status — the latter is needed because blocked_llm
keeps the entity at visibility_status='pending')
Audit log resource format:
* runner.py emits prefixed `store_submission:{id}` (post-fix)
* Detail-page timeline query handles three patterns: prefixed
submission, helper-emitted `store_entity:{sub_id}`, and bare-id
legacy rows — all surface in the activity timeline
UX fixes:
* Owner sees Under review / Quarantined / Hidden banner with status
* Install button gray-disabled (not blue) when non-approved
* Owner cannot delete quarantined entries (403); admin can
* Admin queue: filter chips, sortable columns, paging, page-size
* Auto-refresh queue every 5s while pending rows are visible
* Store upload page file picker no longer opens twice (label →
input default action collided with explicit JS handler)
Tests: 168 passed across the guardrails suites (admin submissions,
store API, inline / LLM / purge guardrails, store repositories,
marketplace filter, schema version). New regression coverage
includes: archive surfaces via JOIN even when API path is bypassed;
deleted submission renders activity timeline (tombstone); flea
detail surfaces submission_status only for owner/admin; detail page
renders Entity lifecycle row; audit log resource format covers both
helper and runner paths.
* fix(store-guardrails): PR #233 follow-up — prompt injection, atomic PUT, BG race, schema, reaper, sort whitelist
Addresses 9 of the 23 findings from the PR #233 review (spec at
docs/superpowers/specs/2026-05-09-pr233-guardrails-fixes-spec.md).
Merge-gate items #1-#6 plus high-value mediums #7, #9-#12, #23.
Architectural items (#8 enum split, #14 factory) and pure
maintainability (#15-#22) deferred to follow-ups.
Security:
* #1 prompt injection — SYSTEM_PROMPT now passed via the SDK's
dedicated system= parameter; bundle wrapped in <bundle>...</bundle>
sentinels declared data-only by the system prompt; literal
sentinel strings in user content are escaped so an adversarial
README can't forge a close tag.
* #6 static scan honesty — module docstring + admin copy + docs
declare static scan as signal not gate; .md/.txt/.rst/.html/.json/
.yaml/.yml/.toml skipped to avoid false positives on prose.
AST mode for Python deferred (separate flag, FP comparison work).
Correctness:
* #2 PUT atomicity — bundles bake into plugin.staging-<rand>/
alongside live, atomic-rename on success; failed checks leave
live tree byte-for-byte intact.
* #3 BG-task race — set_visibility_if_pending guards verdict flips
to the (pending, hidden) review window; admin archives during
review survive; skipped flips audit-logged.
* #4 v35 NOT NULL/DEFAULT — schema v35→v36 re-applies them on
store_entities.visibility_status. CHECK constraint enforced
application-side (DuckDB ADD CHECK on existing column unsupported).
* #7 stuck-review reaper — reap_stuck_llm_reviews flips pending_llm
rows older than guardrails.stuck_review_grace_seconds (default
1800) to review_error. Scheduler runs every 15 min via new
/api/admin/run-reap-stuck-reviews. Set knob to 0 to disable.
* #9 quota counter — count_blocked_for_submitter_since now counts
blocked_inline + blocked_llm + review_error so a submitter
triggering only LLM-blocked verdicts is bounded.
* #10 missing risk_level — surfaces as review_error with
error='missing_risk_level' instead of silently defaulting to
'medium' (which looked like a model-decided block).
* #11 archived_at clear — set_visibility nulls archived_at +
archived_by when transitioning out of 'archived' so a future
read doesn't show stale archive forensics on an approved row.
Maintainability:
* #12 FSM doc comment — accurate insert/transition/lifecycle
description in src/db.py near store_submissions schema.
* #23 sort-key whitelist — admin queue rejects unknown sort keys
with 400 invalid_sort_key; substring-replace footgun removed.
Deferred (separate PRs):
* #5 quota race — proper fix requires asyncio.Lock spanning the
full pipeline; threading.Lock blocks event loop, DuckDB MVCC
doesn't help. API-level slowapi bounds worst case for now.
* #6 part 3 (AST static scan), #8 (enum split), #13 (import
bundle docs), #14 (factory consolidation), #15-#22 (maint).
Tests:
* New: tests/test_store_guardrails_prompt_injection.py (corpus +
trust-boundary invariants), tests/test_store_put_atomic.py,
tests/test_store_guardrails_reaper.py.
* Extended: test_store_guardrails_llm.py (system param, missing
risk_level, BG race), test_admin_store_submissions.py (quota
counter widening, sort whitelist 400), test_store_repositories.py
(un-archive metadata clear), test_db_schema_version.py (v36).
* Full suite: 3738 passed; 17 pre-existing baseline failures
unchanged (db migration tests, cli binary rename, catalog export,
user mgmt v5 backfill — confirmed by stash + rerun on clean tree).
* Extract session pipeline framework, refactor verification, add UsageProcessor skeleton
Pluggable framework under services/session_pipeline/ (contract + lib + per-processor
runner) so multiple processors can read /data/user_sessions/<key>/*.jsonl on their
own cadence with full failure isolation. Verification flow becomes the first plugin;
a no-op UsageProcessor reserves the second slot pending a separate brainstorm on
extraction logic + storage shape.
Schema v28→v29: rename session_extraction_state → session_processor_state with
composite PK (processor_name, session_file). Existing rows copied over with
processor_name='verification'; legacy table dropped. Migration is idempotent and
no-ops the copy step on fresh installs that came up at the new schema.
Endpoint: /api/admin/run-verification-detector replaced by parametrized
/api/admin/run-session-processor?processor=<name>. Audit action format follows.
Scheduler JOBS: verification-detector entry split into session-processor:verification
+ session-processor:usage. SCHEDULER_VERIFICATION_DETECTOR_INTERVAL retained for
operator compatibility (drives both cadence and health-check grace window);
SCHEDULER_USAGE_PROCESSOR_INTERVAL added.
* Address PR #232 review: scan dead branch + per-processor lock
- `SessionProcessorStateRepository.scan_unprocessed_for` dead else: both
branches surfaced every jsonl, the SELECT was unused, runner MD5-rehashed
every stable session per tick. Replaced with an mtime precheck — stable
sessions (mtime <= processed_at) are filtered at scan; modified files
still surface for the runner's authoritative `file_hash` invalidation.
Naive-local comparison matches the existing health-check idiom (DuckDB
TIMESTAMP strips tz on storage).
- Per-processor advisory lock around `_run_processor` in
`/api/admin/run-session-processor`. Scheduler tick + manual admin POST
could otherwise both run, both call create_evidence on overlapping
detections, and accumulate duplicate verification_evidence rows (the
dedup short-circuit only covers create+contradiction, not evidence per
ADR Decision 3). Non-blocking acquire → 409 Conflict on concurrent
invocation; release in finally so a runner exception doesn't wedge the
processor.
Tests: two new scan unit tests (mtime filter + post-mark mtime bump), 409
endpoint test, lock-released-on-exception test. Two existing tests updated
for the new "filtered at scan" stat shape (previously asserted skipped == 1,
now scanned == 0).
* Address PR #232 review #2: parallel scheduler tick + last_run on terminal state
Two pre-existing scaffold bugs in services/scheduler/__main__.py amplified
by adding more session-pipeline jobs:
1. Serial for-loop over jobs with synchronous httpx.post(timeout=900) — a
10-minute verification run blocked every other job (data-refresh,
health-check, usage, corporate-memory) for the whole window. The PR's
stated isolation guarantee held inside the runner but broke at the
scheduler dispatch layer.
2. last_run advanced only when _call_api returned True. Permanent-failure
jobs hot-looped on every tick (30s) instead of cadence (15min).
Fix: ThreadPoolExecutor.submit per due job + per-job in_flight set so a
long-running job can't be re-launched on subsequent ticks. last_run
advances unconditionally in finally; errors still surface via _call_api
logging + audit_log on the receiving side.
_run_job extracted to module-level for unit testing. New tests:
- TestRunJobBookkeeping: advances on success / failure / unhandled raise
- TestRunLoopParallelism: in_flight protection prevents duplicate
launches across ticks for a single slow job
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
* feat(home+news): state-aware /home + /news + admin-edited news section
Squash of the vr/home-page feature work for clean rebase onto main.
Original 18-commit history preserved in branch backup/vr-home-page-pre-rebase.
What's in this PR:
**State-aware /home page**
- New `/home` route with hero + auto-mode + connectors (Asana / GWS /
Atlassian) + lookarounds. Onboarded vs not-onboarded state-machine
branches a single template (`home_not_onboarded.html`); the install
steps, "Setup a new Claude Code" CTA (90-day PAT mint), and per-
connector setup prompts hide once `users.onboarded=TRUE`. A
completion badge replaces them.
- "Mark me as offboarded" button reverses the flag without an SQL UPDATE.
- `users.onboarded BOOLEAN` column added; default FALSE; flipped by the
CLI's `agnes init` post-success POST and the `/admin/users` API.
- Connector setup prompts pre-check whether the tool is already
installed/connected before re-running setup.
- GWS scope set widened to include Google Chat (`chat.spaces`,
`chat.messages`).
**Single template + design tokens**
- `dashboard.html` now extends `base.html` via the new
`{% block layout %}` opt-out (full-width pages skip the 800px
`.container`). Net: every page shares one shell.
- `style-custom.css` `:root` extended with `--space-{7,9,10,12}`,
`--radius-2xl`, `--shadow-{card,elevated}`, `--text-{muted,disabled}`,
`--focus-ring`, `--transition-*`, `--width-{narrow,app,wide}` so
inline page styles can migrate incrementally.
**Auth redirects honor AGNES_HOME_ROUTE**
- `safe_next_path` resolves the configured home route when no `default=`
is passed; OAuth callbacks, magic-link clicks, password form, and
LOCAL_DEV_MODE shortcuts now land on `/home` (or whatever the operator
picked) instead of always /dashboard.
**News section + /news permalink + /admin/news editor**
- Schema-bumped `news_template` table (single versioned entity, draft +
publish gate). `published BOOLEAN` distinguishes draft from public;
monotonically-increasing `version` per save; rows >30d pruned on
save except the currently-displayed published version.
- `/home` bottom-of-page renders the latest published intro with a
"Read more →" link to `/news` (which renders the full body).
- `/admin/news` editor with sandboxed live preview, versions table,
per-row Unpublish, Format-help cheatsheet.
- `agnes admin news show / draft / edit / publish / unpublish /
versions / export` (CLI). Talks to the live server via the
`/api/admin/news/*` endpoints (PAT-authed) — no direct DB access
so it coexists with a running uvicorn.
- **Optimistic-lock guard**: `agnes admin news publish --version N` and
PUT/PATCH endpoints accept `expected_version` and 409 with structured
`{error: "version_conflict", expected, actual, actual_by}` when a
concurrent admin replaced the draft. Edit refuses to overwrite a
draft authored by someone else without `--force` or
`--expect-version`.
- nh3 (Rust-backed ammonia) HTML sanitizer; iframe pre-pass strips
any iframe whose src is not on the YouTube/Vimeo/Loom allowlist;
javascript:/data: schemes blocked everywhere.
- Author CSS vocabulary: `.news-hero` (blue gradient hero block),
`.callout`/`.callout-{info,warn,success,danger}`,
`.video-embed`, `.news-section`, `.news-grid-{2,3}`, `.news-cta` —
all consolidated in `style-custom.css` under "News content
vocabulary (shared)" so /home perex, /news body, and /admin/news
preview share one source of styling.
- Code-inside-`<pre>` contrast fix (was unreadable amber-on-silver).
- `.news-content` table styling (border, header band, row-hover).
**`scripts/dev/run-local.sh`** — local uvicorn launcher. Pulls Google
OAuth client id/secret from GCP Secret Manager
(`AGNES_OAUTH_GCP_PROJECT`-driven, no vendor defaults), points
`AGNES_CLI_DIST_DIR` at `./dist` so the wheel endpoint resolves, and
`--dev` flips `LOCAL_DEV_MODE=1` + `AGNES_HOME_ROUTE=/home` for one-
command iteration. `LOCAL_DEV_MODE=1` also enables the FastAPI debug
toolbar.
**CLAUDE.md "Run tests before every push" section** codifies
`pytest tests/ -n auto -q` as non-negotiable before each push.
**Tests**: 51 + 14 + 8 = 73 new tests across news-template repo,
sanitizer, API, web, CLI; plus updated home/auth/template tests for
the new shared-shell architecture.
Origin docs (gitignored, customer-fork content):
docs/brainstorms/home-page-requirements.md,
docs/plans/2026-05-07-001-feat-home-page-plan.md.
* feat(cli): agnes onboarded {on,off,status} — self-scoped flag toggle
User-facing equivalent of the in-page "Mark me as (off)boarded" button
on /home. POSTs /api/me/onboarded with {onboarded, source}; --source
overrides the audit-log marker so flips made from the CLI vs the web
button vs agnes init automation stay distinguishable.
`status` reads via /api/me/profile (when present); falls back to a
quick body-marker scan of /home so the read path doesn't write an
audit_log row. PAT-authed via cli.client.api_post — same convention
as agnes admin news / agnes admin add-user etc.
Tests: 5 covering on/off/status round-trip, idempotency, and
audit-log source recording. Full suite holds at 12 pre-existing
failures (same set as before).
* ui(nav+home): primary nav reorg + green What's new band + /marketplace link fix
Primary nav (post-rebase audit + per-user feedback):
- Items: Home → Marketplace → Data Packages → Memory. Admin dropdown
for admins only. The "Dashboard" label was renamed Home — point still
resolves through `home_route` so customer instances on /dashboard
still land there.
- Activity Center moved into the Admin dropdown. Per-team adoption
analytics is admin-consumed in practice; the route still allows
any authed user for direct deep-links so existing /home tile +
bookmarks keep working.
- Memory link added (→ /corporate-memory) — was previously buried in
the /home "Look around" tiles.
- Setup local agent + My Stack dropped from main nav. Setup is the
/home install flow's home now; My Stack lives as a tab inside
/marketplace.
/home tweaks:
- Plugin marketplace tile now points at /marketplace (was /store —
legacy from before the marketplace rebrand landed in #230).
- "What's new" section header gets a green band (success-flavored
D1FAE5 background, A7F3D0 border, darker green title) so the
bottom-of-page news block visibly distinguishes from the blue
install-hero at the top. Header strip only — body stays white.
Test fix: test_home_route_resolution renamed `dashboard_link_uses_home_route`
→ `home_link_uses_home_route` and asserts `href="/home">Home` instead
of `href="/home">Dashboard` after the label change.
* fix(home): decouple Step 3 + Connect-tools collapse from server onboarded flag
The server-side `users.onboarded` flip happens through two paths:
1. Explicit user click on "Mark me as onboarded" or `agnes onboarded on`.
2. Implicit `agnes init` POST → /api/me/onboarded on success.
Path 2 produced a UX surprise: an analyst running `agnes init` mid-flow
reloaded /home and saw Step 3 (auto-mode) + Connect-your-tools auto-
collapse to summary bars. They were actively working through those
sections — the install POST never signalled "I'm done with the rest
of setup", just "Agnes itself is installed".
Decouple the section-collapse decision from the server flag:
- Step 1 + Step 2 install blocks: still hidden on `onboarded=TRUE`
(their completion is a hard server signal — Agnes IS installed).
- Step 3 + Connect-your-tools: render flat by default in BOTH states.
Wrapped in `<details class="setup-collapsible" open>` so the
browser's native disclosure handles per-section toggle without JS,
but the `<summary>` is CSS-hidden until the page-level
`data-setup-minimized="1"` attribute is set on `.home-mock`.
- New "Minimize setup view" toggle inside the blue install-hero,
rendered only when onboarded. Click flips the data-attr on
`.home-mock` AND removes the `open` attribute from each
`<details>`. State persists in `localStorage["agnes_home_setup_minimized"]`
so the choice survives reloads but is per-device.
- "Show full setup view" (the same button when minimized) re-opens
both `<details>` and clears localStorage.
When minimized, each `<details>` still has its own native expand/
collapse — click the gray summary bar to peek at one section without
toggling the page-level minimize off.
Tests:
- test_step3_and_connectors_render_flat_when_onboarded_by_default —
asserts `<details class="setup-collapsible" ... open>` for both
sections post-onboarding and the absence of any server-rendered
`data-setup-minimized` attribute on the `.home-mock` root.
- test_minimize_toggle_visible_only_when_onboarded — toggle button
rendered only when onboarded.
Full pytest holds at 12 pre-existing failures (same set).
* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)
Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.
Coverage:
- FastAPI 500 handler captures unhandled exceptions
- src/orchestrator.py rebuild + rebuild_source failures
- services/scheduler/ HTTP-job failures
- cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
skipped; flushes before re-raise so short-lived CLI invocations don't
drop events)
- connectors/llm/anthropic_provider.py + openai_compat.py emit
$ai_generation events with provider, model, latency, token counts
(prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
because LLM prompts here routinely include customer SQL/data)
- Browser snippet injected into every text/html response by
PosthogInjectionMiddleware — registered inside the GZip layer so it
sees uncompressed HTML before compression. Many templates are
standalone (their own DOCTYPE) and never extend base.html, so a
per-template include would miss them.
- Frontend: $pageview, $pageleave, JS error capture via window.error
and unhandledrejection handlers, masked session replay
(maskAllInputs: true plus CSS-selector mask for known data surfaces),
feature flags (browser posthog.isFeatureEnabled + server-side
feature_enabled with fallback for older SDKs).
Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).
Files:
- src/observability/posthog_client.py — lazy singleton, no network
when disabled, single-process flush on shutdown
- src/observability/llm_tracing.py — trace_generation context manager
- app/middleware/posthog_inject.py — HTML rewrite middleware
- app/web/templates/_posthog.html — browser snippet template
- docs/observability.md — operator guide
- config/.env.template — documented POSTHOG_* knobs
- tests/test_posthog_disabled.py + tests/test_posthog_client.py +
tests/test_llm_tracing.py — 18 tests covering disabled state,
identify-mode payloads, $ai_generation shape, error variant.
CHANGELOG entry under [Unreleased] Added.
* feat(observability): tag every PostHog event with environment + release
Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.
- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
the loaded callback so $pageview, $exception, autocapture, etc. all
carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
can actually call posthog.identify(user_id, {email}) for logged-in
users (previously the user block always resolved to None because
nothing wrote to request.state.user).
4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.
* feat(observability): inline user attrs on every PostHog event + debug throw route
PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.
- Backend capture_exception merges user_id / user_email / user_name into
the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
properties via posthog.register({...}) so every $exception, $pageview,
and custom event coming from posthog.captureException() carries them
inline. Mirrors the backend so cross-referencing client/server events
doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
Runs Depends(get_current_user) first so request.state.user is set when
the unhandled-exception handler captures the event. Lets operators
exercise the full observability path end-to-end without hand-rolling
a TestClient script. Configurable via ?kind=ValueError&msg=...
7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.
* fix(observability): address minasarustamyan PR #231 review
Two bugs caught in review.
1. PosthogInjectionMiddleware dropped Response.background on every
return path. BaseHTTPMiddleware materialises the body and asks
subclasses to return a fresh Response — three paths in dispatch()
omitted background=, silently cancelling any BackgroundTask /
BackgroundTasks the route attached (audit logging, async webhooks,
email sends) with no log line. Fix: route every return through a
_passthrough() helper that forwards background.
Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
can't balloon RSS during buffering. Bigger bodies short-circuit
through with a warning rather than being injected.
Regression tests in tests/test_posthog_inject_middleware.py exercise
four return paths (snippet present, render-fail, double-injection
guard, non-HTML passthrough) plus the streaming-guard short-circuit.
2. $ai_input / $ai_output_choices were emitted without truncation, so
POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
per-event ingest limit — exactly the calls (large prompts with
schemas / sample rows / SQL) an operator would want to inspect.
Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
an explicit "…[truncated N chars]" marker so readers don't mistake
truncated captures for complete ones. Metadata (provider, model,
tokens, latency, error) flows regardless. Three new tests cover
default-cap clipping, env-override, and pass-through under the cap.
37 PostHog tests pass.
* Add /marketplace browse page + Model B opt-in stack composition
New /marketplace browse surface unifies the curated marketplaces
(admin-managed git mirrors) and the community Flea Market behind
three tabs — Curated / Flea / My Stack — with per-tab category
filter, search across both sources with scope checkboxes, and
numeric pagination, all driven by URL query state. Plugin detail
at /marketplace/curated/<slug>/<plugin> and /marketplace/flea/<id>;
nested skill / agent detail at /marketplace/curated/<slug>/<plugin>/
{skill,agent}/<name> and the flea-side single-page detail.
Model B opt-in: an RBAC grant on a curated plugin is now only
*eligibility*. The user must click "Add to my stack" for it to
enter their served Claude Code marketplace. Composition flips
from (rbac ∖ opt_outs) ∪ store_installs to
(rbac ∩ subscriptions) ∪ store_installs. The legacy
user_plugin_optouts table is renamed user_curated_subscriptions
(schema v27) — same table shape, inverted semantic, repository
methods become subscribe / unsubscribe / is_subscribed.
UX vocabulary: Install → Add to my stack, Installed → In your
stack, card "Installed" badge → "In stack" (amber pill), tab
"My Subscriptions" → "My Stack". Bridges the two-step model
(server-side bookmark vs. on-laptop install) the previous label
hid. Click triggers an inline post-add hint panel under the
description with the agnes refresh-marketplace recipe + Copy
chip, dismissible per-browser via localStorage.
Per-tab info blocks above the filter row:
- Curated: trust signal — "Each plugin here has a named curator
accountable for it." (blue accent + See-all-curators link)
- Flea: open-shelf signal — "Anyone in the company can upload
here." (purple accent + Tips-for-sharing link)
- My Stack: personal-shelf orientation — "Your AI stack —
everything you've added." (slate accent, no link)
Tabs carry per-tab Heroicons (shield-check / building-storefront
/ rectangle-stack) tinted to match each tab's accent; flips white
when the tab is active for contrast.
Hero illustration anchored to the right of the blue hero panel
(absolute, 47% wide, behind the search row content). Hidden
under 900px viewport.
Action-row CTAs realigned to publication intent: curated
"How to add new content" → "Submit a plugin" (links to the
guide page); flea button removed since +Upload sits next to it.
Empty-state CTAs match. /marketplace/guide/{curated,flea}
routes now host publication-flow guide pages with placeholder
ledes — full copy to be authored separately.
Categories: Heroicons-based icons mapped per category in
src/category_icons.py (zero new dependencies; SVG path strings
inlined). Marketplace cards, filter pills, and detail pages
read from the same source.
API endpoints under /api/marketplace:
- GET /items per-tab listing (curated / flea / my)
- GET /categories per-tab non-zero counts
- GET /curated/{slug}/{plugin} plugin detail
- POST/DELETE /curated/{slug}/{plugin}/install subscribe toggle
- GET /curated/{slug}/{plugin}/{skill,agent}/{name} inner item
The tab=my branch reads directly from
user_curated_subscriptions ∪ user_store_installs (not
resolve_user_marketplace, which bundles flea skills/agents into
a single store-bundle synthetic entry useful for serving the
Claude Code marketplace ZIP/git but wrong for browsing where
each item should appear as its own card).
Detail pages: plugin detail surfaces inner skills/agents as
clickable nested cards; commands/hooks/MCPs render as plain
name lists. Skill/agent detail mirrors the plugin layout with
kind-tinted accents (skill = green, agent = purple), Description
+ Details sidebar, Files + Docs sections, and the "How to call
it" copy-able invocation chip showing /<plugin>:<inner-name>
exactly as Claude Code namespaces it post-install. Curated
nested has no install button — links back to the parent plugin.
Navbar: standalone "My AI Stack" relabelled "My Stack" and
points at /marketplace?tab=my; "Store" link removed (Store
flow is reachable via the Flea Market tab's +Upload button).
The standalone /my-ai-stack and /store routes still work for
old bookmarks.
Tests cover the new browse / categories / install / RBAC paths
under tests/test_marketplace_api.py; existing marketplace and
store tests updated for Model B (explicit subscribe in fixtures).
Schema bumped v26 → v27 with idempotent migration that wipes
existing user_plugin_optouts rows on flip and adds
marketplace_plugins.created_at with registered_at backfill.
* Fix v28 migration + post-rebase test fallout
v28 ALTER TABLE marketplace_plugins ADD COLUMN created_at conflicted with
_SYSTEM_SCHEMA's earlier CREATE that already includes the column on fresh
installs (test fixtures starting at any pre-v28 version trip on it).
Switch to ADD COLUMN IF NOT EXISTS — same idiom as the upstream v27
Keboola sync-strategy migration on the same ladder.
Two test patches needed after the rebase bumped SCHEMA_VERSION 27 → 28:
- test_keboola_v27_migration.py: test_schema_version_constant_is_27 was
pinning ==27. Loosened to >=27 (the test's purpose is to verify the
v27 Keboola migration, not to pin the current SCHEMA_VERSION).
- test_setup_page_unified.py: was monkeypatching resolve_allowed_plugins
but compute_default_agent_prompt now reads from resolve_user_marketplace
(Model B-aware). Stub the right function so the test exercises the
v28 served-set path.
* Harden curated skill/agent inner endpoints against path traversal
`_read_inner`, the `skill_dir` walk in `curated_skill_detail`, and the
`agent_path.stat` in `curated_agent_detail` joined URL path-params onto
`plugin_root` without verifying the resolved candidate stayed inside it.
Starlette's `[^/]+` on `{skill_name}` / `{agent_name}` blocks the direct
URL exploit (encoded `/` 404s before the handler), but a curator-planted
symlink inside a curated marketplace's git mirror could still dereference
outside the plugin tree on read.
Adds `_safe_join(plugin_root, *parts)` doing
`Path.resolve(strict=True)` + `relative_to(plugin_root.resolve())`, used
by all three call sites so the boundary is enforced once and consistently.
Tests cover the helper directly (normal path resolves, escaping `..`
returns None, escaping symlink returns None, missing file returns None)
plus an end-to-end check that the symlink case actually 404s on the
HTTP endpoint. Symlink tests skip on Windows where symlink creation
needs elevated permissions; they run on Linux CI.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
## Summary
Two minimum-viable fixes after today's 0.44.0 → 0.47.3 release train and the production 30-user launch. Devil's advocate review of a 3-PR / 7-item plan cut scope to these 2 — the rest is deferred to a separate "operate-first, instrument-second" backlog item.
### B2 — Docker session_collector log skip
`services/session_collector` was logging `Collection complete: 0 users, 0 files copied` + `WARNING: Group 'data-ops' not found, using default group` every 10 minutes in the Docker layout (where `/home/*/user/sessions/` doesn't exist). New env var `AGNES_SKIP_LEGACY_COLLECTOR=1` set by default in `docker-compose.yml` short-circuits the collector pass.
The bare-VM deployment path (where /home/* IS populated by Claude Code) leaves the env var unset and continues to scan normally — including the data-ops warning, which is load-bearing for catching missing-group mis-deploys.
### O2 — FIFO check in `_check_session_pipeline`
The existing check compares `MAX(processed_at)` to newest jsonl mtime — catches "detector hasn't run lately" but blind to "old file was skipped while newer ones were processed". New code finds the oldest FS jsonl that's NOT in `session_extraction_state.session_file` and flags if its mtime is older than `SESSION_PIPELINE_STUCK_FILE_GRACE_SECONDS` (default 4× the existing grace = 2h).
Severity intentionally starts at `info` so we can collect prod data on false-positive rate before tightening to `warning`. The aggregator already treats `info` as non-promoting (see the severity vocabulary docstring at the top of `app/api/health.py`), so the headline `status` stays at `healthy` even when this fires — the operator sees the entry in the per-check breakdown but no spurious `degraded` overall.
## Test plan
- [x] `pytest tests/test_session_collector.py` — 17 tests pass (existing 9 + new 8 covering env-set/unset, truthy variants, falsy non-skip).
- [x] `pytest tests/test_health_session_pipeline.py` — 8 tests pass (existing 4 + new 4 FIFO tests covering stuck-file, under-threshold, all-processed, env-override).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/229" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`agnes self-upgrade` without `--force` previously short-circuited on the local 24h `update_check.json` cache. After a server-side version bump within that window, the explicit command exited silently as a no-op — empirically observed today when prod 0.47.1 → 0.47.2 didn't propagate.
Fix: always invalidate the cache in `_resolve_info`. The cache still gates the implicit warning loop in the root callback (correctly — that runs on every `agnes <anything>` and can't hammer `/cli/latest`).
## Test plan
- [x] New `test_self_upgrade_bypasses_24h_cache_without_force` — stale cache claims current; mocked server reports newer; assert UpdateInfo carries the newer version, not the cached one.
- [x] Existing self-upgrade tests pass (including `--force` semantics — force is now downstream-only, behavior preserved).
## Summary
Smoke-testing the just-shipped 0.47.1 against production exposed two regressions:
1. `agnes query --remote "SELECT FROM unit_economics WHERE bad_col=1"` returned `Table "unit_economics" must be qualified` (the OLD error) instead of `Unrecognized name: bad_col` (the #218 fix's intended behavior).
2. `agnes query "DESCRIBE unit_economics"` showed only DuckDB's misleading `Did you mean order_economics?` with no Agnes hint paragraph (the #219 fix is missing).
Root cause: PR #217's squash merge (`506a378c`) carried stale snapshots of `app/api/query.py` and `cli/commands/query.py` from before #218 and #219 merged. The rebase-and-merge auto-merged those files cleanly (no conflict markers) but the result silently reverted both fixes.
Restore the two changes verbatim. Tests for both fixes already on main and continue to pass against the restored code.
## Test plan
- [x] `pytest tests/test_api_query_guardrail.py tests/test_cli_query.py` — clean
- [x] Manual repro against prod after deploy: both flows now surface the intended diagnostic.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/225" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):
- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).
## Architecture
- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.
## Test plan
- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
- Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
- Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)
## Bugs caught + fixed during E2E
The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:
1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.
## Operator notes
- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).
## Spec source
The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.
Closes#155.
Closes#156.
## Test plan
- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.
## Notable design decisions
- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.
## Out of scope
- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).
## Release
Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Two bugs in `agnes describe` surfaced from a real analyst session following the CLAUDE.md agent-rails discovery workflow. Together they break `agnes describe` end-to-end for any analyst (or analyst-AI) who follows the documented form.
### A) CLI parsing
`agnes describe TABLE -n 5` failed with `Missing argument 'TABLE_ID'`. Root cause: the command was registered as a `Typer.Typer` subcommand group via `app.add_typer(describe_app, name="describe")` + `@describe_app.callback(invoke_without_command=True)`, and that pattern mis-parses positional + short-int option in some orderings. Same pattern in `cli/commands/schema.py` works only because schema has no INTEGER short option. Fix: switch to flat `@app.command("describe")`.
### B) Server NaN
`/api/v2/sample/<id>` (called by `agnes describe`) returned HTTP 500 with `ValueError: Out of range float values are not JSON compliant: nan` whenever a row contained NaN. Fix: sanitize NaN/±inf to None before JSON serialization.
## Test plan
- [x] `pytest tests/test_cli_describe*.py` — added regression tests pinning `-n` parsing on either side of the positional.
- [x] `pytest tests/test_api_v2_sample*.py` — added regression test for NaN row → JSON `null` (not 500).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/224" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`claude -p` (headless mode) gives SessionEnd hook subprocesses ~1 second before SIGTERM, regardless of work in progress. `agnes push` for a typical workspace takes 5-30s. The current synchronous SessionEnd hook (`agnes push --quiet 2>/dev/null || true`) was therefore being killed mid-first-upload — `|| true` masks the SIGTERM as exit 0, so this regression was invisible until I traced it via a wrapper script and Claude's `~/.claude/debug/<sid>.txt` log.
Fix: wrap SessionEnd push in `bash -c "( nohup agnes push --quiet </dev/null >/dev/null 2>&1 & ) ; true"`. The subshell exits immediately, orphaning the upload child to init so it survives the hook subprocess kill. Same `bash -c` pattern as the existing `refresh-marketplace` SessionStart entry (for Windows compatibility).
End-to-end verified against production: claude exited in 5s, detached child completed the upload, file `491e3a23-...jsonl` landed on the server within 30s with mtime 14:30 UTC.
## Test plan
- [x] `pytest tests/test_lib_hooks.py` — added `test_session_end_push_is_detached` regression test asserting `nohup`, `&`, `</dev/null` are all present.
- [x] `pytest tests/test_setup_hooks_template.py` — assertions loosened from `==` to `in` where necessary.
- [x] Verified end-to-end against production with the detached wrapper before opening this PR (manual probe).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/222" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Verified against production: `claude -p` headless mode doesn't fire SessionEnd hooks (proven via `--output-format stream-json --include-hook-events`: zero `SessionEnd` events), so any session JSONLs from `-p` invocations stay orphaned locally and never reach the server. Fix: add `agnes push --quiet` as a third SessionStart entry — symmetric self-heal alongside the existing `agnes pull` entry. Existing workspaces pick this up on their next `agnes init` via the marker-based migration already in `cli/lib/hooks.py`.
Separately: a colleague's fresh install showed `agnes diagnose` warning "uploads are not being processed", which led them to suspect their `agnes push` was broken. The warning is actually about the LLM-based `verification-detector` backlog (uploads themselves were arriving fine — confirmed by 23+3 JSONLs landed on the server while the warning was firing). Reword the warning to "verification-detector backlog" + add `last_processed` to the diagnose dict so operators don't have to grep logs to confirm.
## Test plan
- [x] `pytest tests/test_lib_hooks.py` — updated count + added `agnes push in SessionStart` assertion.
- [x] `pytest tests/test_setup_hooks_template.py` — updated.
- [x] `pytest tests/test_clean_install_integration.py` — updated.
- [x] `pytest tests/test_health_session_pipeline.py` — updated warning text + asserted `last_processed` field.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/220" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`agnes query "DESCRIBE unit_economics"` (where `unit_economics` is `query_mode='remote'`) previously returned DuckDB's nearest-name suggestion (`Did you mean "order_economics"`?), sending users down the wrong path. Now appends a friendly hint about remote tables.
Reproduced from a real analyst session — colleague spent ~30s diagnosing what was actually "this is a remote table, not materialized locally".
## Test plan
- [x] New test: `_query_local("DESCRIBE unit_economics", ...)` against an empty local DuckDB triggers the new hint, original DuckDB error still echoed.
- [x] Negative test: a syntax-error query does NOT trigger the hint (regex only matches "Table with name X does not exist").
- [x] `pytest tests/test_cli_query*.py` clean.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/219" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
When `agnes query --remote` references a column that doesn't exist on the FROM table, users were seeing `Table "<id>" must be qualified with a dataset` instead of the actually-useful `Unrecognized name: <column>` from BigQuery. Surface the first-attempt diagnostic now; keep the second-attempt context as `underlying_original`.
Reproduced against production:
```
$ agnes query --remote "SELECT COUNT(*) FROM unit_economics WHERE authorize_date = DATE '2025-05-06'"
Error: remote_estimate_failed (HTTP 400)
message: Could not estimate scan size for this query.
underlying: 400 ... Table "unit_economics" must be qualified with a dataset.
```
(`unit_economics` has `authorize_timestamp`, not `authorize_date`.)
## Test plan
- [x] New `test_remote_estimate_failed_surfaces_first_error_when_attempts_differ` asserts the first-attempt message wins, second-attempt is preserved as `underlying_original`, hint points to `agnes schema`.
- [x] Existing `test_guardrail_returns_400_remote_estimate_failed_on_double_parse_error` still passes (both attempts mocked to identical error).
- [x] `pytest tests/test_api_query_guardrail.py` clean.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/218" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
Cuts the [Unreleased] section into [0.46.0] in CHANGELOG.md and
bumps pyproject.toml. The user-visible content was already on main
via PR #190 (commit 28430ced); this is the release-cut commit that
should have been the last commit on that PR — splitting it out so
the operator-facing release artifact (tag + GitHub Release) lines up
with what's already deployed at :stable.
* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
Operator-and-analyst quality bundle: a security fix for the optional
Telegram bot, two CLI gaps closed, and three rounds of UX polish on
`agnes diagnose` and `agnes pull` so non-TTY consumers (CI runners,
Claude Code SessionStart hooks, sub-agent watchdogs) get readable,
actionable signal.
- Pairing-code RNG: random.choices -> secrets.choice (CSPRNG).
- Telegram script runner: refuse out-of-shape usernames before sudo -u.
CLAUDE.md.bak.<ISO-timestamp> before regenerating.
- agnes admin unregister-table <id> -> DELETE /api/admin/registry/{id}
- agnes admin update-table <id> --field=value ... -> PUT /api/admin/registry/{id}
response but never promotes the headline. BQ billing-equals-data check
downgraded warning -> info.
default (5 s / 1 MiB vs 30 s / 10%) so sub-agent watchdogs don't kill
the pull as a hung process. New env knobs:
AGNES_PULL_PROGRESS_INTERVAL_{SECONDS,BYTES}.
--include-schema (or ?include=schema) to opt back in.
Tests: 120 passed across the touched modules, including new tests for
each fix. Pre-existing failures on main (DB migration v1->v9, binary
rename) are unrelated and not introduced here.
The startup script runs on every boot but the metadata_startup_script
field is in lifecycle.ignore_changes — so a TF apply that changed
image_tag does NOT reach a long-lived VM until someone explicitly
recreates it. Meanwhile, operators commonly hand-edit /opt/agnes/.env
to pin a specific image (custom branch builds, staged rollouts).
Pre-fix, every boot rewrote .env from the baked-in template and
clobbered the operator's choice — concretely, a stop+start triggered
by a machine_type change would reset AGNES_TAG to whatever was in
the template at first provision, regardless of the operator's
intervening edit.
Now the script reads the existing .env (when present) for AGNES_TAG
and AGNES_TEMP_DIR; when those keys are set, the existing values
win over the template-computed ones. Logged on stdout when AGNES_TAG
disagrees with $IMAGE_TAG so an operator audit-trails the boot.
Fresh provisions are unchanged (no .env yet → template values land).
To force a TF-driven reset on an existing VM: rm /opt/agnes/.env
and reboot. Cut as infra-v1.8.0 — additive, downstream consumers
opt in by bumping the module ref.
The 'Add to group' dropdown on /admin/users/{id} silently filtered out
every Google-Workspace-managed group (rightly — the API would 409 on
POST). On deployments where Admin and Everyone are both Workspace-mapped
via AGNES_GROUP_{ADMIN,EVERYONE}_EMAIL and no custom Agnes groups exist
yet (FoundryAI prod + dev today), the picker showed only the literal
'— Pick a group —' option with the 'Add' button disabled. Operator had
no indication that they needed to create a custom group first.
Three states surface a hint below the picker now:
- user is already in every group (literally nothing left)
- every remaining group is Google-Workspace-managed (link to
/admin/groups + admin.google.com explainer)
- no groups exist at all
The skip-google-managed logic stays — POST would still 409 on those
rows, this just stops the empty-state from being a silent dead end.
Add allow_stopping_for_update=true on google_compute_instance.vm. Without
it, a TF change to machine_type triggers ForceNew (destroy + recreate);
with it, the provider stops + mutates + restarts the VM in place, which
is what an operator resizing a running deployment expects. Tag as
infra-v1.7.0; consumers opt in by bumping the module ref.
See CHANGELOG.md for the full entry. (Bumped from 0.42.0 to 0.43.0 since
0.42.0 was taken by PR #208's backtick-rewriter fix during this branch's
review cycle.)
0.40.0 added _persist_materialized_inner_view in materialize_query, which
tried to open extract.duckdb from a fresh DuckDB handle to write the _meta
row + inner view. In production this conflicts with the same uvicorn
process's existing read-only ATTACH (orchestrator's analytics conn holds
extract.duckdb ATTACHed as <source_name> alias), and DuckDB single-process
file-handle uniqueness rejects with:
Binder Error: Unique file handle conflict: Cannot attach "extract"
— already attached by database "<source>"
The helper logs WARNING fail-soft, parquet stays canonical, but the
master view never appears via the meta path.
Fix: at the end of _attach_and_create_views, scan
<extract_dir>/data/*.parquet and CREATE OR REPLACE VIEW <id> AS
SELECT * FROM read_parquet('<path>') for any parquet whose <id> is not
already in the per-source tables list (= meta path didn't pick it up).
Decoupled from materialize_query open-handle race. Honors the same
view_ownership cross-connector collision rules as the meta path
(first-come-first-served via view_repo.claim).
Tests:
- filesystem-fallback fires when _meta row missing
- skipped when meta path already created the view (no shadow)
- skips invalid identifiers (e.g. parquet stem starting with a digit)
- doesn't crash when source has no data/ subdir
Pre-fix flow:
1. extractor subprocess writes _meta with N remote rows + creates N inner
views in extract.duckdb (rebuild_from_registry skips materialized rows
per design — explicit `continue` at line 389)
2. _run_materialized_pass calls materialize_query, which writes parquet
atomically + returns stats — but never updates _meta
3. orchestrator.rebuild scans _meta, finds only the N remote rows, creates
master views only for them. Materialized parquet is on disk but
invisible to /api/query → 400 'not yet materialized'
Symptom appears after every container recreate (the previous run's _meta
state is wiped because docker compose down nukes the named volume that
backs extract.duckdb on some compose layouts; even on volumes that
persist, the next extractor pass calls _create_meta_table which DROPs
+ CREATEs _meta cleanly).
Fix: after os.replace(tmp_path, parquet_path) in materialize_query, open
extract.duckdb (read-write), DELETE existing _meta row for table_id,
INSERT new one with query_mode='materialized', and CREATE OR REPLACE
VIEW <table_id> AS SELECT * FROM read_parquet(<path>). All inside a
single transaction so concurrent reads see either old or new state, not
torn rows. Fail-soft on lock contention or schema drift — parquet
remains canonical, next sync pass recovers.
Tests: 3 new in test_bq_materialize.py covering:
- meta + inner view registered after materialize, alongside existing
remote rows
- re-run replaces (not duplicates) the meta row
- skips inner-view registration when extract.duckdb doesn't exist yet
(fresh BQ-only deployment edge case)
Address code-reviewer findings on the bigquery_query() rewrite path:
1. Outer LIMIT wrap — bigquery_query() materialises BQ result into DuckDB
before fetchmany sees it (vs ATTACH-catalog Storage Read API streaming).
A user 'SELECT *' against a billion-row remote table would buffer the
entire result before request.limit applied. Wrap rewritten SQL in an
outer 'LIMIT N+1' so the cap pushes into the BQ job itself.
2. Dollar-quoted inner SQL — naive replace("'", "''") doubling missed
DuckDB backslash-escape sequences (\\, \\n, \\t, …). A predicate
like 'WHERE name = ''O\\'Brien''' was unsafe under the doubling
path. DuckDB $bqq_inner$ … $bqq_inner$ form takes the inner SQL
verbatim with no escapes whatsoever. Falls back to legacy doubling
if user SQL improbably contains the literal tag.
3. Parse-error fallback — when the rewritten path fails with a BQ-side
parse / validation error (DuckDB-only syntax like ::INT cast that
survives identifier rewrite but BQ refuses), retry the user's
original SQL via the legacy ATTACH-catalog path so the request still
succeeds. Mirrors the existing dry-run fallback contract.
4. CHANGELOG — delete duplicate CLI bullets that landed under
already-released [0.38.1] (file corruption from merge — entries are
correctly under [0.39.0]).
Two improvements to `agnes pull` progress reporting:
1. **Aggregated per-file progress across chunked downloads**: the
existing Rich progress bar already used one task per file, but the
chunked-download contract (one file = N parallel chunk callbacks
summing to file size) meant we needed to verify that all chunk
threads advance the same task. They do — the per-file callback is
constructed once per tid and routes every chunk's byte delta to the
same task / textual entry, so the bar shows one aggregated bytes-
downloaded total rather than N separate sub-bars.
2. **Textual fallback for non-TTY stderr**: when stderr is not a
terminal (SessionStart hook, CI runner, Docker log capture), Rich
either suppresses output (silent multi-minute pull on a 5 GB
parquet) or emits raw control sequences. The new `_TextualProgress`
helper instead emits one plain-text line per file at most every
10%-of-total-bytes or 30 s, plus a final `100% done` line per file.
Format: `[N/T files] <tid>: 25% (16 MB / 66 MB) at 1.5 MB/s`.
The TTY path is unchanged. Detection uses `sys.stderr.isatty()` —
`show_progress=True` flips into the textual fallback when that returns
False. `show_progress=False` (the SessionStart hook) still emits no
progress text in either mode.
translate_bq_error previously mapped BQ's responseTooLarge failure mode
to bq_bad_request (HTTP 400 with the raw upstream message). The user-
facing implication ('your SQL has a syntax error') is wrong -- the root
cause is query shape (BQ refused to return the result inline because
it exceeded the response size limit), and the actionable remediation is
'narrow the WHERE clause, aggregate further, or use a materialized
table'.
Add bq_response_too_large as a first-class BqAccessError kind (also 400)
with a canonical hint message; original BQ message preserved in details
for operator debugging. Detection is substring-based on 'response too
large' and fires before the generic BadRequest path so the dedicated
mapping always wins. Affects every BQ-touching path since they all
share translate_bq_error -- /api/query, /api/v2/{scan,sample,schema},
materialize.
Pool the httpx.Client used by `stream_download` so N parquet downloads
share a single TLS handshake instead of one handshake each. With the
optional `h2` package installed, HTTP/2 multiplexing further lets all
chunk Range requests share a single TCP connection — synergizes with
the range-chunked download path added in the previous commit.
The shared client is created lazily on first stream-download call, kept
alive for the duration of the process via a module-level slot, and
closed at exit via `atexit.register`. Construction wraps in a
try/except: when `h2` is unavailable (slim install), httpx raises
ImportError on `http2=True` and we transparently fall back to an
HTTP/1.1 client — pooling alone still amortizes TLS handshakes.
`agnes pull` must never crash on a missing optional package, so the
fallback path is non-negotiable. `h2>=4.1.0` is added to the core
dependency set; downstream slim installs that drop it lose the HTTP/2
benefit but keep correctness.
Each BqAccess.duckdb_session() acquire previously created a fresh
in-memory DuckDB conn and ran INSTALL bigquery; LOAD bigquery;
CREATE SECRET; ATTACH on it -- costing ~0.5 s per request even before
any BQ work. Add a process-local pool (deque + lock) of pre-warmed
sessions; acquire reuses a warm entry when available, refreshing the
auth SECRET so a long-lived pool entry doesn't keep a stale GCE
metadata token past its TTL. Liveness probe (cheap SELECT 1) drops
broken entries before handing them to callers.
On exception inside the with-block the conn is closed instead of
returned to pool (session may carry dirty state). Pool size is
data_source.bigquery.session_pool_size (default 4; sentinel 0
disables pooling). Process-cached, not fork-safe (single uvicorn
worker is the supported deployment shape per CLAUDE.md).
All call sites get faster automatically: /api/query, /api/v2/{scan,
sample,schema}, materialize, the orchestrator's remote-attach, and
the BQ dry-run cap-guard.
When the server advertises `accept-ranges: bytes` and a parquet exceeds
`AGNES_PULL_CHUNK_THRESHOLD_BYTES` (default 50 MB), `stream_download`
now splits the file into N parallel HTTP Range requests
(`AGNES_PULL_CHUNK_PARALLELISM`, default 4, capped 1..16) and
assembles the parts into the destination atomically.
Targets the per-flow-shaped network (corp VPN with per-TCP-connection
rate-limiting) where single-stream throughput is throttled but N parallel
streams over the same connection scale roughly linearly. Manifests with
1 large materialized parquet + N remote tables previously left the
existing across-files `AGNES_PULL_PARALLELISM=4` pool with 1 active
worker = single-stream throughput; this fixes that.
Falls back to single-stream when:
- HEAD doesn't advertise `accept-ranges: bytes`
- Server returns 200 instead of 206 to a Range probe
- File size below the threshold
Cleanup discipline: every part file removed before return (success or
failure); destination written via `<target>.tmp` and renamed atomically.
Per-chunk retry on transient network blips (bounded by AGNES_STREAM_RETRIES).
User SQL hitting query_mode='remote' BigQuery rows was 50-100x slower
than the equivalent direct bigquery_query() call because DuckDB's master
view (CREATE VIEW … AS SELECT * FROM bigquery.<ds>.<tbl>) does not push
WHERE/SELECT/LIMIT into BQ in ATTACH-catalog mode. The BQ extension opens
a Storage Read API session over the entire upstream table; on >100M-row
sources this was 70-150s and frequently failed with 'Response too large
to return'.
Extract the existing dry-run rewriter's core (table-name → BQ-native
backtick path) into a shared helper. Add an execution-path rewriter
that wraps the whole user SQL in bigquery_query('<project>', '<inner>')
so the BQ planner sees the full query and engages partition pruning +
projection pushdown server-side.
Conservative fall-through: cross-source JOINs (BQ ↔ Keboola/Jira local),
queries already containing bigquery_query(, and unconfigured BQ project
all skip the rewrite and run the original SQL via ATTACH-catalog so
behavior degrades gracefully.
The flat-mount overlay's caddy `volumes: !override` block listed only
three mounts, but the base docker-compose.yml caddy service has five.
`!override` (compose-spec semantics) replaces the entire list, so two
mounts were silently dropped under the flat layout:
- `data:/srv:ro` — Caddy's read-only view of the agnes data dir, used
by the `@download` file_server handler in Caddyfile (added in v0.36.0
as the perf bypass for multi-GB parquet downloads). Without this
mount, `try_files /bigquery/data/<id>.parquet …` finds no file and
every parquet download falls through to the app's uvicorn worker —
defeating the bypass entirely.
- `caddy_config:/config` — Caddy's autosave/ACME state. Less critical
(we feed certs in via /certs) but loses the autosaved adapter config
across container recreates.
Restated both mounts with a comment block explaining the !override
caveat for any future overlay author.
Plus: CHANGELOG entries for the host-mount.yml direct-bind fix and
the STATE_DIR + flat-mount overlay under [Unreleased].
Devin Review on PR #188 (15:53Z): the renamed [0.36.0] section had
two separate ### Added blocks and two separate ### Changed blocks,
which violates Keep-a-Changelog grouping (and CLAUDE.md's explicit
'group by section' rule). Merged each set into a single ordered
block: Added, Changed, Fixed. No content removed; only reflowed.
Renames the [Unreleased] section to [0.36.0] in CHANGELOG, adds the
top-level summary, drops a fresh empty [Unreleased] above, and bumps
pyproject from 0.35.1.
Also fixes the third Devin Review finding on this PR: the CLI
ReadTimeout message hardcoded QUERY_TIMEOUT_S (300s) so a 30s-default
call (agnes catalog, agnes auth, …) reported a wait window that
didn't match reality. _translate_transport_error now takes the actual
httpx timeout from the calling helper; the BQ-job advisory only
appears for calls where the timeout was set ≥ 60s.
Three first-try-failure-surface fixes from Pavel's #185 trace + the
template guidance question, all under PR #188's umbrella so they land
together with the file_server / parallel pull / Tier 1 work.
1. CLI clean-error wrapper — new AgnesTransportError raised by the
api_*/stream_download helpers when httpx times out / drops /
refuses, plus a top-level Typer wrapper (cli/main.py) that prints
one-line "Error: …" + actionable hint and exits non-zero. Full
traceback goes to ~/.config/agnes/last-error.log for support
forwarding. Unhandled Exceptions are caught at the same boundary
so no Python traceback ever leaks to the analyst's terminal.
Pavel's #185 Phase 3B: a 30-frame httpx traceback from a slow BQ
--remote query made it look like a CLI bug. Now: clean message +
hint pointing at `agnes snapshot create` / partition-column
guidance.
Entry point in pyproject.toml flipped from `cli.main:app` →
`cli.main:_run_with_clean_errors` so the wrapper actually runs
under the installed `agnes` binary.
2. agnes init / agnes pull --skip-materialize + progress bar.
--skip-materialize omits query_mode='materialized' rows from the
download set so a first init doesn't spend 44 minutes silently
pulling a single 6 GB parquet (Pavel's #185 Phase 1). Rich-driven
per-file progress bar with label/bytes/rate/ETA renders to stderr
when not --quiet and not --json. Aggregates across the parallel
ThreadPoolExecutor workers added earlier in this PR.
3. config/claude_md_template.txt: explicit one-line snippet pointing
at `agnes catalog --json | jq '.tables[] | select(.id=="<id>")'`
for per-table descriptions + restated invariant: "the description
field on each catalog row is the authoritative business-rules
text — re-read live, never copy into this file." Resolves the
regression-or-feature debate between Pavel (wants annotations)
and the user feedback that landed in the prior commit (don't
embed table-specific content; tables change). Catalog command
stays the source of truth.
Five hottest BQ-touching endpoints were `async def` but invoked synchronous
DuckDB / BQ-extension calls inside the body. Under uvicorn's single event
loop that meant a single heavy `agnes query --remote` (waiting up to
~200 s for BQ's jobs.query) froze EVERY other request — /api/health,
dashboard, auth, even another query — for the full BQ wait. Operators
saw "VM idle, app frozen" during PR #188's testing.
Convert to plain `def` so FastAPI auto-offloads the body to the anyio
thread pool. Event loop stays free for non-BQ requests.
- app/api/query.py:execute_query
- app/api/v2_scan.py:scan_estimate_endpoint, scan_endpoint
- app/api/v2_sample.py:sample
- app/api/v2_schema.py:schema
Audit: 0 `await` statements in any converted handler (verified file-by-
file), so the rename is safe. Tests in tests/test_v2_*.py called the
handlers via `asyncio.run(...)` which now fails on a non-coroutine return;
swapped for direct calls (asyncio.run( -> ( ) — keeps paren balance).
Plus AGNES_THREADPOOL_SIZE env var (default 200, was anyio's stock 40)
in app/main.py:lifespan. Set via
anyio.to_thread.current_default_thread_limiter().total_tokens. 200 is
comfortable headroom for <50 concurrent analysts; bump for more.
480/480 impacted tests pass (the 2 remaining errors are a pre-existing
fixture setup issue in test_reader_smoke_matrix.py unrelated to this
change).
Three concrete changes addressing the "analyst Claude misuses the CLI"
class of bugs (image.png table — issues #3, #5, plus the recurrent
"how big is this table" guesswork):
1. config/claude_md_template.txt — the template agnes init writes to
<workspace>/CLAUDE.md. Surfaces every catalog-row field with a why,
adds a query_mode-based decision tree, explicit --estimate scoping
(snapshot create ONLY — was the #1 first-try error), an agnes fetch
→ agnes snapshot create rename note, and a 6-row failure-mode table
that maps each common error wording to its right next step.
2. app/api/v2_catalog.py — populate rough_size_hint for local +
materialized rows from the on-disk parquet size, bucketed
small/medium/large/very_large. Was hardcoded null with a TODO; AI
couldn't tell "is this 6.8 GB" without a failed --remote round-trip.
3. cli/update_check.py — the [update] banner survived the da→agnes
rename and printed "[update] da X is out of date" on every command,
training analysts to associate the binary with the old name.
Verified by rendering the template against representative contexts
(33/33 tests pass) and running every use case from the original
screenshot through the real CLI against a dev VM.
The download loop in cli/lib/pull.py was strictly serial — N tables took
Σ stream_download(t_i). With the Caddy file_server change in this PR,
the server can now sustain many parallel sendfile transfers without
blocking app workers, so the client-side serialization became the new
bottleneck.
Switch to ThreadPoolExecutor capped by AGNES_PULL_PARALLELISM (default 4,
set 1 to restore pre-PR serial). 4 matches typical home-broadband
saturation without over-subscribing the analyst's NIC. Drops to serial
when len(to_download) <= 1 to avoid executor overhead in the common
single-table case.
Per-table error semantics preserved via (tid, entry, err) tuple — a
failure on one parquet doesn't abort the rest of the batch.
Verified end-to-end against a dev VM with the new Caddy file_server
deployed: 2-table pull through agnes CLI works under the new concurrency.
Sibling change to the Caddy file_server PR (#182). Without this,
existing long-uptime VMs would pull the new agnes image on auto-upgrade
but keep their stale Caddyfile + docker-compose.yml — leaving the
file_server route + the data:/srv:ro mount inert. Confirmed live
2026-05-05 when the file_server change merged in main but stayed
unreachable on a running dev VM until /opt/agnes/* was scp'd by hand.
agnes-auto-upgrade.sh now hashes the bind-mounted config files
(Caddyfile + every docker-compose overlay) on every 5 min tick and
triggers a `docker compose up -d` recreation when the hash drifts
— same trigger path as an image-digest change. Fail-soft via the
.new-then-mv pattern: a curl 404 / network blip leaves the existing
file untouched.
Self-update at the bottom of the script: re-fetch
/usr/local/bin/agnes-auto-upgrade.sh itself so the very fix that
watches config files lands on running VMs without a manual ssh-and-
curl cycle. Otherwise we'd have a self-perpetuating "old script
problem" — the watch-config logic never propagating to the VMs that
need it.
Operators no longer need to ssh + scp Caddyfile/compose changes.
A single analyst's multi-GB `agnes pull` held the only uvicorn worker
for the duration of the stream, starving UI / /api/health / every other
API endpoint. Container flipped to `unhealthy`. Triggered while a
6.8 GB `order_economics` pull was in-flight on prod 2026-05-05.
Caddy now intercepts `GET /api/data/{table_id}/download` and serves
the parquet directly via sendfile from the data volume (mounted r-o
at /srv inside the caddy container). RBAC enforced by `forward_auth`
to a new lightweight `GET /api/data/{table_id}/check-access` endpoint
(returns 204 / 403) — the bulk transfer never reaches uvicorn.
Path discovery via `try_files` over the known extract.duckdb v2 source
subdirs. Anything not at a static path falls through to the existing
app handler so legacy `src_data/parquet` and future connectors still
work without a Caddyfile change. Non-Caddy deployments are unchanged.
Stage 1 (multi-worker uvicorn) was considered but blocked by the
single-writer DuckDB lock on system.duckdb — workers > 1 would crash
at startup on "Could not set lock on file", the same race that pushed
the scheduler from in-process writes to HTTP-via-app. Multi-reader
workers + single-writer coordination is out of scope for this PR.
DuckDB BigQuery extension defaults `bq_query_timeout_ms` to 90 s, which
is too tight for analyst-scale queries against view-backed BQ datasets.
`agnes query --remote` HTTP 400'd with `Binder Error: Query execution
exceeded the timeout. Job ID: ...` whenever the underlying BQ job ran
longer than 90 s, even though the job itself was healthy.
Add `data_source.bigquery.query_timeout_ms` (default 600 000 ms = 10 min,
sentinel 0 falls through to the extension default). Applied via
`SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching
DuckDB session: orchestrator's `_remote_attach` ATTACH path, BqAccess
session factory, and the standalone extractor. Configurable via
`/admin/server-config` UI.
Fail-soft: extension versions that don't recognise the setting silently
keep the default rather than poisoning the session.
* fix(keboola): per-table fallback to legacy Storage-API client
The DuckDB Keboola extension's per-table COPY fails with
`Schema '..."in.c-..."' does not exist or not authorized` on
projects whose Snowflake backend doesn't expose bucket schemas
to the storage-token-derived QueryService role
(keboola/duckdb-extension#17). ATTACH itself succeeds, so the
existing extension-level fallback in `_try_attach_extension`
never triggers — the table is just marked failed.
- Promote `kbcstorage>=0.9.0` from optional to core dep so the
legacy client import in `_extract_via_legacy` doesn't crash
default installs with `ModuleNotFoundError`.
- Wrap `_extract_via_extension` in a per-table try/except so a
scan failure retries via `_extract_via_legacy` instead of
recording `tables_failed` and moving on.
Slower than the extension path, but produces correct parquets
on affected projects while the upstream extension fix lands.
* test(keboola): cover per-table extension→legacy fallback
Two existing tests mocked _extract_via_extension to throw and asserted
the original message survived in result["errors"]. With per-table
fallback, the new flow retries via _extract_via_legacy — which on the
mock URLs would throw a different (404 / DNS-fail) error, replacing the
asserted message.
- Mock _extract_via_legacy alongside _extract_via_extension in
test_network_timeout_during_extraction +
test_partial_failure_continues +
test_all_tables_fail_returns_full_failure_stats so the assertion
observes the final propagated error from the fallback chain.
- Add test_extension_per_table_failure_falls_back_to_legacy that
exercises the new behavior directly: extension scan fails with the
QueryService schema-not-authorized message
(keboola/duckdb-extension#17), legacy succeeds, parquet ends up
queryable.
Patch release bundling the only Unreleased change: bump httpx client
timeout for agnes query --remote from 30s to 300s (configurable via
AGNES_QUERY_TIMEOUT). Renames CHANGELOG [Unreleased] section to
[0.35.1] and bumps pyproject version to match.
The httpx client behind 'agnes query --remote' used the default 30s
timeout, killing every BigQuery SELECT that took longer than half a
minute — i.e. most non-trivial remote queries.
cli/client.py now exposes QUERY_TIMEOUT_S (default 300s, override via
AGNES_QUERY_TIMEOUT) and propagates a kw-only 'timeout' through
api_get/post/delete/patch. _query_remote passes QUERY_TIMEOUT_S so only
the long-running /api/query path gets the bump; every other CLI call
keeps the 30s default.
Server-side has no read deadline on /api/query, so the client cap was
the sole bottleneck.
Backup-orchestration commands were split across two namespaces (pull in
agnes store, push in agnes admin store), which broke the operator
mental model — pull/push are a paired operation and should sit
together.
Move pull + info into agnes admin store so all bulk operations share
one help screen. Add agnes store mine as the user-facing equivalent —
calls the same /api/store/bundle.zip endpoint with ?owner=me, which
the server resolves to the caller's user_id. Authors can archive
their own uploads without admin role; whole-Store bulk reads stay
admin-flavored as a discoverability hint.
Server: 3-line addition to export_bundle handles owner='me' as a
magic alias for the caller. No new endpoint.
Tests updated: pull/info expectations move from agnes store to
agnes admin store; new tests cover agnes store mine and the
?owner=me server resolution. 69/69 store tests green locally.
Adds whole-Store backup/restore primitives so an external CI/CD job can
mirror the Store to a git repo (and restore back from one).
REST:
- GET /api/store/bundle.zip — deterministic ZIP of all (filtered) Store
entities. Layout: manifest.json + entities/<id>/{plugin,assets}/.
Manifest carries owner_email for cross-instance restore. Auth: any
authenticated user (Store is community-open).
- POST /api/store/import-bundle — admin-only restore. Modes
merge|replace|skip; owner resolution by email with stub-disabled-user
fallback when the email is unknown on the target instance.
CLI:
- agnes store update <id> [--description X] [--zip PATH] ... — in-place
edit (server PUT permits owner OR admin per F4). Closes the missing
edit affordance for analysts who want to fix a typo or push a new
ZIP without losing install_count.
- agnes store pull [-o store.zip] [--unpack DIR] — download the bundle.
--unpack streams + extracts so an external git-backup workflow can
drop the tree straight into a repo and `git add .`.
- agnes store info [--json] — counts + size summary.
- agnes admin store push <zip-or-dir> [--mode ...] — admin-only restore.
Auto-zips a directory client-side so a working-tree → server
round-trip is one command.
cli/v2_client.py gains api_get_stream helper for binary downloads.
Tests: 5 new server tests (bundle shape + filters + round-trip + stub
user creation + skip mode + admin-only gate) + 11 new CLI tests
(update, pull/unpack, info, admin push). 66/66 store-related tests
green locally.
The previous gather used `sorted(glob, key=lambda p: p.stat().st_mtime)`.
A transient OSError (race with delete, permission flicker, EBADF on a
weird filesystem) on any single file raised through the lambda and 500-ed
the whole page.
Reworked: stat each path under try/except into a (path, stat) list, sort
the already-statted entries. Bad files drop silently from the listing.
Regression test test_profile_sessions_page_tolerates_stat_failures
patches Path.stat to raise on one of two files, asserts the page returns
200 with the good row rendered and the bad row dropped.
Devin flagged that run_session_collector still had the same audit-skip
gap I fixed in run_verification_detector and run_corporate_memory in
the previous two rounds — a PermissionError walking /home, an OSError
on /data/user_sessions mkdir, or any other unhandled exception from
collector.run() would skip the audit_log row and only show in docker
logs.
Same try/except + unhandled_error pattern as the sibling endpoints.
All three LLM-pipeline run-* endpoints now record their failures the
same way; /admin/scheduler-runs sees them. Regression test in
tests/test_admin_run_endpoints.py::TestRunSessionCollector::test_unhandled_exception_still_audits.
User feedback during e2e of #179: the listing page is nice but I want
to grab the raw jsonl and look at what's inside.
Adds GET /profile/sessions/<filename>:
- Auth via get_current_user (owner-only).
- Path safety: rejects "/", "\", "..", leading ".", and any non-".jsonl"
filename. The served path resolves under
${DATA_DIR}/user_sessions/<caller.id>/; if resolution escapes that
base directory, returns 404 (never 403, so existence of other users'
files isn't leaked).
- FileResponse with Content-Disposition: attachment.
UI: Download button per row in profile_sessions.html.
Tests in test_web_ui.py: path-traversal / nested / dotfile / non-jsonl
all 404 for owner; unauthenticated 302/401/403; authenticated owner
gets 200 + correct Content-Disposition.
Devin flagged that run_corporate_memory still had the same audit-skip
gap I just fixed in run_verification_detector — if collect_all() throws
anything other than the already-translated ValueError (DuckDB lock,
network blip, unexpected SDK error), the audit_log row was never
written and /admin/scheduler-runs missed the failure.
Same try/except + unhandled_error pattern as the verification_detector
fix from 4c4dfee8. Regression test in
tests/test_admin_run_endpoints.py::TestRunCorporateMemory::test_unhandled_exception_still_audits.
Three changes addressing user feedback during e2e test of #179 + Devin Review on e86dd5ed.
1) /profile/sessions — new self-service user page in the user menu.
Lists all session jsonls the caller uploaded via `agnes push` joined
against session_extraction_state. Each row shows uploaded_at, file
size, status badge (pending/processed/extracted), processed_at, and
items_extracted. The page docstring + help text explicitly call out
that items_extracted=0 means the verification detector ran fine but
the LLM found no claims to track — that's the documented "no items"
outcome, not a broken pipeline. Closes the gap surfaced during the
e2e test of #176 where a user could see their sessions on disk and
process them through the LLM but had no UI to inspect what happened.
2) run_verification_detector audits unhandled exceptions (Devin #1).
If detector.run() threw anything other than the already-translated
ValueError, the audit_log row was never written. The endpoint now
wraps detector.run in try/except, records the exception in
audit_params["unhandled_error"], then re-raises as 500 after audit.
The /admin/scheduler-runs page surfaces the failure row with the
error type + message.
3) SCHEDULER_AUDIT_ACTIONS list corrected (Devin #2). Previous list
had "marketplaces_sync_all" (wrong — actual is "marketplace.sync_all")
plus "data_refresh" and "scripts_run_due" which app/api/sync.py and
app/api/scripts.py don't write to audit_log. Fixed to the four
actually-logged strings; comment points at the missing audit calls
as a follow-up.
Tests: tests/test_web_ui.py adds TestAdminRoleGuards::test_profile_sessions_page_no_admin_required and tightens test_admin_scheduler_runs_page_admin_only to assert the correct marketplace.sync_all string.
create_entity + update_entity created the `scratch` temp dir inside one
try/finally but cleaned it up in a separate one. Validation HTTPExceptions
raised by _safe_zip_extract (zip_unsafe_path, zip_too_large_uncompressed)
or the BadZipFile→422 conversion exited the first scope, and the second
finally was never entered → temp dir leaked on every failed upload.
Devin flagged this on the F2 commit. The leak pre-existed (zip_unsafe_path
was the original vector); F2 added zip_too_large_uncompressed to the same
broken cleanup path. Fixed by collapsing scratch creation + cleanup into
one outer try/finally that covers both extraction AND metadata/bake; the
inner try/except/finally still handles BadZipFile→422 + tmp file cleanup.
Same restructure in update_entity. Regression test
`test_scratch_dir_cleaned_up_after_failed_extraction` triggers a
zip_unsafe_path 422 and asserts tmp/agnes_store_* contains no leaked
dirs.