* Add /marketplace browse page + Model B opt-in stack composition
New /marketplace browse surface unifies the curated marketplaces
(admin-managed git mirrors) and the community Flea Market behind
three tabs — Curated / Flea / My Stack — with per-tab category
filter, search across both sources with scope checkboxes, and
numeric pagination, all driven by URL query state. Plugin detail
at /marketplace/curated/<slug>/<plugin> and /marketplace/flea/<id>;
nested skill / agent detail at /marketplace/curated/<slug>/<plugin>/
{skill,agent}/<name> and the flea-side single-page detail.
Model B opt-in: an RBAC grant on a curated plugin is now only
*eligibility*. The user must click "Add to my stack" for it to
enter their served Claude Code marketplace. Composition flips
from (rbac ∖ opt_outs) ∪ store_installs to
(rbac ∩ subscriptions) ∪ store_installs. The legacy
user_plugin_optouts table is renamed user_curated_subscriptions
(schema v27) — same table shape, inverted semantic, repository
methods become subscribe / unsubscribe / is_subscribed.
UX vocabulary: Install → Add to my stack, Installed → In your
stack, card "Installed" badge → "In stack" (amber pill), tab
"My Subscriptions" → "My Stack". Bridges the two-step model
(server-side bookmark vs. on-laptop install) the previous label
hid. Click triggers an inline post-add hint panel under the
description with the agnes refresh-marketplace recipe + Copy
chip, dismissible per-browser via localStorage.
Per-tab info blocks above the filter row:
- Curated: trust signal — "Each plugin here has a named curator
accountable for it." (blue accent + See-all-curators link)
- Flea: open-shelf signal — "Anyone in the company can upload
here." (purple accent + Tips-for-sharing link)
- My Stack: personal-shelf orientation — "Your AI stack —
everything you've added." (slate accent, no link)
Tabs carry per-tab Heroicons (shield-check / building-storefront
/ rectangle-stack) tinted to match each tab's accent; flips white
when the tab is active for contrast.
Hero illustration anchored to the right of the blue hero panel
(absolute, 47% wide, behind the search row content). Hidden
under 900px viewport.
Action-row CTAs realigned to publication intent: curated
"How to add new content" → "Submit a plugin" (links to the
guide page); flea button removed since +Upload sits next to it.
Empty-state CTAs match. /marketplace/guide/{curated,flea}
routes now host publication-flow guide pages with placeholder
ledes — full copy to be authored separately.
Categories: Heroicons-based icons mapped per category in
src/category_icons.py (zero new dependencies; SVG path strings
inlined). Marketplace cards, filter pills, and detail pages
read from the same source.
API endpoints under /api/marketplace:
- GET /items per-tab listing (curated / flea / my)
- GET /categories per-tab non-zero counts
- GET /curated/{slug}/{plugin} plugin detail
- POST/DELETE /curated/{slug}/{plugin}/install subscribe toggle
- GET /curated/{slug}/{plugin}/{skill,agent}/{name} inner item
The tab=my branch reads directly from
user_curated_subscriptions ∪ user_store_installs (not
resolve_user_marketplace, which bundles flea skills/agents into
a single store-bundle synthetic entry useful for serving the
Claude Code marketplace ZIP/git but wrong for browsing where
each item should appear as its own card).
Detail pages: plugin detail surfaces inner skills/agents as
clickable nested cards; commands/hooks/MCPs render as plain
name lists. Skill/agent detail mirrors the plugin layout with
kind-tinted accents (skill = green, agent = purple), Description
+ Details sidebar, Files + Docs sections, and the "How to call
it" copy-able invocation chip showing /<plugin>:<inner-name>
exactly as Claude Code namespaces it post-install. Curated
nested has no install button — links back to the parent plugin.
Navbar: standalone "My AI Stack" relabelled "My Stack" and
points at /marketplace?tab=my; "Store" link removed (Store
flow is reachable via the Flea Market tab's +Upload button).
The standalone /my-ai-stack and /store routes still work for
old bookmarks.
Tests cover the new browse / categories / install / RBAC paths
under tests/test_marketplace_api.py; existing marketplace and
store tests updated for Model B (explicit subscribe in fixtures).
Schema bumped v26 → v27 with idempotent migration that wipes
existing user_plugin_optouts rows on flip and adds
marketplace_plugins.created_at with registered_at backfill.
* Fix v28 migration + post-rebase test fallout
v28 ALTER TABLE marketplace_plugins ADD COLUMN created_at conflicted with
_SYSTEM_SCHEMA's earlier CREATE that already includes the column on fresh
installs (test fixtures starting at any pre-v28 version trip on it).
Switch to ADD COLUMN IF NOT EXISTS — same idiom as the upstream v27
Keboola sync-strategy migration on the same ladder.
Two test patches needed after the rebase bumped SCHEMA_VERSION 27 → 28:
- test_keboola_v27_migration.py: test_schema_version_constant_is_27 was
pinning ==27. Loosened to >=27 (the test's purpose is to verify the
v27 Keboola migration, not to pin the current SCHEMA_VERSION).
- test_setup_page_unified.py: was monkeypatching resolve_allowed_plugins
but compute_default_agent_prompt now reads from resolve_user_marketplace
(Model B-aware). Stub the right function so the test exercises the
v28 served-set path.
* Harden curated skill/agent inner endpoints against path traversal
`_read_inner`, the `skill_dir` walk in `curated_skill_detail`, and the
`agent_path.stat` in `curated_agent_detail` joined URL path-params onto
`plugin_root` without verifying the resolved candidate stayed inside it.
Starlette's `[^/]+` on `{skill_name}` / `{agent_name}` blocks the direct
URL exploit (encoded `/` 404s before the handler), but a curator-planted
symlink inside a curated marketplace's git mirror could still dereference
outside the plugin tree on read.
Adds `_safe_join(plugin_root, *parts)` doing
`Path.resolve(strict=True)` + `relative_to(plugin_root.resolve())`, used
by all three call sites so the boundary is enforced once and consistently.
Tests cover the helper directly (normal path resolves, escaping `..`
returns None, escaping symlink returns None, missing file returns None)
plus an end-to-end check that the symlink case actually 404s on the
HTTP endpoint. Symlink tests skip on Windows where symlink creation
needs elevated permissions; they run on Linux CI.
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
531 lines
32 KiB
Markdown
531 lines
32 KiB
Markdown
# AI Data Analyst
|
||
|
||
Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.
|
||
|
||
## First-Time Setup
|
||
|
||
When a user opens this project for the first time, guide them through interactive setup:
|
||
|
||
### Step 1: Gather Information
|
||
Ask the user for:
|
||
1. Company domain (e.g., "acme.com") - used for Google OAuth
|
||
2. Data source type: keboola / bigquery / csv
|
||
3. Instance name (e.g., "Acme Data Analyst")
|
||
|
||
### Step 2: Generate Configuration
|
||
1. Copy `config/instance.yaml.example` to `config/instance.yaml`
|
||
2. Fill in values from Step 1
|
||
3. If Keboola: ask for Storage API token, stack URL, project ID
|
||
4. Create `.env` from `config/.env.template`
|
||
|
||
### Step 3: Register Tables
|
||
1. Use the FastAPI admin API (`POST /api/admin/register-table`, then `PUT /api/admin/registry/{id}` for updates) or webapp UI to register tables
|
||
2. Tables are stored in DuckDB `table_registry` with source_type, bucket, source_table, query_mode
|
||
3. For migration from old format: `python scripts/migrate_registry_to_duckdb.py`
|
||
|
||
### Step 4: Docker Deployment
|
||
```bash
|
||
docker compose up # Start app + scheduler
|
||
docker compose --profile full up # Include telegram bot
|
||
|
||
# HTTPS mode — Caddy + corporate-CA certs at /data/state/certs
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
|
||
--profile tls up -d
|
||
```
|
||
|
||
See `docs/DEPLOYMENT.md` → **TLS** for cert provisioning + `scripts/ops/agnes-tls-rotate.sh` (daily refetch from `TLS_FULLCHAIN_URL`, `SIGUSR1` reload on diff, no-op when unchanged). The infra repo's `startup.sh` installs this as a systemd timer automatically.
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
├── src/ # Core engine
|
||
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
|
||
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
|
||
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
|
||
│ ├── profiler.py # Data profiling
|
||
│ └── catalog_export.py # OpenMetadata catalog export
|
||
├── app/ # FastAPI application
|
||
│ ├── main.py # App setup, router registration
|
||
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
|
||
│ └── web/ # HTML dashboard routes
|
||
├── connectors/ # Data source connectors (extract.duckdb contract)
|
||
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
|
||
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
|
||
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
|
||
├── cli/ # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
|
||
├── app/auth/ # Authentication (FastAPI-based providers)
|
||
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
|
||
├── server/ # Legacy deployment infrastructure
|
||
├── scripts/ # Utility + migration scripts
|
||
├── config/ # Configuration templates (instance.yaml.example)
|
||
├── docs/ # Documentation + metric YAML definitions
|
||
└── tests/ # Test suite (633 tests)
|
||
```
|
||
|
||
## Architecture: extract.duckdb Contract
|
||
|
||
Every data source produces the same output:
|
||
```
|
||
/data/extracts/{source_name}/
|
||
├── extract.duckdb ← _meta table + views
|
||
└── data/ ← parquet files (local sources only)
|
||
```
|
||
|
||
### Remote table support (`_remote_attach`)
|
||
|
||
Extractors with remote/passthrough tables (query_mode='remote') include a `_remote_attach` table
|
||
in extract.duckdb so the orchestrator can re-ATTACH the external DuckDB extension at query time:
|
||
|
||
```sql
|
||
CREATE TABLE _remote_attach (
|
||
alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc'
|
||
extension VARCHAR, -- Extension name, e.g. 'keboola'
|
||
url VARCHAR, -- Connection URL
|
||
token_env VARCHAR -- Env-var name holding the auth token, OR empty for
|
||
-- extensions with built-in auth (e.g. BigQuery uses the
|
||
-- GCE metadata server — see `connectors/bigquery/auth.py`).
|
||
);
|
||
```
|
||
|
||
The orchestrator reads this table, installs/loads the extension, fetches the token
|
||
(via `token_env` lookup, or via the extension-specific auth path when `token_env=''`),
|
||
creates a session-scoped DuckDB SECRET when the extension requires one (BigQuery), and
|
||
ATTACHes the external source. Views referencing `kbc."bucket"."table"` then resolve correctly.
|
||
This mechanism is generic — any connector can plug in.
|
||
|
||
The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into master `analytics.duckdb`, and creates views.
|
||
|
||
```
|
||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Keboola │ │ BigQuery │ │ Jira │
|
||
│ extractor │ │ extractor │ │ webhooks │
|
||
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
|
||
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
extract.duckdb extract.duckdb extract.duckdb
|
||
+ data/*.parquet (views → BQ) + data/*.parquet
|
||
│ │ │
|
||
└─────────────────┼─────────────────┘
|
||
▼
|
||
SyncOrchestrator.rebuild()
|
||
ATTACH → master views in analytics.duckdb
|
||
│
|
||
┌──────────┼──────────┐
|
||
▼ ▼ ▼
|
||
FastAPI CLI
|
||
(serve) (agnes pull)
|
||
```
|
||
|
||
Source modes:
|
||
- **Batch pull** (Keboola, `query_mode='local'`): DuckDB extension downloads to parquet, scheduled
|
||
- **Remote attach** (BigQuery, `query_mode='remote'`): DuckDB BQ extension, no download, queries go to BQ
|
||
- **Materialized SQL** (BigQuery, `query_mode='materialized'`): scheduler runs admin-registered SQL through DuckDB BQ extension (via `BqAccess` from `connectors/bigquery/access.py`) and writes the result to `/data/extracts/bigquery/data/<id>.parquet`. Distributed via the same manifest + `agnes pull` flow as Keboola tables. Cost guardrail via `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable — YAML `null` falls through to the default).
|
||
- **Real-time push** (Jira): Webhooks update parquets incrementally
|
||
|
||
## Configuration
|
||
|
||
Instance-specific config: `config/instance.yaml` (see example).
|
||
Environment variables: `.env` (never committed).
|
||
Table definitions: DuckDB `table_registry` table in `system.duckdb`.
|
||
|
||
## Development
|
||
|
||
```bash
|
||
# Setup
|
||
python3 -m venv .venv && source .venv/bin/activate
|
||
uv pip install ".[dev]"
|
||
|
||
# Run FastAPI locally
|
||
uvicorn app.main:app --reload
|
||
|
||
# Run tests
|
||
pytest tests/ -v
|
||
|
||
# Trigger sync manually
|
||
curl -X POST http://localhost:8000/api/sync/trigger
|
||
|
||
# Docker
|
||
docker compose up
|
||
```
|
||
|
||
### Local sync & Claude Code hooks
|
||
|
||
`agnes pull` is the canonical analyst-side distribution path: pulls the RBAC-filtered manifest from the server, downloads parquets whose MD5 changed (skipping `query_mode='remote'` rows), rebuilds local DuckDB views over them. `agnes push` mirrors it for the upload direction (sessions, CLAUDE.local.md).
|
||
|
||
`agnes init` writes two hooks into `<workspace>/.claude/settings.json`:
|
||
|
||
- `SessionStart` → `agnes pull --quiet` — pulls fresh parquets at the start of every Claude Code session
|
||
- `SessionEnd` → `agnes push --quiet` — uploads session jsonl + `CLAUDE.local.md` to the server
|
||
|
||
Both pass `--quiet` so they don't pollute Claude Code stdout, and trail with `|| true` so a server outage never blocks a session. Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace, not in unrelated sessions on the same machine.
|
||
|
||
Admin RBAC for auto-sync: `query_mode IN ('local', 'materialized')` plus a `resource_grants` row for one of the analyst's groups → table appears in their manifest → `agnes pull` downloads it. No per-user sync config; the admin layer is the single source of truth.
|
||
|
||
## Business Metrics
|
||
|
||
Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
|
||
|
||
```bash
|
||
agnes admin metrics import docs/metrics/
|
||
```
|
||
|
||
### For AI agents analyzing data:
|
||
Before computing any business metric, look up the canonical definition:
|
||
1. `agnes catalog --metrics` — find the relevant metric
|
||
2. `agnes catalog --metrics --show revenue/mrr` — read the SQL and business rules
|
||
3. Use the SQL from the metric definition, adapt to the specific question
|
||
|
||
Never invent metric calculations — always use the canonical definitions.
|
||
|
||
## Querying Agnes data — agent rails
|
||
|
||
When asked about ANY data in Agnes, follow this protocol.
|
||
|
||
### Discovery first
|
||
|
||
Before writing ANY query against a table, run:
|
||
|
||
agnes catalog --json | jq <filter> # know what's available
|
||
agnes schema <table> # learn columns + types
|
||
agnes describe <table> -n 5 # see real values for shape
|
||
|
||
NEVER write `SELECT * FROM <table>` blindly. For local-mode tables it's
|
||
wasteful; for remote-mode tables it can blow up at 225M rows.
|
||
|
||
### Choose the right tool
|
||
|
||
Tables in `agnes catalog` have a `query_mode`:
|
||
|
||
- **`local`**: data is on the laptop as parquet (synced via `agnes pull`).
|
||
Query directly with `agnes query "SELECT … FROM <table>"`.
|
||
|
||
- **`remote`** (typically BigQuery): the parquet does NOT exist on the laptop.
|
||
You MUST either:
|
||
1. **`agnes snapshot create`** a filtered subset → query the local snapshot, OR
|
||
2. **`agnes query --remote`** for one-shot server-side execution. Works on
|
||
all `query_mode='remote'` rows regardless of upstream BQ entity type
|
||
(BASE TABLE → Storage Read API with predicate pushdown; VIEW /
|
||
MATERIALIZED_VIEW → BQ jobs API, no pushdown). Cost-guarded by a
|
||
5 GiB scan cap (configurable in /admin/server-config). Direct
|
||
`bq."<dataset>"."<table>"` paths are registry-gated — unregistered
|
||
paths return 403 `bq_path_not_registered`.
|
||
3. **`agnes query --register-bq`** for hybrid joins (rarely needed).
|
||
|
||
### `agnes snapshot create` workflow (preferred for remote tables)
|
||
|
||
# 1. estimate first
|
||
agnes snapshot create web_sessions_example \
|
||
--select event_date,country_code,session_id \
|
||
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
||
AND country_code = 'CZ'" \
|
||
--estimate
|
||
# → "estimated_scan_bytes: 4.2 GB, result: ~250k rows, 12 MB locally"
|
||
|
||
# 2. if reasonable, fetch
|
||
agnes snapshot create web_sessions_example ... --as cz_recent
|
||
|
||
# 3. query the local snapshot
|
||
agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
|
||
|
||
### Heuristics for `agnes snapshot create`
|
||
|
||
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
|
||
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
|
||
- ALWAYS run `--estimate` first when:
|
||
- You're not sure of the data shape
|
||
- The table has `partition_by` or `clustered_by` set (per `agnes schema`)
|
||
- The fetch could plausibly exceed 1 GB local bytes
|
||
- Reuse `agnes snapshot list` before fetching — if a snapshot covers your
|
||
query already, skip the fetch.
|
||
|
||
### BigQuery SQL flavor for `--where`
|
||
|
||
For `source_type=bigquery` (per `agnes catalog`):
|
||
|
||
- Date literal: `DATE '2026-01-01'` (NOT `'2026-01-01'::date`)
|
||
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
|
||
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
|
||
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
|
||
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
|
||
- NULL: `col IS NOT NULL` (standard)
|
||
- Cast: `CAST(x AS INT64)` (NOT `INT`)
|
||
|
||
For `source_type=keboola` / `source_type=jira` (local), use DuckDB SQL flavor
|
||
in your `agnes query` calls — there's no `--where` on local since fetch is implicit.
|
||
|
||
### Snapshot hygiene
|
||
|
||
- Reuse snapshots across questions in the same conversation.
|
||
- Use descriptive names: `cz_recent`, `orders_q1_us`, `sessions_today`.
|
||
- Drop with `agnes snapshot drop <name>` when done with a topic.
|
||
- `agnes disk-info` to see total cache size.
|
||
|
||
### When NOT to use `agnes snapshot create`
|
||
|
||
- Single aggregate on remote BASE TABLE (`SELECT COUNT(*) FROM remote`):
|
||
use `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"`.
|
||
Storage Read API pushes the COUNT into BQ — cheap, no materialization.
|
||
- Single aggregate on remote VIEW/MATERIALIZED_VIEW: same syntax works
|
||
(#160), but the BQ jobs API can't push WHERE/COUNT into the view body.
|
||
Cost guardrail (default 5 GiB) catches expensive scans → 400
|
||
`remote_scan_too_large` with `agnes snapshot create` suggestion. Pivot to
|
||
`agnes snapshot create <id> --where '<predicate>'` if the cap is hit.
|
||
- Throwaway exploration: `agnes query --remote "SELECT … FROM <registered_id>"`.
|
||
Direct `bq."<dataset>"."<table>"` paths are now registry-gated — register
|
||
first or use the catalog id.
|
||
- Cross-table JOIN with both tables remote: combine `agnes snapshot create` for one
|
||
side + `agnes query --remote` for the other; full cross-remote JOIN
|
||
requires more thought (see #101 for design space).
|
||
|
||
## Marketplace Repositories
|
||
|
||
Admin-managed git repos cloned nightly to `${DATA_DIR}/marketplaces/<slug>/`
|
||
so FastAPI can read their contents from disk.
|
||
|
||
- Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`.
|
||
- Scheduler calls `POST /api/marketplaces/sync-all` (admin-only, authed via `SCHEDULER_API_TOKEN`) at `daily 03:00` UTC. Routing through HTTP keeps the app the sole writer to `system.duckdb` — the previous in-process call from the scheduler container raced the app's long-lived DB handle and 500-ed on `Could not set lock on file`.
|
||
- Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`.
|
||
- PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret.
|
||
- Registry lives in DuckDB table `marketplace_registry` (schema v9).
|
||
- After each successful sync, `src/marketplace.py` parses `.claude-plugin/marketplace.json`
|
||
from the cloned repo and caches the plugin list in `marketplace_plugins`
|
||
(keyed by `(marketplace_id, plugin_name)`).
|
||
- `src/marketplace.py` handles clone/fetch/reset with token redaction in any surfaced error message.
|
||
|
||
## Access control (v13)
|
||
|
||
Two layers, no role hierarchy. Full reference: [`docs/RBAC.md`](docs/RBAC.md).
|
||
|
||
- `user_groups` — named groups. Two seeded as `is_system=TRUE` at startup:
|
||
`Admin` (god-mode short-circuit on every authorization check) and
|
||
`Everyone` (auto-membership for every user).
|
||
- `user_group_members` — `(user_id, group_id, source)`. `source ∈
|
||
{admin, google_sync, system_seed}` so each writer only manipulates its own
|
||
rows; Google's nightly DELETE+INSERT does not clobber admin-added members.
|
||
- `resource_grants` — generic `(group, resource_type, resource_id)` triple.
|
||
Replaces `plugin_access` from v12; the same shape now covers any future
|
||
entity-scoped grant (datasets, knowledge categories, …).
|
||
|
||
Resource types are an `app.resource_types.ResourceType` `StrEnum` paired with
|
||
a `ResourceTypeSpec` registered in `RESOURCE_TYPES` — adding a new one is one
|
||
enum member, one `list_blocks(conn)` delegate (projects domain tables into the
|
||
`(block → items)` shape the /admin/access tree renders), and one spec entry.
|
||
No DB migration, no second wiring step. Endpoints gate with either
|
||
`require_admin` (app-level) or `require_resource_access(ResourceType.X,
|
||
"{path}")` (entity-level), both from `app.auth.access`.
|
||
|
||
Admin UI: `/admin/access`. CLI: `agnes admin group {list,create,delete,members,
|
||
add-member,remove-member}` and `agnes admin grant {list,create,delete}`.
|
||
|
||
## Claude Code marketplace endpoint
|
||
|
||
Agnes serves a single aggregated Claude Code marketplace over two channels,
|
||
both gated by PAT auth and filtered per caller:
|
||
|
||
- `GET /marketplace.zip` — deterministic ZIP download with `ETag` /
|
||
`If-None-Match` (304 when content unchanged). Consumed by a client-side
|
||
SessionStart hook.
|
||
- `GET /marketplace.git/*` — git smart-HTTP (dulwich via a2wsgi). Registered
|
||
in Claude Code once, then Claude Code owns the clone/fetch cycle.
|
||
|
||
Auth: ZIP uses `Authorization: Bearer <PAT>`. Git uses HTTP Basic where the
|
||
password field carries the PAT (`https://x:<PAT>@host/marketplace.git/`) —
|
||
git CLI does not speak Bearer.
|
||
|
||
Content: filtered via `src.marketplace_filter.resolve_allowed_plugins` which
|
||
joins `resource_grants ↔ marketplace_plugins` (matching
|
||
`mp.marketplace_id || '/' || mp.name = rg.resource_id`) scoped to the
|
||
caller's `user_group_members`. Admin is treated as a regular group here —
|
||
no god-mode shortcut for the marketplace feed, so admins curate their own
|
||
view by granting plugins to the Admin group (or any group they belong to).
|
||
On-disk layout in the served ZIP / git tree uses a slug-prefixed directory
|
||
(`plugins/<slug>-<plugin>/`) so two marketplaces shipping a same-named
|
||
plugin don't overwrite each other's files. The synth marketplace.json's
|
||
`name` field, however, is the plugin's authoritative name from its own
|
||
`.claude-plugin/plugin.json` (with a fallback to the upstream
|
||
marketplace.json `name`) — Claude Code's `/plugin` UI resolves a loaded
|
||
plugin back to its catalog entry by `plugin.json` name, so the catalog
|
||
entry's `name` must match. Same-named plugins from two upstream
|
||
marketplaces therefore collide in the catalog by design; admin RBAC
|
||
(which grants survive the filter) decides which one wins, identical to
|
||
how Claude Code behaves when a user adds two upstream marketplaces with
|
||
overlapping plugin names directly. `/marketplace/info` exposes both
|
||
`name` and `prefixed_name` so operators can disambiguate.
|
||
|
||
Cache: content-addressed bare repos at `${DATA_DIR}/marketplaces/git-cache/`
|
||
keyed by sha256(filtered content). Two users with the same RBAC view share
|
||
one repo; content change → new repo next to the old one. No TTL / prune yet.
|
||
|
||
User registration inside Claude Code:
|
||
|
||
```
|
||
# ZIP channel (typically via a SessionStart hook that unpacks into ./marketplace/)
|
||
curl -H "Authorization: Bearer $AGNES_PAT" https://agnes.example.com/marketplace.zip
|
||
|
||
# Git channel — one-time registration. Two paths; pick the first that works.
|
||
|
||
# (a) Direct registration — preferred when it works.
|
||
/plugin marketplace add https://x:$AGNES_PAT@agnes.example.com/marketplace.git/
|
||
|
||
# (b) Two-step fallback — required when (a) fails. Bun-compiled `claude` on
|
||
# macOS / Windows ignores the OS trust store and CA env vars on the
|
||
# marketplace HTTPS path, so direct add can fail with TLS errors against
|
||
# a private-CA Agnes instance even when system tools work fine. System
|
||
# `git` honors GIT_SSL_CAINFO + the OS trust store, so cloning manually
|
||
# and pointing Claude Code at the local clone sidesteps the Bun TLS path
|
||
# entirely.
|
||
git clone https://x:$AGNES_PAT@agnes.example.com/marketplace.git/ ~/agnes-marketplace
|
||
claude plugin marketplace add ~/agnes-marketplace
|
||
# Optional hardening: strip the PAT from the cloned repo's origin so it
|
||
# doesn't sit in plaintext at ~/agnes-marketplace/.git/config — re-clone via
|
||
# the dashboard's setup flow when the PAT rotates.
|
||
git -C ~/agnes-marketplace remote set-url origin https://agnes.example.com/marketplace.git/
|
||
```
|
||
|
||
The dashboard-served setup payload (see `app/web/setup_instructions.py`) already
|
||
branches between (a) and (b) automatically based on platform when a private CA
|
||
is in play. The block above is the manual equivalent for users registering
|
||
outside that flow (e.g. operators bringing up a new instance, or
|
||
analysts whose first attempt failed and need to retry by hand).
|
||
|
||
## Hybrid Queries (BigQuery + Local)
|
||
|
||
For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results:
|
||
|
||
```bash
|
||
agnes query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
|
||
--register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
|
||
```
|
||
|
||
The `--register-bq` flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple `--register-bq` flags can be used for multiple BQ sources.
|
||
|
||
For complex SQL, use stdin mode:
|
||
```bash
|
||
echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | agnes query --stdin
|
||
```
|
||
|
||
## Extensibility
|
||
|
||
### Data Sources (extract.duckdb contract)
|
||
New connector = `connectors/<name>/extractor.py` producing `extract.duckdb + data/`.
|
||
Must create `_meta` table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode.
|
||
Orchestrator ATTACHes it automatically.
|
||
|
||
### Authentication
|
||
Auth providers in `app/auth/` (FastAPI-based):
|
||
- **Google**: OAuth via Google (Workspace group memberships pulled at sign-in — see `docs/auth-groups.md` for the GCP setup checklist + the `security` label gotcha)
|
||
- **Email**: Email magic link (itsdangerous token)
|
||
- **Desktop**: JWT for API
|
||
|
||
### RBAC
|
||
|
||
See **[Access control (v13)](#access-control-v13)** above and [`docs/RBAC.md`](docs/RBAC.md) for the full reference. TL;DR for module authors: gate endpoints with `Depends(require_admin)` for app-level mutations or `Depends(require_resource_access(ResourceType.X, "{path}"))` for entity-scoped grants. Add a new resource type by extending the `ResourceType` `StrEnum` and registering a `ResourceTypeSpec` (with a `list_blocks` projection delegate) in `app/resource_types.py`.
|
||
|
||
## Release & deploy workflows
|
||
|
||
Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.
|
||
|
||
### `release.yml` — auto-build on every push
|
||
Runs on **every** push to **every** branch.
|
||
- Push to `main` → `:stable`, `:stable-YYYY.MM.N` (CalVer).
|
||
- Push to non-main `<prefix>/<branch>` → `:dev`, `:dev-YYYY.MM.N`, `:dev-<branch-slug>`, and (when prefix isn't a Git Flow convention) `:dev-<prefix>-latest` alias.
|
||
|
||
VMs that pin to a floating tag (`:dev`, `:dev-<prefix>-latest`) auto-upgrade within ~5 min via the cron in `agnes-auto-upgrade.sh`. Convenient for per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins, regardless of who).
|
||
|
||
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
|
||
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
|
||
- `:keboola-deploy-<git-tag-suffix>` — immutable, tied to the exact commit
|
||
- `:keboola-deploy-latest` — floating alias the consumer pins to
|
||
|
||
**Operator workflow:**
|
||
```bash
|
||
git checkout <commit-or-branch>
|
||
git tag keboola-deploy-<descriptive-name>
|
||
git push origin keboola-deploy-<descriptive-name>
|
||
# → workflow builds + publishes both tags
|
||
# → VM cron picks up :keboola-deploy-latest within ~5 min
|
||
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM
|
||
```
|
||
|
||
Use this when the consumer (e.g. a customer dev VM) needs **deploy-when-I-decide** semantics — no surprise rollouts from upstream branch pushes by other contributors. The infra repo pins `image_tag = "keboola-deploy-latest"` on the relevant VM.
|
||
|
||
### Module versioning
|
||
The customer-instance Terraform module under `infra/modules/customer-instance/` is published as `infra-vMAJOR.MINOR.PATCH` git tags (separate from app CalVer tags). Bump on any module-API change; downstream infra repos pin to the tag in their `source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y"`.
|
||
|
||
After merging a module change to `main`:
|
||
```bash
|
||
git tag infra-vX.Y.Z origin/main
|
||
git push origin infra-vX.Y.Z
|
||
```
|
||
|
||
### Replacing a VM after a startup-script change
|
||
Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `google_compute_instance.vm` so normal `terraform apply` doesn't churn running VMs. To propagate a startup-script update, trigger the consumer's apply workflow manually with the VM resource address — typical workflow_dispatch input is `recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'`.
|
||
|
||
## Key Implementation Details
|
||
|
||
### DuckDB Schema (src/db.py)
|
||
- Schema v28 with auto-migration v1→…→v28 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)**, **v21 adds `welcome_template` singleton table backing the Agent Setup Prompt admin override (`/admin/agent-prompt`)**, **v22 reserves the `setup_banner` table — feature dropped mid-development; table retained for forward compatibility with already-migrated instances**, **v23 adds `claude_md_template` singleton table backing the Agent Workspace Prompt admin override (`/admin/workspace-prompt`)**, **v24 rewrites materialized BQ `source_query` from DuckDB-flavor `bq."ds"."t"` to BQ-native `` `<project>.ds.t` `` so the new wrapping path accepts them; idempotent + warns when project unconfigured**, **v25 adds `store_entities` + `user_store_installs` + `user_plugin_optouts` backing the `/store` and `/my-ai-stack` pages — the served marketplace is now `(admin_granted ∖ opt_outs) ∪ store_installs`**, **v26 unifies Keboola `query_mode='local'` rows into `'materialized'` — the old local mode (DuckDB Keboola extension's COPY through QueryService) is replaced by the new Storage API export-async path which works regardless of project flags; existing `local` Keboola rows are flipped, NULL `source_query` means full-table export**, **v27 adds 7 columns to `table_registry` for Keboola per-table sync-strategy support: `incremental_window_days`, `max_history_days`, `incremental_column`, `where_filters`, `partition_by`, `partition_granularity`, `initial_load_chunk_days`. Layered on top of v26: admins can opt specific tables back to `query_mode='local'` (via the Direct extract Edit-modal radio) to enable the new dispatcher. The pre-existing `sync_strategy` column (default `'full_refresh'`) is reused — pre-v27 it was inert catalog metadata; post-v27 the Keboola extractor dispatches off it (`full_refresh` | `incremental` | `partitioned`). All new columns NULL on existing rows; meaningful only when paired with the matching strategy.**, **v28 introduces explicit-install (Model B) for curated marketplace plugins — served set flips from `(rbac ∖ user_plugin_optouts)` to `(rbac ∩ subscriptions)`. The `user_plugin_optouts` table+columns are reused (no DDL rename) so existing operator instances skip migration churn; row PRESENCE flips meaning from "excluded" to "subscribed", and the migration wipes existing rows so the inverted reading starts from a clean baseline. Also adds `marketplace_plugins.created_at` (per-plugin newest-first sort on /marketplace), backfilled from parent `marketplace_registry.registered_at` so existing plugins get a sensible date until the next sync overwrites with `CURRENT_TIMESTAMP`.** — see CHANGELOG and docs/RBAC.md)
|
||
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
|
||
- `sync_state`, `sync_history`: track extraction progress
|
||
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
|
||
- System DB at `{DATA_DIR}/state/system.duckdb`
|
||
- Analytics DB at `{DATA_DIR}/analytics/server.duckdb`
|
||
|
||
### SyncOrchestrator (src/orchestrator.py)
|
||
- `rebuild()`: scans extracts dir, ATTACHes all, creates master views, updates sync_state
|
||
- `rebuild_source(name)`: single source (used after Jira webhooks)
|
||
- Thread-safe via `_rebuild_lock`
|
||
|
||
### Connector Pattern
|
||
- **Keboola**: `connectors/keboola/extractor.py` uses DuckDB Keboola extension, fallback to `client.py`
|
||
- **BigQuery**: `connectors/bigquery/extractor.py` uses DuckDB BQ extension (remote-only, no download)
|
||
- **Jira**: `connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py` updates `_meta`
|
||
- `connectors/keboola/client.py`: legacy Keboola Storage API wrapper (kept as fallback)
|
||
|
||
### Config Loading
|
||
1. `config/loader.py` loads `instance.yaml`
|
||
2. `app/instance_config.py` exposes `get_data_source_type()`, `get_value()`
|
||
3. Table config lives in DuckDB `table_registry` (not markdown files)
|
||
|
||
### Files NOT to modify (stable infrastructure)
|
||
- `connectors/jira/file_lock.py` - Advisory file locking
|
||
- `connectors/jira/transform.py` - Core Jira transform logic
|
||
- `services/ws_gateway/` - WebSocket notification gateway
|
||
|
||
## Vendor-agnostic OSS — no customer-specific content
|
||
|
||
This repo is the public OSS distribution. **Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies.** That includes:
|
||
|
||
- Specific deployments or brands (private VM names, internal product brands, organization names that aren't already public sponsors).
|
||
- Cloud project IDs, internal hostnames, runbook paths from a particular install (`/opt/<deployment>`, `<host>.<internal-domain>`, `prj-<org>-…`, internal SA emails).
|
||
- Cross-references to private repos (`<private-org>/<private-repo>#NN`). Describe the integration in generic terms or link to public examples instead.
|
||
|
||
When you motivate a change, frame it abstractly ("behind a TLS-terminating reverse proxy", "in containerized deploys") rather than naming a specific operator. When you show examples, use placeholders (`example.com`, `<your-host>`, `<install-dir>`). When config has reasonable defaults pulled from one deployment's habits, generalize them or surface them as documented examples — not hard-coded assumptions.
|
||
|
||
Customer-specific automation, hostnames, and identities live in private infra repos that *consume* this OSS. The OSS describes capabilities, defaults, and configuration knobs — not how a specific operator wired them up.
|
||
|
||
## Changelog discipline — non-negotiable
|
||
|
||
**Every PR that adds, removes, or changes user-visible behavior MUST update `CHANGELOG.md` in the same PR.** No exceptions, no follow-ups, no "I'll do it after merge". User-visible = anything an operator, end-user, or downstream integrator can observe: CLI flags / output / exit codes, REST endpoints / payloads / status codes, web UI, `instance.yaml` schema, env vars, `extract.duckdb` contract, Docker / compose / Caddyfile knobs, default behaviors, breaking changes, security fixes.
|
||
|
||
**How:**
|
||
- Add a bullet under the topmost `## [Unreleased]` heading (create one if missing — it sits above the latest released version).
|
||
- Group by `### Added` / `### Changed` / `### Fixed` / `### Removed` / `### Internal` (Keep-a-Changelog sections).
|
||
- Mark breaking changes with `**BREAKING**` at the start of the bullet — operators grep for that string before bumping the pin.
|
||
- Reference the relevant doc/runbook if one exists (e.g. `see docs/auth-groups.md`), don't restate it.
|
||
- Internal-only changes (refactors, test additions, dependency bumps without behavior change) go under `### Internal` — still log them, just keep them terse.
|
||
|
||
**When you cut a release:**
|
||
- Rename `## [Unreleased]` → `## [X.Y.Z] — YYYY-MM-DD`.
|
||
- Append a new empty `## [Unreleased]` section at the top so the next PR has somewhere to land.
|
||
- Bump `version` in `pyproject.toml` to match `X.Y.Z`.
|
||
- Tag the merge commit as `vX.Y.Z` and push the tag.
|
||
|
||
**If you find yourself opening a PR without a CHANGELOG entry, stop and add one before requesting review.** Reviewers should bounce PRs that touch user-visible behavior without a changelog update — same way they'd bounce a PR with no test changes for new logic.
|
||
|
||
## Git Commits & Pull Requests
|
||
|
||
- Keep commit messages clean and concise
|
||
- Do not include AI attribution in commits or PRs
|
||
- Before opening a PR, scan the diff and the PR body for the customer-specific tokens listed above (`grep -niE '<token1>|<token2>|...'`). If anything matches, generalize or remove it.
|